Ingestion specs — promotion steps

A promotion transforms parsed data and writes it to a target table. A spec can have several promotions (one per output dataset), and each runs an ordered list of steps.

Promotion shape

{
  "dataset_id": "brands",
  "domain": "catalog",
  "conformed_table": "brands",
  "source_table": "marque",
  "source_tables": ["marque"],
  "merge_keys": ["organization_id", "code"],
  "update_columns": ["name", "updated_at"],
  "steps": [ /* ordered transformation steps */ ]
}

Field	Meaning
`dataset_id`	Logical dataset name.
`domain`	Logical domain (e.g. `catalog`, `sales`, `inventory`).
`conformed_table`	The target silver/gold table.
`source_table` / `source_tables`	The source table(s) this promotion reads.
`merge_keys`	Columns that identify a row for upsert.
`update_columns`	On upsert, which columns to update (`null`/omitted = all).
`steps`	The ordered pipeline (below).

Write behavior

Promotions write with one of three modes — merge (upsert on merge_keys), append (insert only), or overwrite partition (replace a partition). In practice, a promotion with merge_keys upserts on those keys; update_columns narrows which columns are written on a match (e.g. only enrich postal_code without touching the rest of the row).

Ordering

Steps run top to bottom. Each step sees the output of the previous one, so order matters — e.g. add/rename columns before generate_id, and run an id_mapping_output only after the id exists.

Step catalog

Every step is an object with a type. Below, grouped by purpose.

Shaping

coerce_types — cast columns to SQL types

{ "type": "coerce_types", "columns": { "quantity": "INT", "price": "DOUBLE", "date": "TIMESTAMP" } }

Casts each listed column to the given Spark SQL type.

filter — keep rows matching a condition

{ "type": "filter", "condition": "quantity > 0 AND date IS NOT NULL" }

condition is a SQL WHERE expression (without the WHERE keyword).

deduplicate — remove duplicate rows

{ "type": "deduplicate", "columns": ["no_ligmvt"], "strategy": "keep_last", "order_by": ["updated_at"], "order_desc": true }

strategy: drop (default), keep_first, or keep_last. order_by / order_desc decide which row survives for keep_first/keep_last.

add_column — add a computed column

{ "type": "add_column", "column": "created_at", "expression": "current_timestamp()" }

expression is any SQL scalar expression (literals, functions, CASE, references to other columns).

rename_columns — rename columns

{ "type": "rename_columns", "columns": { "no_marque": "code", "nom": "name" } }

Maps source → target names.

select_columns — keep only these columns

{ "type": "select_columns", "columns": ["id", "code", "name"] }

Drops everything not listed.

drop_columns — remove these columns

{ "type": "drop_columns", "columns": ["_tmp", "_extra"] }

inject_value — inject a runtime value

{ "type": "inject_value", "column": "organization_id", "value_key": "organization_id" }

Writes a value from the runtime context (e.g. organization_id, created_by) as a literal column.

Joining

join — join another dataset

{
  "type": "join",
  "target_dataset": "shops",
  "source_column": "shop_code",
  "target_column": "code",
  "join_type": "left",
  "select_columns": [ { "source": "id", "alias": "shop_id" }, { "source": "name" } ],
  "columns_to_drop": ["_tmp"],
  "deduplicate_on": ["code"]
}

join_type: left (default), inner, outer, right. select_columns picks (and optionally aliases) columns from the joined dataset.

sequential_join — flexible join with fallback & expressions

{
  "type": "sequential_join",
  "target_dataset": "libtaille",
  "join_column": "no_libtaille",
  "target_join_column": null,
  "fallback_join_column": null,
  "select_columns": [],
  "select_expressions": { "_size_label": "libelle" },
  "deduplicate_on": ["no_libtaille"],
  "join_type": "left"
}

Adds two capabilities over join: a fallback_join_column (COALESCE-style secondary key) and select_expressions (custom SQL expressions selected from the target).

aggregation_join — group + aggregate, then join

{
  "type": "aggregation_join",
  "target_dataset": "detcde",
  "group_by_columns": ["no_ligcde"],
  "aggregate_columns": [
    { "source_column": "pa", "function": "avg", "alias": "_detcde_pa", "cast_type": "double" }
  ],
  "join_column": "no_ligcde",
  "join_type": "left",
  "order_by": null,
  "order_desc": false
}

Aggregates the target with function ∈ avg, max, min, sum, count, first, then joins the result on join_column.

Reshaping

unpivot — wide to long

{
  "type": "unpivot",
  "id_columns": ["product_id"],
  "value_columns": ["jan", "feb", "mar"],
  "variable_column": "month",
  "value_column": "value"
}

Turns { id, col1, col2, … } into { id, variable, value } rows. variable_column/value_column default to variable/value.

Identity & taxonomy

generate_id — deterministic UUID from key columns

{ "type": "generate_id", "output_column": "id", "key_columns": ["code"], "namespace": null }

Derives a deterministic id from key_columns — the same key values produce the same id across runs (idempotent), which is what makes re-ingestion safe. An optional namespace scopes the derivation (UUID v5). Defaults: output_column = id, key_columns = all non-null columns.

id_mapping_output — persist a code → id mapping

{ "type": "id_mapping_output", "mapping_type": "brands", "code_column": "code", "id_column": "id", "extra_columns": [] }

Persists a mapping of code_column → id_column under mapping_type, so later promotions (in this or another spec) can resolve foreign keys against it.

id_mapping_join — resolve an FK via a persisted mapping

{ "type": "id_mapping_join", "mapping_type": "brands", "source_column": "brand_code", "output_column": "brand_id", "extra_output_columns": ["shop_id"], "deduplicate": true }

Looks source_column up in the mapping created by an earlier id_mapping_output and writes the resolved id into output_column.

taxonomy_mapping — resolve a label to a taxonomy id

{ "type": "taxonomy_mapping", "taxonomy_type": "size", "source_column": "_size_label", "source_id_column": "no_libtaille", "output_column": "taxonomy_id" }

Resolves a text label (e.g. a size or color) to a Solya taxonomy id. source_id_column optionally provides the POS reference for disambiguation.

Sub-option reference

Enum	Values
Deduplicate strategy	`drop`, `keep_first`, `keep_last`
Join type	`left`, `inner`, `outer`, `right`
Aggregate function	`avg`, `max`, `min`, `sum`, `count`, `first`
Write mode	`merge`, `append`, `overwrite_partition`

Cross-spec foreign keys work via the id_mapping_output → id_mapping_join pair: one promotion publishes the mapping, later promotions consume it. See worked examples.

​Promotion shape

​Write behavior

​Ordering

​Step catalog

​Shaping

​Joining

​Reshaping

​Identity & taxonomy

​Sub-option reference

Promotion shape

Write behavior

Ordering

Step catalog

Shaping

Joining

Reshaping

Identity & taxonomy

Sub-option reference