Ingestion specs — parsing

Parsing turns the raw file into a typed table. It has four parts:

"parsing": {
  "parser": { /* how to read the file */ },
  "mapping": { /* source column -> target field + type */ },
  "extra_fields": { /* what to do with unmapped columns */ },
  "validation": { /* required columns, min rows */ }
}

Parser

The parser object selects a reader (type) and configures it.

`type`	For
`csv`	CSV / delimited text
`excel`	Excel workbooks
`polaris_sav`	Polaris `.sav` SQL-dump archives

Common options

Field	Meaning
`header_row`	0-indexed row holding the column names (default `0`).
`skip_rows`	Rows to skip before the header.
`max_rows`	Cap on rows parsed (`null` = no limit).
`strip_columns`	Columns to drop (by index array, or `true` for all).
`supports_reparse`	Whether the file can be re-parsed after initial ingest.

CSV options

encoding (e.g. utf-8, iso-8859-1), delimiter (,, ;, \t, |…), has_header.

Excel options

sheet_names (array, or null for all sheets), plus section_context for files where data is grouped under section headers:

"section_context": [
  { "label": "Magasin", "output_column": "_section_magasin" },
  { "label": "Poste",   "output_column": "_section_poste" }
]

Each recognized section header value is carried onto the rows beneath it, into the named output column.

SAV options (Polaris)

sql_filename (path to the SQL dump inside the archive, e.g. 0-full.sql), tables (table names to extract), extract_media (pull binary assets).

Column mapping

mapping.columns maps each source column to a target field and a Spark SQL type:

"mapping": {
  "columns": {
    "Numéro":        { "target": "ticket_number", "type": "STRING" },
    "Montant":       { "target": "amount",         "type": "DOUBLE" },
    "Date":          { "target": "payment_date",   "type": "STRING" }
  }
}

type is any Spark SQL type (it’s a free-form string, not a fixed enum). Common values: STRING, INT, LONG, DOUBLE, TIMESTAMP.

Extra fields

Unmapped columns are handled by extra_fields.mode:

Mode	Behavior
`ignore`	Drop unmapped columns.
`passthrough`	Keep them as-is.
`collect`	Gather them into a list in `target_column`.
`store_json`	Store them as a JSON object in `target_column`.

"extra_fields": { "mode": "store_json", "target_column": "_extra_fields" }

Validation

Guard rails applied after parsing:

"validation": {
  "required_columns": ["ticket_number", "payment_method"],
  "min_rows": 1
}

required_columns — must be present and non-null.
min_rows — minimum row count.

Parsed columns then flow into the promotion pipeline, where they’re transformed and written to silver/gold.

​Parser

​Common options

​CSV options

​Excel options

​SAV options (Polaris)

​Column mapping

​Extra fields

​Validation

Parser

Common options

CSV options

Excel options

SAV options (Polaris)

Column mapping

Extra fields

Validation