Parser
Theparser object selects a reader (type) and configures it.
type | For |
|---|---|
csv | CSV / delimited text |
excel | Excel workbooks |
polaris_sav | Polaris .sav SQL-dump archives |
Common options
| Field | Meaning |
|---|---|
header_row | 0-indexed row holding the column names (default 0). |
skip_rows | Rows to skip before the header. |
max_rows | Cap on rows parsed (null = no limit). |
strip_columns | Columns to drop (by index array, or true for all). |
supports_reparse | Whether the file can be re-parsed after initial ingest. |
CSV options
encoding (e.g. utf-8, iso-8859-1), delimiter (,, ;, \t, |…), has_header.
Excel options
sheet_names (array, or null for all sheets), plus section_context for files
where data is grouped under section headers:
SAV options (Polaris)
sql_filename (path to the SQL dump inside the archive, e.g. 0-full.sql), tables
(table names to extract), extract_media (pull binary assets).
Column mapping
mapping.columns maps each source column to a target field and a Spark SQL type:
type is any Spark SQL type (it’s a free-form string, not a fixed enum). Common values:
STRING, INT, LONG, DOUBLE, TIMESTAMP.
Extra fields
Unmapped columns are handled byextra_fields.mode:
| Mode | Behavior |
|---|---|
ignore | Drop unmapped columns. |
passthrough | Keep them as-is. |
collect | Gather them into a list in target_column. |
store_json | Store them as a JSON object in target_column. |
Validation
Guard rails applied after parsing:required_columns— must be present and non-null.min_rows— minimum row count.
Parsed columns then flow into the promotion pipeline,
where they’re transformed and written to silver/gold.

