Overview

The Open Data Integration Nomenclature (ODIN) is Crux's standard for declarative data delivery. It provides a delivery nomenclature that incentivizes industry-standard GitOps practices. ODIN specs are inherently abstracted from their underlying control planes and workflow frameworks, but they natively work with the Sphere by Crux when onboarding data via the Crux platform.

ODIN specification components

As you build a data product in Crux Studio, the Crux platform populates a YAML standard data format with the following features:

Metadata and step specifications that need to be run in data pipelines in YAML
YAML files versioned that are backward compatible so that a later version supports all statements in an earlier version
Routines for validating the YAML and making sure the fields and structure are set correctly
YAML files existing in a tree. Children in a YAML file can point to their parent with the parent: field

As you connect to a data source, select file patterns, model data, identify its tables and schemas, and specify desired ingestion schedule, deadlines, and destinations. Crux dynamically builds and updates the ODIN specification for each dataset in your data product.

ODIN framework

Some of the information stored in the ODIN YAML file:

ID (Airflow-specific for the pipeline)
Dataset ID
Data Product ID
Org ID
Source connection and extraction information
Normalizer specification (if applicable)
Null value types
Schema history and schema validations
Context and environment variables
Schedule interval
Availability deadline
Destination information

Sample ODIN file

id: org_name_file_pattern_v1
version: 1.x
annotations:
  ENV: PRODUCTION
metadata:
  org_id: Orabcde
  data_product_id: Pr1234abcde
  dataset_id: Ds123456
run_uber_step: true
global:
  extract:
    action_class: pipeline.crux_pdk.actions.extract.extractor.ShortCircuitExtractor
    connection:
      type: SFTP
      conf: ${INGESTION-ENGINE-SFTP_TYPE}
  process:
    action_class: pipeline.crux_pdk.actions.process.java_processor.JavaProcessor
    crux_api_conf: ${ORG_NAME_FILE_PATTERN_V1_API}
    endpoint: ${API_HOST}
pipelines:
  - id: org_name_file_pattern_csv_ca8d
    global:
      global:
        file_available_start_date: 2025-02-14
        supplier_implied_date_regex: file_pattern_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
        provenance_file_patterns:
          origin_patterns:
            - file_pattern_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
          return_patterns:
            - file_pattern_(?P<YYYY>\d{4})(?P<MM>\d{2})(?P<DD>\d{2})
        schema_def:
          na_values:
            - ""
            - -1.#IND
            - NONE
            - -1.#QNAN
            - "#N/A"
            - +nan
            - NA
            - 1.#IND
            - "#N/A N/A"
            - "#NA"
            - N/A
            - NaN
            - +NaN
            - "NULL"
            - None
            - -nan
            - n/a
            - -NaN
            - "null"
            - nan
            - 1.#QNAN
            - none
            - "Null"
          fields:
            - name: Date
              data_type: DATE
              date_format: "%Y%m%d"
            - name: Date_Effective
              data_type: DATE
              date_format: "%Y%m%d"
            - name: Index_Symbol
              data_type: STRING
            - name: Internal_Key
              data_type: STRING
            - name: ISIN
              data_type: STRING
            - name: Company_Name
              data_type: STRING
    steps:
      - id: extract
        category: short_circuit
        conf:
          fetch_method: fetch_all_files_in_dir_with_state
          remote_path: /source/folder/sub-folder/
          file_patterns:
            - file_pattern_((202502(1[4-9]|2\d)|2025(0[3-9]|1\d)\d\d)|(202[6-9]|20[3-9]\d|2[1-2]\d\d)\d\d\d\d).csv
      - id: process
        conf:
          encoding: utf-8
          delimiter: ;
          escapechar: "\0"
          error_bad_lines: true
          skip_footer: false
          file_has_header: true
          strip_trailing_delimiter: false
          handle_extended_whitespace_chars: false
availability_deadlines:
  - deadline_minute: "30"
    deadline_hour: "21"
    deadline_day_of_month: "*"
    deadline_month: "*"
    deadline_day_of_week: "*"
    deadline_year: "*"
    file_frequency: daily
    timezone: UTC
destinations:
  - destination_id: AQZ2paE57OqePz7K5sG6nsunYC
    name: My Snowflake Destination
dag:
  max_active_runs: 1
  owner: CruxInformatics
  schedule_interval: "*/15 15-16 * * *"
  priority_weight: 1
  dag_start_date: 2025-03-06
  dag_catchup: false
  enable_delivery_cache: false
  queue: kubernetes

Managing ODIN with cruxctl

When reviewing your data product in Crux Studio, you can view the ODIN specification file for each dataset in your data product directly in the Crux app. Additionally, you can use the cruxctl Command Line Interface (CLI) to verify the dataset specification before deploying it to production.

You can manage and make changes to the ODIN specification for each dataset by performing the following operations:

View ODIN in the Review and deploy step as you onboard your data product.
Export ODIN to download the YAML file.
Open the YAML file in your favorite text editor.
Make any needed changes and save your updates.
Run the following command to validate your updated YAML file:
```
cruxctl dataset validate [your_file.yaml]
```
When this validation check passes, run the following command to deploy:
```
cruxctl dataset apply [your_file.yaml]
```

🚀 Running the cruxctl dataset apply command will shortcut the deploy action and directly apply any changes to your DAG. This deployment will be reflected in the details of your data product, and as you examine it in the Review and deploy step in the Crux Studio.

Review & deploy

Building your first data product

Introducing Sphere by Crux - Self Service

Known Limitations

How to Validate and Deploy Datasets using cruxctl

Getting started with ODIN

Overview

ODIN specification components

ODIN framework

Sample ODIN file

Managing ODIN with cruxctl