The Crux domain model helps you organize and customize your data to improve business decisions and processes. This guide explains the structure and elements of a Crux domain model.

Overview

Data products are at the top of the Crux domain model ontology. A data product is a model-ready solution delivering external data that is ready for immediate analysis. Data becomes a data product after being extracted (ingested from the source) and processed (normalized and standardized). Data products are organized into datasets, tables, and table schemas, allowing for a flexible and scalable approach to external data management.

Key concepts

The following graphic demonstrated:

Data product: A collection of one or more datasets that have been processed and normalized.
Dataset: A dataset is structured data within a data product, defined by its source, file pattern, and schedule, forming a data pipeline for efficient analysis. All data in the dataset share a common file pattern and are delivered on the same schedule.
Table: A table is a dataset component that organizes data into two-dimensional data structures, similar to rows and columns in a spreadsheet or an SQL table.
Table Schemas: Defines the data structure with the flexibility to evolve over time.

At its simplest, a data product is a collection of datasets, and the following diagram illustrates how these components relate.

Data products

Every Crux data product consists of four primary components:

Source Connection: Defines connection to the data provider, where the data originates from.
Destination Connection: Specifies the target destination where the processed data is delivered.
Dataset(s): The core organizational structure of your data product.
Delivery: The mechanism that keeps data up-to-date by fetching it from the source.

Datasets

All data in a dataset is sourced with a shared data pipeline following a common ingestion schedule. A dataset consists of one or multiple tables. Each table contains one or more table schemas, which govern the structure of individual resources.

Dataset: The top-level entity with a file name pattern and schedule that controls when data is fetched and processed.
Table: A sequence of resources (data files) grouped over time. Each table has a name and a file_pattern that defines how files are structured.
Table schema: The schema defines the structure of data in the table. Schemas can change over time, with optional start_date and end_date fields marking when these changes occur.
Resource: The individual data files. Resources include metadata like file paths, remote modified times, and remote and local sizes. They exist in two states: raw (before Crux processing) and processed (after Crux processing).

Learn more about datasets here.

Tables

A table is a dataset component that organizes data into two-dimensional data structures, similar to rows and columns in a spreadsheet or an SQL table. Tables can also support external workflows involving unstructured and “raw” data, such as images, audio, video, PDF, and other files.

Table Schemas

A table schema defines the structure, relationships, and constraints of data within a dataset table. It specifies the organization, types, and constraints of data elements, including:

Columns: Canonical names, data types (e.g., integer, string, date), semantic types (CUSIP, ISIN, SSN, Address, etc.) for each field in a table.
Relationships: Connections between tables (e.g., foreign keys, primary keys), when applicable.
Constraints: Rules that ensure data quality and consistency, such as required fields, unique values, or min-max value range.

A table schema serves as a blueprint for guiding data integration, validation, and transformation, ensuring data adheres to its declared state for reliable insights and interoperability.

Deliveries

Deliveries are the scheduled processes that fetch new data from a source for a given dataset. They ensure that data is continually updated and ready for analysis.

Summary

The Crux Domain Model provides a robust structure for managing external data. Understanding how data products, datasets, tables, table schemas, and deliveries interact is key to effectively leveraging Crux’s capabilities. This nested hierarchy allows for flexible data ingestion, processing, and schema evolution, making it a powerful tool for data engineers working with external data.

Learn more

Learn about the structure and organization of data products built with Crux.