Skip to main content

Datasets

A dataset is a core element of a data product. All data in a dataset is sourced with a shared data pipeline and follows a common schedule.

Jon Tam avatar
Written by Jon Tam
Updated over 6 months ago

In Crux, a dataset is the core element of your data product and serves as the foundation for organizing and managing external data. Its hierarchical structure allows for flexible handling of data over time, ensuring seamless ingestion, processing, and delivery of external data.

Components of a dataset

A dataset follows a nested hierarchy:

  • A dataset is composed of tables.

  • Each table contains one or more table schemas, which define the structure of the data.

  • Resources (the individual files) are managed within this structure and processed based on the defined schema.

Datasets

The dataset has a file name pattern and schedule that controls when data is fetched and processed. All data in a dataset is sourced with a shared data pipeline and has a common ingestion schedule. You can think of a dataset as an individual analytics stream designed to extract, process, and deliver data on a scheduled basis.

Tables

A table is a sequence of data resources (data files) grouped over time. Each table has a file_pattern that defines how the files are structured and named. Tables represent the time-based data structure within a dataset, allowing flexibility in data organization.

Table Schema

The table schema defines the structure of the data within each table. Schemas can evolve over time to reflect changes in the data, with optional start_date and end_date fields indicating when different schema versions are active. This allows datasets to accommodate changing data formats and field structures without interruption.

Resources

Resources are the individual data files that make up a dataset. Each resource contains metadata such as file path, modified time, and remote and local sizes. Resources can exist in two states:

  • Raw Resource: The file as it is ingested from the source before Crux processing.

  • Processed Resource: The file is normalized and ready for use after it has gone through Crux’s processing.

Dataset schedule

Datasets can run once or on a recurring basis. A schedule associated with the dataset's pipeline and extraction routine determines how frequently data is ingested from the supplier and delivered to the destination. This makes the dataset dynamic, regularly refreshed, and processed according to the schedule.

Note: Data availability is determined by data suppliers. Although Crux does not control when data updates are available at the source, Crux can help monitor data for delays and notify you when data is late or missed.

Learn more about schedules here.

Summary

Datasets provide the structure needed to organize raw data into manageable and actionable units, ensuring that the data can be processed and analyzed efficiently. In short, a dataset consists of multiple tables; each table contains one or more schemas, which govern the structure of individual resources. With evolving schemas and a flexible table structure, datasets allow continuous data ingestion without interruption, even as the underlying data changes over time.

Learn more

Learn about the structure and organization of data products built with Crux.

Did this answer your question?