Skip to main content

Table Schemas

Understand how schemas work in the Crux app.

Jon Tam avatar
Written by Jon Tam
Updated over 6 months ago

A table schema defines the structure, relationships, and constraints of data within a dataset table. In the context of structured datasets, a table schema defines the structure of a data table, which contains the data records. Each table is defined by an inbound (raw) schema, representing the source data, and an outbound (processed, flat file) schema, which reflects how the data is stored after processing. Depending on the data transformation involved, these schemas can be identical or different. A schema ensures that the data follows a predefined structure, which is critical for validation, processing, and downstream use.

A table schema is a blueprint for guiding data integration, validation, and transformation, ensuring data adheres to its declared state for reliable insights and interoperability.

Components of a table schema

Table schema specifies the organization, types, and constraints of data elements, including:

  • Columns: Describe canonical names, data types, and semantic types.

    • Fields: The schema includes fields that map to the data records within the source file. Each field represents a data attribute, such as a name, description, or value.

    • Field Types: Each field is associated with a specific data type, such as STRING, INTEGER, DATE, or FLOAT, defining the format and constraints for the data values.

    • Semantic types (CUSIP, ISIN, SSN, Address, etc.) for each field in a table.

  • Relationships: Connections between tables (e.g., foreign keys, primary keys), when applicable.

  • Constraints: Rules that ensure data quality and consistency, such as required fields, unique values, or min-max value range.

Example

The schema may contain the following fields and corresponding formats for a file containing historical mortgage rates.

  • name field in a STRING format

  • description field in a STRING format

  • date field in a DATE format, indicating when the mortgage rate became effective

  • mortgage rate field in a FLOAT format, reflecting the interest rate applied

  • Any additional fields that provide further data insights

💡 When onboarding a self-service data product, you can modify data column names or their data types when doing so is safe.

Schema types

Schemas include one of the following types:

  1. Input Schema: The original schema provided by the data supplier, reflecting the raw data format.

  2. Base Schema: The initial mapping between the input schema and Crux's standard output representation, aligning the incoming data with the system's structure. At Crux, each table is linked to a schema for easier data management.

Schema versions

Data tables can evolve over time as fields are added, removed, reordered, or renamed. Crux automatically detects these changes and organizes them into schema versions, ensuring consistency as the dataset evolves.

  • The schema version represents a specific schema instance for a given file at a point in time.

Data quality rules

When managing schemas as part of Crux self-service, you can apply additional configurations for each data column to maintain high data quality, such as:

  • Default values for specific fields

  • Expected distribution values

  • Allowed ENUM values

  • Minimum and maximum values for fields

  • Row count validations to ensure dataset completeness

  • Tests for individual column values to detect any issues or inconsistencies

You can configure notification rules to receive alerts when data quality tests fail or schema inconsistencies arise, ensuring prompt action is taken.

Summary

Schemas are the backbone of structured datasets, providing a clear structure that ensures data is organized, validated, and processed correctly. By maintaining high data quality and adapting to changes in data formats through schema versions, you and your team can manage evolving datasets with minimal disruption. Crux’s ability to automatically detect schema changes and offer data quality management ensures that data pipelines remain robust and reliable over time.

Learn more

Learn about the structure and organization of data products built with Crux.

Did this answer your question?