A table schema defines the structure, relationships, and constraints of data within a dataset table. In the context of structured datasets, a table schema defines the structure of a data table, which contains the data records. Each table is defined by an inbound (raw) schema, representing the source data, and an outbound (processed, flat file) schema, which reflects how the data is stored after processing. Depending on the data transformation involved, these schemas can be identical or different. A schema ensures that the data follows a predefined structure, which is critical for validation, processing, and downstream use.
A table schema is a blueprint for guiding data integration, validation, and transformation, ensuring data adheres to its declared state for reliable insights and interoperability.
Components of a table schema
Table schema specifies the organization, types, and constraints of data elements, including:
Columns: Describe canonical names, data types, and semantic types.
Fields: The schema includes fields that map to the data records within the source file. Each field represents a data attribute, such as a name, description, or value.
Field Types: Each field is associated with a specific data type, such as STRING, INTEGER, DATE, or FLOAT, defining the format and constraints for the data values.
Semantic types (CUSIP, ISIN, SSN, Address, etc.) for each field in a table.
Relationships: Connections between tables (e.g., foreign keys, primary keys), when applicable.
Constraints: Rules that ensure data quality and consistency, such as required fields, unique values, or min-max value range.
Example
The schema may contain the following fields and corresponding formats for a file containing historical mortgage rates.
name
field in aSTRING
formatdescription
field in aSTRING
formatdate
field in aDATE
format, indicating when the mortgage rate became effectivemortgage rate
field in aFLOAT
format, reflecting the interest rate appliedAny additional fields that provide further data insights
💡 When onboarding a self-service data product, you can modify data column names or their data types when doing so is safe.
Schema types
Schemas include one of the following types:
Input Schema: The original schema provided by the data supplier, reflecting the raw data format.
Base Schema: The initial mapping between the input schema and Crux's standard output representation, aligning the incoming data with the system's structure. At Crux, each table is linked to a schema for easier data management.
Schema versions
Data tables can evolve over time as fields are added, removed, reordered, or renamed. Crux automatically detects these changes and organizes them into schema versions, ensuring consistency as the dataset evolves.
The schema version represents a specific schema instance for a given file at a point in time.
Data quality rules
When managing schemas as part of Crux self-service, you can apply additional configurations for each data column to maintain high data quality, such as:
Default values for specific fields
Expected distribution values
Allowed ENUM values
Minimum and maximum values for fields
Row count validations to ensure dataset completeness
Tests for individual column values to detect any issues or inconsistencies
You can configure notification rules to receive alerts when data quality tests fail or schema inconsistencies arise, ensuring prompt action is taken.
Summary
Schemas are the backbone of structured datasets, providing a clear structure that ensures data is organized, validated, and processed correctly. By maintaining high data quality and adapting to changes in data formats through schema versions, you and your team can manage evolving datasets with minimal disruption. Crux’s ability to automatically detect schema changes and offer data quality management ensures that data pipelines remain robust and reliable over time.
Learn more
Learn about the structure and organization of data products built with Crux.