Skip to main content

Dataset deadlines

Ensure data freshness with default deadlines, or customize your own to monitor data updates and receive alerts for missed or delayed deliveries.

Jon Tam avatar
Written by Jon Tam
Updated over 5 months ago

Dataset delivery deadlines help establish boundaries for when data is considered late or missing. Each external dataset has a unique data update schedule controlled by the data supplier, making it challenging to time your updates to align closely with when the supplier publishes the data. Simultaneously, you want to avoid searching for new updates too early, as this may waste valuable processing and computing resources.

Crux offers a solution to help you stay informed about time-sensitive data while preventing alert fatigue. You can configure a deadline for each dataset, which is tied to the time period when data is ingested.

Setting deadlines

During the dataset onboarding, you can either accept Crux's default recommendations for optimal dataset delivery deadlines or customize and set your own. Crux generates recommendations by analyzing previous dataset deliveries and observing the timeline of when the supplier publishes updates at the remote data source. You are free to override the recommended deadlines and configure your values.

A few examples:

  • Monthly. Ingest data at midnight on the first day of every month (0 0 1 * *)

    • You can configure a deadline to (59 11 5 * *) for the dataset ingestion.

    • Data is considered late if ingestion starts on or before 11:59 AM on the 5th of each month, but delivery to the destination is not finished by 11:59 AM on the 5th of each month.

    • Data is deemed missing if ingestion does not begin by 11:59 AM on the 5th of each month.

  • Weekly. Ingest data on Mondays at 6:00 AM (0 6 * * 1).

    • You can configure a deadline to (7 * * 1-5) for the dataset ingestion.

    • Data is considered late if ingestion starts on or before 7:00:00 AM on Monday through Friday, but delivery to the destination is not finished by 7:00 AM on Friday of that week.

    • Data is deemed missing if ingestion does not begin by 7:00:00 AM on Friday.

  • Daily. Ingest data daily at 8:00:00 AM (0 8 * * *)

    • You can configure a deadline to (30 8 * * * ) for the dataset ingestion.

    • Data is considered late if ingestion starts on or before 8:30:00 AM, but delivery to the destination is not finished by 8:30 AM.

    • Data is deemed missing if ingestion does not begin by 8:30 AM.

  • Intraday. Ingest data hourly (0 * * * *)

    • You can configure a deadline to (59 11 * * *) for the dataset ingestion.

    • Data is considered late if ingestion starts on or before 11:59 PM, but delivery to the destination is not finished that same day.

    • Data is deemed missing if ingestion does not begin by 11:59 PM on the same day and is delayed to subsequent days.

Updating deadlines

When you update dataset deadlines, the changes take effect in the following delivery period. For example, let's say your dataset is scheduled to be delivered daily at 8:00 AM (i.e., the dataset schedule is configured as 0 8 * * *). You decide to change the delivery deadline from 8:30 AM (i.e., original dataset deadline was 30 8 * * *) to noon (i.e., update dataset deadline to 0 12 * * *).

  • If a delivery was already successful that day (i.e., data was delivered before the original 8:30 AM deadline), Crux will not repost the delivery status against the new 12:00 PM deadline.

  • If no delivery has occurred by 8:30 AM that day, Crux will monitor deliveries against the new deadline of 12:00 PM. Data will be considered healthy if it arrives between 8:00 AM and noon and missed if not delivered to the destination by noon that day.

Datasets delivered only once

If your dataset's scheduled configuration is set to @once, you will not need to configure deadlines for it. Datasets scheduled to run delivery only @once will not be included in the health dashboard. However, you can still view the delivery details of the dataset by navigating to the "My Data" menu and selecting the relevant data product, followed by the dataset of interest.

Did this answer your question?