Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Data Quality Assessment

The Data Quality Assessment capability automatically detects and surfaces common data quality issues and, often, handles them with minimal or no action on the part of the user. The assessment not only saves time finding and addressing issues, but provides transparency into automated data processing (you can see the automated processing that has been applied). It includes a warning level to help determine issue severity.

See the associated considerations for important additional information.

As part of EDA1, DataRobot runs checks on features that don’t require date/time and/or target information. Once EDA2 starts, DataRobot runs additional checks. In the end, the following checks are run:

Time series projects run all the baseline data quality checks as well as checks for:

The Visual AI project Data Quality Assessment runs the same baseline checks and an additional missing image check:

Once EDA1 completes, the Data Quality Assessment appears just above the feature listing on the Data page.

In addition to the baseline data quality assessment, DataRobot provides additional detail for time series and Visual AI projects. Once model building completes, you can view the Data Quality Handling Report for additional imputation information.

Overview

The Data Quality Assessment provides information about data quality issues that are relevant to your stage of model building. Initially run as part of EDA1 (data ingest), the results report on the All Features list. It runs again and updates after EDA2, displaying information for the selected feature list (or, by default, All Features). For checks that are not applicable to individual features (for example, Inconsistent Gaps), the report provides a general summary. Click View Info to view (and then Close Info to dismiss) the report:

Each data quality check provides issue status flags, a short description of the issue, and a recommendation message, if appropriate:

  • Warning (): Attention or action required

  • Informational (): No action required

  • No issue ()

Because the results are feature-list based, it is possible that if you change the selected feature list on the Data page, new checks will appear or current checks will disappear from the assessment. For example, if feature list List 1 contains a feature problem, which contains outliers, the outliers check will show in the assessment. If you change lists to List 2 which does not include problem (or any other feature with outliers), the outliers check will report "no issue" ().

From within the assessment modal, you can filter by issue type to see which features triggered the checks. Toggle on Show only affected features and check boxes next to the check names to select which checks to display:

DataRobot then displays only features violating the selected data quality checks, and within the selected feature list, on the Data page. Hover on an icon for more detail:

For multilabel and Visual AI projects, Preview Log displays at the top if the assessment detects multicategorical format errors or missing images in the dataset. Click Preview Log to open a window with a detailed view of each error, so you can more easily find and fix them in the dataset.

Explore the assessment

Once EDA1 completes and you have, perhaps, filtered the display, view the list of features impacted by the issues you are interested in investigating. To see the values that triggered a warning or information notification, expand a feature and review the Histogram and Frequent Values visualizations.

To learn more about the topics discussed on this page, see:

Feature considerations

Consider the following when working with Data Quality Assessment capability:

  • For disguised missing values, inlier, and excess zero issues, automated handling is only enabled for linear and Keras blueprints, where they have proven to reduce model error. Detection is applied to all blueprints.
  • You cannot disable automated imputation handling.
  • A public API is not yet available.
  • Automated feature engineering runs on raw data (instead of removing all excess zeros and disguised missing values before calculating rolling averages).

Updated January 26, 2024