{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Predict CO₂ levels with out-of-time validation modeling\n",
"\n",
"This notebook demonstrates how to use [out-of-time validation (OTV)](https://docs.datarobot.com/en/docs/modeling/special-workflows/otv.html) modeling with DataRobot's Python client to predict monthly CO₂ levels for one of Hawaii's active volcanoes, Mauna Loa. The dataset used in this notebook can be accessed [here](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html) (select the first dataset listed to emulate results displayed below), but DataRobot provides a ready-to-use version of this dataset below. For this notebook, the target feature is `interpolated` because `average` has a some missing values that should be skipped.\n",
"\n",
"OTV is a useful modeling method when you know that your data changes in distribution over time. If this is true of your data, random sampling of training and testing datasets would not yield an outcome that would be representative of the model accuracy when it is making predictions in a production environment. Note that OTV can be applied to both classification and regression projects. It partitions your data using the [backtesting](https://docs.datarobot.com/en/docs/modeling/special-workflows/otv.html#backtests) method, also used in time series modeling. \n",
"\n",
"### Requirements\n",
"\n",
"- Python version 3.7.3.\n",
"- DataRobot API version 2.21.0.\n",
"\n",
"Small adjustments to the code below may be required depending on the Python version and DataRobot API version used.\n",
"\n",
"Reference documentation for DataRobot's Python client [here](https://datarobot-public-api-client.readthedocs-hosted.com)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import datarobot as dr\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import data\n",
"\n",
"You can download the sample training dataset [here](co2_mm_mlo.csv)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
year
\n",
"
month
\n",
"
decimal date
\n",
"
average
\n",
"
interpolated
\n",
"
trend
\n",
"
ndays
\n",
"
day
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1958
\n",
"
3
\n",
"
1958.208
\n",
"
315.71
\n",
"
315.71
\n",
"
314.62
\n",
"
-1
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
1958
\n",
"
4
\n",
"
1958.292
\n",
"
317.45
\n",
"
317.45
\n",
"
315.29
\n",
"
-1
\n",
"
1
\n",
"
\n",
"
\n",
"
2
\n",
"
1958
\n",
"
5
\n",
"
1958.375
\n",
"
317.50
\n",
"
317.50
\n",
"
314.71
\n",
"
-1
\n",
"
1
\n",
"
\n",
"
\n",
"
3
\n",
"
1958
\n",
"
6
\n",
"
1958.458
\n",
"
-99.99
\n",
"
317.10
\n",
"
314.85
\n",
"
-1
\n",
"
1
\n",
"
\n",
"
\n",
"
4
\n",
"
1958
\n",
"
7
\n",
"
1958.542
\n",
"
315.86
\n",
"
315.86
\n",
"
314.98
\n",
"
-1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" year month decimal date average interpolated trend ndays day\n",
"0 1958 3 1958.208 315.71 315.71 314.62 -1 1\n",
"1 1958 4 1958.292 317.45 317.45 315.29 -1 1\n",
"2 1958 5 1958.375 317.50 317.50 314.71 -1 1\n",
"3 1958 6 1958.458 -99.99 317.10 314.85 -1 1\n",
"4 1958 7 1958.542 315.86 315.86 314.98 -1 1"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_path = \"https://docs.datarobot.com/en/docs/api/guide/common-case/co2_mm_mlo.csv\"\n",
"df = pd.read_csv(data_path) # Add your dataset here\n",
"df[\"day\"] = 1 # Displays an arbitrary \"day\" column to create an accurate \"date\" feature\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Connect to DataRobot\n",
"\n",
"Use the snippet below to authenticate and connect to DataRobot. You can read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/index.html#configure-api-authentication)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call\n",
"# dr.Client(config_path='path-to-drconfig.yaml')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing\n",
"\n",
"Before you begin modeling, you must complete the following steps:\n",
"\n",
"- Create an accurate \"date\" feature.\n",
"- Remove all unnecessary features.\n",
"- Create two month lag features.\n",
"\n",
"You can create many more features (such as aggregates on a monthly level or percentages), but for the purposes of OTV, this is not required."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
interpolated
\n",
"
trend
\n",
"
date
\n",
"
lag_1
\n",
"
lag_2
\n",
"
lag_3
\n",
"
lag_4
\n",
"
\n",
" \n",
" \n",
"
\n",
"
8
\n",
"
313.33
\n",
"
315.31
\n",
"
1958-11-01
\n",
"
312.66
\n",
"
313.20
\n",
"
314.93
\n",
"
315.86
\n",
"
\n",
"
\n",
"
9
\n",
"
314.67
\n",
"
315.61
\n",
"
1958-12-01
\n",
"
313.33
\n",
"
312.66
\n",
"
313.20
\n",
"
314.93
\n",
"
\n",
"
\n",
"
10
\n",
"
315.62
\n",
"
315.70
\n",
"
1959-01-01
\n",
"
314.67
\n",
"
313.33
\n",
"
312.66
\n",
"
313.20
\n",
"
\n",
"
\n",
"
11
\n",
"
316.38
\n",
"
315.88
\n",
"
1959-02-01
\n",
"
315.62
\n",
"
314.67
\n",
"
313.33
\n",
"
312.66
\n",
"
\n",
"
\n",
"
12
\n",
"
316.71
\n",
"
315.62
\n",
"
1959-03-01
\n",
"
316.38
\n",
"
315.62
\n",
"
314.67
\n",
"
313.33
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" interpolated trend date lag_1 lag_2 lag_3 lag_4\n",
"8 313.33 315.31 1958-11-01 312.66 313.20 314.93 315.86\n",
"9 314.67 315.61 1958-12-01 313.33 312.66 313.20 314.93\n",
"10 315.62 315.70 1959-01-01 314.67 313.33 312.66 313.20\n",
"11 316.38 315.88 1959-02-01 315.62 314.67 313.33 312.66\n",
"12 316.71 315.62 1959-03-01 316.38 315.62 314.67 313.33"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"date\"] = pd.to_datetime(df[[\"year\", \"month\", \"day\"]])\n",
"df.drop([\"year\", \"month\", \"decimal date\", \"average\", \"ndays\", \"day\"], inplace=True, axis=1)\n",
"\n",
"# Create 2 month lag features\n",
"for i in range(1, 5):\n",
" df[\"lag_{}\".format(i)] = df[\"interpolated\"].shift(i)\n",
"\n",
"df = df.iloc[8:]\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plot the data\n",
"\n",
"By plotting the data (displayed below), you can observe that it follows an upwards trend. Note that randomly partitioning the data for testing purposes would not work and you would not get representative accuracy metrics."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.lineplot(x=\"date\", y=\"interpolated\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define datetime partitioning\n",
"\n",
"Use the snippet below to define datetime partitioning for the data. You can also reference [a more complete example of Datetime Partitioning](https://github.com/datarobot-community/examples-for-data-scientists/blob/master/Advanced%20Tuning%20and%20Partitioning/Python/Datetime%20Partitioning.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"spec = dr.DatetimePartitioningSpecification(\n",
" datetime_partition_column=\"date\", number_of_backtests=4, use_time_series=False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start the project\n",
"The snippet below passes the `spec` object as an input to the `partitioning_method` variable in the `set_target` method. This starts the project with the designated settings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"project = dr.Project.create(df, project_name=\"Predicting CO2 levels for Mauna Loa\")\n",
"\n",
"project.set_target(\"interpolated\", partitioning_method=spec, worker_count=-1)\n",
"project.wait_for_autopilot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Access insights\n",
"\n",
"All model insights are available via the API. The example below displays Feature Impact calculated for one of the trained models.\n",
"\n",
"Access more examples and sample code for extracting insights from [the DataRobot Community]( https://github.com/datarobot-community/examples-for-data-scientists/tree/master/Model%20Evaluation/Python)."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"model = project.get_models()[0]\n",
"\n",
"# Get Feature Impact\n",
"feature_impact = model.get_or_request_feature_impact()\n",
"\n",
"# Save feature impact in pandas dataframe\n",
"fi_df = pd.DataFrame(feature_impact)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(figsize=(12, 5))\n",
"\n",
"# Plot feature impact\n",
"sns.barplot(x=\"featureName\", y=\"impactNormalized\", data=fi_df[0:5], color=\"g\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}