{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loan default notebook\n",
"\n",
"Many credit decisioning systems are driven by scorecards, which are very simplistic rules-based systems. These are built by end-user organizations through industry knowledge or through simple statistical systems. Some organizations go a step further and obtain scorecards from third parties which may not be customized for an individual organization’s book. An AI-based approach can help financial institutions learn signals from their own book and assess risk at a more granular level. Once the risk is calculated, a strategy may be implemented to use this information for interventions. If you can predict someone is going to default, this may lead to intervention steps such as sending earlier notices or rejecting loan applications.\n",
"\n",
"## Setup\n",
"\n",
"This example assumes that the DataRobot Python client package has been installed and configured with the credentials of a DataRobot user with API access permissions. Retrieve your DataRobot API Token by logging into DataRobot and navigating to the Developer Tools in your profile.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
"!pip install datarobot umap-learn nbformat hdbscan\n",
"\n",
"import datarobot as dr\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib.ticker as mtick\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"light_blue = \"#598fd6\"\n",
"grey_blue = \"#5f728b\"\n",
"orange = \"#dd6b3d\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to DataRobot\n",
"\n",
"Read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call\n",
"# dr.Client(config_path='path-to-drconfig.yaml')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import data\n",
"\n",
"The data file is hosted by DataRobot using the URL in the following cell. Read in the data directly from the URL into a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and display the results to verify all of the data looks correct. If you have your own data files you can access that data in several ways."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
id
\n",
"
member_id
\n",
"
loan_amnt
\n",
"
funded_amnt
\n",
"
installment
\n",
"
grade
\n",
"
sub_grade
\n",
"
emp_title
\n",
"
emp_length
\n",
"
home_ownership
\n",
"
...
\n",
"
revol_util
\n",
"
total_acc
\n",
"
initial_list_status
\n",
"
collections_12_mths_ex_med
\n",
"
mths_since_last_major_derog
\n",
"
application_type
\n",
"
acc_now_delinq
\n",
"
tot_coll_amt
\n",
"
tot_cur_bal
\n",
"
is_bad
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3296446
\n",
"
4068857
\n",
"
11200
\n",
"
11200
\n",
"
343.89
\n",
"
A
\n",
"
A2
\n",
"
Nokia Siemens Network
\n",
"
10.0
\n",
"
OWN
\n",
"
...
\n",
"
66.20%
\n",
"
21
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
187717.0
\n",
"
False
\n",
"
\n",
"
\n",
"
1
\n",
"
3286412
\n",
"
4058853
\n",
"
10000
\n",
"
10000
\n",
"
328.06
\n",
"
B
\n",
"
B2
\n",
"
creative financial group
\n",
"
2.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
74.20%
\n",
"
11
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
16623.0
\n",
"
True
\n",
"
\n",
"
\n",
"
2
\n",
"
3286406
\n",
"
4058848
\n",
"
8000
\n",
"
8000
\n",
"
282.41
\n",
"
C
\n",
"
C4
\n",
"
Techtron Systems
\n",
"
7.0
\n",
"
RENT
\n",
"
...
\n",
"
72%
\n",
"
17
\n",
"
w
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
17938.0
\n",
"
False
\n",
"
\n",
"
\n",
"
3
\n",
"
3296434
\n",
"
4068843
\n",
"
16000
\n",
"
16000
\n",
"
500.65
\n",
"
A
\n",
"
A4
\n",
"
Bristol Hospital
\n",
"
10.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
75.20%
\n",
"
56
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
372771.0
\n",
"
False
\n",
"
\n",
"
\n",
"
4
\n",
"
3286395
\n",
"
4058836
\n",
"
4000
\n",
"
4000
\n",
"
125.17
\n",
"
A
\n",
"
A4
\n",
"
Aspen Skiing Company
\n",
"
10.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
95.50%
\n",
"
21
\n",
"
w
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
331205.0
\n",
"
False
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
95
\n",
"
3286021
\n",
"
4058369
\n",
"
4000
\n",
"
4000
\n",
"
146.12
\n",
"
D
\n",
"
D3
\n",
"
Morton Plant Hospital
\n",
"
6.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
84.80%
\n",
"
21
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
112600.0
\n",
"
False
\n",
"
\n",
"
\n",
"
96
\n",
"
2835185
\n",
"
3417435
\n",
"
8500
\n",
"
8500
\n",
"
264.88
\n",
"
A
\n",
"
A3
\n",
"
The Adocate/Hearst News
\n",
"
5.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
28.70%
\n",
"
19
\n",
"
w
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
112238.0
\n",
"
False
\n",
"
\n",
"
\n",
"
97
\n",
"
3241124
\n",
"
3984059
\n",
"
13000
\n",
"
13000
\n",
"
432.54
\n",
"
B
\n",
"
B3
\n",
"
Southwest ISD
\n",
"
3.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
57.90%
\n",
"
18
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
202.0
\n",
"
120076.0
\n",
"
False
\n",
"
\n",
"
\n",
"
98
\n",
"
3198040
\n",
"
3930968
\n",
"
20000
\n",
"
20000
\n",
"
608.72
\n",
"
A
\n",
"
A1
\n",
"
west texas a&m university
\n",
"
5.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
39.90%
\n",
"
27
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
230748.0
\n",
"
False
\n",
"
\n",
"
\n",
"
99
\n",
"
3188257
\n",
"
3921251
\n",
"
16000
\n",
"
16000
\n",
"
387.40
\n",
"
C
\n",
"
C3
\n",
"
Antea Group
\n",
"
1.0
\n",
"
MORTGAGE
\n",
"
...
\n",
"
54.60%
\n",
"
28
\n",
"
f
\n",
"
0
\n",
"
NaN
\n",
"
INDIVIDUAL
\n",
"
0
\n",
"
0.0
\n",
"
194506.0
\n",
"
False
\n",
"
\n",
" \n",
"
\n",
"
100 rows × 34 columns
\n",
"
"
],
"text/plain": [
" id member_id loan_amnt funded_amnt installment grade sub_grade \\\n",
"0 3296446 4068857 11200 11200 343.89 A A2 \n",
"1 3286412 4058853 10000 10000 328.06 B B2 \n",
"2 3286406 4058848 8000 8000 282.41 C C4 \n",
"3 3296434 4068843 16000 16000 500.65 A A4 \n",
"4 3286395 4058836 4000 4000 125.17 A A4 \n",
".. ... ... ... ... ... ... ... \n",
"95 3286021 4058369 4000 4000 146.12 D D3 \n",
"96 2835185 3417435 8500 8500 264.88 A A3 \n",
"97 3241124 3984059 13000 13000 432.54 B B3 \n",
"98 3198040 3930968 20000 20000 608.72 A A1 \n",
"99 3188257 3921251 16000 16000 387.40 C C3 \n",
"\n",
" emp_title emp_length home_ownership ... revol_util \\\n",
"0 Nokia Siemens Network 10.0 OWN ... 66.20% \n",
"1 creative financial group 2.0 MORTGAGE ... 74.20% \n",
"2 Techtron Systems 7.0 RENT ... 72% \n",
"3 Bristol Hospital 10.0 MORTGAGE ... 75.20% \n",
"4 Aspen Skiing Company 10.0 MORTGAGE ... 95.50% \n",
".. ... ... ... ... ... \n",
"95 Morton Plant Hospital 6.0 MORTGAGE ... 84.80% \n",
"96 The Adocate/Hearst News 5.0 MORTGAGE ... 28.70% \n",
"97 Southwest ISD 3.0 MORTGAGE ... 57.90% \n",
"98 west texas a&m university 5.0 MORTGAGE ... 39.90% \n",
"99 Antea Group 1.0 MORTGAGE ... 54.60% \n",
"\n",
" total_acc initial_list_status collections_12_mths_ex_med \\\n",
"0 21 f 0 \n",
"1 11 f 0 \n",
"2 17 w 0 \n",
"3 56 f 0 \n",
"4 21 w 0 \n",
".. ... ... ... \n",
"95 21 f 0 \n",
"96 19 w 0 \n",
"97 18 f 0 \n",
"98 27 f 0 \n",
"99 28 f 0 \n",
"\n",
" mths_since_last_major_derog application_type acc_now_delinq tot_coll_amt \\\n",
"0 NaN INDIVIDUAL 0 0.0 \n",
"1 NaN INDIVIDUAL 0 0.0 \n",
"2 NaN INDIVIDUAL 0 0.0 \n",
"3 NaN INDIVIDUAL 0 0.0 \n",
"4 NaN INDIVIDUAL 0 0.0 \n",
".. ... ... ... ... \n",
"95 NaN INDIVIDUAL 0 0.0 \n",
"96 NaN INDIVIDUAL 0 0.0 \n",
"97 NaN INDIVIDUAL 0 202.0 \n",
"98 NaN INDIVIDUAL 0 0.0 \n",
"99 NaN INDIVIDUAL 0 0.0 \n",
"\n",
" tot_cur_bal is_bad \n",
"0 187717.0 False \n",
"1 16623.0 True \n",
"2 17938.0 False \n",
"3 372771.0 False \n",
"4 331205.0 False \n",
".. ... ... \n",
"95 112600.0 False \n",
"96 112238.0 False \n",
"97 120076.0 False \n",
"98 230748.0 False \n",
"99 194506.0 False \n",
"\n",
"[100 rows x 34 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_path = \"https://s3.amazonaws.com/datarobot-use-case-datasets/Lending+Club+Dataset+Train.csv\"\n",
"\n",
"pathfinder_df = pd.read_csv(data_path, encoding=\"ISO-8859-1\")\n",
"pathfinder_df.rename(columns={\"loan_is_bad\": \"is_bad\"}, inplace=True)\n",
"\n",
"pathfinder_df.head(100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualize data\n",
"\n",
"Use the following snippets to display unique aspects of the data. The first cell groups the dataframe by average annual income for loans that default and those that do not. The second cell shows how often a loan defaults based on the `emp_length value`. The third cells shows the average default rate for loans for each state."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" addr_state is_bad\n",
"0 AK 0.098837\n",
"1 AL 0.188563\n",
"2 AR 0.165803\n",
"3 AZ 0.168299\n",
"4 CA 0.151439\n",
"5 CO 0.138211\n",
"6 CT 0.155527\n",
"7 DC 0.091549\n",
"8 DE 0.157895\n",
"9 FL 0.180175\n",
"10 GA 0.153999\n",
"11 HI 0.192926\n",
"12 IL 0.129351\n",
"13 IN 0.166667\n",
"14 KS 0.127615\n",
"15 KY 0.140515\n",
"16 LA 0.161716\n",
"17 MA 0.150280\n",
"18 MD 0.165319\n",
"19 MI 0.173405\n",
"20 MN 0.138249\n",
"21 MO 0.153558\n",
"22 MT 0.131579\n",
"23 NC 0.147039\n",
"24 NE 1.000000\n",
"25 NH 0.101695\n",
"26 NJ 0.181864\n",
"27 NM 0.160000\n",
"28 NV 0.175192\n",
"29 NY 0.173567\n",
"30 OH 0.150431\n",
"31 OK 0.163121\n",
"32 OR 0.136858\n",
"33 PA 0.159287\n",
"34 RI 0.200855\n",
"35 SC 0.145000\n",
"36 SD 0.104839\n",
"37 TN 0.000000\n",
"38 TX 0.133150\n",
"39 UT 0.161290\n",
"40 VA 0.164849\n",
"41 VT 0.208696\n",
"42 WA 0.159657\n",
"43 WI 0.149385\n",
"44 WV 0.110092\n",
"45 WY 0.103448"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"avg_value_df = pathfinder_df.groupby(\"addr_state\").agg({\"is_bad\": \"mean\"}).reset_index()\n",
"\n",
"avg_value_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initiate modeling\n",
" \n",
"Create a DataRobot project to train models against the assembled dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
"EXISTING_PROJECT_ID = None # If you've already created a project, replace None with the ID here\n",
"if EXISTING_PROJECT_ID is None:\n",
" # Create project and pass in data\n",
" project = dr.Project.create(sourcedata=pathfinder_df, project_name=\"Predict loan defaults\")\n",
"\n",
" # Set the project target to the appropriate feature. Use the LogLoss metric to measure performance\n",
" project.set_target(target=\"is_bad\", mode=dr.AUTOPILOT_MODE.QUICK, worker_count=\"-1\")\n",
"else:\n",
" # Fetch the existing project\n",
" project = dr.Project.get(EXISTING_PROJECT_ID)\n",
"\n",
"project.wait_for_autopilot(check_interval=30)\n",
"# Uncomment and replace the project ID if the project already exists\n",
"# project = dr.Project.get(\"612cb904ce5d5617d67af394\")\n",
"\n",
"# Get the project metric (i.e LogLoss, RMSE, etc...)\n",
"metric = project.metric\n",
"\n",
"# Get project URL\n",
"project_url = project.get_leaderboard_ui_permalink()\n",
"\n",
"# Get project ID\n",
"project_id = project.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### View project in UI\n",
"\n",
"If you want to view any aspects of the project in the DataRobot UI, you can retrieve the URL for the project with the snippet below and use it to navigate to the DataRobot application in your browser."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://app.datarobot.com/projects/62cda041ab0bc3275f7a4a86/models'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display project URL\n",
"project_url"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initiate modeling"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In progress: 0, queued: 0 (waited: 0s)\n"
]
}
],
"source": [
"project.wait_for_autopilot(check_interval=30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate model performance \n",
"\n",
"In order to measure model performance, first select the top model based on a specific performance metric (i.e., `LogLoss`) and then evaluate several different types of charts, such as Lift Chart, ROC Curve, and Feature Importance. There are two helper functions (detailed below) that assist in producing these charts.\n",
"\n",
"You can reference more information about how to evaluate model performance in the [DataRobot platform documentation](https://docs.datarobot.com/en/docs/modeling/analyze-models/index.html).\n",
"\n",
"In the snippet below, use models built during Autopilot to create a list of the top-performing models based on their accuracy.\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
"def sorted_by_metric(models, test_set, metric):\n",
" models_with_score = [model for model in models if model.metrics[metric][test_set] is not None]\n",
"\n",
" return sorted(models_with_score, key=lambda model: model.metrics[metric][test_set])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The top performing model is Model('Elastic-Net Classifier (L2 / Binomial Deviance)') using metric, LogLoss\n"
]
}
],
"source": [
"models = project.get_models()\n",
"\n",
"# Uncomment if this is not set above in the create project cell\n",
"metric = project.metric\n",
"\n",
"# Get the top-performing model\n",
"model_top = sorted_by_metric(models, \"crossValidation\", metric)[0]\n",
"\n",
"print(\n",
" \"\"\"The top performing model is {model} using metric, {metric}\"\"\".format(\n",
" model=str(model_top), metric=metric\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"autoscroll": "auto"
},
"outputs": [],
"source": [
"# Set styling\n",
"dr_dark_blue = \"#08233F\"\n",
"dr_blue = \"#1F77B4\"\n",
"dr_orange = \"#FF7F0E\"\n",
"dr_red = \"#BE3C28\"\n",
"\n",
"# Function to build histograms\n",
"\n",
"\n",
"def rebin_df(raw_df, number_of_bins):\n",
" cols = [\"bin\", \"actual_mean\", \"predicted_mean\", \"bin_weight\"]\n",
" new_df = pd.DataFrame(columns=cols)\n",
" current_prediction_total = 0\n",
" current_actual_total = 0\n",
" current_row_total = 0\n",
" x_index = 1\n",
" bin_size = 60 / number_of_bins\n",
" for rowId, data in raw_df.iterrows():\n",
" current_prediction_total += data[\"predicted\"] * data[\"bin_weight\"]\n",
" current_actual_total += data[\"actual\"] * data[\"bin_weight\"]\n",
" current_row_total += data[\"bin_weight\"]\n",
"\n",
" if (rowId + 1) % bin_size == 0:\n",
" x_index += 1\n",
" bin_properties = {\n",
" \"bin\": ((round(rowId + 1) / 60) * number_of_bins),\n",
" \"actual_mean\": current_actual_total / current_row_total,\n",
" \"predicted_mean\": current_prediction_total / current_row_total,\n",
" \"bin_weight\": current_row_total,\n",
" }\n",
"\n",
" new_df = new_df.append(bin_properties, ignore_index=True)\n",
" current_prediction_total = 0\n",
" current_actual_total = 0\n",
" current_row_total = 0\n",
" return new_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lift chart\n",
"\n",
"A [lift chart](https://docs.datarobot.com/en/docs/modeling/analyze-models/evaluate/lift-chart.html#lift-chart) shows you how close model predictions are to the actual values of the target in the training data. The lift chart data includes the average predicted value and the average actual values of the target, sorted by the prediction values in ascending order and split into up to 60 bins.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Function to build lift charts\n",
"\n",
"\n",
"def matplotlib_lift(bins_df, bin_count, ax):\n",
" grouped = rebin_df(bins_df, bin_count)\n",
" ax.plot(\n",
" range(1, len(grouped) + 1),\n",
" grouped[\"predicted_mean\"],\n",
" marker=\"+\",\n",
" lw=1,\n",
" color=dr_blue,\n",
" label=\"predicted\",\n",
" )\n",
" ax.plot(\n",
" range(1, len(grouped) + 1),\n",
" grouped[\"actual_mean\"],\n",
" marker=\"*\",\n",
" lw=1,\n",
" color=dr_orange,\n",
" label=\"actual\",\n",
" )\n",
" ax.set_xlim([0, len(grouped) + 1])\n",
" ax.set_facecolor(dr_dark_blue)\n",
" ax.legend(loc=\"best\")\n",
" ax.set_title(\"Lift chart {} bins\".format(bin_count))\n",
" ax.set_xlabel(\"Sorted Prediction\")\n",
" ax.set_ylabel(\"Value\")\n",
" return grouped"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"No handles with labels found to put in legend.\n",
"No handles with labels found to put in legend.\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"lift_chart = model_top.get_lift_chart(\"validation\")\n",
"\n",
"# Save the result into a Pandas dataframe\n",
"lift_df = pd.DataFrame(lift_chart.bins)\n",
"\n",
"bin_counts = [10, 15]\n",
"f, axarr = plt.subplots(len(bin_counts))\n",
"f.set_size_inches((8, 4 * len(bin_counts)))\n",
"\n",
"rebinned_dfs = []\n",
"for i in range(len(bin_counts)):\n",
" rebinned_dfs.append(matplotlib_lift(lift_df, bin_counts[i], axarr[i]))\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ROC Curve\n",
"\n",
"The receiver operating characteristic curve, or [ROC curve](https://docs.datarobot.com/en/docs/modeling/analyze-models/evaluate/roc-curve-tab/roc-curve.html#roc-curve), is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
accuracy
\n",
"
f1_score
\n",
"
false_negative_score
\n",
"
true_negative_score
\n",
"
true_positive_score
\n",
"
false_positive_score
\n",
"
true_negative_rate
\n",
"
false_positive_rate
\n",
"
true_positive_rate
\n",
"
matthews_correlation_coefficient
\n",
"
positive_predictive_value
\n",
"
negative_predictive_value
\n",
"
threshold
\n",
"
fraction_predicted_as_positive
\n",
"
fraction_predicted_as_negative
\n",
"
lift_positive
\n",
"
lift_negative
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.843687
\n",
"
0.000000
\n",
"
1238
\n",
"
6682
\n",
"
0
\n",
"
0
\n",
"
1.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.000000
\n",
"
0.843687
\n",
"
1.000000
\n",
"
0.000000
\n",
"
1.000000
\n",
"
0.000000
\n",
"
1.000000
\n",
"
\n",
"
\n",
"
1
\n",
"
0.844066
\n",
"
0.004835
\n",
"
1235
\n",
"
6682
\n",
"
3
\n",
"
0
\n",
"
1.000000
\n",
"
0.000000
\n",
"
0.002423
\n",
"
0.045224
\n",
"
1.000000
\n",
"
0.844007
\n",
"
0.514413
\n",
"
0.000379
\n",
"
0.999621
\n",
"
6.397415
\n",
"
1.000379
\n",
"
\n",
"
\n",
"
2
\n",
"
0.843939
\n",
"
0.006431
\n",
"
1234
\n",
"
6680
\n",
"
4
\n",
"
2
\n",
"
0.999701
\n",
"
0.000299
\n",
"
0.003231
\n",
"
0.038695
\n",
"
0.666667
\n",
"
0.844074
\n",
"
0.477247
\n",
"
0.000758
\n",
"
0.999242
\n",
"
4.264943
\n",
"
1.000459
\n",
"
\n",
"
\n",
"
3
\n",
"
0.844318
\n",
"
0.014388
\n",
"
1229
\n",
"
6678
\n",
"
9
\n",
"
4
\n",
"
0.999401
\n",
"
0.000599
\n",
"
0.007270
\n",
"
0.059846
\n",
"
0.692308
\n",
"
0.844568
\n",
"
0.449702
\n",
"
0.001641
\n",
"
0.998359
\n",
"
4.428980
\n",
"
1.001045
\n",
"
\n",
"
\n",
"
4
\n",
"
0.844444
\n",
"
0.025316
\n",
"
1222
\n",
"
6672
\n",
"
16
\n",
"
10
\n",
"
0.998503
\n",
"
0.001497
\n",
"
0.012924
\n",
"
0.072549
\n",
"
0.615385
\n",
"
0.845199
\n",
"
0.423901
\n",
"
0.003283
\n",
"
0.996717
\n",
"
3.936871
\n",
"
1.001792
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
110
\n",
"
0.184848
\n",
"
0.276071
\n",
"
7
\n",
"
233
\n",
"
1231
\n",
"
6449
\n",
"
0.034870
\n",
"
0.965130
\n",
"
0.994346
\n",
"
0.061893
\n",
"
0.160286
\n",
"
0.970833
\n",
"
0.043229
\n",
"
0.969697
\n",
"
0.030303
\n",
"
1.025419
\n",
"
1.150703
\n",
"
\n",
"
\n",
"
111
\n",
"
0.175253
\n",
"
0.274061
\n",
"
5
\n",
"
155
\n",
"
1233
\n",
"
6527
\n",
"
0.023197
\n",
"
0.976803
\n",
"
0.995961
\n",
"
0.049450
\n",
"
0.158892
\n",
"
0.968750
\n",
"
0.039698
\n",
"
0.979798
\n",
"
0.020202
\n",
"
1.016497
\n",
"
1.148234
\n",
"
\n",
"
\n",
"
112
\n",
"
0.165657
\n",
"
0.272086
\n",
"
3
\n",
"
77
\n",
"
1235
\n",
"
6605
\n",
"
0.011523
\n",
"
0.988477
\n",
"
0.997577
\n",
"
0.033049
\n",
"
0.157526
\n",
"
0.962500
\n",
"
0.033700
\n",
"
0.989899
\n",
"
0.010101
\n",
"
1.007756
\n",
"
1.140826
\n",
"
\n",
"
\n",
"
113
\n",
"
0.156439
\n",
"
0.270394
\n",
"
0
\n",
"
1
\n",
"
1238
\n",
"
6681
\n",
"
0.000150
\n",
"
0.999850
\n",
"
1.000000
\n",
"
0.004837
\n",
"
0.156333
\n",
"
1.000000
\n",
"
0.019673
\n",
"
0.999874
\n",
"
0.000126
\n",
"
1.000126
\n",
"
1.185274
\n",
"
\n",
"
\n",
"
114
\n",
"
0.156313
\n",
"
0.270365
\n",
"
0
\n",
"
0
\n",
"
1238
\n",
"
6682
\n",
"
0.000000
\n",
"
1.000000
\n",
"
1.000000
\n",
"
0.000000
\n",
"
0.156313
\n",
"
0.000000
\n",
"
0.000044
\n",
"
1.000000
\n",
"
0.000000
\n",
"
1.000000
\n",
"
0.000000
\n",
"
\n",
" \n",
"
\n",
"
115 rows × 17 columns
\n",
"
"
],
"text/plain": [
" accuracy f1_score false_negative_score true_negative_score \\\n",
"0 0.843687 0.000000 1238 6682 \n",
"1 0.844066 0.004835 1235 6682 \n",
"2 0.843939 0.006431 1234 6680 \n",
"3 0.844318 0.014388 1229 6678 \n",
"4 0.844444 0.025316 1222 6672 \n",
".. ... ... ... ... \n",
"110 0.184848 0.276071 7 233 \n",
"111 0.175253 0.274061 5 155 \n",
"112 0.165657 0.272086 3 77 \n",
"113 0.156439 0.270394 0 1 \n",
"114 0.156313 0.270365 0 0 \n",
"\n",
" true_positive_score false_positive_score true_negative_rate \\\n",
"0 0 0 1.000000 \n",
"1 3 0 1.000000 \n",
"2 4 2 0.999701 \n",
"3 9 4 0.999401 \n",
"4 16 10 0.998503 \n",
".. ... ... ... \n",
"110 1231 6449 0.034870 \n",
"111 1233 6527 0.023197 \n",
"112 1235 6605 0.011523 \n",
"113 1238 6681 0.000150 \n",
"114 1238 6682 0.000000 \n",
"\n",
" false_positive_rate true_positive_rate \\\n",
"0 0.000000 0.000000 \n",
"1 0.000000 0.002423 \n",
"2 0.000299 0.003231 \n",
"3 0.000599 0.007270 \n",
"4 0.001497 0.012924 \n",
".. ... ... \n",
"110 0.965130 0.994346 \n",
"111 0.976803 0.995961 \n",
"112 0.988477 0.997577 \n",
"113 0.999850 1.000000 \n",
"114 1.000000 1.000000 \n",
"\n",
" matthews_correlation_coefficient positive_predictive_value \\\n",
"0 0.000000 0.000000 \n",
"1 0.045224 1.000000 \n",
"2 0.038695 0.666667 \n",
"3 0.059846 0.692308 \n",
"4 0.072549 0.615385 \n",
".. ... ... \n",
"110 0.061893 0.160286 \n",
"111 0.049450 0.158892 \n",
"112 0.033049 0.157526 \n",
"113 0.004837 0.156333 \n",
"114 0.000000 0.156313 \n",
"\n",
" negative_predictive_value threshold fraction_predicted_as_positive \\\n",
"0 0.843687 1.000000 0.000000 \n",
"1 0.844007 0.514413 0.000379 \n",
"2 0.844074 0.477247 0.000758 \n",
"3 0.844568 0.449702 0.001641 \n",
"4 0.845199 0.423901 0.003283 \n",
".. ... ... ... \n",
"110 0.970833 0.043229 0.969697 \n",
"111 0.968750 0.039698 0.979798 \n",
"112 0.962500 0.033700 0.989899 \n",
"113 1.000000 0.019673 0.999874 \n",
"114 0.000000 0.000044 1.000000 \n",
"\n",
" fraction_predicted_as_negative lift_positive lift_negative \n",
"0 1.000000 0.000000 1.000000 \n",
"1 0.999621 6.397415 1.000379 \n",
"2 0.999242 4.264943 1.000459 \n",
"3 0.998359 4.428980 1.001045 \n",
"4 0.996717 3.936871 1.001792 \n",
".. ... ... ... \n",
"110 0.030303 1.025419 1.150703 \n",
"111 0.020202 1.016497 1.148234 \n",
"112 0.010101 1.007756 1.140826 \n",
"113 0.000126 1.000126 1.185274 \n",
"114 0.000000 1.000000 0.000000 \n",
"\n",
"[115 rows x 17 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"roc = model_top.get_roc_curve(\"validation\")\n",
"\n",
"# Save the result into a pandas dataframe\n",
"roc_df = pd.DataFrame(roc.roc_points)\n",
"\n",
"roc_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dr_roc_green = \"#03c75f\"\n",
"white = \"#ffffff\"\n",
"dr_purple = \"#65147D\"\n",
"dr_dense_green = \"#018f4f\"\n",
"\n",
"threshold = roc.get_best_f1_threshold()\n",
"fig = plt.figure(figsize=(8, 8))\n",
"axes = fig.add_subplot(1, 1, 1, facecolor=dr_dark_blue)\n",
"\n",
"plt.scatter(roc_df.false_positive_rate, roc_df.true_positive_rate, color=dr_roc_green)\n",
"plt.plot(roc_df.false_positive_rate, roc_df.true_positive_rate, color=dr_roc_green)\n",
"plt.plot([0, 1], [0, 1], color=white, alpha=0.25)\n",
"plt.title(\"ROC curve\")\n",
"plt.xlabel(\"False Positive Rate\")\n",
"plt.xlim([0, 1])\n",
"plt.ylabel(\"True Positive Rate\")\n",
"plt.ylim([0, 1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Impact\n",
"\n",
"[Feature Impact](https://docs.datarobot.com/en/docs/modeling/analyze-models/understand/feature-impact.html) measures how important a feature is in the context of a model. It measures how much the accuracy of a model would decrease if that feature was removed.\n",
"\n",
"Feature Impact is available for all model types and works by altering input data and observing the effect on a model’s score. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once DataRobot computes the feature impact for a model, that information is saved with the project.\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 1.04, 'Feature Impact')"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"feature_impacts = model_top.get_or_request_feature_impact()\n",
"\n",
"# Limit size to make chart look good. Display top 25 values\n",
"if len(feature_impacts) > 25:\n",
" feature_impacts = feature_impacts[0:24]\n",
"\n",
"# Formats the ticks from a float into a percent\n",
"percent_tick_fmt = mtick.PercentFormatter(xmax=1.0)\n",
"\n",
"impact_df = pd.DataFrame(feature_impacts)\n",
"impact_df.sort_values(by=\"impactNormalized\", ascending=True, inplace=True)\n",
"\n",
"# Positive values are blue, negative are red\n",
"bar_colors = impact_df.impactNormalized.apply(lambda x: dr_red if x < 0 else dr_blue)\n",
"\n",
"ax = impact_df.plot.barh(\n",
" x=\"featureName\", y=\"impactNormalized\", legend=False, color=bar_colors, figsize=(10, 8)\n",
")\n",
"ax.xaxis.set_major_formatter(percent_tick_fmt)\n",
"ax.xaxis.set_tick_params(labeltop=True)\n",
"ax.xaxis.grid(True, alpha=0.2)\n",
"ax.set_facecolor(dr_dark_blue)\n",
"\n",
"plt.ylabel(\"\")\n",
"plt.xlabel(\"Effect\")\n",
"plt.xlim((None, 1)) # Allow for negative impact\n",
"plt.title(\"Feature Impact\", y=1.04);"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Make predictions\n",
"\n",
"### Test predictions\n",
"\n",
"After determining the top-performing model from the Leaderboard, upload the prediction test dataset to verify that the model generates predictions successfully before deploying the model to a production environment. The predictions are returned as a Pandas dataframe. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_path_scoring = (\n",
" \"https://s3.amazonaws.com/datarobot-use-case-datasets/Lending+Club+Dataset+Pred.csv\"\n",
")\n",
"scoring_df = pd.read_csv(data_path_scoring, encoding=\"ISO-8859-1\")\n",
"pathfinder_df.rename(columns={\"loan_is_bad\": \"is_bad\"}, inplace=True)\n",
"\n",
"prediction_dataset = project.upload_dataset(scoring_df)\n",
"predict_job = model_top.request_predictions(prediction_dataset.id)\n",
"prediction_dataset.id\n",
"\n",
"predictions = predict_job.get_result_when_complete()\n",
"pd.concat([scoring_df, predictions], axis=1)\n",
"predictions.positive_probability.plot(kind=\"hist\", title=\"Predicted Probabilities\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploy a model to production\n",
"\n",
"\n",
"If you are happy with the model's performance, you can deploy it to a production environment with [MLOps](https://docs.datarobot.com/en/mlops/index.html). Deploying the model will free up workers, as data scored through the deployment doesn't use any modeling workers. Furthermore, you are no longer restricted on the amount of data to score; score over 100GB with the deployment. Deployments also offer many model management benefits: monitoring service, data drift, model comparison, retraining, and more."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"autoscroll": "auto"
},
"outputs": [
{
"data": {
"text/plain": [
"Deployment(Late Shipment Predictions)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Retrieve a prediction server\n",
"prediction_server = dr.PredictionServer.list()[0]\n",
"\n",
"# Get the top performing model. Uncomment if this did not execute in the previous section\n",
"# model_top = sorted_by_metric(models, 'crossValidation', metric)[0]\n",
"deployment = dr.Deployment.create_from_learning_model(\n",
" model_top.id,\n",
" label=\"Predicting Loan Defaults\",\n",
" description=\"Predicting Loan Defaults\",\n",
" default_prediction_server_id=prediction_server.id,\n",
")\n",
"deployment.id"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure batch predictions\n",
"\n",
"After the model has been deployed, DataRobot creates an endpoint for real-time scoring. The deployment allows you to use DataRobot's batch prediction API to score large datasets with a deployed DataRobot model. \n",
"\n",
"The batch prediction API provides flexible intake and output options when scoring large datasets using prediction servers. The API is exposed through the DataRobot Public API and can be consumed using a REST-enabled client or Public API bindings for DataRobot's Python client."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set the deployment ID"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before proceeding, provide the deployed model's deployment ID (retrieved from the deployment's [Overview tab](https://docs.datarobot.com/en/docs/mlops/monitor/dep-overview.html) or from the Deployment object in the Python client: `deployment.id`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"deployment_id = \"YOUR_DEPLOYMENT_ID\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Determine input and output options\n",
"\n",
"DataRobot's batch prediction API allows you to score data from and to multiple sources. You can take advantage of the credentials and data sources you have already established previously through the UI for easy scoring. Credentials are usernames and passwords, while data sources are any databases with which you have previously established a connection (e.g., Snowflake). View the example code below outlining how to query credentials and data sources.\n",
"\n",
"You can reference the full list of DataRobot's supported [input](https://docs.datarobot.com/en/docs/predictions/batch/batch-prediction-api/intake-options.html) and [output options](https://docs.datarobot.com/en/docs/predictions/batch/batch-prediction-api/output-options.html).\n",
"\n",
"Reference the DataRobot documentation for more information about [data connections](https://docs.datarobot.com/en/docs/data/connect-data/data-conn.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The snippet below shows how you can query all credentials tied to a DataRobot account."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dr.Credential.list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output above returns multiple sets of credentials. The alphanumeric string included in each item of the list is the credentials ID. You can use that ID to access credentials through the API."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The snippet below shows how you can query all data sources tied to a DataRobot account. The second line lists each datastore with an alphanumeric string; that is the datastore ID."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5e6696ff820e737a5bd78430\n"
]
}
],
"source": [
"dr.DataStore.list()\n",
"print(dr.DataStore.list()[0].id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scoring examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The snippets below demonstrate how to score data with the Batch Prediction API. Edit the `intake_settings` and `output_settings` to suit your needs. You can mix and match until you get the outcome you prefer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Score from CSV to CSV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Scoring without Prediction Explanations\n",
"if False:\n",
" dr.BatchPredictionJob.score(\n",
" deployment_id,\n",
" intake_settings={\n",
" 'type': 'localFile',\n",
" 'file': 'inputfile.csv' # Provide the filepath, Pandas dataframe, or file-like object here\n",
" },\n",
" output_settings={\n",
" 'type': 'localFile',\n",
" 'path: 'outputfile.csv'\n",
" }\n",
" )\n",
"\n",
"#Scoring with Prediction Explanations\n",
"if False:\n",
" dr.BatchPredictionJob.score(\n",
" deployment_id,\n",
" intake_settings={\n",
" 'type': 'localFile',\n",
" 'file': 'inputfile.csv' # Provide the filepath, Pandas dataframe, or file-like object here\n",
" },\n",
" output_settings={\n",
" 'type': 'localFile',\n",
" 'path': 'outputfile.csv'\n",
" },\n",
" \n",
" max_explanations=3 #Compute Prediction Explanations for the amount of features indicated here\n",
" \n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Score from S3 to S3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if False:\n",
" dr.BatchPredictionJob.score(\n",
" deployment_id,\n",
" intake_settings={\n",
" \"type\": \"s3\",\n",
" \"url\": \"s3://theos-test-bucket/lending_club_scoring.csv\", # Provide the URL of your datastore here\n",
" \"credential_id\": \"YOUR_CREDENTIAL_ID_FROM_ABOVE\", # Provide your credentials here\n",
" },\n",
" output_settings={\n",
" \"type\": \"s3\",\n",
" \"url\": \"s3://theos-test-bucket/lending_club_scored2.csv\",\n",
" \"credential_id\": \"YOUR_CREDENTIAL_ID_FROM_ABOVE\",\n",
" },\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Score from JDBC to JDBC"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"if False:\n",
" dr.BatchPredictionJob.score(\n",
" deployment_id,\n",
" intake_settings={\n",
" \"type\": \"jdbc\",\n",
" \"table\": \"table_name\",\n",
" \"schema\": \"public\",\n",
" \"dataStoreId\": data_store.id, # Provide the ID of your datastore here\n",
" \"credentialId\": cred.credential_id, # Provide your credentials here\n",
" },\n",
" output_settings={\n",
" \"type\": \"jdbc\",\n",
" \"table\": \"table_name\",\n",
" \"schema\": \"public\",\n",
" \"statementType\": \"insert\",\n",
" \"dataStoreId\": data_store.id,\n",
" \"credentialId\": cred.credential_id,\n",
" },\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}