{
"cells": [
{
"cell_type": "markdown",
"id": "47f1d548",
"metadata": {},
"source": [
"# Python Programming Language for Data Analysis with Pandas Library\n",
"\n",
"\n",
"**In this workshop we will analyze the data set with information on return and Morningstar Sustainalytics rating of 2000+ ETF from 2015 to 2021 using the `pandas` library.** \n",
"\n",
"Python programming language has numerous functions. These functions are organized into modules. Modules are python scripts that contain Python objects. In order to use the objects inside these modules (or commonly known as libraries), we must first import either the entire module or specific objects in it. \n",
"\n",
"**`pandas` (Python for Data Analysis)** is a popular python library for data analysis. We will use this library extensively in this workshop. This library is automatically installed during Anaconda installation. It is also installed and available on Google Colab or UofT Jupyter Hub. \n",
"\n",
"Other popular python libraries are **`NumPy` (Numerical Python)** and **`matplotlib`** used for scientific computing and visualization, respectively. We will use these libraries too. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "bc3397b2",
"metadata": {},
"outputs": [],
"source": [
"# import required libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9fd7b134",
"metadata": {},
"outputs": [],
"source": [
"# read select columns only\n",
"col_list=['Name', 'Ticker', 'Base Currency', 'Global Broad Category Group',\n",
" 'Fund Size Base Currency', 'Domicile', 'month',\n",
" 'Monthly Return Base Currency', 'Total Market Value(Net) Portfolio Currency', \n",
" 'asset_class', 'Morningstar Rating Overall', 'Carbon Risk Classification',\n",
" 'Portfolio Environmental Risk Score', 'Portfolio Governance Risk Score',\n",
" 'Portfolio Social Risk Score', 'Currency Code', 'Exchange Rate USD']\n",
"etf = pd.read_csv('ETF_Data_final.csv', sep=\",\", header=0, index_col=None, usecols=col_list)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e35fc87b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Ticker | \n",
" Base_Currency | \n",
" Global_Broad_Category_Group | \n",
" Fund_Size_Base_Currency | \n",
" Domicile | \n",
" Month | \n",
" Monthly_Return_Base_Currency | \n",
" Total_Market_Value_Portfolio_Currency | \n",
" Asset_Class | \n",
" Morningstar_Rating_Overall | \n",
" Carbon_Risk_Classification | \n",
" Portfolio_Environmental_Risk_Score | \n",
" Portfolio_Governance_Risk_Score | \n",
" Portfolio_Social_Risk_Score | \n",
" Currency_Code | \n",
" Exchange_Rate_USD | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AAF First Priority CLO Bond ETF | \n",
" AAA | \n",
" US Dollar | \n",
" Fixed Income | \n",
" 10002701.0 | \n",
" United States | \n",
" 2020-10-01 | \n",
" 0.00281 | \n",
" 9563230.0 | \n",
" bond | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USD | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" AAF First Priority CLO Bond ETF | \n",
" AAA | \n",
" US Dollar | \n",
" Fixed Income | \n",
" 10002701.0 | \n",
" United States | \n",
" 2020-11-01 | \n",
" 0.61405 | \n",
" 9615667.0 | \n",
" bond | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USD | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" AAF First Priority CLO Bond ETF | \n",
" AAA | \n",
" US Dollar | \n",
" Fixed Income | \n",
" 10002701.0 | \n",
" United States | \n",
" 2020-12-01 | \n",
" 0.27885 | \n",
" 9633997.0 | \n",
" bond | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USD | \n",
" 1.0 | \n",
"
\n",
" \n",
" 3 | \n",
" AAF First Priority CLO Bond ETF | \n",
" AAA | \n",
" US Dollar | \n",
" Fixed Income | \n",
" 10002701.0 | \n",
" United States | \n",
" 2021-01-01 | \n",
" 0.43888 | \n",
" 9173441.0 | \n",
" bond | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USD | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" AAF First Priority CLO Bond ETF | \n",
" AAA | \n",
" US Dollar | \n",
" Fixed Income | \n",
" 10002701.0 | \n",
" United States | \n",
" 2021-02-01 | \n",
" -0.13515 | \n",
" 9609404.0 | \n",
" bond | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" USD | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Ticker Base_Currency \\\n",
"0 AAF First Priority CLO Bond ETF AAA US Dollar \n",
"1 AAF First Priority CLO Bond ETF AAA US Dollar \n",
"2 AAF First Priority CLO Bond ETF AAA US Dollar \n",
"3 AAF First Priority CLO Bond ETF AAA US Dollar \n",
"4 AAF First Priority CLO Bond ETF AAA US Dollar \n",
"\n",
" Global_Broad_Category_Group Fund_Size_Base_Currency Domicile \\\n",
"0 Fixed Income 10002701.0 United States \n",
"1 Fixed Income 10002701.0 United States \n",
"2 Fixed Income 10002701.0 United States \n",
"3 Fixed Income 10002701.0 United States \n",
"4 Fixed Income 10002701.0 United States \n",
"\n",
" Month Monthly_Return_Base_Currency \\\n",
"0 2020-10-01 0.00281 \n",
"1 2020-11-01 0.61405 \n",
"2 2020-12-01 0.27885 \n",
"3 2021-01-01 0.43888 \n",
"4 2021-02-01 -0.13515 \n",
"\n",
" Total_Market_Value_Portfolio_Currency Asset_Class \\\n",
"0 9563230.0 bond \n",
"1 9615667.0 bond \n",
"2 9633997.0 bond \n",
"3 9173441.0 bond \n",
"4 9609404.0 bond \n",
"\n",
" Morningstar_Rating_Overall Carbon_Risk_Classification \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Portfolio_Environmental_Risk_Score Portfolio_Governance_Risk_Score \\\n",
"0 NaN NaN \n",
"1 NaN NaN \n",
"2 NaN NaN \n",
"3 NaN NaN \n",
"4 NaN NaN \n",
"\n",
" Portfolio_Social_Risk_Score Currency_Code Exchange_Rate_USD \n",
"0 NaN USD 1.0 \n",
"1 NaN USD 1.0 \n",
"2 NaN USD 1.0 \n",
"3 NaN USD 1.0 \n",
"4 NaN USD 1.0 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# change column names\n",
"etf.columns = ['Name', 'Ticker', 'Base_Currency', 'Global_Broad_Category_Group',\n",
" 'Fund_Size_Base_Currency', 'Domicile', 'Month',\n",
" 'Monthly_Return_Base_Currency', 'Total_Market_Value_Portfolio_Currency', \n",
" 'Asset_Class', 'Morningstar_Rating_Overall', 'Carbon_Risk_Classification',\n",
" 'Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score',\n",
" 'Portfolio_Social_Risk_Score', 'Currency_Code', 'Exchange_Rate_USD']\n",
"etf.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c3990717",
"metadata": {},
"outputs": [],
"source": [
"# convert to datetime object and create month and year columns\n",
"etf['date'] = pd.to_datetime(etf['Month'],format='%Y-%m-%d', errors='coerce')\n",
"etf['month'] = etf['date'].dt.month\n",
"etf['year'] = etf['date'].dt.year\n",
"\n",
"# convert to USD currency\n",
"etf['monthly_return_USD'] = etf['Monthly_Return_Base_Currency']*etf['Exchange_Rate_USD']\n",
"etf['fund_size_USD'] = etf['Fund_Size_Base_Currency']*etf['Exchange_Rate_USD']\n",
"\n",
"# convert to USD currency and to millions\n",
"etf['total_market_value_USD'] = etf['Total_Market_Value_Portfolio_Currency']*etf['Exchange_Rate_USD']\n",
"etf['total_mkt_val_net_USD_millions'] = etf['total_market_value_USD']/1000000\n",
"\n",
"# create primary key based on ticker and currency code\n",
"etf['primary_key'] = etf['Ticker'] + etf['Currency_Code']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a38b1cfc",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(133999, 25)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop any rows that have missing value in the returns or market value columns\n",
"etf.dropna(subset=['fund_size_USD', 'monthly_return_USD','total_mkt_val_net_USD_millions'], inplace=True)\n",
"etf.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d9676ea6",
"metadata": {},
"outputs": [],
"source": [
"# drop any columns with duplicate data\n",
"dup_col = ['Name', 'Ticker', 'Base_Currency', 'Fund_Size_Base_Currency', 'Month', 'Monthly_Return_Base_Currency',\n",
" 'Total_Market_Value_Portfolio_Currency', 'Currency_Code', 'Exchange_Rate_USD', 'date', 'total_market_value_USD']\n",
"other_col = ['fund_size_USD', 'total_mkt_val_net_USD_millions']\n",
"etf.drop(dup_col+other_col, axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ecd0d9e6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 133999 entries, 0 to 140228\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Global_Broad_Category_Group 133999 non-null object \n",
" 1 Domicile 133999 non-null object \n",
" 2 Asset_Class 133999 non-null object \n",
" 3 Morningstar_Rating_Overall 55007 non-null float64\n",
" 4 Carbon_Risk_Classification 45990 non-null object \n",
" 5 Portfolio_Environmental_Risk_Score 61783 non-null float64\n",
" 6 Portfolio_Governance_Risk_Score 61783 non-null float64\n",
" 7 Portfolio_Social_Risk_Score 61783 non-null float64\n",
" 8 month 133999 non-null int64 \n",
" 9 year 133999 non-null int64 \n",
" 10 monthly_return_USD 133999 non-null float64\n",
" 11 primary_key 133999 non-null object \n",
"dtypes: float64(5), int64(2), object(5)\n",
"memory usage: 13.3+ MB\n"
]
}
],
"source": [
"etf.info()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0e3bd735",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 2., 3., 4., 1., 5.])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"etf['Morningstar_Rating_Overall'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "2b4d2fb0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 'Medium Risk', 'Low Risk', 'Negligible Risk', 'High Risk'],\n",
" dtype=object)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"etf['Carbon_Risk_Classification'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "51a298e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Portfolio_Environmental_Risk_Score\n",
"min 0.000000\n",
"mean 16.984573\n",
"max 66.590000\n",
"median 5.370000\n",
"Name: Portfolio_Environmental_Risk_Score, dtype: float64\n",
"\n",
"Portfolio_Governance_Risk_Score\n",
"min 0.000000\n",
"mean 18.545082\n",
"max 67.910000\n",
"median 7.840000\n",
"Name: Portfolio_Governance_Risk_Score, dtype: float64\n",
"\n",
"Portfolio_Social_Risk_Score\n",
"min 0.000000\n",
"mean 19.931538\n",
"max 65.550000\n",
"median 10.240000\n",
"Name: Portfolio_Social_Risk_Score, dtype: float64\n"
]
}
],
"source": [
"risk_scores = ['Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score', 'Portfolio_Social_Risk_Score']\n",
"for i in risk_scores:\n",
" print(f'\\n{i}')\n",
" print(etf[i].agg(['min', 'mean', 'max', 'median']))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "042a069f",
"metadata": {},
"outputs": [],
"source": [
"# convert missing rating to its own category\n",
"etf['Morningstar_Rating_Overall'].fillna(6, inplace=True)\n",
"etf['Carbon_Risk_Classification'].fillna('Unknown Risk', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "c1226db4",
"metadata": {},
"outputs": [],
"source": [
"# replace risk_score with median value\n",
"risk_scores = ['Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score', 'Portfolio_Social_Risk_Score']\n",
"for i in risk_scores:\n",
" etf[i].fillna(etf[i].median(), inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "847d430a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 133999 entries, 0 to 140228\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Global_Broad_Category_Group 133999 non-null object \n",
" 1 Domicile 133999 non-null object \n",
" 2 Asset_Class 133999 non-null object \n",
" 3 Morningstar_Rating_Overall 133999 non-null float64\n",
" 4 Carbon_Risk_Classification 133999 non-null object \n",
" 5 Portfolio_Environmental_Risk_Score 133999 non-null float64\n",
" 6 Portfolio_Governance_Risk_Score 133999 non-null float64\n",
" 7 Portfolio_Social_Risk_Score 133999 non-null float64\n",
" 8 month 133999 non-null int64 \n",
" 9 year 133999 non-null int64 \n",
" 10 monthly_return_USD 133999 non-null float64\n",
" 11 primary_key 133999 non-null object \n",
"dtypes: float64(5), int64(2), object(5)\n",
"memory usage: 13.3+ MB\n"
]
}
],
"source": [
"etf.info()"
]
},
{
"cell_type": "markdown",
"id": "530851a5",
"metadata": {},
"source": [
"## 5. Dealing with Categorical Data"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "78ce57a5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" month | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" 2020 | \n",
"
\n",
" \n",
" 1 | \n",
" 11 | \n",
" 2020 | \n",
"
\n",
" \n",
" 2 | \n",
" 12 | \n",
" 2020 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 2021 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 2021 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" month year\n",
"0 10 2020\n",
"1 11 2020\n",
"2 12 2020\n",
"3 1 2021\n",
"4 2 2021"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_date = etf[['month','year']].apply(lambda x : pd.Categorical(x))\n",
"cat_date.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c8d55913",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 133999 entries, 0 to 140228\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 month 133999 non-null category\n",
" 1 year 133999 non-null category\n",
"dtypes: category(2)\n",
"memory usage: 1.3 MB\n"
]
}
],
"source": [
"cat_date.info()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ca0c276c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 133999 entries, 0 to 140228\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Global_Broad_Category_Group 133999 non-null object \n",
" 1 Domicile 133999 non-null object \n",
" 2 Asset_Class 133999 non-null object \n",
" 3 Morningstar_Rating_Overall 133999 non-null float64 \n",
" 4 Carbon_Risk_Classification 133999 non-null object \n",
" 5 Portfolio_Environmental_Risk_Score 133999 non-null float64 \n",
" 6 Portfolio_Governance_Risk_Score 133999 non-null float64 \n",
" 7 Portfolio_Social_Risk_Score 133999 non-null float64 \n",
" 8 month 133999 non-null category\n",
" 9 year 133999 non-null category\n",
" 10 monthly_return_USD 133999 non-null float64 \n",
" 11 primary_key 133999 non-null object \n",
"dtypes: category(2), float64(5), object(5)\n",
"memory usage: 11.5+ MB\n"
]
}
],
"source": [
"etf[['month','year']] = etf[['month','year']].apply(lambda x : pd.Categorical(x))\n",
"etf.info()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "065b2ef2",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Global_Broad_Category_Group', 'Domicile', 'Asset_Class',\n",
" 'Carbon_Risk_Classification', 'month', 'year', 'primary_key'],\n",
" dtype='object')"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_cols = etf.select_dtypes(include=['object', 'category']).columns\n",
"cat_cols"
]
},
{
"cell_type": "markdown",
"id": "21e626d4",
"metadata": {},
"source": [
"### 5a. Creating Dummy Variables\n",
"\n",
"What are the ways to convert categorical feature into numeric values? There are many. One popular method is to create a new column for each categorical variable and fill in value 1 for the specified routename column if the row belongs to that routename. This process is also called creating dummy variable.\n",
"\n",
"Pandas has an easy way to create dummy variables using the `get_dummies` method. Let's apply that on a toy dataset to see what transformation is taking place."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "5832461c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" fruits | \n",
" quantity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" apple | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" orange | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fruits quantity\n",
"0 apple 2\n",
"1 orange 5"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create toy data set with categorica and numeric features\n",
"test_data = pd.DataFrame( [['apple', 2], ['orange', 5]], columns=['fruits', 'quantity'])\n",
"test_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "52fb45bc",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apple | \n",
" orange | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apple orange\n",
"0 1 0\n",
"1 0 1"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# create dummy variables for a categorical column\n",
"pd.get_dummies(test_data['fruits'])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3a12f438",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apple | \n",
" orange | \n",
" quantity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apple orange quantity\n",
"0 1 0 2\n",
"1 0 1 5"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# combine the dummy variables with remaining numeric column of the original data\n",
"pd.concat([pd.get_dummies(test_data['fruits']), test_data['quantity']], axis=1)"
]
},
{
"cell_type": "markdown",
"id": "53155b53",
"metadata": {},
"source": [
"### 5b. One Hot Encoding\n",
"\n",
"The above transformation can also be done via OneHotEncoder function available through the preprocessing submodule of sklearn. This method is easier to use if you have many categorical columns. Here, we will see an example on our toy data set."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "b2f27a25",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import OneHotEncoder"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "087a178e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<2x2 sparse matrix of type ''\n",
"\twith 2 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1. create an instance of the one hot encoder object\n",
"ohe = OneHotEncoder()\n",
"\n",
"# 2. fit the data to the one hot encoder instance\n",
"ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)) # require the input data to be 2-dimensional"
]
},
{
"cell_type": "markdown",
"id": "76bfcc6a",
"metadata": {},
"source": [
"The resulting object is a sparse matrix, which cannot be combined to a `dataframe`. Therefore, we must first convert it to a numpy `array` for which there is a `toarray` method provided."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "4cffd572",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0.],\n",
" [0., 1.]])"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 3. convert the transformed data to an numpy array \n",
"dummy_fruits = ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)).toarray()\n",
"dummy_fruits"
]
},
{
"cell_type": "markdown",
"id": "0866fae3",
"metadata": {},
"source": [
"Numpy `array` can be easily converted to a pandas `dataframe`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "b50b7a63",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1\n",
"0 1.0 0.0\n",
"1 0.0 1.0"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4. transform the numpy array to dataframe\n",
"pd.DataFrame(dummy_fruits)"
]
},
{
"cell_type": "markdown",
"id": "ea82d7e9",
"metadata": {},
"source": [
"Note that we are missing the column names, which can make it difficult to know which category the column belongs to. This can be easily retrieved using the `categories_` method of the `OneHotEncoder` object. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "c1013666",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" apple | \n",
" orange | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" apple orange\n",
"0 1.0 0.0\n",
"1 0.0 1.0"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# while converting to dataframe add column names as well\n",
"pd.DataFrame(dummy_fruits, columns=ohe.categories_)"
]
},
{
"cell_type": "markdown",
"id": "81162e1c",
"metadata": {},
"source": [
"You could also use the `get_feature_names_out` method, which can take a string to add to each column name."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "97892906",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['fruits_apple', 'fruits_orange'], dtype=object)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ohe.get_feature_names_out(['fruits'])"
]
},
{
"cell_type": "markdown",
"id": "3cf0374f",
"metadata": {},
"source": [
"Now, let's try this on our date that have categorical values."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "0b27c7a5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Global_Broad_Category_Group\n",
"7\n",
"Equity 101599\n",
"Fixed Income 14674\n",
"Miscellaneous 13054\n",
"Alternative 2369\n",
"Commodities 1670\n",
"Allocation 445\n",
"Convertibles 188\n",
"Name: Global_Broad_Category_Group, dtype: int64\n",
"\n",
"Domicile\n",
"3\n",
"United States 120118\n",
"Canada 9029\n",
"United Kingdom 4852\n",
"Name: Domicile, dtype: int64\n",
"\n",
"Asset_Class\n",
"4\n",
"equity 113018\n",
"bond 14973\n",
"real_estate 3882\n",
"commodity 2126\n",
"Name: Asset_Class, dtype: int64\n",
"\n",
"Carbon_Risk_Classification\n",
"5\n",
"Unknown Risk 88009\n",
"Low Risk 24391\n",
"Medium Risk 20501\n",
"High Risk 879\n",
"Negligible Risk 219\n",
"Name: Carbon_Risk_Classification, dtype: int64\n",
"\n",
"month\n",
"12\n",
"\n",
"year\n",
"7\n",
"2021 29384\n",
"2020 25313\n",
"2019 22748\n",
"2018 16647\n",
"2017 14821\n",
"2016 13402\n",
"2015 11684\n",
"Name: year, dtype: int64\n",
"\n",
"primary_key\n",
"2572\n"
]
}
],
"source": [
"for i in cat_cols:\n",
" print(f'\\n{i}')\n",
" n_uniq = etf[i].nunique()\n",
" print(n_uniq)\n",
" if n_uniq < 10:\n",
" print(etf[i].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "754bfa59",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(133999, 38)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# transform all categorical columns using one hot encoder except primary key\n",
"ohe = OneHotEncoder()\n",
"cat_data = ohe.fit_transform(etf[cat_cols[:-1]]).toarray() # do not include primary key\n",
"cat_data = pd.DataFrame(cat_data, columns=ohe.get_feature_names_out(cat_cols[:-1]))\n",
"cat_data.index = etf.index\n",
"cat_data.shape"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "b1f8e98a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(133999, 5)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# drop original categorical columns and add the transformed ones instead\n",
"etf_transformed = etf.drop(cat_cols, axis=1)\n",
"etf_transformed.shape"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "46c9289d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(133999, 43)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# combine numeric features with transformed categorical features\n",
"etf_transformed = pd.concat([etf_transformed,cat_data], axis=1)\n",
"etf_transformed.shape"
]
},
{
"cell_type": "markdown",
"id": "c961f3ce",
"metadata": {},
"source": [
"### 5c. Encoding Labels"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "1cfcf7c8",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import LabelEncoder"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "1b487895",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([5, 5, 5, ..., 5, 5, 5])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"le = LabelEncoder()\n",
"le.fit_transform(etf['Global_Broad_Category_Group'])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "06253595",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([5, 5, 5, ..., 5, 5, 5])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"le.fit_transform(etf['Global_Broad_Category_Group'])"
]
},
{
"cell_type": "markdown",
"id": "a34101a7",
"metadata": {},
"source": [
"## 6. Data Modelling with Scikit-Learn\n",
"\n",
"There are many ways to model the data and the best way really depends on the problem you are trying to solve. Let's say for example, we want to know if it is possible to use the information available in our transformed data to predict the monthly return.\n",
"\n",
"Prior to modelling the data, the data must be clean and any item in the data must be numeric. This is because machine learning models or statistical models do not take data that have textual values or missing values. So such data must be processed and transformed to some reasonable numeric representation before they are used in modelling. \n",
"\n",
"Since we have already done the data cleaning and processing part, we will begin by separating the variable that will be predicted from the variables that will be use to predict. The former is commonly refered to as target (y) and the latter as features (X)."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "85a0467c",
"metadata": {},
"outputs": [],
"source": [
"X = etf_transformed.drop(['monthly_return_USD'], axis=1)\n",
"y = etf_transformed['monthly_return_USD']"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "fd645cb8",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Morningstar_Rating_Overall | \n",
" Portfolio_Environmental_Risk_Score | \n",
" Portfolio_Governance_Risk_Score | \n",
" Portfolio_Social_Risk_Score | \n",
" Global_Broad_Category_Group_Allocation | \n",
" Global_Broad_Category_Group_Alternative | \n",
" Global_Broad_Category_Group_Commodities | \n",
" Global_Broad_Category_Group_Convertibles | \n",
" Global_Broad_Category_Group_Equity | \n",
" Global_Broad_Category_Group_Fixed Income | \n",
" ... | \n",
" month_10 | \n",
" month_11 | \n",
" month_12 | \n",
" year_2015 | \n",
" year_2016 | \n",
" year_2017 | \n",
" year_2018 | \n",
" year_2019 | \n",
" year_2020 | \n",
" year_2021 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6.0 | \n",
" 5.37 | \n",
" 7.84 | \n",
" 10.24 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" ... | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 6.0 | \n",
" 5.37 | \n",
" 7.84 | \n",
" 10.24 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" ... | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 6.0 | \n",
" 5.37 | \n",
" 7.84 | \n",
" 10.24 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 6.0 | \n",
" 5.37 | \n",
" 7.84 | \n",
" 10.24 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 6.0 | \n",
" 5.37 | \n",
" 7.84 | \n",
" 10.24 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" ... | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 42 columns
\n",
"
"
],
"text/plain": [
" Morningstar_Rating_Overall Portfolio_Environmental_Risk_Score \\\n",
"0 6.0 5.37 \n",
"1 6.0 5.37 \n",
"2 6.0 5.37 \n",
"3 6.0 5.37 \n",
"4 6.0 5.37 \n",
"\n",
" Portfolio_Governance_Risk_Score Portfolio_Social_Risk_Score \\\n",
"0 7.84 10.24 \n",
"1 7.84 10.24 \n",
"2 7.84 10.24 \n",
"3 7.84 10.24 \n",
"4 7.84 10.24 \n",
"\n",
" Global_Broad_Category_Group_Allocation \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
" Global_Broad_Category_Group_Alternative \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
" Global_Broad_Category_Group_Commodities \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
" Global_Broad_Category_Group_Convertibles \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
" Global_Broad_Category_Group_Equity \\\n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 0.0 \n",
"4 0.0 \n",
"\n",
" Global_Broad_Category_Group_Fixed Income ... month_10 month_11 \\\n",
"0 1.0 ... 1.0 0.0 \n",
"1 1.0 ... 0.0 1.0 \n",
"2 1.0 ... 0.0 0.0 \n",
"3 1.0 ... 0.0 0.0 \n",
"4 1.0 ... 0.0 0.0 \n",
"\n",
" month_12 year_2015 year_2016 year_2017 year_2018 year_2019 year_2020 \\\n",
"0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 \n",
"1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 \n",
"2 1.0 0.0 0.0 0.0 0.0 0.0 1.0 \n",
"3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" year_2021 \n",
"0 0.0 \n",
"1 0.0 \n",
"2 0.0 \n",
"3 1.0 \n",
"4 1.0 \n",
"\n",
"[5 rows x 42 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "cae87550",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.00281\n",
"1 0.61405\n",
"2 0.27885\n",
"3 0.43888\n",
"4 -0.13515\n",
"Name: monthly_return_USD, dtype: float64"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y[:5]"
]
},
{
"cell_type": "markdown",
"id": "df9a1125",
"metadata": {},
"source": [
"Next, we will split the feature data into training and test set. This is while we use data to train a machine learning model, its performance should be reported on a data that the model has never seen before. This ensures that the model is able to generalize i.e. it has not just learned the training data very well but also some patterns that can help predict future data points. This is very important if we want to use the model in the real world.\n",
"\n",
"Usually a 70-30 or 80-20 split is recommended. In our case, we will keep 30% of the data for testing and train the model with remaining 70% of the data. Let's import the `train_test_split` function available via the `model_selection` submodule of `sklearn` library. This function takes the feature and target data and give us the desired splits of the data. "
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "58902b5c",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split as tts"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "c9911c40",
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = tts(X, # feature data\n",
" y, # target data \n",
" test_size=.30, # size of the test set\n",
" random_state=42)# set a random number to get the exact split next time "
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "32290c1d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(93799, 42) (40200, 42) (93799,) (40200,)\n"
]
}
],
"source": [
"print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"id": "830d38bc",
"metadata": {},
"source": [
"You can see that 20% of the data for both features and target are now held-out as the test set. \n",
"\n",
"### 6a. Linear Regression\n",
"Linear Regression is the most used statistical algorithm. It examines a linear relationship between two or more variables. Linear Regression provides a good background in understanding a whole class of linear models known as Generalized Linear Models (GLM). It also helps understand statistical terms such as cost functions, coefficients and optimization. \n",
"\n",
"The overall objective of Linear Regression is to predict a straight line through the data, such that the vertical distance of each data point is minimal from that line.\n",
"\n",
"Let's now import the `Linear_Regression` model from the `linear_model` submodule of `sklearn`, which we will use to fit our data."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "68cb22b9",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "markdown",
"id": "8a24cb8d",
"metadata": {},
"source": [
"Most of the functions in sklearn can be used in the same way:\n",
"\n",
"1. Create an instance of the object in use.\n",
"2. Use `fit` or `fit_transform` method to fit or transform the data as needed. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "78d1e137",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = LinearRegression()\n",
"model.fit(X_train,y_train)"
]
},
{
"cell_type": "markdown",
"id": "a658b154",
"metadata": {},
"source": [
"Now that the data has been fit to the linear model, some model property information will now be available in the `model` object. One of these properties is the `score` method, which returns the coefficient of determination of the prediction, also known as R squared. It is the proportion of the variation in the dependent variable that is predictable form the independent variable. We can input the training data to get this score.\n",
"\n",
"Other properties of interest are the intercept and the coefficients of the linear regression model. We can get these information with the `intercept_` and `coef_` methods respectively."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "0fdbfd3e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.002265433163255448"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# return the coefficient of determination of the prediction\n",
"model.score(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "19960f60",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3.034094003604393"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.intercept_"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "23ec669a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-0.02835847, 0.04473452, 0.01876206, -0.0681953 , 0.21158945,\n",
" 1.36047047, -2.92853603, 0.11323771, 1.8329775 , -0.91875368,\n",
" 0.32901458, 3.75975934, -1.32739484, -2.43236449, 0.41342607,\n",
" 2.25771068, -1.28035563, -1.39078112, -0.84891828, 0.48299489,\n",
" -0.45748469, 0.36232662, 0.46108145, 0.75855728, -0.92645283,\n",
" -2.49362126, 2.82413288, -0.47930915, 0.31343101, 0.44857785,\n",
" -0.86410168, -2.1507675 , -0.42698793, 2.73204843, 0.26449291,\n",
" -1.59002016, 1.59605138, 0.51663461, -1.96972057, 0.50276752,\n",
" 0.90567084, 0.03861637])"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.coef_"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "b9461b74",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" feature | \n",
" coef | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" Global_Broad_Category_Group_Miscellaneous | \n",
" 3.7598 | \n",
"
\n",
" \n",
" 5 | \n",
" Global_Broad_Category_Group_Alternative | \n",
" -2.9285 | \n",
"
\n",
" \n",
" 25 | \n",
" month_3 | \n",
" 2.8241 | \n",
"
\n",
" \n",
" 32 | \n",
" month_10 | \n",
" 2.7320 | \n",
"
\n",
" \n",
" 24 | \n",
" month_2 | \n",
" -2.4936 | \n",
"
\n",
" \n",
" 12 | \n",
" Domicile_United Kingdom | \n",
" -2.4324 | \n",
"
\n",
" \n",
" 14 | \n",
" Asset_Class_bond | \n",
" 2.2577 | \n",
"
\n",
" \n",
" 30 | \n",
" month_8 | \n",
" -2.1508 | \n",
"
\n",
" \n",
" 37 | \n",
" year_2017 | \n",
" -1.9697 | \n",
"
\n",
" \n",
" 7 | \n",
" Global_Broad_Category_Group_Convertibles | \n",
" 1.8330 | \n",
"
\n",
" \n",
" 35 | \n",
" year_2015 | \n",
" 1.5961 | \n",
"
\n",
" \n",
" 34 | \n",
" month_12 | \n",
" -1.5900 | \n",
"
\n",
" \n",
" 16 | \n",
" Asset_Class_equity | \n",
" -1.3908 | \n",
"
\n",
" \n",
" 4 | \n",
" Global_Broad_Category_Group_Allocation | \n",
" 1.3605 | \n",
"
\n",
" \n",
" 11 | \n",
" Domicile_Canada | \n",
" -1.3274 | \n",
"
\n",
" \n",
" 15 | \n",
" Asset_Class_commodity | \n",
" -1.2804 | \n",
"
\n",
" \n",
" 23 | \n",
" month_1 | \n",
" -0.9265 | \n",
"
\n",
" \n",
" 8 | \n",
" Global_Broad_Category_Group_Equity | \n",
" -0.9188 | \n",
"
\n",
" \n",
" 39 | \n",
" year_2019 | \n",
" 0.9057 | \n",
"
\n",
" \n",
" 29 | \n",
" month_7 | \n",
" -0.8641 | \n",
"
\n",
" \n",
" 17 | \n",
" Asset_Class_real_estate | \n",
" -0.8489 | \n",
"
\n",
" \n",
" 22 | \n",
" Carbon_Risk_Classification_Unknown Risk | \n",
" 0.7586 | \n",
"
\n",
" \n",
" 36 | \n",
" year_2016 | \n",
" 0.5166 | \n",
"
\n",
" \n",
" 38 | \n",
" year_2018 | \n",
" 0.5028 | \n",
"
\n",
" \n",
" 18 | \n",
" Carbon_Risk_Classification_High Risk | \n",
" 0.4830 | \n",
"
\n",
" \n",
" 26 | \n",
" month_4 | \n",
" -0.4793 | \n",
"
\n",
" \n",
" 21 | \n",
" Carbon_Risk_Classification_Negligible Risk | \n",
" 0.4611 | \n",
"
\n",
" \n",
" 19 | \n",
" Carbon_Risk_Classification_Low Risk | \n",
" -0.4575 | \n",
"
\n",
" \n",
" 28 | \n",
" month_6 | \n",
" 0.4486 | \n",
"
\n",
" \n",
" 31 | \n",
" month_9 | \n",
" -0.4270 | \n",
"
\n",
" \n",
" 13 | \n",
" Domicile_United States | \n",
" 0.4134 | \n",
"
\n",
" \n",
" 20 | \n",
" Carbon_Risk_Classification_Medium Risk | \n",
" 0.3623 | \n",
"
\n",
" \n",
" 9 | \n",
" Global_Broad_Category_Group_Fixed Income | \n",
" 0.3290 | \n",
"
\n",
" \n",
" 27 | \n",
" month_5 | \n",
" 0.3134 | \n",
"
\n",
" \n",
" 33 | \n",
" month_11 | \n",
" 0.2645 | \n",
"
\n",
" \n",
" 3 | \n",
" Portfolio_Social_Risk_Score | \n",
" 0.2116 | \n",
"
\n",
" \n",
" 6 | \n",
" Global_Broad_Category_Group_Commodities | \n",
" 0.1132 | \n",
"
\n",
" \n",
" 2 | \n",
" Portfolio_Governance_Risk_Score | \n",
" -0.0682 | \n",
"
\n",
" \n",
" 0 | \n",
" Morningstar_Rating_Overall | \n",
" 0.0447 | \n",
"
\n",
" \n",
" 40 | \n",
" year_2020 | \n",
" 0.0386 | \n",
"
\n",
" \n",
" 1 | \n",
" Portfolio_Environmental_Risk_Score | \n",
" 0.0188 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feature coef\n",
"10 Global_Broad_Category_Group_Miscellaneous 3.7598\n",
"5 Global_Broad_Category_Group_Alternative -2.9285\n",
"25 month_3 2.8241\n",
"32 month_10 2.7320\n",
"24 month_2 -2.4936\n",
"12 Domicile_United Kingdom -2.4324\n",
"14 Asset_Class_bond 2.2577\n",
"30 month_8 -2.1508\n",
"37 year_2017 -1.9697\n",
"7 Global_Broad_Category_Group_Convertibles 1.8330\n",
"35 year_2015 1.5961\n",
"34 month_12 -1.5900\n",
"16 Asset_Class_equity -1.3908\n",
"4 Global_Broad_Category_Group_Allocation 1.3605\n",
"11 Domicile_Canada -1.3274\n",
"15 Asset_Class_commodity -1.2804\n",
"23 month_1 -0.9265\n",
"8 Global_Broad_Category_Group_Equity -0.9188\n",
"39 year_2019 0.9057\n",
"29 month_7 -0.8641\n",
"17 Asset_Class_real_estate -0.8489\n",
"22 Carbon_Risk_Classification_Unknown Risk 0.7586\n",
"36 year_2016 0.5166\n",
"38 year_2018 0.5028\n",
"18 Carbon_Risk_Classification_High Risk 0.4830\n",
"26 month_4 -0.4793\n",
"21 Carbon_Risk_Classification_Negligible Risk 0.4611\n",
"19 Carbon_Risk_Classification_Low Risk -0.4575\n",
"28 month_6 0.4486\n",
"31 month_9 -0.4270\n",
"13 Domicile_United States 0.4134\n",
"20 Carbon_Risk_Classification_Medium Risk 0.3623\n",
"9 Global_Broad_Category_Group_Fixed Income 0.3290\n",
"27 month_5 0.3134\n",
"33 month_11 0.2645\n",
"3 Portfolio_Social_Risk_Score 0.2116\n",
"6 Global_Broad_Category_Group_Commodities 0.1132\n",
"2 Portfolio_Governance_Risk_Score -0.0682\n",
"0 Morningstar_Rating_Overall 0.0447\n",
"40 year_2020 0.0386\n",
"1 Portfolio_Environmental_Risk_Score 0.0188"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coefs = {}\n",
"for i,j in zip(X_train.columns, model.coef_[1:]):\n",
" coefs[i] = j\n",
"pd.DataFrame(coefs.items(), columns=['feature','coef']).sort_values(by='coef', key=abs, ascending=False).round(4)"
]
},
{
"cell_type": "markdown",
"id": "6ccef2fb",
"metadata": {},
"source": [
"R Squared value only measures the fit of training data to the model. How well will this model perform on an unseen test data is the next step of evaluation. Regression models often use the mean squared error metric to evaluate the performance on an unseen data. To calculate this we can use `mean_squared_error` function available throuhg the `metrics` submodule of `sklearn`. The function takes the model prediction on a given data and the actual target value for that dataset. Therefore, we first need to generate predictions from our model on the test set using the `predict` method."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "d4e3aac8",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error as mse"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "5d73d3af",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"863.5539077021615\n"
]
}
],
"source": [
"y_pred_test = model.predict(X_test)\n",
"mse_lrmodel = mse(y_pred_test, y_test)\n",
"print(mse_lrmodel)"
]
},
{
"cell_type": "markdown",
"id": "086bd82e",
"metadata": {},
"source": [
"How do we know this is a good enough value?\n",
"\n",
"In machine learning, we usually have benchmark model against which we can test the performance. In this case we only have one model, so we can create another model and see which one performs better. \n",
"\n",
"We can repeat what we did earlier on another model or we can create a for-loop such that the exact same operation goes through all the models in a list. This latter is obviously better as we do not have to write the same code over and over again. It is also easier from readability perspective."
]
},
{
"cell_type": "markdown",
"id": "fabf0e6f",
"metadata": {},
"source": [
"### 6b. Decision Trees"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "6c124ad3",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.tree import DecisionTreeRegressor"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "cb25e82e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('LR', LinearRegression()), ('DT', DecisionTreeRegressor())]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# initialize an empty list to add sklearn model objects\n",
"models = []\n",
"\n",
"# add the sklearn model objects to the list one by one\n",
"# while adding the model also give it a name so put the name and model in a tuple\n",
"models.append(('LR', LinearRegression())) \n",
"models.append(('DT', DecisionTreeRegressor())) # Ensemble method - collection of many decision trees\n",
"models"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "b09799aa",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LinearRegression()\n",
"DecisionTreeRegressor()\n"
]
},
{
"data": {
"text/plain": [
"{'LR': 863.5539077021615, 'DT': 942.1607929570563}"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores = {}\n",
"for name, model in models:\n",
" print(model)\n",
" model.fit(X_train, y_train)\n",
" y_pred_test = model.predict(X_test)\n",
" mse_score = mse(y_pred_test, y_test)\n",
" scores[name] = mse_score\n",
"scores"
]
},
{
"cell_type": "markdown",
"id": "b070c863",
"metadata": {},
"source": [
"The Decision Tree Regression model has higher mean squared error than the Linear Regression model and therefore, Linear Regression model is better than the Decision Tree model.\n",
"\n",
"__Note:__ It might also be a good idea to store the fitted model, so that once you can explore more details of the model rather than just the scores. See [here](https://scikit-learn.org/stable/modules/tree.html#tree-regression) to learn more about decision tree model properties and sklearn features available to explore the model details.\n",
"\n",
"Recall earlier we did not use the primary_key column to in our model because there were many categories. If we repeat the above modelling process including the information in this column, are we likely to make a better prediction? Let's check that now."
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "26bb157d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(133999, 2615)"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# transform all categorical columns using one hot encoder including primary key\n",
"ohe = OneHotEncoder()\n",
"cat_data = ohe.fit_transform(etf[cat_cols]).toarray() # do not include primary key\n",
"cat_data = pd.DataFrame(cat_data, columns=ohe.get_feature_names_out(cat_cols))\n",
"cat_data.index = etf.index\n",
"\n",
"# drop original categorical columns and add the transformed ones instead\n",
"etf_transformed = etf.drop(cat_cols, axis=1)\n",
"\n",
"# combine numeric features with transformed categorical features\n",
"etf_transformed = pd.concat([etf_transformed,cat_data], axis=1)\n",
"etf_transformed.shape"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "0b5d782d",
"metadata": {},
"outputs": [],
"source": [
"# create feature and target variables\n",
"X = etf_transformed.drop(['monthly_return_USD'], axis=1)\n",
"y = etf_transformed['monthly_return_USD']\n",
"\n",
"# split data into training and test set\n",
"X_train, X_test, y_train, y_test = tts(X, y, test_size=.30, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "9eb102ba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"LinearRegression()\n",
"DecisionTreeRegressor()\n"
]
},
{
"data": {
"text/plain": [
"{'LR': 553594206705754.5, 'DT': 897.1922295122931}"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# initialize an empty list to add sklearn model objects\n",
"models = []\n",
"\n",
"# add the sklearn model objects to the list one by one\n",
"# while adding the model also give it a name so put the name and model in a tuple\n",
"models.append(('LR', LinearRegression())) \n",
"models.append(('DT', DecisionTreeRegressor())) # Ensemble method - collection of many decision trees\n",
"models\n",
"\n",
"scores = {}\n",
"for name, model in models:\n",
" print(model)\n",
" model.fit(X_train, y_train)\n",
" y_pred_test = model.predict(X_test)\n",
" mse_score = mse(y_pred_test, y_test)\n",
" scores[name] = mse_score\n",
"scores"
]
},
{
"cell_type": "markdown",
"id": "04f6daaa",
"metadata": {},
"source": [
"Adding information in the primary_key column leads to increase in the error of the linear regression model but decreases the same for decision tree model. This is common behavior when comparing a linear and non-linear models with large number of variables. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}