{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "47f1d548",
   "metadata": {},
   "source": [
    "# Python Programming Language for Data Analysis with Pandas Library\n",
    "\n",
    "\n",
    "**In this workshop we will analyze the data set with information on return and Morningstar Sustainalytics rating of 2000+ ETF from 2015 to 2021 using the `pandas` library.** \n",
    "\n",
    "Python programming language has numerous functions. These functions are organized into modules. Modules are python scripts that contain Python objects. In order to use the objects inside these modules (or commonly known as libraries), we must first import either the entire module or specific objects in it. \n",
    "\n",
    "**`pandas` (Python for Data Analysis)** is a popular python library for data analysis. We will use this library extensively in this workshop. This library is automatically installed during Anaconda installation. It is also installed and available on Google Colab or UofT Jupyter Hub. \n",
    "\n",
    "Other popular python libraries are **`NumPy` (Numerical Python)** and **`matplotlib`** used for scientific computing and visualization, respectively. We will use these libraries too. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bc3397b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import required libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9fd7b134",
   "metadata": {},
   "outputs": [],
   "source": [
    "# read select columns only\n",
    "col_list=['Name', 'Ticker', 'Base Currency', 'Global Broad Category Group',\n",
    "          'Fund Size Base Currency', 'Domicile', 'month',\n",
    "          'Monthly Return  Base Currency', 'Total Market Value(Net)  Portfolio Currency', \n",
    "          'asset_class', 'Morningstar Rating Overall', 'Carbon Risk Classification',\n",
    "          'Portfolio Environmental Risk Score', 'Portfolio Governance Risk Score',\n",
    "          'Portfolio Social Risk Score', 'Currency Code', 'Exchange Rate USD']\n",
    "etf = pd.read_csv('ETF_Data_final.csv', sep=\",\", header=0, index_col=None, usecols=col_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e35fc87b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Ticker</th>\n",
       "      <th>Base_Currency</th>\n",
       "      <th>Global_Broad_Category_Group</th>\n",
       "      <th>Fund_Size_Base_Currency</th>\n",
       "      <th>Domicile</th>\n",
       "      <th>Month</th>\n",
       "      <th>Monthly_Return_Base_Currency</th>\n",
       "      <th>Total_Market_Value_Portfolio_Currency</th>\n",
       "      <th>Asset_Class</th>\n",
       "      <th>Morningstar_Rating_Overall</th>\n",
       "      <th>Carbon_Risk_Classification</th>\n",
       "      <th>Portfolio_Environmental_Risk_Score</th>\n",
       "      <th>Portfolio_Governance_Risk_Score</th>\n",
       "      <th>Portfolio_Social_Risk_Score</th>\n",
       "      <th>Currency_Code</th>\n",
       "      <th>Exchange_Rate_USD</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAF First Priority CLO Bond ETF</td>\n",
       "      <td>AAA</td>\n",
       "      <td>US Dollar</td>\n",
       "      <td>Fixed Income</td>\n",
       "      <td>10002701.0</td>\n",
       "      <td>United States</td>\n",
       "      <td>2020-10-01</td>\n",
       "      <td>0.00281</td>\n",
       "      <td>9563230.0</td>\n",
       "      <td>bond</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>USD</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAF First Priority CLO Bond ETF</td>\n",
       "      <td>AAA</td>\n",
       "      <td>US Dollar</td>\n",
       "      <td>Fixed Income</td>\n",
       "      <td>10002701.0</td>\n",
       "      <td>United States</td>\n",
       "      <td>2020-11-01</td>\n",
       "      <td>0.61405</td>\n",
       "      <td>9615667.0</td>\n",
       "      <td>bond</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>USD</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AAF First Priority CLO Bond ETF</td>\n",
       "      <td>AAA</td>\n",
       "      <td>US Dollar</td>\n",
       "      <td>Fixed Income</td>\n",
       "      <td>10002701.0</td>\n",
       "      <td>United States</td>\n",
       "      <td>2020-12-01</td>\n",
       "      <td>0.27885</td>\n",
       "      <td>9633997.0</td>\n",
       "      <td>bond</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>USD</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AAF First Priority CLO Bond ETF</td>\n",
       "      <td>AAA</td>\n",
       "      <td>US Dollar</td>\n",
       "      <td>Fixed Income</td>\n",
       "      <td>10002701.0</td>\n",
       "      <td>United States</td>\n",
       "      <td>2021-01-01</td>\n",
       "      <td>0.43888</td>\n",
       "      <td>9173441.0</td>\n",
       "      <td>bond</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>USD</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AAF First Priority CLO Bond ETF</td>\n",
       "      <td>AAA</td>\n",
       "      <td>US Dollar</td>\n",
       "      <td>Fixed Income</td>\n",
       "      <td>10002701.0</td>\n",
       "      <td>United States</td>\n",
       "      <td>2021-02-01</td>\n",
       "      <td>-0.13515</td>\n",
       "      <td>9609404.0</td>\n",
       "      <td>bond</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>USD</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              Name Ticker Base_Currency  \\\n",
       "0  AAF First Priority CLO Bond ETF    AAA     US Dollar   \n",
       "1  AAF First Priority CLO Bond ETF    AAA     US Dollar   \n",
       "2  AAF First Priority CLO Bond ETF    AAA     US Dollar   \n",
       "3  AAF First Priority CLO Bond ETF    AAA     US Dollar   \n",
       "4  AAF First Priority CLO Bond ETF    AAA     US Dollar   \n",
       "\n",
       "  Global_Broad_Category_Group  Fund_Size_Base_Currency       Domicile  \\\n",
       "0                Fixed Income               10002701.0  United States   \n",
       "1                Fixed Income               10002701.0  United States   \n",
       "2                Fixed Income               10002701.0  United States   \n",
       "3                Fixed Income               10002701.0  United States   \n",
       "4                Fixed Income               10002701.0  United States   \n",
       "\n",
       "        Month  Monthly_Return_Base_Currency  \\\n",
       "0  2020-10-01                       0.00281   \n",
       "1  2020-11-01                       0.61405   \n",
       "2  2020-12-01                       0.27885   \n",
       "3  2021-01-01                       0.43888   \n",
       "4  2021-02-01                      -0.13515   \n",
       "\n",
       "   Total_Market_Value_Portfolio_Currency Asset_Class  \\\n",
       "0                              9563230.0        bond   \n",
       "1                              9615667.0        bond   \n",
       "2                              9633997.0        bond   \n",
       "3                              9173441.0        bond   \n",
       "4                              9609404.0        bond   \n",
       "\n",
       "   Morningstar_Rating_Overall Carbon_Risk_Classification  \\\n",
       "0                         NaN                        NaN   \n",
       "1                         NaN                        NaN   \n",
       "2                         NaN                        NaN   \n",
       "3                         NaN                        NaN   \n",
       "4                         NaN                        NaN   \n",
       "\n",
       "   Portfolio_Environmental_Risk_Score  Portfolio_Governance_Risk_Score  \\\n",
       "0                                 NaN                              NaN   \n",
       "1                                 NaN                              NaN   \n",
       "2                                 NaN                              NaN   \n",
       "3                                 NaN                              NaN   \n",
       "4                                 NaN                              NaN   \n",
       "\n",
       "   Portfolio_Social_Risk_Score Currency_Code  Exchange_Rate_USD  \n",
       "0                          NaN           USD                1.0  \n",
       "1                          NaN           USD                1.0  \n",
       "2                          NaN           USD                1.0  \n",
       "3                          NaN           USD                1.0  \n",
       "4                          NaN           USD                1.0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# change column names\n",
    "etf.columns = ['Name', 'Ticker', 'Base_Currency', 'Global_Broad_Category_Group',\n",
    "               'Fund_Size_Base_Currency', 'Domicile', 'Month',\n",
    "               'Monthly_Return_Base_Currency', 'Total_Market_Value_Portfolio_Currency', \n",
    "               'Asset_Class', 'Morningstar_Rating_Overall', 'Carbon_Risk_Classification',\n",
    "               'Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score',\n",
    "               'Portfolio_Social_Risk_Score', 'Currency_Code', 'Exchange_Rate_USD']\n",
    "etf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c3990717",
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert to datetime object and create month and year columns\n",
    "etf['date'] = pd.to_datetime(etf['Month'],format='%Y-%m-%d', errors='coerce')\n",
    "etf['month'] = etf['date'].dt.month\n",
    "etf['year'] = etf['date'].dt.year\n",
    "\n",
    "# convert to USD currency\n",
    "etf['monthly_return_USD'] = etf['Monthly_Return_Base_Currency']*etf['Exchange_Rate_USD']\n",
    "etf['fund_size_USD'] = etf['Fund_Size_Base_Currency']*etf['Exchange_Rate_USD']\n",
    "\n",
    "# convert to USD currency and to millions\n",
    "etf['total_market_value_USD'] = etf['Total_Market_Value_Portfolio_Currency']*etf['Exchange_Rate_USD']\n",
    "etf['total_mkt_val_net_USD_millions'] = etf['total_market_value_USD']/1000000\n",
    "\n",
    "# create primary key based on ticker and currency code\n",
    "etf['primary_key'] = etf['Ticker'] + etf['Currency_Code']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a38b1cfc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(133999, 25)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop any rows that have missing value in the returns or market value columns\n",
    "etf.dropna(subset=['fund_size_USD', 'monthly_return_USD','total_mkt_val_net_USD_millions'], inplace=True)\n",
    "etf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d9676ea6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop any columns with duplicate data\n",
    "dup_col = ['Name', 'Ticker', 'Base_Currency', 'Fund_Size_Base_Currency', 'Month', 'Monthly_Return_Base_Currency',\n",
    "          'Total_Market_Value_Portfolio_Currency', 'Currency_Code', 'Exchange_Rate_USD', 'date', 'total_market_value_USD']\n",
    "other_col = ['fund_size_USD', 'total_mkt_val_net_USD_millions']\n",
    "etf.drop(dup_col+other_col, axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ecd0d9e6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 133999 entries, 0 to 140228\n",
      "Data columns (total 12 columns):\n",
      " #   Column                              Non-Null Count   Dtype  \n",
      "---  ------                              --------------   -----  \n",
      " 0   Global_Broad_Category_Group         133999 non-null  object \n",
      " 1   Domicile                            133999 non-null  object \n",
      " 2   Asset_Class                         133999 non-null  object \n",
      " 3   Morningstar_Rating_Overall          55007 non-null   float64\n",
      " 4   Carbon_Risk_Classification          45990 non-null   object \n",
      " 5   Portfolio_Environmental_Risk_Score  61783 non-null   float64\n",
      " 6   Portfolio_Governance_Risk_Score     61783 non-null   float64\n",
      " 7   Portfolio_Social_Risk_Score         61783 non-null   float64\n",
      " 8   month                               133999 non-null  int64  \n",
      " 9   year                                133999 non-null  int64  \n",
      " 10  monthly_return_USD                  133999 non-null  float64\n",
      " 11  primary_key                         133999 non-null  object \n",
      "dtypes: float64(5), int64(2), object(5)\n",
      "memory usage: 13.3+ MB\n"
     ]
    }
   ],
   "source": [
    "etf.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0e3bd735",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([nan,  2.,  3.,  4.,  1.,  5.])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "etf['Morningstar_Rating_Overall'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2b4d2fb0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([nan, 'Medium Risk', 'Low Risk', 'Negligible Risk', 'High Risk'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "etf['Carbon_Risk_Classification'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "51a298e9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Portfolio_Environmental_Risk_Score\n",
      "min        0.000000\n",
      "mean      16.984573\n",
      "max       66.590000\n",
      "median     5.370000\n",
      "Name: Portfolio_Environmental_Risk_Score, dtype: float64\n",
      "\n",
      "Portfolio_Governance_Risk_Score\n",
      "min        0.000000\n",
      "mean      18.545082\n",
      "max       67.910000\n",
      "median     7.840000\n",
      "Name: Portfolio_Governance_Risk_Score, dtype: float64\n",
      "\n",
      "Portfolio_Social_Risk_Score\n",
      "min        0.000000\n",
      "mean      19.931538\n",
      "max       65.550000\n",
      "median    10.240000\n",
      "Name: Portfolio_Social_Risk_Score, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "risk_scores = ['Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score', 'Portfolio_Social_Risk_Score']\n",
    "for i in risk_scores:\n",
    "    print(f'\\n{i}')\n",
    "    print(etf[i].agg(['min', 'mean', 'max', 'median']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "042a069f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert missing rating to its own category\n",
    "etf['Morningstar_Rating_Overall'].fillna(6, inplace=True)\n",
    "etf['Carbon_Risk_Classification'].fillna('Unknown Risk', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "c1226db4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace risk_score with median value\n",
    "risk_scores = ['Portfolio_Environmental_Risk_Score', 'Portfolio_Governance_Risk_Score', 'Portfolio_Social_Risk_Score']\n",
    "for i in risk_scores:\n",
    "    etf[i].fillna(etf[i].median(), inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "847d430a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 133999 entries, 0 to 140228\n",
      "Data columns (total 12 columns):\n",
      " #   Column                              Non-Null Count   Dtype  \n",
      "---  ------                              --------------   -----  \n",
      " 0   Global_Broad_Category_Group         133999 non-null  object \n",
      " 1   Domicile                            133999 non-null  object \n",
      " 2   Asset_Class                         133999 non-null  object \n",
      " 3   Morningstar_Rating_Overall          133999 non-null  float64\n",
      " 4   Carbon_Risk_Classification          133999 non-null  object \n",
      " 5   Portfolio_Environmental_Risk_Score  133999 non-null  float64\n",
      " 6   Portfolio_Governance_Risk_Score     133999 non-null  float64\n",
      " 7   Portfolio_Social_Risk_Score         133999 non-null  float64\n",
      " 8   month                               133999 non-null  int64  \n",
      " 9   year                                133999 non-null  int64  \n",
      " 10  monthly_return_USD                  133999 non-null  float64\n",
      " 11  primary_key                         133999 non-null  object \n",
      "dtypes: float64(5), int64(2), object(5)\n",
      "memory usage: 13.3+ MB\n"
     ]
    }
   ],
   "source": [
    "etf.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "530851a5",
   "metadata": {},
   "source": [
    "## 5. Dealing with Categorical Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "78ce57a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>11</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>12</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2021</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2021</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  month  year\n",
       "0    10  2020\n",
       "1    11  2020\n",
       "2    12  2020\n",
       "3     1  2021\n",
       "4     2  2021"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_date = etf[['month','year']].apply(lambda x : pd.Categorical(x))\n",
    "cat_date.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c8d55913",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 133999 entries, 0 to 140228\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count   Dtype   \n",
      "---  ------  --------------   -----   \n",
      " 0   month   133999 non-null  category\n",
      " 1   year    133999 non-null  category\n",
      "dtypes: category(2)\n",
      "memory usage: 1.3 MB\n"
     ]
    }
   ],
   "source": [
    "cat_date.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ca0c276c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 133999 entries, 0 to 140228\n",
      "Data columns (total 12 columns):\n",
      " #   Column                              Non-Null Count   Dtype   \n",
      "---  ------                              --------------   -----   \n",
      " 0   Global_Broad_Category_Group         133999 non-null  object  \n",
      " 1   Domicile                            133999 non-null  object  \n",
      " 2   Asset_Class                         133999 non-null  object  \n",
      " 3   Morningstar_Rating_Overall          133999 non-null  float64 \n",
      " 4   Carbon_Risk_Classification          133999 non-null  object  \n",
      " 5   Portfolio_Environmental_Risk_Score  133999 non-null  float64 \n",
      " 6   Portfolio_Governance_Risk_Score     133999 non-null  float64 \n",
      " 7   Portfolio_Social_Risk_Score         133999 non-null  float64 \n",
      " 8   month                               133999 non-null  category\n",
      " 9   year                                133999 non-null  category\n",
      " 10  monthly_return_USD                  133999 non-null  float64 \n",
      " 11  primary_key                         133999 non-null  object  \n",
      "dtypes: category(2), float64(5), object(5)\n",
      "memory usage: 11.5+ MB\n"
     ]
    }
   ],
   "source": [
    "etf[['month','year']] = etf[['month','year']].apply(lambda x : pd.Categorical(x))\n",
    "etf.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "065b2ef2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Global_Broad_Category_Group', 'Domicile', 'Asset_Class',\n",
       "       'Carbon_Risk_Classification', 'month', 'year', 'primary_key'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cat_cols = etf.select_dtypes(include=['object', 'category']).columns\n",
    "cat_cols"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21e626d4",
   "metadata": {},
   "source": [
    "### 5a. Creating Dummy Variables\n",
    "\n",
    "What are the ways to convert categorical feature into numeric values? There are many. One popular method is to create a new column for each categorical variable and fill in value 1 for the specified routename column if the row belongs to that routename. This process is also called creating dummy variable.\n",
    "\n",
    "Pandas has an easy way to create dummy variables using the `get_dummies` method. Let's apply that on a toy dataset to see what transformation is taking place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "5832461c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fruits</th>\n",
       "      <th>quantity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>apple</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>orange</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   fruits  quantity\n",
       "0   apple         2\n",
       "1  orange         5"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create toy data set with categorica and numeric features\n",
    "test_data = pd.DataFrame( [['apple', 2], ['orange', 5]], columns=['fruits', 'quantity'])\n",
    "test_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "52fb45bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>orange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   apple  orange\n",
       "0      1       0\n",
       "1      0       1"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create dummy variables for a categorical column\n",
    "pd.get_dummies(test_data['fruits'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "3a12f438",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>orange</th>\n",
       "      <th>quantity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   apple  orange  quantity\n",
       "0      1       0         2\n",
       "1      0       1         5"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# combine the dummy variables with remaining numeric column of the original data\n",
    "pd.concat([pd.get_dummies(test_data['fruits']), test_data['quantity']], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53155b53",
   "metadata": {},
   "source": [
    "### 5b. One Hot Encoding\n",
    "\n",
    "The above transformation can also be done via OneHotEncoder function available through the preprocessing submodule of sklearn. This method is easier to use if you have many categorical columns. Here, we will see an example on our toy data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b2f27a25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "087a178e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<2x2 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 2 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 1. create an instance of the one hot encoder object\n",
    "ohe = OneHotEncoder()\n",
    "\n",
    "# 2. fit the data to the one hot encoder instance\n",
    "ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)) # require the input data to be 2-dimensional"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76bfcc6a",
   "metadata": {},
   "source": [
    "The resulting object is a sparse matrix, which cannot be combined to a `dataframe`. Therefore, we must first convert it to a numpy `array` for which there is a `toarray` method provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "4cffd572",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1., 0.],\n",
       "       [0., 1.]])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 3. convert the transformed data to an numpy array \n",
    "dummy_fruits = ohe.fit_transform(test_data['fruits'].values.reshape(-1,1)).toarray()\n",
    "dummy_fruits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0866fae3",
   "metadata": {},
   "source": [
    "Numpy `array` can be easily converted to a pandas `dataframe`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "b50b7a63",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1\n",
       "0  1.0  0.0\n",
       "1  0.0  1.0"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 4. transform the numpy array to dataframe\n",
    "pd.DataFrame(dummy_fruits)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea82d7e9",
   "metadata": {},
   "source": [
    "Note that we are missing the column names, which can make it difficult to know which category the column belongs to. This can be easily retrieved using the `categories_` method of the `OneHotEncoder` object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "c1013666",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>apple</th>\n",
       "      <th>orange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  apple orange\n",
       "0   1.0    0.0\n",
       "1   0.0    1.0"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# while converting to dataframe add column names as well\n",
    "pd.DataFrame(dummy_fruits, columns=ohe.categories_)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81162e1c",
   "metadata": {},
   "source": [
    "You could also use the `get_feature_names_out` method, which can take a string to add to each column name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "97892906",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['fruits_apple', 'fruits_orange'], dtype=object)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ohe.get_feature_names_out(['fruits'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cf0374f",
   "metadata": {},
   "source": [
    "Now, let's try this on our date that have categorical values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "0b27c7a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Global_Broad_Category_Group\n",
      "7\n",
      "Equity           101599\n",
      "Fixed Income      14674\n",
      "Miscellaneous     13054\n",
      "Alternative        2369\n",
      "Commodities        1670\n",
      "Allocation          445\n",
      "Convertibles        188\n",
      "Name: Global_Broad_Category_Group, dtype: int64\n",
      "\n",
      "Domicile\n",
      "3\n",
      "United States     120118\n",
      "Canada              9029\n",
      "United Kingdom      4852\n",
      "Name: Domicile, dtype: int64\n",
      "\n",
      "Asset_Class\n",
      "4\n",
      "equity         113018\n",
      "bond            14973\n",
      "real_estate      3882\n",
      "commodity        2126\n",
      "Name: Asset_Class, dtype: int64\n",
      "\n",
      "Carbon_Risk_Classification\n",
      "5\n",
      "Unknown Risk       88009\n",
      "Low Risk           24391\n",
      "Medium Risk        20501\n",
      "High Risk            879\n",
      "Negligible Risk      219\n",
      "Name: Carbon_Risk_Classification, dtype: int64\n",
      "\n",
      "month\n",
      "12\n",
      "\n",
      "year\n",
      "7\n",
      "2021    29384\n",
      "2020    25313\n",
      "2019    22748\n",
      "2018    16647\n",
      "2017    14821\n",
      "2016    13402\n",
      "2015    11684\n",
      "Name: year, dtype: int64\n",
      "\n",
      "primary_key\n",
      "2572\n"
     ]
    }
   ],
   "source": [
    "for i in cat_cols:\n",
    "    print(f'\\n{i}')\n",
    "    n_uniq = etf[i].nunique()\n",
    "    print(n_uniq)\n",
    "    if n_uniq < 10:\n",
    "        print(etf[i].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "754bfa59",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(133999, 38)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform all categorical columns using one hot encoder except primary key\n",
    "ohe = OneHotEncoder()\n",
    "cat_data = ohe.fit_transform(etf[cat_cols[:-1]]).toarray() # do not include primary key\n",
    "cat_data = pd.DataFrame(cat_data, columns=ohe.get_feature_names_out(cat_cols[:-1]))\n",
    "cat_data.index = etf.index\n",
    "cat_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "b1f8e98a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(133999, 5)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop original categorical columns and add the transformed ones instead\n",
    "etf_transformed = etf.drop(cat_cols, axis=1)\n",
    "etf_transformed.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "46c9289d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(133999, 43)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# combine numeric features with transformed categorical features\n",
    "etf_transformed = pd.concat([etf_transformed,cat_data], axis=1)\n",
    "etf_transformed.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c961f3ce",
   "metadata": {},
   "source": [
    "### 5c. Encoding Labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "1cfcf7c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "1b487895",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5, 5, 5, ..., 5, 5, 5])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "le = LabelEncoder()\n",
    "le.fit_transform(etf['Global_Broad_Category_Group'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "06253595",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5, 5, 5, ..., 5, 5, 5])"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "le.fit_transform(etf['Global_Broad_Category_Group'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a34101a7",
   "metadata": {},
   "source": [
    "## 6. Data Modelling with Scikit-Learn\n",
    "\n",
    "There are many ways to model the data and the best way really depends on the problem you are trying to solve. Let's say for example, we want to know if it is possible to use the information available in our transformed data to predict the monthly return.\n",
    "\n",
    "Prior to modelling the data, the data must be clean and any item in the data must be numeric. This is because machine learning models or statistical models do not take data that have textual values or missing values. So such data must be processed and transformed to some reasonable numeric representation before they are used in modelling. \n",
    "\n",
    "Since we have already done the data cleaning and processing part, we will begin by separating the variable that will be predicted from the variables that will be use to predict. The former is commonly refered to as target (y) and the latter as features (X)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "85a0467c",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = etf_transformed.drop(['monthly_return_USD'], axis=1)\n",
    "y = etf_transformed['monthly_return_USD']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "fd645cb8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Morningstar_Rating_Overall</th>\n",
       "      <th>Portfolio_Environmental_Risk_Score</th>\n",
       "      <th>Portfolio_Governance_Risk_Score</th>\n",
       "      <th>Portfolio_Social_Risk_Score</th>\n",
       "      <th>Global_Broad_Category_Group_Allocation</th>\n",
       "      <th>Global_Broad_Category_Group_Alternative</th>\n",
       "      <th>Global_Broad_Category_Group_Commodities</th>\n",
       "      <th>Global_Broad_Category_Group_Convertibles</th>\n",
       "      <th>Global_Broad_Category_Group_Equity</th>\n",
       "      <th>Global_Broad_Category_Group_Fixed Income</th>\n",
       "      <th>...</th>\n",
       "      <th>month_10</th>\n",
       "      <th>month_11</th>\n",
       "      <th>month_12</th>\n",
       "      <th>year_2015</th>\n",
       "      <th>year_2016</th>\n",
       "      <th>year_2017</th>\n",
       "      <th>year_2018</th>\n",
       "      <th>year_2019</th>\n",
       "      <th>year_2020</th>\n",
       "      <th>year_2021</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6.0</td>\n",
       "      <td>5.37</td>\n",
       "      <td>7.84</td>\n",
       "      <td>10.24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>6.0</td>\n",
       "      <td>5.37</td>\n",
       "      <td>7.84</td>\n",
       "      <td>10.24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6.0</td>\n",
       "      <td>5.37</td>\n",
       "      <td>7.84</td>\n",
       "      <td>10.24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6.0</td>\n",
       "      <td>5.37</td>\n",
       "      <td>7.84</td>\n",
       "      <td>10.24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6.0</td>\n",
       "      <td>5.37</td>\n",
       "      <td>7.84</td>\n",
       "      <td>10.24</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 42 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Morningstar_Rating_Overall  Portfolio_Environmental_Risk_Score  \\\n",
       "0                         6.0                                5.37   \n",
       "1                         6.0                                5.37   \n",
       "2                         6.0                                5.37   \n",
       "3                         6.0                                5.37   \n",
       "4                         6.0                                5.37   \n",
       "\n",
       "   Portfolio_Governance_Risk_Score  Portfolio_Social_Risk_Score  \\\n",
       "0                             7.84                        10.24   \n",
       "1                             7.84                        10.24   \n",
       "2                             7.84                        10.24   \n",
       "3                             7.84                        10.24   \n",
       "4                             7.84                        10.24   \n",
       "\n",
       "   Global_Broad_Category_Group_Allocation  \\\n",
       "0                                     0.0   \n",
       "1                                     0.0   \n",
       "2                                     0.0   \n",
       "3                                     0.0   \n",
       "4                                     0.0   \n",
       "\n",
       "   Global_Broad_Category_Group_Alternative  \\\n",
       "0                                      0.0   \n",
       "1                                      0.0   \n",
       "2                                      0.0   \n",
       "3                                      0.0   \n",
       "4                                      0.0   \n",
       "\n",
       "   Global_Broad_Category_Group_Commodities  \\\n",
       "0                                      0.0   \n",
       "1                                      0.0   \n",
       "2                                      0.0   \n",
       "3                                      0.0   \n",
       "4                                      0.0   \n",
       "\n",
       "   Global_Broad_Category_Group_Convertibles  \\\n",
       "0                                       0.0   \n",
       "1                                       0.0   \n",
       "2                                       0.0   \n",
       "3                                       0.0   \n",
       "4                                       0.0   \n",
       "\n",
       "   Global_Broad_Category_Group_Equity  \\\n",
       "0                                 0.0   \n",
       "1                                 0.0   \n",
       "2                                 0.0   \n",
       "3                                 0.0   \n",
       "4                                 0.0   \n",
       "\n",
       "   Global_Broad_Category_Group_Fixed Income  ...  month_10  month_11  \\\n",
       "0                                       1.0  ...       1.0       0.0   \n",
       "1                                       1.0  ...       0.0       1.0   \n",
       "2                                       1.0  ...       0.0       0.0   \n",
       "3                                       1.0  ...       0.0       0.0   \n",
       "4                                       1.0  ...       0.0       0.0   \n",
       "\n",
       "   month_12  year_2015  year_2016  year_2017  year_2018  year_2019  year_2020  \\\n",
       "0       0.0        0.0        0.0        0.0        0.0        0.0        1.0   \n",
       "1       0.0        0.0        0.0        0.0        0.0        0.0        1.0   \n",
       "2       1.0        0.0        0.0        0.0        0.0        0.0        1.0   \n",
       "3       0.0        0.0        0.0        0.0        0.0        0.0        0.0   \n",
       "4       0.0        0.0        0.0        0.0        0.0        0.0        0.0   \n",
       "\n",
       "   year_2021  \n",
       "0        0.0  \n",
       "1        0.0  \n",
       "2        0.0  \n",
       "3        1.0  \n",
       "4        1.0  \n",
       "\n",
       "[5 rows x 42 columns]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "cae87550",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.00281\n",
       "1    0.61405\n",
       "2    0.27885\n",
       "3    0.43888\n",
       "4   -0.13515\n",
       "Name: monthly_return_USD, dtype: float64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df9a1125",
   "metadata": {},
   "source": [
    "Next, we will split the feature data into training and test set. This is while we use data to train a machine learning model, its performance should be reported on a data that the model has never seen before. This ensures that the model is able to generalize i.e. it has not just learned the training data very well but also some patterns that can help predict future data points. This is very important if we want to use the model in the real world.\n",
    "\n",
    "Usually a 70-30 or 80-20 split is recommended. In our case, we will keep 30% of the data for testing and train the model with remaining 70% of the data. Let's import the `train_test_split` function available via the `model_selection` submodule of `sklearn` library. This function takes the feature and target data and give us the desired splits of the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "58902b5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split as tts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "c9911c40",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = tts(X,              # feature data\n",
    "                                       y,              # target data \n",
    "                                       test_size=.30,  # size of the test set\n",
    "                                       random_state=42)# set a random number to get the exact split next time "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "32290c1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(93799, 42) (40200, 42) (93799,) (40200,)\n"
     ]
    }
   ],
   "source": [
    "print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "830d38bc",
   "metadata": {},
   "source": [
    "You can see that 20% of the data for both features and target are now held-out as the test set. \n",
    "\n",
    "### 6a. Linear Regression\n",
    "Linear Regression is the most used statistical algorithm. It examines a linear relationship between two or more variables. Linear Regression provides a good background in understanding a whole class of linear models known as Generalized Linear Models (GLM). It also helps understand statistical terms such as cost functions, coefficients and optimization. \n",
    "\n",
    "The overall objective of Linear Regression is to predict a straight line through the data, such that the vertical distance of each data point is minimal from that line.\n",
    "\n",
    "Let's now import the `Linear_Regression` model from the `linear_model` submodule of `sklearn`, which we will use to fit our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "68cb22b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a24cb8d",
   "metadata": {},
   "source": [
    "Most of the functions in sklearn can be used in the same way:\n",
    "\n",
    "1. Create an instance of the object in use.\n",
    "2. Use `fit` or `fit_transform` method to fit or transform the data as needed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "78d1e137",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression()"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LinearRegression()\n",
    "model.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a658b154",
   "metadata": {},
   "source": [
    "Now that the data has been fit to the linear model, some model property information will now be available in the `model` object. One of these properties is the `score` method, which returns the coefficient of determination of the prediction, also known as R squared. It is the proportion of the variation in the dependent variable that is predictable form the independent variable. We can input the training data to get this score.\n",
    "\n",
    "Other properties of interest are the intercept and the coefficients of the linear regression model. We can get these information with the `intercept_` and `coef_` methods respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "0fdbfd3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.002265433163255448"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# return the coefficient of determination of the prediction\n",
    "model.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "19960f60",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.034094003604393"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "23ec669a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.02835847,  0.04473452,  0.01876206, -0.0681953 ,  0.21158945,\n",
       "        1.36047047, -2.92853603,  0.11323771,  1.8329775 , -0.91875368,\n",
       "        0.32901458,  3.75975934, -1.32739484, -2.43236449,  0.41342607,\n",
       "        2.25771068, -1.28035563, -1.39078112, -0.84891828,  0.48299489,\n",
       "       -0.45748469,  0.36232662,  0.46108145,  0.75855728, -0.92645283,\n",
       "       -2.49362126,  2.82413288, -0.47930915,  0.31343101,  0.44857785,\n",
       "       -0.86410168, -2.1507675 , -0.42698793,  2.73204843,  0.26449291,\n",
       "       -1.59002016,  1.59605138,  0.51663461, -1.96972057,  0.50276752,\n",
       "        0.90567084,  0.03861637])"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "b9461b74",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature</th>\n",
       "      <th>coef</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Global_Broad_Category_Group_Miscellaneous</td>\n",
       "      <td>3.7598</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Global_Broad_Category_Group_Alternative</td>\n",
       "      <td>-2.9285</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>month_3</td>\n",
       "      <td>2.8241</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>month_10</td>\n",
       "      <td>2.7320</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>month_2</td>\n",
       "      <td>-2.4936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Domicile_United Kingdom</td>\n",
       "      <td>-2.4324</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Asset_Class_bond</td>\n",
       "      <td>2.2577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>month_8</td>\n",
       "      <td>-2.1508</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>year_2017</td>\n",
       "      <td>-1.9697</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Global_Broad_Category_Group_Convertibles</td>\n",
       "      <td>1.8330</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>year_2015</td>\n",
       "      <td>1.5961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>month_12</td>\n",
       "      <td>-1.5900</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Asset_Class_equity</td>\n",
       "      <td>-1.3908</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Global_Broad_Category_Group_Allocation</td>\n",
       "      <td>1.3605</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Domicile_Canada</td>\n",
       "      <td>-1.3274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Asset_Class_commodity</td>\n",
       "      <td>-1.2804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>month_1</td>\n",
       "      <td>-0.9265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Global_Broad_Category_Group_Equity</td>\n",
       "      <td>-0.9188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>year_2019</td>\n",
       "      <td>0.9057</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>month_7</td>\n",
       "      <td>-0.8641</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Asset_Class_real_estate</td>\n",
       "      <td>-0.8489</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Carbon_Risk_Classification_Unknown Risk</td>\n",
       "      <td>0.7586</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>year_2016</td>\n",
       "      <td>0.5166</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>year_2018</td>\n",
       "      <td>0.5028</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Carbon_Risk_Classification_High Risk</td>\n",
       "      <td>0.4830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>month_4</td>\n",
       "      <td>-0.4793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Carbon_Risk_Classification_Negligible Risk</td>\n",
       "      <td>0.4611</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Carbon_Risk_Classification_Low Risk</td>\n",
       "      <td>-0.4575</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>month_6</td>\n",
       "      <td>0.4486</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>month_9</td>\n",
       "      <td>-0.4270</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Domicile_United States</td>\n",
       "      <td>0.4134</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Carbon_Risk_Classification_Medium Risk</td>\n",
       "      <td>0.3623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Global_Broad_Category_Group_Fixed Income</td>\n",
       "      <td>0.3290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>month_5</td>\n",
       "      <td>0.3134</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>month_11</td>\n",
       "      <td>0.2645</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Portfolio_Social_Risk_Score</td>\n",
       "      <td>0.2116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Global_Broad_Category_Group_Commodities</td>\n",
       "      <td>0.1132</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Portfolio_Governance_Risk_Score</td>\n",
       "      <td>-0.0682</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Morningstar_Rating_Overall</td>\n",
       "      <td>0.0447</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>year_2020</td>\n",
       "      <td>0.0386</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Portfolio_Environmental_Risk_Score</td>\n",
       "      <td>0.0188</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                       feature    coef\n",
       "10   Global_Broad_Category_Group_Miscellaneous  3.7598\n",
       "5      Global_Broad_Category_Group_Alternative -2.9285\n",
       "25                                     month_3  2.8241\n",
       "32                                    month_10  2.7320\n",
       "24                                     month_2 -2.4936\n",
       "12                     Domicile_United Kingdom -2.4324\n",
       "14                            Asset_Class_bond  2.2577\n",
       "30                                     month_8 -2.1508\n",
       "37                                   year_2017 -1.9697\n",
       "7     Global_Broad_Category_Group_Convertibles  1.8330\n",
       "35                                   year_2015  1.5961\n",
       "34                                    month_12 -1.5900\n",
       "16                          Asset_Class_equity -1.3908\n",
       "4       Global_Broad_Category_Group_Allocation  1.3605\n",
       "11                             Domicile_Canada -1.3274\n",
       "15                       Asset_Class_commodity -1.2804\n",
       "23                                     month_1 -0.9265\n",
       "8           Global_Broad_Category_Group_Equity -0.9188\n",
       "39                                   year_2019  0.9057\n",
       "29                                     month_7 -0.8641\n",
       "17                     Asset_Class_real_estate -0.8489\n",
       "22     Carbon_Risk_Classification_Unknown Risk  0.7586\n",
       "36                                   year_2016  0.5166\n",
       "38                                   year_2018  0.5028\n",
       "18        Carbon_Risk_Classification_High Risk  0.4830\n",
       "26                                     month_4 -0.4793\n",
       "21  Carbon_Risk_Classification_Negligible Risk  0.4611\n",
       "19         Carbon_Risk_Classification_Low Risk -0.4575\n",
       "28                                     month_6  0.4486\n",
       "31                                     month_9 -0.4270\n",
       "13                      Domicile_United States  0.4134\n",
       "20      Carbon_Risk_Classification_Medium Risk  0.3623\n",
       "9     Global_Broad_Category_Group_Fixed Income  0.3290\n",
       "27                                     month_5  0.3134\n",
       "33                                    month_11  0.2645\n",
       "3                  Portfolio_Social_Risk_Score  0.2116\n",
       "6      Global_Broad_Category_Group_Commodities  0.1132\n",
       "2              Portfolio_Governance_Risk_Score -0.0682\n",
       "0                   Morningstar_Rating_Overall  0.0447\n",
       "40                                   year_2020  0.0386\n",
       "1           Portfolio_Environmental_Risk_Score  0.0188"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coefs = {}\n",
    "for i,j in zip(X_train.columns, model.coef_[1:]):\n",
    "    coefs[i] = j\n",
    "pd.DataFrame(coefs.items(), columns=['feature','coef']).sort_values(by='coef', key=abs, ascending=False).round(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ccef2fb",
   "metadata": {},
   "source": [
    "R Squared value only measures the fit of training data to the model. How well will this model perform on an unseen test data is the next step of evaluation. Regression models often use the mean squared error metric to evaluate the performance on an unseen data. To calculate this we can use `mean_squared_error` function available throuhg the `metrics` submodule of `sklearn`. The function takes the model prediction on a given data and the actual target value for that dataset. Therefore, we first need to generate predictions from our model on the test set using the `predict` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "d4e3aac8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error as mse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "5d73d3af",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "863.5539077021615\n"
     ]
    }
   ],
   "source": [
    "y_pred_test = model.predict(X_test)\n",
    "mse_lrmodel = mse(y_pred_test, y_test)\n",
    "print(mse_lrmodel)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "086bd82e",
   "metadata": {},
   "source": [
    "How do we know this is a good enough value?\n",
    "\n",
    "In machine learning, we usually have benchmark model against which we can test the performance. In this case we only have one model, so we can create another model and see which one performs better. \n",
    "\n",
    "We can repeat what we did earlier on another model or we can create a for-loop such that the exact same operation goes through all the models in a list. This latter is obviously better as we do not have to write the same code over and over again. It is also easier from readability perspective."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fabf0e6f",
   "metadata": {},
   "source": [
    "### 6b. Decision Trees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "6c124ad3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "cb25e82e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('LR', LinearRegression()), ('DT', DecisionTreeRegressor())]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# initialize an empty list to add sklearn model objects\n",
    "models = []\n",
    "\n",
    "# add the sklearn model objects to the list one by one\n",
    "# while adding the model also give it a name so put the name and model in a tuple\n",
    "models.append(('LR', LinearRegression())) \n",
    "models.append(('DT', DecisionTreeRegressor())) # Ensemble method - collection of many decision trees\n",
    "models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "b09799aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LinearRegression()\n",
      "DecisionTreeRegressor()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'LR': 863.5539077021615, 'DT': 942.1607929570563}"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores = {}\n",
    "for name, model in models:\n",
    "    print(model)\n",
    "    model.fit(X_train, y_train)\n",
    "    y_pred_test = model.predict(X_test)\n",
    "    mse_score = mse(y_pred_test, y_test)\n",
    "    scores[name] = mse_score\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b070c863",
   "metadata": {},
   "source": [
    "The Decision Tree Regression model has higher mean squared error than the Linear Regression model and therefore, Linear Regression model is better than the Decision Tree model.\n",
    "\n",
    "__Note:__ It might also be a good idea to store the fitted model, so that once you can explore more details of the model rather than just the scores. See [here](https://scikit-learn.org/stable/modules/tree.html#tree-regression) to learn more about decision tree model properties and sklearn features available to explore the model details.\n",
    "\n",
    "Recall earlier we did not use the primary_key column to in our model because there were many categories. If we repeat the above modelling process including the information in this column, are we likely to make a better prediction? Let's check that now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "26bb157d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(133999, 2615)"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform all categorical columns using one hot encoder including primary key\n",
    "ohe = OneHotEncoder()\n",
    "cat_data = ohe.fit_transform(etf[cat_cols]).toarray() # do not include primary key\n",
    "cat_data = pd.DataFrame(cat_data, columns=ohe.get_feature_names_out(cat_cols))\n",
    "cat_data.index = etf.index\n",
    "\n",
    "# drop original categorical columns and add the transformed ones instead\n",
    "etf_transformed = etf.drop(cat_cols, axis=1)\n",
    "\n",
    "# combine numeric features with transformed categorical features\n",
    "etf_transformed = pd.concat([etf_transformed,cat_data], axis=1)\n",
    "etf_transformed.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "0b5d782d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create feature and target variables\n",
    "X = etf_transformed.drop(['monthly_return_USD'], axis=1)\n",
    "y = etf_transformed['monthly_return_USD']\n",
    "\n",
    "# split data into training and test set\n",
    "X_train, X_test, y_train, y_test = tts(X, y, test_size=.30, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "9eb102ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LinearRegression()\n",
      "DecisionTreeRegressor()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'LR': 553594206705754.5, 'DT': 897.1922295122931}"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# initialize an empty list to add sklearn model objects\n",
    "models = []\n",
    "\n",
    "# add the sklearn model objects to the list one by one\n",
    "# while adding the model also give it a name so put the name and model in a tuple\n",
    "models.append(('LR', LinearRegression())) \n",
    "models.append(('DT', DecisionTreeRegressor())) # Ensemble method - collection of many decision trees\n",
    "models\n",
    "\n",
    "scores = {}\n",
    "for name, model in models:\n",
    "    print(model)\n",
    "    model.fit(X_train, y_train)\n",
    "    y_pred_test = model.predict(X_test)\n",
    "    mse_score = mse(y_pred_test, y_test)\n",
    "    scores[name] = mse_score\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04f6daaa",
   "metadata": {},
   "source": [
    "Adding information in the primary_key column leads to increase in the error of the linear regression model but decreases the same for decision tree model. This is common behavior when comparing a linear and non-linear models with large number of variables. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}