{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural Language Processing\n",
    "\n",
    "We will be using the venturebeat data that we have scrapped and stored. We will begin with loading the data, inspecting it and then convert text into numeric features. Our task/problem here is to build a natural language processing model that can take the information of the article and determine the category it belongs to. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>url</th>\n",
       "      <th>category</th>\n",
       "      <th>title</th>\n",
       "      <th>text</th>\n",
       "      <th>date</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>length</th>\n",
       "      <th>nwords</th>\n",
       "      <th>lex_div</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>https://venturebeat.com/2020/03/20/despite-set...</td>\n",
       "      <td>AI</td>\n",
       "      <td>Despite setbacks, coronavirus could hasten the...</td>\n",
       "      <td>This week, nearly every major company developi...</td>\n",
       "      <td>2020-03-20</td>\n",
       "      <td>3</td>\n",
       "      <td>20</td>\n",
       "      <td>6466</td>\n",
       "      <td>1011</td>\n",
       "      <td>0.070227</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>https://venturebeat.com/2020/03/19/sensor-towe...</td>\n",
       "      <td>Games</td>\n",
       "      <td>Sensor Tower: U.S. iPhone users spent about $5...</td>\n",
       "      <td>U.S. iPhone users spent an average of about $5...</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>3</td>\n",
       "      <td>19</td>\n",
       "      <td>1136</td>\n",
       "      <td>200</td>\n",
       "      <td>0.290000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>https://venturebeat.com/2020/03/19/microsoft-u...</td>\n",
       "      <td>Games</td>\n",
       "      <td>Microsoft unveils DirectX 12 Ultimate with imp...</td>\n",
       "      <td>Microsoft is moving on to the next generation ...</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>3</td>\n",
       "      <td>19</td>\n",
       "      <td>4731</td>\n",
       "      <td>783</td>\n",
       "      <td>0.067688</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>https://venturebeat.com/2020/03/19/sea-of-star...</td>\n",
       "      <td>Games</td>\n",
       "      <td>Sea of Stars is a gorgeous retro-RPG from The ...</td>\n",
       "      <td>Sabotage Studios announced Sea of Stars today,...</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>3</td>\n",
       "      <td>19</td>\n",
       "      <td>898</td>\n",
       "      <td>156</td>\n",
       "      <td>0.352564</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>https://venturebeat.com/2020/03/19/htc-holds-v...</td>\n",
       "      <td>AR/VR</td>\n",
       "      <td>HTC holds virtual media event, sends coronavir...</td>\n",
       "      <td>HTC’s just-concluded Virtual Vive Ecosystem Co...</td>\n",
       "      <td>2020-03-19</td>\n",
       "      <td>3</td>\n",
       "      <td>19</td>\n",
       "      <td>4030</td>\n",
       "      <td>649</td>\n",
       "      <td>0.090909</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 url category  \\\n",
       "0  https://venturebeat.com/2020/03/20/despite-set...       AI   \n",
       "1  https://venturebeat.com/2020/03/19/sensor-towe...    Games   \n",
       "2  https://venturebeat.com/2020/03/19/microsoft-u...    Games   \n",
       "3  https://venturebeat.com/2020/03/19/sea-of-star...    Games   \n",
       "4  https://venturebeat.com/2020/03/19/htc-holds-v...    AR/VR   \n",
       "\n",
       "                                               title  \\\n",
       "0  Despite setbacks, coronavirus could hasten the...   \n",
       "1  Sensor Tower: U.S. iPhone users spent about $5...   \n",
       "2  Microsoft unveils DirectX 12 Ultimate with imp...   \n",
       "3  Sea of Stars is a gorgeous retro-RPG from The ...   \n",
       "4  HTC holds virtual media event, sends coronavir...   \n",
       "\n",
       "                                                text       date  month  day  \\\n",
       "0  This week, nearly every major company developi... 2020-03-20      3   20   \n",
       "1  U.S. iPhone users spent an average of about $5... 2020-03-19      3   19   \n",
       "2  Microsoft is moving on to the next generation ... 2020-03-19      3   19   \n",
       "3  Sabotage Studios announced Sea of Stars today,... 2020-03-19      3   19   \n",
       "4  HTC’s just-concluded Virtual Vive Ecosystem Co... 2020-03-19      3   19   \n",
       "\n",
       "   length  nwords   lex_div  \n",
       "0    6466    1011  0.070227  \n",
       "1    1136     200  0.290000  \n",
       "2    4731     783  0.067688  \n",
       "3     898     156  0.352564  \n",
       "4    4030     649  0.090909  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('venturebeat2020.csv')\n",
    "data = df.copy()\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Data Preprocessing\n",
    "\n",
    "- Tokenization\n",
    "- Remove stop words\n",
    "- Remove punctuations\n",
    "- Stemming\n",
    "- Lemmatization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "nltk.download('popular', quiet=True)\n",
    "from nltk import word_tokenize, wordpunct_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "import unicodedata\n",
    "\n",
    "from nltk.stem import SnowballStemmer\n",
    "from nltk import pos_tag\n",
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "from nltk.corpus import wordnet as wn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_stopword(token):\n",
    "    stops  = set(stopwords.words('english'))\n",
    "    return token.lower() in stops\n",
    "\n",
    "def is_punct(token):\n",
    "    return all(unicodedata.category(char).startswith('P') for char in token)\n",
    "\n",
    "def normalizer(text):\n",
    "    stem = nltk.stem.SnowballStemmer('english')\n",
    "    text = text.lower()\n",
    "    \n",
    "    tokenized = []\n",
    "    for token in nltk.word_tokenize(text):\n",
    "        tokenized.append(stem.stem(token))\n",
    "    \n",
    "    tokenized = [token for token in tokenized \n",
    "                 if not is_punct(token)            # remove tokens that are punctuations\n",
    "                 and not is_stopword(token)        # remove stopwords\n",
    "                 and token.isascii()               # remove non-english characters\n",
    "               ]\n",
    "            \n",
    "    return ' '.join(tokenized)                     # join b/c we are inputting a list\n",
    "\n",
    "def lemmatizer(token, postag):\n",
    "    lemm = WordNetLemmatizer()\n",
    "    tag= {\n",
    "        'N':wn.NOUN,\n",
    "        'V':wn.VERB,\n",
    "        'R':wn.ADV,\n",
    "        'J':wn.ADJ\n",
    "    }.get(postag[0], wn.NOUN)\n",
    "    \n",
    "    return lemm.lemmatize(token, tag)\n",
    "\n",
    "def normalizer_lemm(text):\n",
    "    \n",
    "    tagged_tokenized = pos_tag(wordpunct_tokenize(text))\n",
    "    \n",
    "    tokenized = [ lemmatizer(token, tag).lower() \n",
    "                 for (token, tag) in tagged_tokenized\n",
    "                 if not is_punct(token) \n",
    "                 and token.isascii()\n",
    "                ]\n",
    "    \n",
    "    # remove extended stopwords\n",
    "    stop_words = stopwords.words('english')\n",
    "    stop_words.extend(['game', 'compani'])\n",
    "    stops = set(stop_words)\n",
    "    tokenized = [token for token in tokenized if not token in stops]\n",
    "    \n",
    "    return ' '.join(tokenized)                     # join b/c we are inputting a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = data['text'].values.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This week, nearly every major company developing autonomous vehicles in the U.S. halted testing in an effort to stem the spread of COVID-19, which has sickened more than 250,000 people and killed over 10,000 around the world. Still some experts argue pandemics like COVID-19 should hasten the adoption of driverless vehicles for passenger pickup, transportation of goods, and more. Autonomous vehicles still require disinfection — which companies like Alphabet’s Waymo and KiwiBot are conducting manually with sanitation teams — but in some cases, self-driving cars and delivery robots might minimize the risk of spreading disease. In a climate of social distancing, when on-demand services from Instacart to GrubHub have taken steps to minimize human contact, one factor in driverless cars’ favor is that they don’t require a potentially sick person behind the wheel. Tellingly, on Monday, when Waymo grounded its commercial robotaxis with human safety drivers, it initially said it would continue \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'week near everi major compani develop autonom vehicl u.s. halt test effort stem spread covid-19 sicken 250,000 peopl kill 10,000 around world still expert argu pandem like covid-19 hasten adopt driverless vehicl passeng pickup transport good autonom vehicl still requir disinfect compani like alphabet waymo kiwibot conduct manual sanit team case self-driv car deliveri robot might minim risk spread diseas climat social distanc on-demand servic instacart grubhub taken step minim human contact one factor driverless car favor requir potenti sick person behind wheel tell monday waymo ground commerci robotaxi human safeti driver initi said would continu oper driverless autonom car fleet peopl understand theori autonom vehicl reduc spread infect allow social distanc said amit nisenbaum ceo tactil mobil provid tactil data sens technolog allow autonom vehicl detect road bump curvatur hazard compani build fleet autonom vehicl develop solut guidelin general mainten clean steril keep strict clean '"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "norm_corpus = [ normalizer(i) for i in corpus ]\n",
    "print(corpus[0][:999])\n",
    "norm_corpus[0][:999]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'week nearly every major company develop autonomous vehicle u halt test effort stem spread covid 19 sicken 250 000 people kill 10 000 around world still expert argue pandemic like covid 19 hasten adoption driverless vehicle passenger pickup transportation good autonomous vehicle still require disinfection company like alphabet waymo kiwibot conduct manually sanitation team case self drive car delivery robot might minimize risk spread disease climate social distancing demand service instacart grubhub take step minimize human contact one factor driverless car favor require potentially sick person behind wheel tellingly monday waymo ground commercial robotaxis human safety driver initially say would continue operate driverless autonomous car fleet people understand theory autonomous vehicle reduce spread infection allow social distancing say amit nisenbaum ceo tactile mobility provider tactile data sense technology allow autonomous vehicle detect road bump curvature hazard companies build'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "normlemm_corpus = [ normalizer_lemm(i) for i in corpus ]\n",
    "normlemm_corpus[0][:999]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Feature Extraction: Vectorization + 3. Data Modelling using Pipelines\n",
    "\n",
    "- Count Vectorizer\n",
    "- TFIDF Vectorizer\n",
    "- Classification using ML models\n",
    "- NLP Pipelines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "from yellowbrick.text.freqdist import FreqDistVisualizer\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def model_pipeline(vectorizer, classifier):\n",
    "    \n",
    "    steps = [('vectorization',  vectorizer),\n",
    "             ('classification', classifier)\n",
    "        ]\n",
    "    pipe = Pipeline(steps)\n",
    "    \n",
    "    return pipe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pipeline(memory=None,\n",
      "         steps=[('vectorization',\n",
      "                 CountVectorizer(analyzer='word', binary=False,\n",
      "                                 decode_error='strict',\n",
      "                                 dtype=<class 'numpy.int64'>, encoding='utf-8',\n",
      "                                 input='content', lowercase=True, max_df=1.0,\n",
      "                                 max_features=None, min_df=1,\n",
      "                                 ngram_range=(1, 1), preprocessor=None,\n",
      "                                 stop_words=None, strip_accents=None,\n",
      "                                 token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
      "                                 tokenizer=None, vocabulary=None)),\n",
      "                ('classification',\n",
      "                 LogisticRegression(C=10, class_weight=None, dual=False,\n",
      "                                    fit_intercept=True, intercept_scaling=1,\n",
      "                                    l1_ratio=None, max_iter=1000,\n",
      "                                    multi_class='multinomial', n_jobs=None,\n",
      "                                    penalty='l2', random_state=None,\n",
      "                                    solver='newton-cg', tol=0.0001, verbose=0,\n",
      "                                    warm_start=False))],\n",
      "         verbose=False)\n",
      "-----------------------------------------------------------------------------\n",
      "Pipeline(memory=None,\n",
      "         steps=[('vectorization',\n",
      "                 CountVectorizer(analyzer='word', binary=False,\n",
      "                                 decode_error='strict',\n",
      "                                 dtype=<class 'numpy.int64'>, encoding='utf-8',\n",
      "                                 input='content', lowercase=True, max_df=1.0,\n",
      "                                 max_features=None, min_df=1,\n",
      "                                 ngram_range=(1, 1), preprocessor=None,\n",
      "                                 stop_words=None, strip_accents=None,\n",
      "                                 token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
      "                                 tokenizer=None, vocab...\n",
      "                 RandomForestClassifier(bootstrap=True, class_weight=None,\n",
      "                                        criterion='gini', max_depth=None,\n",
      "                                        max_features='auto',\n",
      "                                        max_leaf_nodes=None,\n",
      "                                        min_impurity_decrease=0.0,\n",
      "                                        min_impurity_split=None,\n",
      "                                        min_samples_leaf=1, min_samples_split=2,\n",
      "                                        min_weight_fraction_leaf=0.0,\n",
      "                                        n_estimators='warn', n_jobs=None,\n",
      "                                        oob_score=False, random_state=None,\n",
      "                                        verbose=0, warm_start=False))],\n",
      "         verbose=False)\n",
      "-----------------------------------------------------------------------------\n",
      "Pipeline(memory=None,\n",
      "         steps=[('vectorization',\n",
      "                 TfidfVectorizer(analyzer='word', binary=False,\n",
      "                                 decode_error='strict',\n",
      "                                 dtype=<class 'numpy.float64'>,\n",
      "                                 encoding='utf-8', input='content',\n",
      "                                 lowercase=True, max_df=0.95, max_features=5000,\n",
      "                                 min_df=20, ngram_range=(1, 2), norm='l2',\n",
      "                                 preprocessor=None, smooth_idf=True,\n",
      "                                 stop_words='english', strip_accents=None,\n",
      "                                 sublinear_tf=False,\n",
      "                                 token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
      "                                 tokenizer=None, use_idf=True,\n",
      "                                 vocabulary=None)),\n",
      "                ('classification',\n",
      "                 LogisticRegression(C=10, class_weight=None, dual=False,\n",
      "                                    fit_intercept=True, intercept_scaling=1,\n",
      "                                    l1_ratio=None, max_iter=1000,\n",
      "                                    multi_class='multinomial', n_jobs=None,\n",
      "                                    penalty='l2', random_state=None,\n",
      "                                    solver='newton-cg', tol=0.0001, verbose=0,\n",
      "                                    warm_start=False))],\n",
      "         verbose=False)\n",
      "-----------------------------------------------------------------------------\n",
      "Pipeline(memory=None,\n",
      "         steps=[('vectorization',\n",
      "                 TfidfVectorizer(analyzer='word', binary=False,\n",
      "                                 decode_error='strict',\n",
      "                                 dtype=<class 'numpy.float64'>,\n",
      "                                 encoding='utf-8', input='content',\n",
      "                                 lowercase=True, max_df=0.95, max_features=5000,\n",
      "                                 min_df=20, ngram_range=(1, 2), norm='l2',\n",
      "                                 preprocessor=None, smooth_idf=True,\n",
      "                                 stop_words='english', strip_accents=None,\n",
      "                                 sublinear_tf=False,...\n",
      "                 RandomForestClassifier(bootstrap=True, class_weight=None,\n",
      "                                        criterion='gini', max_depth=None,\n",
      "                                        max_features='auto',\n",
      "                                        max_leaf_nodes=None,\n",
      "                                        min_impurity_decrease=0.0,\n",
      "                                        min_impurity_split=None,\n",
      "                                        min_samples_leaf=1, min_samples_split=2,\n",
      "                                        min_weight_fraction_leaf=0.0,\n",
      "                                        n_estimators='warn', n_jobs=None,\n",
      "                                        oob_score=False, random_state=None,\n",
      "                                        verbose=0, warm_start=False))],\n",
      "         verbose=False)\n",
      "-----------------------------------------------------------------------------\n"
     ]
    }
   ],
   "source": [
    "X = normlemm_corpus\n",
    "y = data['category'].values\n",
    "n = 1560\n",
    "\n",
    "X_train, X_test = X[:n], X[n:]\n",
    "y_train, y_test = y[:n], y[n:]\n",
    "\n",
    "vect = [CountVectorizer(), \n",
    "        TfidfVectorizer(max_df=0.95, min_df=20, max_features=5000, stop_words='english', ngram_range=(1,2))]\n",
    "\n",
    "models = [LogisticRegression(C = 10, solver='newton-cg', multi_class='multinomial', max_iter=1000),\n",
    "          RandomForestClassifier()\n",
    "         ]\n",
    "\n",
    "for i in vect:\n",
    "    for j in models:\n",
    "        nlp_model = model_pipeline(i,j)\n",
    "        print(nlp_model)\n",
    "        print('-----------------------------------------------------------------------------')\n",
    "#         nlp_model.fit(X_train, y_train)\n",
    "#         y_pred = nlp_model.predict(X_test)\n",
    "#         print('accuracy: ', accuracy_score(y_test, y_pred))\n",
    "#         print(classification_report(y_test, y_pred))\n",
    "#         print('----------------------')\n",
    "#         print('----------------------')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}