## Natural Language Processing

We will be using the venturebeat data that we have scrapped and stored. We will begin with loading the data, inspecting it and then convert text into numeric features. Our task/problem here is to build a natural language processing model that can take the information of the article and determine the category it belongs to. 

In [1]:
import pandas as pd
import re

In [3]:
df = pd.read_csv('venturebeat2020.csv')
data = df.copy()
data.head()

Unnamed: 0,url,category,title,text,date,month,day,length,nwords,lex_div
0,https://venturebeat.com/2020/03/20/despite-set...,AI,"Despite setbacks, coronavirus could hasten the...","This week, nearly every major company developi...",2020-03-20,3,20,6466,1011,0.070227
1,https://venturebeat.com/2020/03/19/sensor-towe...,Games,Sensor Tower: U.S. iPhone users spent about $5...,U.S. iPhone users spent an average of about $5...,2020-03-19,3,19,1136,200,0.29
2,https://venturebeat.com/2020/03/19/microsoft-u...,Games,Microsoft unveils DirectX 12 Ultimate with imp...,Microsoft is moving on to the next generation ...,2020-03-19,3,19,4731,783,0.067688
3,https://venturebeat.com/2020/03/19/sea-of-star...,Games,Sea of Stars is a gorgeous retro-RPG from The ...,"Sabotage Studios announced Sea of Stars today,...",2020-03-19,3,19,898,156,0.352564
4,https://venturebeat.com/2020/03/19/htc-holds-v...,AR/VR,"HTC holds virtual media event, sends coronavir...",HTC’s just-concluded Virtual Vive Ecosystem Co...,2020-03-19,3,19,4030,649,0.090909


### 1. Data Preprocessing

- Tokenization
- Remove stop words
- Remove punctuations
- Stemming
- Lemmatization

In [4]:
import nltk
nltk.download('popular', quiet=True)
from nltk import word_tokenize, wordpunct_tokenize
from nltk.corpus import stopwords
import unicodedata

from nltk.stem import SnowballStemmer
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn

In [5]:
def is_stopword(token):
    stops  = set(stopwords.words('english'))
    return token.lower() in stops

def is_punct(token):
    return all(unicodedata.category(char).startswith('P') for char in token)

def normalizer(text):
    stem = nltk.stem.SnowballStemmer('english')
    text = text.lower()
    
    tokenized = []
    for token in nltk.word_tokenize(text):
        tokenized.append(stem.stem(token))
    
    tokenized = [token for token in tokenized 
                 if not is_punct(token)            # remove tokens that are punctuations
                 and not is_stopword(token)        # remove stopwords
                 and token.isascii()               # remove non-english characters
               ]
            
    return ' '.join(tokenized)                     # join b/c we are inputting a list

def lemmatizer(token, postag):
    lemm = WordNetLemmatizer()
    tag= {
        'N':wn.NOUN,
        'V':wn.VERB,
        'R':wn.ADV,
        'J':wn.ADJ
    }.get(postag[0], wn.NOUN)
    
    return lemm.lemmatize(token, tag)

def normalizer_lemm(text):
    
    tagged_tokenized = pos_tag(wordpunct_tokenize(text))
    
    tokenized = [ lemmatizer(token, tag).lower() 
                 for (token, tag) in tagged_tokenized
                 if not is_punct(token) 
                 and token.isascii()
                ]
    
    # remove extended stopwords
    stop_words = stopwords.words('english')
    stop_words.extend(['game', 'compani'])
    stops = set(stop_words)
    tokenized = [token for token in tokenized if not token in stops]
    
    return ' '.join(tokenized)                     # join b/c we are inputting a list

In [6]:
corpus = data['text'].values.tolist()

In [7]:
norm_corpus = [ normalizer(i) for i in corpus ]
print(corpus[0][:999])
norm_corpus[0][:999]

This week, nearly every major company developing autonomous vehicles in the U.S. halted testing in an effort to stem the spread of COVID-19, which has sickened more than 250,000 people and killed over 10,000 around the world. Still some experts argue pandemics like COVID-19 should hasten the adoption of driverless vehicles for passenger pickup, transportation of goods, and more. Autonomous vehicles still require disinfection — which companies like Alphabet’s Waymo and KiwiBot are conducting manually with sanitation teams — but in some cases, self-driving cars and delivery robots might minimize the risk of spreading disease. In a climate of social distancing, when on-demand services from Instacart to GrubHub have taken steps to minimize human contact, one factor in driverless cars’ favor is that they don’t require a potentially sick person behind the wheel. Tellingly, on Monday, when Waymo grounded its commercial robotaxis with human safety drivers, it initially said it would continue 


'week near everi major compani develop autonom vehicl u.s. halt test effort stem spread covid-19 sicken 250,000 peopl kill 10,000 around world still expert argu pandem like covid-19 hasten adopt driverless vehicl passeng pickup transport good autonom vehicl still requir disinfect compani like alphabet waymo kiwibot conduct manual sanit team case self-driv car deliveri robot might minim risk spread diseas climat social distanc on-demand servic instacart grubhub taken step minim human contact one factor driverless car favor requir potenti sick person behind wheel tell monday waymo ground commerci robotaxi human safeti driver initi said would continu oper driverless autonom car fleet peopl understand theori autonom vehicl reduc spread infect allow social distanc said amit nisenbaum ceo tactil mobil provid tactil data sens technolog allow autonom vehicl detect road bump curvatur hazard compani build fleet autonom vehicl develop solut guidelin general mainten clean steril keep strict clean 

In [8]:
normlemm_corpus = [ normalizer_lemm(i) for i in corpus ]
normlemm_corpus[0][:999]

'week nearly every major company develop autonomous vehicle u halt test effort stem spread covid 19 sicken 250 000 people kill 10 000 around world still expert argue pandemic like covid 19 hasten adoption driverless vehicle passenger pickup transportation good autonomous vehicle still require disinfection company like alphabet waymo kiwibot conduct manually sanitation team case self drive car delivery robot might minimize risk spread disease climate social distancing demand service instacart grubhub take step minimize human contact one factor driverless car favor require potentially sick person behind wheel tellingly monday waymo ground commercial robotaxis human safety driver initially say would continue operate driverless autonomous car fleet people understand theory autonomous vehicle reduce spread infection allow social distancing say amit nisenbaum ceo tactile mobility provider tactile data sense technology allow autonomous vehicle detect road bump curvature hazard companies build

### 2. Feature Extraction: Vectorization + 3. Data Modelling using Pipelines

- Count Vectorizer
- TFIDF Vectorizer
- Classification using ML models
- NLP Pipelines

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from yellowbrick.text.freqdist import FreqDistVisualizer
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [10]:
def model_pipeline(vectorizer, classifier):
    
    steps = [('vectorization',  vectorizer),
             ('classification', classifier)
        ]
    pipe = Pipeline(steps)
    
    return pipe

In [28]:
X = normlemm_corpus
y = data['category'].values
n = 1560

X_train, X_test = X[:n], X[n:]
y_train, y_test = y[:n], y[n:]

vect = [CountVectorizer(), 
        TfidfVectorizer(max_df=0.95, min_df=20, max_features=5000, stop_words='english', ngram_range=(1,2))]

models = [LogisticRegression(C = 10, solver='newton-cg', multi_class='multinomial', max_iter=1000),
          RandomForestClassifier()
         ]

for i in vect:
    for j in models:
        nlp_model = model_pipeline(i,j)
        print(nlp_model)
        print('-----------------------------------------------------------------------------')
#         nlp_model.fit(X_train, y_train)
#         y_pred = nlp_model.predict(X_test)
#         print('accuracy: ', accuracy_score(y_test, y_pred))
#         print(classification_report(y_test, y_pred))
#         print('----------------------')
#         print('----------------------')

Pipeline(memory=None,
         steps=[('vectorization',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classification',
                 LogisticRegression(C=10, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=1000,
                                    multi_class='multinomial', n_jobs=None,