{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Natural Language Processing\n", "\n", "We will be using the venturebeat data that we have scrapped and stored. We will begin with loading the data, inspecting it and then convert text into numeric features. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1961, 4)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlcategorytitletext
0https://venturebeat.com/2020/03/20/despite-set...AIDespite setbacks, coronavirus could hasten the...This week, nearly every major company developi...
1https://venturebeat.com/2020/03/19/sensor-towe...GamesSensor Tower: U.S. iPhone users spent about $5...U.S. iPhone users spent an average of about $5...
2https://venturebeat.com/2020/03/19/microsoft-u...GamesMicrosoft unveils DirectX 12 Ultimate with imp...Microsoft is moving on to the next generation ...
3https://venturebeat.com/2020/03/19/sea-of-star...GamesSea of Stars is a gorgeous retro-RPG from The ...Sabotage Studios announced Sea of Stars today,...
4https://venturebeat.com/2020/03/19/htc-holds-v...AR/VRHTC holds virtual media event, sends coronavir...HTC’s just-concluded Virtual Vive Ecosystem Co...
\n", "
" ], "text/plain": [ " url category \\\n", "0 https://venturebeat.com/2020/03/20/despite-set... AI \n", "1 https://venturebeat.com/2020/03/19/sensor-towe... Games \n", "2 https://venturebeat.com/2020/03/19/microsoft-u... Games \n", "3 https://venturebeat.com/2020/03/19/sea-of-star... Games \n", "4 https://venturebeat.com/2020/03/19/htc-holds-v... AR/VR \n", "\n", " title \\\n", "0 Despite setbacks, coronavirus could hasten the... \n", "1 Sensor Tower: U.S. iPhone users spent about $5... \n", "2 Microsoft unveils DirectX 12 Ultimate with imp... \n", "3 Sea of Stars is a gorgeous retro-RPG from The ... \n", "4 HTC holds virtual media event, sends coronavir... \n", "\n", " text \n", "0 This week, nearly every major company developi... \n", "1 U.S. iPhone users spent an average of about $5... \n", "2 Microsoft is moving on to the next generation ... \n", "3 Sabotage Studios announced Sea of Stars today,... \n", "4 HTC’s just-concluded Virtual Vive Ecosystem Co... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('venturebeat2020.csv')\n", "print(df.shape)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1961 entries, 0 to 1960\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 url 1961 non-null object\n", " 1 category 1961 non-null object\n", " 2 title 1961 non-null object\n", " 3 text 1961 non-null object\n", "dtypes: object(4)\n", "memory usage: 61.4+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our task/problem here is to build a natural language processing model that can take the information of the article and determine the topic it belongs to. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data = df.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Data Preprocessing\n", "\n", "We can extract date, month and day from the url using regular expression and datatime functionalities. We can also add a length and nwords column that represent the number of characters and the number of words in the article text, respectively. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def extract_date(string):\n", " match = re.search(r'\\d{4}/\\d{1,2}/\\d{1,2}', str(string))\n", " return match.group() " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlcategorytitletextdatemonthdaylengthnwords
0https://venturebeat.com/2020/03/20/despite-set...AIDespite setbacks, coronavirus could hasten the...This week, nearly every major company developi...2020-03-2032064661011
1https://venturebeat.com/2020/03/19/sensor-towe...GamesSensor Tower: U.S. iPhone users spent about $5...U.S. iPhone users spent an average of about $5...2020-03-193191136200
2https://venturebeat.com/2020/03/19/microsoft-u...GamesMicrosoft unveils DirectX 12 Ultimate with imp...Microsoft is moving on to the next generation ...2020-03-193194731783
3https://venturebeat.com/2020/03/19/sea-of-star...GamesSea of Stars is a gorgeous retro-RPG from The ...Sabotage Studios announced Sea of Stars today,...2020-03-19319898156
4https://venturebeat.com/2020/03/19/htc-holds-v...AR/VRHTC holds virtual media event, sends coronavir...HTC’s just-concluded Virtual Vive Ecosystem Co...2020-03-193194030649
\n", "
" ], "text/plain": [ " url category \\\n", "0 https://venturebeat.com/2020/03/20/despite-set... AI \n", "1 https://venturebeat.com/2020/03/19/sensor-towe... Games \n", "2 https://venturebeat.com/2020/03/19/microsoft-u... Games \n", "3 https://venturebeat.com/2020/03/19/sea-of-star... Games \n", "4 https://venturebeat.com/2020/03/19/htc-holds-v... AR/VR \n", "\n", " title \\\n", "0 Despite setbacks, coronavirus could hasten the... \n", "1 Sensor Tower: U.S. iPhone users spent about $5... \n", "2 Microsoft unveils DirectX 12 Ultimate with imp... \n", "3 Sea of Stars is a gorgeous retro-RPG from The ... \n", "4 HTC holds virtual media event, sends coronavir... \n", "\n", " text date month day \\\n", "0 This week, nearly every major company developi... 2020-03-20 3 20 \n", "1 U.S. iPhone users spent an average of about $5... 2020-03-19 3 19 \n", "2 Microsoft is moving on to the next generation ... 2020-03-19 3 19 \n", "3 Sabotage Studios announced Sea of Stars today,... 2020-03-19 3 19 \n", "4 HTC’s just-concluded Virtual Vive Ecosystem Co... 2020-03-19 3 19 \n", "\n", " length nwords \n", "0 6466 1011 \n", "1 1136 200 \n", "2 4731 783 \n", "3 898 156 \n", "4 4030 649 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['date'] = pd.to_datetime(data['url'].apply(extract_date))\n", "data['month'] = data['date'].dt.month\n", "data['day'] = data['date'].dt.day\n", "\n", "data['length'] = data['text'].str.len()\n", "data['nwords'] = data['text'].str.split().str.len()\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Lexical diversity__ is one aspect of 'lexical richness' and refers to the ratio of different unique words to the total number of words. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
urlcategorytitletextdatemonthdaylengthnwordslex_div
0https://venturebeat.com/2020/03/20/despite-set...AIDespite setbacks, coronavirus could hasten the...This week, nearly every major company developi...2020-03-20320646610110.070227
1https://venturebeat.com/2020/03/19/sensor-towe...GamesSensor Tower: U.S. iPhone users spent about $5...U.S. iPhone users spent an average of about $5...2020-03-1931911362000.290000
2https://venturebeat.com/2020/03/19/microsoft-u...GamesMicrosoft unveils DirectX 12 Ultimate with imp...Microsoft is moving on to the next generation ...2020-03-1931947317830.067688
3https://venturebeat.com/2020/03/19/sea-of-star...GamesSea of Stars is a gorgeous retro-RPG from The ...Sabotage Studios announced Sea of Stars today,...2020-03-193198981560.352564
4https://venturebeat.com/2020/03/19/htc-holds-v...AR/VRHTC holds virtual media event, sends coronavir...HTC’s just-concluded Virtual Vive Ecosystem Co...2020-03-1931940306490.090909
\n", "
" ], "text/plain": [ " url category \\\n", "0 https://venturebeat.com/2020/03/20/despite-set... AI \n", "1 https://venturebeat.com/2020/03/19/sensor-towe... Games \n", "2 https://venturebeat.com/2020/03/19/microsoft-u... Games \n", "3 https://venturebeat.com/2020/03/19/sea-of-star... Games \n", "4 https://venturebeat.com/2020/03/19/htc-holds-v... AR/VR \n", "\n", " title \\\n", "0 Despite setbacks, coronavirus could hasten the... \n", "1 Sensor Tower: U.S. iPhone users spent about $5... \n", "2 Microsoft unveils DirectX 12 Ultimate with imp... \n", "3 Sea of Stars is a gorgeous retro-RPG from The ... \n", "4 HTC holds virtual media event, sends coronavir... \n", "\n", " text date month day \\\n", "0 This week, nearly every major company developi... 2020-03-20 3 20 \n", "1 U.S. iPhone users spent an average of about $5... 2020-03-19 3 19 \n", "2 Microsoft is moving on to the next generation ... 2020-03-19 3 19 \n", "3 Sabotage Studios announced Sea of Stars today,... 2020-03-19 3 19 \n", "4 HTC’s just-concluded Virtual Vive Ecosystem Co... 2020-03-19 3 19 \n", "\n", " length nwords lex_div \n", "0 6466 1011 0.070227 \n", "1 1136 200 0.290000 \n", "2 4731 783 0.067688 \n", "3 898 156 0.352564 \n", "4 4030 649 0.090909 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def lexical_diversity(text):\n", " return len( set(text) ) / len( text.split() )\n", "\n", "data['text'] = data['text'].astype(str)\n", "data['lex_div'] = data['text'].apply(lexical_diversity)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1961\n" ] } ], "source": [ "corpus = data['text'].values.tolist()\n", "print(len(corpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1a. Tokenization\n", "\n", "Tokenization is the process of splitting text into meaningul elements called tokens." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download('popular', quiet=True)\n", "from nltk import word_tokenize, wordpunct_tokenize" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'have', \"n't\", 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n" ] } ], "source": [ "example = \"I haven't watched the show at the theatre.\"\n", "tokenized = nltk.word_tokenize(example)\n", "print(tokenized)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['I', \"haven't\", 'watched', 'the', 'show', 'at', 'the', 'theatre.']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example.split()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'haven', \"'\", 't', 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n" ] } ], "source": [ "print( wordpunct_tokenize(example) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document. \n", "\n", "### 1b. Stopwords" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import stopwords" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'have', \"n't\", 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n", "[True, True, False, False, True, False, True, True, False, False]\n" ] } ], "source": [ "def is_stopword(token):\n", " stops = set(stopwords.words('english'))\n", " return token.lower() in stops\n", "\n", "print(tokenized)\n", "print( [ is_stopword(i) for i in tokenized])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', \"haven't\", 'watched', 'the', 'show', 'at', 'the', 'theatre.']\n", "[True, True, False, True, False, True, True, False]\n" ] } ], "source": [ "print(example.split() )\n", "print( [ is_stopword(i) for i in example.split()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1c. Punctuations" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'have', \"n't\", 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n", "[False, False, False, False, False, False, False, False, False, True]\n" ] } ], "source": [ "import unicodedata\n", "def is_punct(token):\n", " return all(unicodedata.category(char).startswith('P') for char in token)\n", "\n", "print(tokenized)\n", "print( [ is_punct(i) for i in tokenized])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'haven', \"'\", 't', 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n", "[False, False, True, False, False, False, False, False, False, False, True]\n" ] } ], "source": [ "print( wordpunct_tokenize(example) ) \n", "print( [ is_punct(i) for i in wordpunct_tokenize(example)] )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1d. Stemming" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import SnowballStemmer" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"I haven't watched the show at the theatre.\"]\n", "['I', 'have', \"n't\", 'watched', 'the', 'show', 'at', 'the', 'theatre', '.']\n", "['i', 'have', \"n't\", 'watch', 'the', 'show', 'at', 'the', 'theatr', '.']\n" ] } ], "source": [ "stemmer = SnowballStemmer('english')\n", "stemmed = [ stemmer.stem(token) for token in tokenized ]\n", "print( [example] )\n", "print(tokenized)\n", "print(stemmed)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def normalizer(text):\n", " stem = nltk.stem.SnowballStemmer('english')\n", " text = text.lower()\n", " \n", " tokenized = []\n", " for token in nltk.word_tokenize(text):\n", " tokenized.append(stem.stem(token))\n", " \n", " tokenized = [token for token in tokenized \n", " if not is_punct(token) # remove tokens that are punctuations\n", " and not is_stopword(token) # remove stopwords\n", " and token.isascii() # remove non-english characters\n", " ]\n", " \n", " return ' '.join(tokenized) # join b/c we are inputting a list" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I haven't watched the show at the theatre.\n", "---> n't watch show theatr\n" ] } ], "source": [ "print( example )\n", "print( '---> ' + normalizer(example) )" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This week, nearly every major company developing autonomous vehicles in the U.S. halted testing in an effort to stem the spread of COVID-19, which has sickened more than 250,000 people and killed over 10,000 around the world. Still some experts argue pandemics like COVID-19 should hasten the adoption of driverless vehicles for passenger pickup, transportation of goods, and more. Autonomous vehicles still require disinfection — which companies like Alphabet’s Waymo and KiwiBot are conducting manually with sanitation teams — but in some cases, self-driving cars and delivery robots might minimize the risk of spreading disease. In a climate of social distancing, when on-demand services from Instacart to GrubHub have taken steps to minimize human contact, one factor in driverless cars’ favor is that they don’t require a potentially sick person behind the wheel. Tellingly, on Monday, when Waymo grounded its commercial robotaxis with human safety drivers, it initially said it would continue \n" ] }, { "data": { "text/plain": [ "'week near everi major compani develop autonom vehicl u.s. halt test effort stem spread covid-19 sicken 250,000 peopl kill 10,000 around world still expert argu pandem like covid-19 hasten adopt driverless vehicl passeng pickup transport good autonom vehicl still requir disinfect compani like alphabet waymo kiwibot conduct manual sanit team case self-driv car deliveri robot might minim risk spread diseas climat social distanc on-demand servic instacart grubhub taken step minim human contact one factor driverless car favor requir potenti sick person behind wheel tell monday waymo ground commerci robotaxi human safeti driver initi said would continu oper driverless autonom car fleet peopl understand theori autonom vehicl reduc spread infect allow social distanc said amit nisenbaum ceo tactil mobil provid tactil data sens technolog allow autonom vehicl detect road bump curvatur hazard compani build fleet autonom vehicl develop solut guidelin general mainten clean steril keep strict clean '" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm_corpus = [ normalizer(i) for i in corpus ]\n", "print(corpus[0][:999])\n", "norm_corpus[0][:999]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1e. Lemmatization" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from nltk import pos_tag\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "from nltk.corpus import wordnet as wn" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def is_punct(token):\n", " return all(unicodedata.category(char).startswith('P') for char in token)\n", "\n", "def lemmatizer(token, postag):\n", " lemm = WordNetLemmatizer()\n", " tag= {\n", " 'N':wn.NOUN,\n", " 'V':wn.VERB,\n", " 'R':wn.ADV,\n", " 'J':wn.ADJ\n", " }.get(postag[0], wn.NOUN)\n", " \n", " return lemm.lemmatize(token, tag)\n", "\n", "def normalizer_lemm(text):\n", " \n", " tagged_tokenized = pos_tag(wordpunct_tokenize(text))\n", " \n", " tokenized = [ lemmatizer(token, tag).lower() \n", " for (token, tag) in tagged_tokenized\n", " if not is_punct(token) \n", " and token.isascii()\n", " ]\n", " \n", " # remove extended stopwords\n", " stop_words = stopwords.words('english')\n", " stop_words.extend(['game', 'compani'])\n", " stops = set(stop_words)\n", " tokenized = [token for token in tokenized if not token in stops]\n", " \n", " return ' '.join(tokenized) # join b/c we are inputting a list" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'week nearly every major company develop autonomous vehicle u halt test effort stem spread covid 19 sicken 250 000 people kill 10 000 around world still expert argue pandemic like covid 19 hasten adoption driverless vehicle passenger pickup transportation good autonomous vehicle still require disinfection company like alphabet waymo kiwibot conduct manually sanitation team case self drive car delivery robot might minimize risk spread disease climate social distancing demand service instacart grubhub take step minimize human contact one factor driverless car favor require potentially sick person behind wheel tellingly monday waymo ground commercial robotaxis human safety driver initially say would continue operate driverless autonomous car fleet people understand theory autonomous vehicle reduce spread infection allow social distancing say amit nisenbaum ceo tactile mobility provider tactile data sense technology allow autonomous vehicle detect road bump curvature hazard companies build'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normlemm_corpus = [ normalizer_lemm(i) for i in corpus ]\n", "normlemm_corpus[0][:999]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Feature Extraction: Vectorization\n", "\n", "The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(int,\n", " {'I': 1,\n", " 'have': 1,\n", " \"n't\": 1,\n", " 'watched': 1,\n", " 'the': 2,\n", " 'show': 1,\n", " 'at': 1,\n", " 'theatre': 1,\n", " '.': 1})" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = defaultdict(int)\n", "for token in word_tokenize(example):\n", " words[token] += 1\n", "words " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### 2a. Count Vectorizer \n", " \n", " Scikit-Learn has a CountVectorizer transformer which does this for us easily. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1961x27030 sparse matrix of type ''\n", "\twith 485489 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer = CountVectorizer()\n", "vector = vectorizer.fit_transform(norm_corpus)\n", "vector" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 5, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ...,\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 1, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]], dtype=int64)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.toarray()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27030\n" ] } ], "source": [ "features = vectorizer.get_feature_names()\n", "nfeatures = len(features)\n", "print(nfeatures)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'week': 26168,\n", " 'near': 16403,\n", " 'everi': 8893,\n", " 'major': 14831,\n", " 'compani': 5902,\n", " 'develop': 7347,\n", " 'autonom': 3174,\n", " 'vehicl': 25535,\n", " 'halt': 11092,\n", " 'test': 23835,\n", " 'effort': 8285,\n", " 'stem': 22765,\n", " 'spread': 22559,\n", " 'covid': 6368,\n", " '19': 323,\n", " 'sicken': 21748,\n", " '250': 707,\n", " '000': 1,\n", " 'peopl': 18033,\n", " 'kill': 13483,\n", " '10': 72,\n", " 'around': 2866,\n", " 'world': 26507,\n", " 'still': 22810,\n", " 'expert': 9009,\n", " 'argu': 2820,\n", " 'pandem': 17726,\n", " 'like': 14279,\n", " 'hasten': 11235,\n", " 'adopt': 1956,\n", " 'driverless': 7962,\n", " 'passeng': 17854,\n", " 'pickup': 18260,\n", " 'transport': 24421,\n", " 'good': 10645,\n", " 'requir': 20150,\n", " 'disinfect': 7596,\n", " 'alphabet': 2320,\n", " 'waymo': 26106,\n", " 'kiwibot': 13558,\n", " 'conduct': 6006,\n", " 'manual': 14936,\n", " 'sanit': 20913,\n", " 'team': 23629,\n", " 'case': 4900,\n", " 'self': 21319,\n", " 'driv': 7955,\n", " 'car': 4799,\n", " 'deliveri': 7172,\n", " 'robot': 20505,\n", " 'might': 15582,\n", " 'minim': 15673,\n", " 'risk': 20430,\n", " 'diseas': 7574,\n", " 'climat': 5570,\n", " 'social': 22203,\n", " 'distanc': 7654,\n", " 'on': 17174,\n", " 'demand': 7190,\n", " 'servic': 21430,\n", " 'instacart': 12552,\n", " 'grubhub': 10902,\n", " 'taken': 23489,\n", " 'step': 22770,\n", " 'human': 11856,\n", " 'contact': 6117,\n", " 'one': 17187,\n", " 'factor': 9137,\n", " 'favor': 9272,\n", " 'potenti': 18694,\n", " 'sick': 21747,\n", " 'person': 18105,\n", " 'behind': 3654,\n", " 'wheel': 26247,\n", " 'tell': 23731,\n", " 'monday': 15915,\n", " 'ground': 10881,\n", " 'commerci': 5856,\n", " 'robotaxi': 20506,\n", " 'safeti': 20810,\n", " 'driver': 7958,\n", " 'initi': 12469,\n", " 'said': 20824,\n", " 'would': 26534,\n", " 'continu': 6139,\n", " 'oper': 17269,\n", " 'fleet': 9577,\n", " 'understand': 24964,\n", " 'theori': 23918,\n", " 'reduc': 19868,\n", " 'infect': 12399,\n", " 'allow': 2292,\n", " 'amit': 2417,\n", " 'nisenbaum': 16690,\n", " 'ceo': 5078,\n", " 'tactil': 23457,\n", " 'mobil': 15841,\n", " 'provid': 19143,\n", " 'data': 6897,\n", " 'sens': 21361,\n", " 'technolog': 23660,\n", " 'detect': 7326,\n", " 'road': 20471,\n", " 'bump': 4505,\n", " 'curvatur': 6674,\n", " 'hazard': 11264,\n", " 'build': 4479,\n", " 'solut': 22272,\n", " 'guidelin': 10954,\n", " 'general': 10321,\n", " 'mainten': 14822,\n", " 'clean': 5521,\n", " 'steril': 22783,\n", " 'keep': 13363,\n", " 'strict': 22935,\n", " 'schedul': 21084,\n", " 'check': 5237,\n", " 'along': 2310,\n", " 'alreadi': 2328,\n", " 'exist': 8977,\n", " 'in': 12248,\n", " 'cabin': 4620,\n", " 'monitor': 15930,\n", " 'abl': 1738,\n", " 'handl': 11127,\n", " 'it': 12887,\n", " 'dmitri': 7715,\n", " 'polishchuk': 18567,\n", " 'head': 11292,\n", " 'yandex': 26736,\n", " 'believ': 3673,\n", " 'abil': 1737,\n", " 'appeal': 2689,\n", " 'well': 26195,\n", " 'rider': 20374,\n", " 'someth': 22289,\n", " 'point': 18538,\n", " 'declin': 7049,\n", " 'pick': 18254,\n", " 'intel': 12594,\n", " 'campus': 4712,\n", " 'chandler': 5159,\n", " 'arizona': 2838,\n", " 'hear': 11330,\n", " 'report': 20123,\n", " 'employe': 8506,\n", " 'posit': 18665,\n", " 'strong': 22953,\n", " 'motiv': 16034,\n", " 'us': 25327,\n", " 'told': 24198,\n", " 'venturebeat': 25565,\n", " 'via': 25635,\n", " 'email': 8440,\n", " 'take': 23486,\n", " 'precautionari': 18769,\n", " 'measur': 15232,\n", " 'make': 14838,\n", " 'ride': 20373,\n", " 'safe': 20803,\n", " 'possibl': 18672,\n", " 'cleanli': 5523,\n", " 'use': 25342,\n", " 'best': 3752,\n", " 'practic': 18733,\n", " 'appli': 2704,\n", " 'taxi': 23601,\n", " 'shar': 21542,\n", " 'services': 21435,\n", " 'cours': 6343,\n", " 'deploy': 7260,\n", " 'unlik': 25107,\n", " 'move': 16059,\n", " 'forward': 9791,\n", " 'short': 21670,\n", " 'term': 23807,\n", " 'paus': 17905,\n", " 'govern': 10695,\n", " 'focus': 9669,\n", " 'realloc': 19714,\n", " 'resourc': 20193,\n", " 'freez': 9901,\n", " 'budget': 4456,\n", " 'cope': 6204,\n", " 'fallout': 9179,\n", " 'time': 24095,\n", " 'ramp': 19571,\n", " 'necessari': 16413,\n", " 'legisl': 14079,\n", " 'get': 10398,\n", " 'even': 8879,\n", " 'vast': 25501,\n", " 'lack': 13814,\n", " 'access': 1799,\n", " 'instanc': 12557,\n", " 'public': 19207,\n", " 'live': 14389,\n", " 'onli': 17206,\n", " 'phoenix': 18214,\n", " 'limit': 14297,\n", " 'number': 16923,\n", " 'custom': 6684,\n", " 'moment': 15907,\n", " 'nuro': 16932,\n", " 'r2': 19471,\n", " 'exclus': 8956,\n", " 'carri': 4866,\n", " 'groceri': 10869,\n", " 'essenti': 8795,\n", " 'rather': 19631,\n", " 'occup': 17032,\n", " 'regul': 19949,\n", " 'shown': 21700,\n", " 'willing': 26364,\n", " 'cut': 6690,\n", " 'red': 19839,\n", " 'tape': 23559,\n", " 'rover': 20637,\n", " 'februari': 9303,\n", " 'receiv': 19776,\n", " 'first': 9508,\n", " 'exempt': 8964,\n", " 'depart': 7252,\n", " 'add': 1899,\n", " 'conveni': 6166,\n", " 'perceiv': 18041,\n", " 'without': 26424,\n", " 'trust': 24594,\n", " 'life': 14228,\n", " 'therein': 23932,\n", " 'lie': 14220,\n", " 'differ': 7456,\n", " 'accept': 1797,\n", " 'societi': 22207,\n", " 'much': 16104,\n", " 'faster': 9250,\n", " 'starship': 22694,\n", " 'sever': 21465,\n", " 'deliv': 7169,\n", " 'item': 12894,\n", " 'local': 14439,\n", " 'busi': 4553,\n", " 'observ': 17005,\n", " 'increas': 12312,\n", " 'order': 17332,\n", " 'volum': 25851,\n", " 'recent': 19780,\n", " 'earli': 8154,\n", " 'conclud': 5985,\n", " 'whether': 26262,\n", " 'relat': 19997,\n", " 'restaur': 20209,\n", " 'side': 21751,\n", " 'say': 21008,\n", " 'uptick': 25302,\n", " 'interest': 12624,\n", " 'citi': 5460,\n", " 'san': 20893,\n", " 'francisco': 9854,\n", " 'new': 16536,\n", " 'york': 26811,\n", " 'enact': 8523,\n", " 'mandatori': 14904,\n", " 'closur': 5612,\n", " 'shelter': 21591,\n", " 'plac': 18376,\n", " 'abov': 1751,\n", " 'fulli': 10018,\n", " 'jaguar': 12953,\n", " 'pac': 17652,\n", " 'electr': 8361,\n", " 'suv': 23300,\n", " 'nichola': 16617,\n", " 'farhi': 9226,\n", " 'partner': 17840,\n", " 'oc': 17020,\n", " 'strategi': 22899,\n", " 'consult': 6106,\n", " 'work': 26482,\n", " 'client': 5562,\n", " 'automot': 3169,\n", " 'think': 23964,\n", " 'chief': 5299,\n", " 'challeng': 5141,\n", " 'scale': 21027,\n", " 'meet': 15295,\n", " 'easier': 8176,\n", " 'hire': 11559,\n", " '100': 73,\n", " 'amazon': 2378,\n", " 'announc': 2562,\n", " 'notic': 16838,\n", " 'neolix': 16456,\n", " 'claim': 5481,\n", " 'mid': 15556,\n", " 'march': 14959,\n", " 'began': 3636,\n", " 'sanitari': 20914,\n", " 'suppli': 23228,\n", " 'mask': 15051,\n", " 'antibacteri': 2606,\n", " 'gel': 10305,\n", " 'hygien': 11926,\n", " 'product': 19002,\n", " 'communiti': 5893,\n", " 'berkeley': 3726,\n", " 'denver': 7245,\n", " 'alibaba': 2256,\n", " 'jd': 13016,\n", " 'com': 5818,\n", " 'ecommerc': 8218,\n", " 'book': 4129,\n", " '200': 422,\n", " 'last': 13923,\n", " 'two': 24734,\n", " 'month': 15958,\n", " '125': 174,\n", " 'may': 15136,\n", " '2019': 454,\n", " 'purchas': 19255,\n", " 'spur': 22579,\n", " 'chines': 5318,\n", " 'offer': 17071,\n", " 'subsid': 23056,\n", " '60': 1262,\n", " 'cost': 6293,\n", " 'anticip': 2609,\n", " 'bring': 4356,\n", " 'sale': 20846,\n", " 'van': 25463,\n", " 'end': 8546,\n", " 'year': 26760,\n", " 'china': 5315,\n", " 'medic': 15262,\n", " 'supplement': 23226,\n", " 'labor': 13801,\n", " 'shortag': 21671,\n", " 'area': 2812,\n", " 'hit': 11572,\n", " 'hardest': 11186,\n", " 'partnership': 17843,\n", " 'apollo': 2678,\n", " 'baidu': 3349,\n", " 'platform': 18415,\n", " 'also': 2330,\n", " 'food': 9703,\n", " 'health': 11317,\n", " 'worker': 26487,\n", " 'beij': 3660,\n", " 'care': 4820,\n", " 'fallen': 9175,\n", " 'ill': 12117,\n", " 'despit': 7311,\n", " 'obvious': 17019,\n", " 'advantag': 1978,\n", " 'dure': 8087,\n", " 'crisi': 6479,\n", " 'face': 9117,\n", " 'percept': 18045,\n", " 'battl': 3546,\n", " 'studi': 22975,\n", " 'publish': 19211,\n", " 'brook': 4397,\n", " 'institut': 12566,\n", " 'anoth': 2578,\n", " 'advoc': 1998,\n", " 'highway': 11529,\n", " 'auto': 3141,\n", " 'aha': 2114,\n", " 'found': 9801,\n", " 'convinc': 6176,\n", " 'respond': 20200,\n", " 'poll': 18572,\n", " 'inclin': 12288,\n", " 'almost': 2306,\n", " '70': 1373,\n", " 'survey': 23273,\n", " 'express': 9035,\n", " 'concern': 5978,\n", " 'share': 21546,\n", " 'reason': 19738,\n", " 'predict': 18797,\n", " 'happen': 11163,\n", " 'slowli': 22068,\n", " 'extrem': 9065,\n", " 'caution': 4976,\n", " 'thing': 23959,\n", " 'eventu': 8885,\n", " 'return': 20266,\n", " 'normal': 16798,\n", " 'mani': 14917,\n", " 'consum': 6108,\n", " 'wari': 26027,\n", " 'long': 14492,\n", " 'befor': 3631,\n", " 'beyond': 3776,\n", " 'recuper': 19833,\n", " 'frontier': 9969,\n", " 'tech': 23645,\n", " 'benefit': 3699,\n", " 'soon': 22314,\n", " 'hope': 11719,\n", " 'iphon': 12774,\n", " 'user': 25348,\n", " 'spent': 22471,\n", " 'averag': 3206,\n", " '53': 1170,\n", " '80': 1479,\n", " 'game': 10149,\n", " 'accord': 1814,\n", " 'sensor': 21373,\n", " 'tower': 24309,\n", " '22': 653,\n", " '2018': 453,\n", " 'app': 2682,\n", " 'spend': 22469,\n", " 'reach': 19682,\n", " 'largest': 13905,\n", " 'segment': 21301,\n", " 'repres': 20127,\n", " '54': 1180,\n", " 'growth': 10899,\n", " '2017': 452,\n", " 'saw': 21001,\n", " 'steadi': 22730,\n", " 'awhil': 3245,\n", " '2015': 450,\n", " 'sit': 21896,\n", " '23': 671,\n", " 'free': 9885,\n", " 'to': 24168,\n", " 'play': 18427,\n", " 'earn': 8160,\n", " '99': 1664,\n", " 'candi': 4727,\n", " 'crush': 6544,\n", " 'saga': 20815,\n", " 'top': 24243,\n", " 'among': 2428,\n", " 'minecraft': 15660,\n", " 'premium': 18839,\n", " 'titl': 24147,\n", " 'healthi': 11322,\n", " 'photo': 18223,\n", " 'video': 25662,\n", " 'categori': 4953,\n", " 'includ': 12289,\n", " 'youtub': 26836,\n", " 'editor': 8249,\n", " 'picsart': 18263,\n", " 'actual': 1879,\n", " 'eclips': 8210,\n", " 'less': 14133,\n", " '30': 812,\n", " 'massiv': 15064,\n", " '75': 1417,\n", " 'doe': 7754,\n", " 'money': 15921,\n", " 'rideshar': 20377,\n", " 'uber': 24771,\n", " 'commerc': 5855,\n", " 'store': 22857,\n", " 'microsoft': 15549,\n", " 'next': 16574,\n", " 'generat': 10325,\n", " 'directx': 7517,\n", " 'api': 2670,\n", " 'window': 26376,\n", " 'xbox': 26632,\n", " '12': 158,\n", " 'ultim': 24819,\n", " 'enabl': 8521,\n", " 'unlock': 25112,\n", " 'hardwar': 11189,\n", " 'featur': 9300,\n", " 'pc': 17947,\n", " 'seri': 21413,\n", " 'method': 15457,\n", " 'improv': 12241,\n", " 'visual': 25763,\n", " 'perform': 18063,\n", " 'name': 16274,\n", " 'built': 4485,\n", " 'attempt': 3064,\n", " 'togeth': 24185,\n", " 'cutting': 6697,\n", " 'edg': 8238,\n", " 'multipl': 16143,\n", " 'becaus': 3604,\n", " 'maintain': 14820,\n", " 'compat': 5912,\n", " 'older': 17126,\n", " 'gpus': 10711,\n", " 'consol': 6086,\n", " 'support': 23231,\n", " 'optim': 17302,\n", " 'proper': 19079,\n", " 'made': 14757,\n", " 'lead': 14008,\n", " 'virtuous': 25739,\n", " 'cycl': 6749,\n", " 'longer': 14495,\n", " 'hold': 11630,\n", " 'back': 3295,\n", " 'want': 26006,\n", " 'break': 4289,\n", " 'unifi': 25057,\n", " 'graphic': 10759,\n", " 'releas': 20007,\n", " 'million': 15633,\n", " 'dx12': 8116,\n", " 'card': 4808,\n", " 'set': 21442,\n", " 'catalyz': 4945,\n", " 'rapid': 19606,\n", " 'read': 19690,\n", " 'blog': 4019,\n", " 'post': 18674,\n", " 'wave': 26090,\n", " 'gamer': 10165,\n", " 'likewis': 14284,\n", " 'surg': 23255,\n", " 'capabl': 4764,\n", " 'hardware': 11190,\n", " 'let': 14139,\n", " 'dive': 7677,\n", " 'bake': 3364,\n", " 'ray': 19654,\n", " 'trace': 24328,\n", " 'gpu': 10710,\n", " 'simul': 21849,\n", " 'behavior': 3650,\n", " 'light': 14255,\n", " 'way': 26100,\n", " 'look': 14508,\n", " 'realist': 19706,\n", " 'give': 10499,\n", " 'control': 6160,\n", " 'creator': 6441,\n", " 'raytrac': 19658,\n", " 'call': 4675,\n", " 'ping': 18310,\n", " 'cpu': 6395,\n", " 'effici': 8279,\n", " 'engin': 8587,\n", " 'spool': 22542,\n", " 'shader': 21491,\n", " 'player': 18434,\n", " 'environ': 8660,\n", " 'choos': 5352,\n", " 'inlin': 12481,\n", " 'act': 1864,\n", " 'altern': 2336,\n", " 'dynam': 8126,\n", " 'model': 15860,\n", " 'system': 23424,\n", " 'calcul': 4657,\n", " 'depend': 7257,\n", " 'materi': 15085,\n", " 'sourc': 22357,\n", " 'artist': 2905,\n", " 'behav': 3649,\n", " 'note': 16833,\n", " 'confin': 6023,\n", " 'shadow': 21493,\n", " 'scenario': 21071,\n", " 'complex': 5934,\n", " 'run': 20709,\n", " 'better': 3765,\n", " 'dynamic': 8128,\n", " 'bas': 3502,\n", " 'oppos': 17290,\n", " 'shad': 21489,\n", " 'write': 26563,\n", " 'meanwhil': 15230,\n", " 'shade': 21490,\n", " 'and': 2481,\n", " 'or': 17317,\n", " 'veri': 25585,\n", " 'tracing': 24331,\n", " 'onc': 17178,\n", " 'nvidia': 16950,\n", " 'ensur': 8620,\n", " 'full': 10014,\n", " 'rtx': 20676,\n", " 'pile': 18288,\n", " 'lot': 14546,\n", " 'variabl': 25486,\n", " 'rate': 19628,\n", " 'process': 18985,\n", " 'tune': 24658,\n", " 'detail': 7324,\n", " 'certain': 5097,\n", " 'part': 17825,\n", " 'singl': 21871,\n", " 'frame': 9841,\n", " 'render': 20068,\n", " 'idea': 12036,\n", " 'power': 18707,\n", " 'draw': 7921,\n", " 'shadowi': 21497,\n", " 'room': 20585,\n", " 'realli': 19712,\n", " 'see': 21285,\n", " 'anyway': 2651,\n", " 'signific': 21800,\n", " 'mesh': 15423,\n", " 'big': 3821,\n", " 'geometri': 10366,\n", " 'pipelin': 18330,\n", " 'manag': 14887,\n", " 'level': 14149,\n", " 'of': 17063,\n", " 'put': 19284,\n", " 'simpli': 21842,\n", " 'across': 1862,\n", " 'group': 10888,\n", " 'flexibl': 9592,\n", " 'simpler': 21840,\n", " 'previous': 18908,\n", " 'tool': 24228,\n", " 'sampler': 20884,\n", " 'feedback': 9314,\n", " 'load': 14427,\n", " 'stutter': 22993,\n", " 'asset': 2977,\n", " 'determin': 7333,\n", " 'textur': 23864,\n", " 'need': 16422,\n", " 'ani': 2530,\n", " 'given': 10502,\n", " 'situat': 21904,\n", " 'sampl': 20882,\n", " 'inform': 12439,\n", " 'fed': 9306,\n", " 'stream': 22915,\n", " 'intellig': 12597,\n", " 'precis': 18774,\n", " 'decis': 7042,\n", " 'conjunct': 6046,\n", " 'd3d12': 6778,\n", " 'tile': 24088,\n", " 'larger': 13904,\n", " 'memory': 15364,\n", " 'final': 9449,\n", " 'texture': 23865,\n", " 'spac': 22378,\n", " 'complic': 5939,\n", " 'quicker': 19428,\n", " 'correct': 6265,\n", " 'due': 8037,\n", " 'expect': 8993,\n", " 'begin': 3639,\n", " 'roll': 20554,\n", " 'holiday': 11638,\n", " 'sabotag': 20785,\n", " 'studio': 22977,\n", " 'sea': 21228,\n", " 'star': 22682,\n", " 'today': 24180,\n", " 'role': 20551,\n", " 'inspir': 12547,\n", " '16': 256,\n", " 'bit': 3921,\n", " 'era': 8720,\n", " 'excel': 8944,\n", " 'action': 1865,\n", " 'sidescrol': 21759,\n", " 'messeng': 15431,\n", " 'favorit': 9273,\n", " 'fact': 9134,\n", " 'place': 18378,\n", " 'univers': 25094,\n", " 'stori': 22861,\n", " 'serv': 21425,\n", " 'prequel': 18857,\n", " 'turn': 24679,\n", " 'kickstart': 13469,\n", " 'crowdfund': 6528,\n", " 'goal': 10583,\n", " '90': 1597,\n", " '760': 1430,\n", " 'aim': 2135,\n", " '2022': 593,\n", " 'trailer': 24364,\n", " 'biggest': 3827,\n", " 'japanes': 12991,\n", " 'rpgs': 20654,\n", " '90s': 1606,\n", " 'chrono': 5388,\n", " 'trigger': 24507,\n", " 'super': 23170,\n", " 'mario': 14979,\n", " 'rpg': 20653,\n", " 'influenc': 12426,\n", " 'coupl': 6340,\n", " 'beauti': 3602,\n", " 'pixel': 18366,\n", " 'art': 2885,\n", " '2d': 789,\n", " 'often': 17097,\n", " 'retro': 20258,\n", " 'indi': 12338,\n", " 'rare': 19618,\n", " 'tackl': 23453,\n", " 'market': 14988,\n", " 'although': 2339,\n", " 'undertal': 24972,\n", " 'indivis': 12361,\n", " 'htc': 11810,\n", " 'just': 13210,\n", " 'virtual': 25736,\n", " 'vive': 25776,\n", " 'ecosystem': 8225,\n", " 'confer': 6011,\n", " 'event': 8880,\n", " 'industri': 12369,\n", " 'vr': 25893,\n", " 'replac': 20111,\n", " 'held': 11395,\n", " 'physic': 18239,\n", " 'instead': 12562,\n", " 'escap': 8761,\n", " 'global': 10541,\n", " 'coronavirus': 6253,\n", " 'attende': 3066,\n", " 'themselv': 23911,\n", " 'closer': 5609,\n", " 'thank': 23888,\n", " 'bizarr': 3936,\n", " 'four': 9811,\n", " 'hour': 11775,\n", " 'prior': 18941,\n", " 'shenzhen': 21598,\n", " 'venu': 25571,\n", " 'insid': 12531,\n", " 'engag': 8584,\n", " 'collabor': 5774,\n", " 'applic': 2707,\n", " 'oculus': 17048,\n", " 'valv': 25458,\n", " 'mix': 15793,\n", " 'realiti': 19708,\n", " 'headset': 11307,\n", " 'present': 18879,\n", " 'somewhat': 22292,\n", " 'spars': 22403,\n", " 'audienc': 3095,\n", " 'individu': 12358,\n", " '3d': 962,\n", " 'avatar': 3202,\n", " 'appear': 2690,\n", " 'outdoor': 17452,\n", " 'amphitheat': 2437,\n", " 'concret': 5992,\n", " 'bench': 3686,\n", " 'seat': 21249,\n", " 'follow': 9693,\n", " '15': 232,\n", " 'minut': 15696,\n", " 'sober': 22196,\n", " 'calm': 4684,\n", " 'speech': 22448,\n", " 'chairperson': 5133,\n", " 'cher': 5267,\n", " 'wang': 26004,\n", " 'yves': 26870,\n", " 'maitr': 14827,\n", " 'presid': 18884,\n", " 'alvin': 2357,\n", " 'graylin': 10786,\n", " 'took': 24227,\n", " 'stage': 22641,\n", " 'chide': 5298,\n", " 'rival': 20447,\n", " 'magic': 14779,\n", " 'leap': 14029,\n", " 'fail': 9151,\n", " 'fli': 9597,\n", " 'whale': 26237,\n", " 'tout': 24305,\n", " 'ar': 2755,\n", " 'promot': 19060,\n", " 'quit': 19449,\n", " 'either': 8329,\n", " 'opt': 17297,\n", " 'demonstr': 7215,\n", " 'high': 11520,\n", " 'unusu': 25234,\n", " 'discuss': 7572,\n", " 'impact': 12187,\n", " 'staff': 22638,\n", " 'particl': 17834,\n", " 'ask': 2955,\n", " 'pose': 18664,\n", " 'quick': 19424,\n", " 'selfi': 21320,\n", " 'float': 9613,\n", " 'pictur': 18266,\n", " 'guy': 11000,\n", " 'probabl': 18972,\n", " 'seen': 21293,\n", " 'imag': 12137,\n", " 'virus': 25742,\n", " 'larg': 13901,\n", " 'screen': 21181,\n", " 'never': 16531,\n", " 'could': 6311,\n", " 'balloon': 3395,\n", " 'lik': 14277,\n", " 'crowd': 6526,\n", " 'worri': 26518,\n", " 'go': 10580,\n", " 'hurt': 11891,\n", " 'prepar': 18849,\n", " 'issu': 12878,\n", " 'special': 22425,\n", " 'protect': 19114,\n", " 'gear': 10281,\n", " 'spoke': 22526,\n", " 'member': 15354,\n", " 'cover': 6360,\n", " 'outfit': 17458,\n", " 'came': 4698,\n", " 'hover': 11787,\n", " 'direct': 7510,\n", " 'front': 9963,\n", " 'show': 21694,\n", " 'wide': 26320,\n", " 'angl': 2522,\n", " 'shot': 21683,\n", " 'imagin': 12142,\n", " 'experi': 9001,\n", " 'viewer': 25678,\n", " 'perspect': 18110,\n", " 'type': 24749,\n", " 'futur': 10068,\n", " 'done': 7790,\n", " 'prospect': 19101,\n", " 'shock': 21643,\n", " 'random': 19585,\n", " 'unsettl': 25188,\n", " 'enough': 8608,\n", " 'went': 26204,\n", " 'memori': 15363,\n", " 'whi': 26263,\n", " 'littl': 14379,\n", " 'fun': 10022,\n", " 'right': 20393,\n", " 'now': 16872,\n", " 'standard': 22670,\n", " 'past': 17864,\n", " 'keynot': 13434,\n", " 'common': 5867,\n", " 'odd': 17050,\n", " 'glib': 10532,\n", " 'treatment': 24452,\n", " 'outbreak': 17442,\n", " 'particular': 17836,\n", " 'tone': 24220,\n", " 'deaf': 6987,\n", " 'mend': 15372,\n", " 'thousand': 23996,\n", " 'death': 7004,\n", " 'elsewher': 8428,\n", " 'job': 13092,\n", " 'loss': 14542,\n", " 'socioeconom': 22210,\n", " 'disrupt': 7637,\n", " 'histor': 11568,\n", " 'question': 19417,\n", " 'greater': 10794,\n", " 'import': 12223,\n", " 'wake': 25959,\n", " 'modest': 15868,\n", " 'grow': 10895,\n", " 'becom': 3609,\n", " 'comfort': 5839,\n", " 'softwar': 22233,\n", " 'moreov': 15993,\n", " 'suppos': 23232,\n", " 'particip': 17831,\n", " 'mass': 15057,\n", " 'gather': 10239,\n", " 'locat': 14443,\n", " 'captur': 4797,\n", " 'attent': 3068,\n", " 'conscious': 6070,\n", " 'digit': 7470,\n", " 'perfect': 18059,\n", " 'content': 6130,\n", " 'feel': 9316,\n", " 'uncomfort': 24894,\n", " 'abandon': 1714,\n", " 'real': 19701,\n", " 'research': 20160,\n", " 'describ': 7294,\n", " 'techniqu': 23658,\n", " 'knowledg': 13614,\n", " 'graph': 10756,\n", " 'entiti': 8637,\n", " 'align': 2260,\n", " 'entail': 8622,\n", " 'element': 8384,\n", " 'refer': 19891,\n", " 'anyth': 2647,\n", " 'song': 22301,\n", " 'comput': 5957,\n", " 'speed': 22451,\n", " 'rel': 19995,\n", " 'task': 23583,\n", " 'search': 21241,\n", " 'answer': 2586,\n", " 'alexa': 2236,\n", " '2020': 484,\n", " 'web': 26137,\n", " 'underpin': 24950,\n", " 'network': 16496,\n", " 'facebook': 9118,\n", " 'twitter': 24733,\n", " 'enterpris': 8625,\n", " 'organ': 17343,\n", " 'various': 25494,\n", " 'catalog': 4940,\n", " 'scientist': 21132,\n", " 'hao': 11159,\n", " 'wei': 26179,\n", " 'explain': 9013,\n", " 'mathemat': 15089,\n", " 'object': 16990,\n", " 'consist': 6084,\n", " 'node': 16733,\n", " 'relationship': 20001,\n", " 'easili': 8178,\n", " 'convent': 6167,\n", " 'databas': 6899,\n", " 'exampl': 8938,\n", " 'movi': 16062,\n", " 'actor': 1874,\n", " 'director': 7514,\n", " 'film': 9438,\n", " 'genr': 10343,\n", " 'expand': 8990,\n", " 'involv': 12741,\n", " 'integr': 12591,\n", " 'error': 8750,\n", " 'propos': 19088,\n", " 'neural': 16501,\n", " 'convert': 6171,\n", " 'fixed': 9542,\n", " 'length': 14115,\n", " 'vector': 25523,\n", " 'represent': 20128,\n", " 'attribut': 3080,\n", " 'consid': 6080,\n", " 'central': 5072,\n", " 'nearbi': 16404,\n", " 'produc': 18999,\n", " 'embed': 8450,\n", " 'concaten': 5965,\n", " 'sum': 23124,\n", " 'immedi': 12161,\n", " 'neighbor': 16440,\n", " 'addit': 1905,\n", " 'summat': 23134,\n", " 'secondari': 21261,\n", " 'upon': 25280,\n", " 'baselin': 3509,\n", " 'metric': 15463,\n", " 'precision': 18775,\n", " 'recal': 19764,\n", " 'curv': 6673,\n", " 'prauc': 18753,\n", " 'evalu': 8868,\n", " 'trade': 24346,\n", " 'off': 17067,\n", " 'true': 24579,\n", " 'neg': 16431,\n", " 'furthermor': 10058,\n", " 'compar': 5907,\n", " 'deepmatch': 7091,\n", " 'specif': 22428,\n", " 'design': 7300,\n", " 'scalabl': 21024,\n", " 'mind': 15654,\n", " 'train': 24366,\n", " '95': 1638,\n", " 'mainfram': 14813,\n", " 'rais': 19550,\n", " 'cloud': 5615,\n", " 'nat': 16331,\n", " 'financ': 9451,\n", " 'led': 14051,\n", " 'andreessen': 2495,\n", " 'horowitz': 11737,\n", " 'invest': 12721,\n", " 'riot': 20417,\n", " 'finnish': 9478,\n", " 'iceland': 12015,\n", " 'former': 9772,\n", " ...}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab = vectorizer.vocabulary_\n", "vocab" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.text.freqdist import FreqDistVisualizer" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fig, ax = plt.subplots(1, 1, figsize=(15,10))\n", "visualizer = FreqDistVisualizer(features=features, n=30, ax=ax )\n", "visualizer.fit(vector)\n", "visualizer.show()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, 1, figsize=(15,5))\n", "lists_asc = sorted(vocab.items())\n", "x = [i for (i,j) in lists_asc]\n", "y = [j for (i,j) in lists_asc]\n", "\n", "n=30\n", "plt.bar(x[:n], y[:n])\n", "plt.xticks(rotation=45)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"n't watch show theatr\"" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def normalizer(text):\n", " stem = nltk.stem.SnowballStemmer('english')\n", " text = text.lower()\n", " \n", " tokenized = []\n", " for token in nltk.word_tokenize(text):\n", " tokenized.append(stem.stem(token))\n", " \n", " tokenized = [token for token in tokenized \n", " if not is_punct(token) # remove tokens that are punctuations\n", " and token.isascii() # remove non-english characters\n", " ]\n", " \n", " # remove extended stopwords\n", " stop_words = stopwords.words('english')\n", " stop_words.extend(['data','compani'])\n", " stops = set(stop_words)\n", " tokenized = [token for token in tokenized if not token in stops]\n", " \n", " return ' '.join(tokenized) # join b/c we are inputting a list\n", "\n", "normalizer(example)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "norm_corpus = [ normalizer(i) for i in corpus ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2b. TFIDF Vectorizer\n", "\n", "Again, Scikit-learn has provided an easy to work with functin for this. There is also a \"ngram_range\" parameter, which will help to create vocabulary with one or phrases of two words or both. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1961x27030 sparse matrix of type ''\n", "\twith 485489 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf = TfidfVectorizer(analyzer='word')\n", "tfidf_vector = tfidf.fit_transform(norm_corpus)\n", "tfidf_vector" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0.07271827, 0. , ..., 0. , 0. ,\n", " 0. ],\n", " [0. , 0. , 0. , ..., 0. , 0. ,\n", " 0. ],\n", " [0. , 0. , 0. , ..., 0. , 0. ,\n", " 0. ],\n", " ...,\n", " [0. , 0. , 0. , ..., 0. , 0. ,\n", " 0. ],\n", " [0. , 0.02555779, 0. , ..., 0. , 0. ,\n", " 0. ],\n", " [0. , 0. , 0. , ..., 0. , 0. ,\n", " 0. ]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_vector.toarray()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fig, ax = plt.subplots(1, 1, figsize=(15,8))\n", "visualizer = FreqDistVisualizer(features=tfidf.get_feature_names(), n=30, ax=ax )\n", "visualizer.fit(tfidf_vector)\n", "visualizer.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. MODELLING" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
00000000000004000lbs000mah000mbps000mg000th000x...zuozuorazurichzvizvonimirzvoxzweigzxzychzynga
00500000000...0000000000
10000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000
..................................................................
19560000000000...0000000000
19570300000000...0000000000
19580000000000...0000000000
19590100000000...0000000000
19600000000000...0000000000
\n", "

1961 rows × 27030 columns

\n", "
" ], "text/plain": [ " 00 000 0000 00004 000lbs 000mah 000mbps 000mg 000th 000x ... \\\n", "0 0 5 0 0 0 0 0 0 0 0 ... \n", "1 0 0 0 0 0 0 0 0 0 0 ... \n", "2 0 0 0 0 0 0 0 0 0 0 ... \n", "3 0 0 0 0 0 0 0 0 0 0 ... \n", "4 0 0 0 0 0 0 0 0 0 0 ... \n", "... .. ... ... ... ... ... ... ... ... ... ... \n", "1956 0 0 0 0 0 0 0 0 0 0 ... \n", "1957 0 3 0 0 0 0 0 0 0 0 ... \n", "1958 0 0 0 0 0 0 0 0 0 0 ... \n", "1959 0 1 0 0 0 0 0 0 0 0 ... \n", "1960 0 0 0 0 0 0 0 0 0 0 ... \n", "\n", " zuo zuora zurich zvi zvonimir zvox zweig zx zych zynga \n", "0 0 0 0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 0 \n", "... ... ... ... ... ... ... ... .. ... ... \n", "1956 0 0 0 0 0 0 0 0 0 0 \n", "1957 0 0 0 0 0 0 0 0 0 0 \n", "1958 0 0 0 0 0 0 0 0 0 0 \n", "1959 0 0 0 0 0 0 0 0 0 0 \n", "1960 0 0 0 0 0 0 0 0 0 0 \n", "\n", "[1961 rows x 27030 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = pd.DataFrame(vector.toarray(), columns=features)\n", "X" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1961, 27035)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthdaylengthnwordslex_div00000000000004000lbs...zuozuorazurichzvizvonimirzvoxzweigzxzychzynga
0320646610110.07022705000...0000000000
131911362000.29000000000...0000000000
231947317830.06768800000...0000000000
33198981560.35256400000...0000000000
431940306490.09090900000...0000000000
\n", "

5 rows × 27035 columns

\n", "
" ], "text/plain": [ " month day length nwords lex_div 00 000 0000 00004 000lbs ... \\\n", "0 3 20 6466 1011 0.070227 0 5 0 0 0 ... \n", "1 3 19 1136 200 0.290000 0 0 0 0 0 ... \n", "2 3 19 4731 783 0.067688 0 0 0 0 0 ... \n", "3 3 19 898 156 0.352564 0 0 0 0 0 ... \n", "4 3 19 4030 649 0.090909 0 0 0 0 0 ... \n", "\n", " zuo zuora zurich zvi zvonimir zvox zweig zx zych zynga \n", "0 0 0 0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 0 \n", "\n", "[5 rows x 27035 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['month','day','length','nwords','lex_div']\n", "X = pd.concat([data[cols], X], axis=1)\n", "print(X.shape)\n", "X.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }