{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Natural Language Processing\n", "\n", "We will be using the venturebeat data that we have scrapped and stored. We will begin with loading the data, inspecting it and then convert text into numeric features. Our task/problem here is to build a natural language processing model that can take the information of the article and determine the category it belongs to. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import re" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "category | \n", "title | \n", "text | \n", "date | \n", "month | \n", "day | \n", "length | \n", "nwords | \n", "lex_div | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "https://venturebeat.com/2020/03/20/despite-set... | \n", "AI | \n", "Despite setbacks, coronavirus could hasten the... | \n", "This week, nearly every major company developi... | \n", "2020-03-20 | \n", "3 | \n", "20 | \n", "6466 | \n", "1011 | \n", "0.070227 | \n", "
1 | \n", "https://venturebeat.com/2020/03/19/sensor-towe... | \n", "Games | \n", "Sensor Tower: U.S. iPhone users spent about $5... | \n", "U.S. iPhone users spent an average of about $5... | \n", "2020-03-19 | \n", "3 | \n", "19 | \n", "1136 | \n", "200 | \n", "0.290000 | \n", "
2 | \n", "https://venturebeat.com/2020/03/19/microsoft-u... | \n", "Games | \n", "Microsoft unveils DirectX 12 Ultimate with imp... | \n", "Microsoft is moving on to the next generation ... | \n", "2020-03-19 | \n", "3 | \n", "19 | \n", "4731 | \n", "783 | \n", "0.067688 | \n", "
3 | \n", "https://venturebeat.com/2020/03/19/sea-of-star... | \n", "Games | \n", "Sea of Stars is a gorgeous retro-RPG from The ... | \n", "Sabotage Studios announced Sea of Stars today,... | \n", "2020-03-19 | \n", "3 | \n", "19 | \n", "898 | \n", "156 | \n", "0.352564 | \n", "
4 | \n", "https://venturebeat.com/2020/03/19/htc-holds-v... | \n", "AR/VR | \n", "HTC holds virtual media event, sends coronavir... | \n", "HTC’s just-concluded Virtual Vive Ecosystem Co... | \n", "2020-03-19 | \n", "3 | \n", "19 | \n", "4030 | \n", "649 | \n", "0.090909 | \n", "