{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(gc_analysis)=\n",
"# Datenaufbereitung"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In diesem Notebook werden die Daten aus Webscrapping und der Sentimentanalyse für die Analyse vorbereitetet und angereichert. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import pandas as pd\n",
"import numpy as np \n",
"from pymongo import MongoClient\n",
"from scipy.stats import ttest_ind\n",
"from difflib import SequenceMatcher\n",
"\n",
"#import plotly.express as px\n",
"#import plotly.figure_factory as ff\n",
"#from plotly.offline import init_notebook_mode\n",
"#init_notebook_mode() # To show plotly plots when notebook is exported to html"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"mongodb_pass = json.load(open('API_Data.json'))['mongoDB_pass'] # password mongodb user\n",
"client = MongoClient(mongodb_pass)\n",
"db = client.gc_nfl\n",
"mycoll = db.gc_games"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"````{margin}\n",
"```{note} Da die Youtube-Kommentare sehr lang sein können, wird die maximale Zeichenlänge der Zellen im Dataframe erhöht.\n",
"```\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pd.options.display.max_colwidth = 1000"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hole Daten von MongoDB"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Dafür werden zu nächste die Daten aus mongodb abgerufen und in einen Dataframe umgewandelt. Die explorative Datenanalyse wird am Beispiel der Season 2021/2022 gezeigt, da diese Season bereits abgeschlossen ist."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pipeline = [\n",
" {'$match':{'year':'2021'}},\n",
" {'$project':{\n",
" '_id':0\n",
" }}\n",
" ]\n",
"\n",
"x = mycoll.aggregate(pipeline)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df0 = pd.DataFrame.from_dict(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vorbereitung Entitäten"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Für den Erkenntnisgewinn sind die Enitäten der Kommentare interessant, welche Mannschaftsnamen enthalten. Die Mannschaftsnamen der NFL-Teams bestehen aus zwei Bestandteilen: einen Namen des Heimatsort (z.B. Tampa Bay) und einem Teamnamen (z.B. Buccaneers). Ziel der Data Preperation ist es so viele Enitäten, wie möglich als Mannschaftsnamen zu identifizieren. Dafür müssen Synonyme wie z.B. Spitznamen oder nur der Heimatort als Mannschaftsnamen und Rechtschreibfehler erkannt werden. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" team1 | \n",
" score1 | \n",
" team2 | \n",
" score2 | \n",
" year | \n",
" week | \n",
" videoID | \n",
" comment | \n",
" entity | \n",
" salience | \n",
" score | \n",
" magnitude | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" players | \n",
" 0.800480 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" NFL | \n",
" 0.136442 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" Tampa Bay Buccaneers | \n",
" 0.063078 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" #🐰💀 vs #🤠🦆 | \n",
" #🐰 | \n",
" 1.000000 | \n",
" 0.1 | \n",
" 0.1 | \n",
"
\n",
" \n",
" 4 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" OMG!! I really thought we were gonna beat the goat!! Such an amazing game!! | \n",
" game | \n",
" 0.528493 | \n",
" 0.9 | \n",
" 0.9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" team1 score1 team2 score2 year week videoID \\\n",
"0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"3 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"4 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"\n",
" comment \\\n",
"0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"3 #🐰💀 vs #🤠🦆 \n",
"4 OMG!! I really thought we were gonna beat the goat!! Such an amazing game!! \n",
"\n",
" entity salience score magnitude \n",
"0 players 0.800480 0.0 0.0 \n",
"1 NFL 0.136442 0.0 0.0 \n",
"2 Tampa Bay Buccaneers 0.063078 0.0 0.0 \n",
"3 #🐰 1.000000 0.1 0.1 \n",
"4 game 0.528493 0.9 0.9 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Berücksichtigung von Rechtschreibungsfehler"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Insofern die Entitäten mit Rechtschreibfehler von Google NLP erkannt wurden, sollen diese mit Hilfe des SequenceMatchers und Casefold zu einem \"team_by_entity\"-Wert zugeordnet werden."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df0['entity'] = df0['entity'].apply(lambda x: x.casefold())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Der SequenceMatcher berechent einen Ähnlichkeitswert von zwei Strings. Für das Projekt wird der Grenzwert auf 0.8 festgelegt. Mit der Funktion `match_sequence()` können die Teamnamen mit den Enität aus den Kommentaren abgeglichen werden. Liegt Ähnlichkeit vor, werden die Teamnamen der Enitäten korrigiert. "
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9230769230769231"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"myStr1 = \"Titans\"\n",
"myStr2 = \"tittans\"\n",
"\n",
"SequenceMatcher(a=myStr1.casefold(),b=myStr2.casefold()).ratio()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def match_sequence(string1, string2):\n",
" '''\n",
" Input: string1 -> Entität\n",
" string2 -> Textstelle im Kommentar\n",
" '''\n",
" value = string2\n",
" if (SequenceMatcher(a = string1.casefold(), b = string2.casefold()).ratio() > 0.8):\n",
" value = string1\n",
" return value"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"df0['entity']= df0.apply(lambda x: match_sequence(x['team1'], x['entity']), axis=1)\n",
"df0['entity']= df0.apply(lambda x: match_sequence(x['team2'], x['entity']), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mit der Anwendung des SequenceMatcher konnten 965 weitere Enitäten zugeordnet werden. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Erstellungen eines Synonym-Wörterbuchs "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Datenbasis des Synonym-Wörterbuchs ist eine Tabelle aus Github, welche die vollständigen Mannschaftsnamen enthält. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"nfl_teams = pd.read_csv('https://gist.githubusercontent.com/cnizzardini/13d0a072adb35a0d5817/raw/f315c97c7677845668a9c26e9093d0d550533b00/nfl_teams.csv')\n",
"nfl_teams['Name_1'] = nfl_teams['Name'].apply(lambda x : x.split()[-1])\n",
"nfl_teams['Name_2'] = nfl_teams['Name'].apply(lambda x : \" \".join(x.split()[0:-1]))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ID | \n",
" Name | \n",
" Abbreviation | \n",
" Conference | \n",
" Division | \n",
" Name_1 | \n",
" Name_2 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Arizona Cardinals | \n",
" ARI | \n",
" NFC | \n",
" West | \n",
" Cardinals | \n",
" Arizona | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Atlanta Falcons | \n",
" ATL | \n",
" NFC | \n",
" South | \n",
" Falcons | \n",
" Atlanta | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Baltimore Ravens | \n",
" BAL | \n",
" AFC | \n",
" North | \n",
" Ravens | \n",
" Baltimore | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Buffalo Bills | \n",
" BUF | \n",
" AFC | \n",
" East | \n",
" Bills | \n",
" Buffalo | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Carolina Panthers | \n",
" CAR | \n",
" NFC | \n",
" South | \n",
" Panthers | \n",
" Carolina | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ID Name Abbreviation Conference Division Name_1 \\\n",
"0 1 Arizona Cardinals ARI NFC West Cardinals \n",
"1 2 Atlanta Falcons ATL NFC South Falcons \n",
"2 3 Baltimore Ravens BAL AFC North Ravens \n",
"3 4 Buffalo Bills BUF AFC East Bills \n",
"4 5 Carolina Panthers CAR NFC South Panthers \n",
"\n",
" Name_2 \n",
"0 Arizona \n",
"1 Atlanta \n",
"2 Baltimore \n",
"3 Buffalo \n",
"4 Carolina "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nfl_teams.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mit den Spalten 'Name_1' und 'Name_2' wird der vollständige Mannschaftsname in den Teamname und den Heimatort getrennt, um die ersten Synonyme zu erhalten. Der Teamname dient in den nachfolgenden Betrachtungen als zentrale Entität (\"team_by_entity\") für den Mannschaftsnamen. Alle anderen Synonyme werden dem Teamnamen zugeordnetet. \n",
"\n",
"Nachfolgend wird die Tabelle mit den gebildeten Entitäten verschlankt. Die Werte für \"team_by_entity\" werden mit dem zugehörigen Synonymen kombiniert und unterhalb der bestehenden Tabelle angefügt. Schließlich werden die Spaltennamen umbenannt."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"df_name2 = nfl_teams[['Name_1','Name_2']].copy(deep=True)\n",
"df_name1 = nfl_teams[['Name_1']].copy(deep=True)\n",
"df_name1['Name'] = df_name1['Name_1']"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"nfl_teams = nfl_teams.merge(df_name2.rename(columns={'Name_2':'Name'}),how='outer')\n",
"nfl_teams = nfl_teams.merge(df_name1,how='outer')\n",
"nfl_teams.drop(['ID','Abbreviation', 'Conference','Division','Name_2'], axis=1, inplace=True)\n",
"nfl_teams.rename(columns={'Name':'entity','Name_1':'team_by_entity'}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" entity | \n",
" team_by_entity | \n",
"
\n",
" \n",
" \n",
" \n",
" 11 | \n",
" Green Bay Packers | \n",
" Packers | \n",
"
\n",
" \n",
" 43 | \n",
" Green Bay | \n",
" Packers | \n",
"
\n",
" \n",
" 75 | \n",
" Packers | \n",
" Packers | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" entity team_by_entity\n",
"11 Green Bay Packers Packers\n",
"43 Green Bay Packers\n",
"75 Packers Packers"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vikings = nfl_teams[nfl_teams['team_by_entity'] == \"Packers\"] \n",
"vikings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Für die Mannschaft \"Packers\" ist \"Cheeseheads\" ein sehr gängiger Spitzname. Solche Entitäten sollen ebenfalls berücksichtigt werden. Mit der nachfolgenden Funktion können derartige Synonyme hinzugeügt werden. Exemplarisch wird dies für drei Teams durchgeführt. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def append_pairs(df, entity, team_by_entity, single=True):\n",
" '''\n",
" takes: pandas dataframe, entity and team_by_entity, trigger for multiple or single values\n",
" if single=False the input for entity and team_by_entity has to be a list.\n",
" returns a pandas dataframe object that includes old and new data.\n",
" '''\n",
" if single:\n",
" new_pair = {\n",
" 'entity':[entity],\n",
" 'team_by_entity':[team_by_entity]\n",
" }\n",
" else:\n",
" new_pair = {\n",
" 'entity':entity,\n",
" 'team_by_entity':team_by_entity\n",
" } \n",
" return pd.concat([df,pd.DataFrame(new_pair)])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"nfl_teams = append_pairs(nfl_teams, \"Cheeseheads\", \"Packers\", single=True)\n",
"nfl_teams = append_pairs(nfl_teams, \"Redskins\", \"Commanders\", single=True)\n",
"nfl_teams = append_pairs(nfl_teams, \"Bucs\", \"Buccaneers\", single=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Es ist denkbar, die Synonyme noch deutlich weiter zu ergänzen, z.B. mit den Namen der Quaterbacks. Dem Team \"Buccaneers\" könnte so weitere 1700 Kommentare mit der Entität \"Brady\" zugeordnet werden. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Anwendung des Synonym-Wörterbuchs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Das Synonym-Wörterbuch wird im folgenden auf den gesamten Dataframe angewendet."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"df0['entity'] = df0['entity'].apply(lambda x: x.casefold())"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"df0['team_by_entity'] = None\n",
"for index, row in nfl_teams.iterrows():\n",
" df0['team_by_entity'] = np.where(df0['entity'] == row['entity'].casefold(),row['team_by_entity'], df0['team_by_entity'])"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([None, 'Buccaneers', 'Cowboys', 'Patriots', 'Jets', 'Rams',\n",
" 'Broncos', '49ers', 'Eagles', 'Chiefs', 'Saints', 'Falcons',\n",
" 'Steelers', 'Ravens', 'Browns', 'Dolphins', 'Texans', 'Lions',\n",
" 'Commanders', 'Jaguars', 'Raiders', 'Giants', 'Bills', 'Packers',\n",
" 'Titans', 'Colts', 'Vikings', 'Panthers', 'Cardinals', 'Bears',\n",
" 'Chargers', 'Bengals', 'Seahawks'], dtype=object)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0['team_by_entity'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"330066"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0['team_by_entity'].isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Dokumente, welche keiner \"team_by_enity\" zugeordnet werden konnten, werden entfernt. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" team1 | \n",
" score1 | \n",
" team2 | \n",
" score2 | \n",
" year | \n",
" week | \n",
" videoID | \n",
" comment | \n",
" entity | \n",
" salience | \n",
" score | \n",
" magnitude | \n",
" team_by_entity | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" tampa bay buccaneers | \n",
" 0.063078 | \n",
" 0.0 | \n",
" 0.0 | \n",
" Buccaneers | \n",
"
\n",
" \n",
" 12 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" Though Dallas has nothing to be ashamed of, this game was stripped from them. That was definitely a push-off by Godwin. Outside of that, it was an awesome game to watch. | \n",
" dallas | \n",
" 0.295658 | \n",
" 0.0 | \n",
" 0.0 | \n",
" Cowboys | \n",
"
\n",
" \n",
" 30 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" I’m very lost Tom brad 50mil contract best in the league barely get hurts. But go to Dallas 160mil for dak couldn’t clutch a last year game and get hurt a lot df🤔going on here the goat get paid less but a person who get hurt a lot get paid more with no rings🤨 | \n",
" dallas | \n",
" 0.021111 | \n",
" -0.1 | \n",
" 0.1 | \n",
" Cowboys | \n",
"
\n",
" \n",
" 36 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" Brady the \"goat\" barely very barely beating the cowboys.... I'd even go as far as to say struggled to beat them, barely beat them.... the goat.... ok sure... 2 points... :/ | \n",
" cowboys | \n",
" 0.089547 | \n",
" 0.1 | \n",
" 0.1 | \n",
" Cowboys | \n",
"
\n",
" \n",
" 50 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" Can someone tell me how to get the Patriots game and watch it on my phone thank you | \n",
" patriots | \n",
" 0.073656 | \n",
" 0.2 | \n",
" 0.2 | \n",
" Patriots | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" team1 score1 team2 score2 year week videoID \\\n",
"2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"12 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"30 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"36 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"50 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"\n",
" comment \\\n",
"2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"12 Though Dallas has nothing to be ashamed of, this game was stripped from them. That was definitely a push-off by Godwin. Outside of that, it was an awesome game to watch. \n",
"30 I’m very lost Tom brad 50mil contract best in the league barely get hurts. But go to Dallas 160mil for dak couldn’t clutch a last year game and get hurt a lot df🤔going on here the goat get paid less but a person who get hurt a lot get paid more with no rings🤨 \n",
"36 Brady the \"goat\" barely very barely beating the cowboys.... I'd even go as far as to say struggled to beat them, barely beat them.... the goat.... ok sure... 2 points... :/ \n",
"50 Can someone tell me how to get the Patriots game and watch it on my phone thank you \n",
"\n",
" entity salience score magnitude team_by_entity \n",
"2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers \n",
"12 dallas 0.295658 0.0 0.0 Cowboys \n",
"30 dallas 0.021111 -0.1 0.1 Cowboys \n",
"36 cowboys 0.089547 0.1 0.1 Cowboys \n",
"50 patriots 0.073656 0.2 0.2 Patriots "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0.dropna(subset='team_by_entity').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ebenfalls werden die Dokumente entfernt, welche ein Teamnamen enthalten, welcher nicht für das betrachtete Spiel relevant ist und damit nicht die Stimmung der spielenden Mannschaften repräsentieren, z.B. ein Kommentar zu den Cowboys im Spiel der Packers gegen Buccaners."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Anreichung des DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"df0['team_by_entity'] = np.where((df0['team_by_entity'] == df0['team1']) | (df0['team_by_entity'] == df0['team2']), df0['team_by_entity'], None )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Im nächsten Schritt werden den Größen 'Score' und 'Magnitude' ein \"Sentiment\" zugeordnet. Dafür werden Schwellenwerte für negativ, neutral und positiv festgelegt. \n",
"\n",
"|senitment|score|magnitude|\n",
"|---------|-----|------|\n",
"|positive| > 0.1| > 0.1|\n",
"|negative| > -0.1| >0.1|\n",
"|neutral| <= abs(0.1) | <= 0.1|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Um an dieser Stelle den \"Floating-point error\" zu umgehen, werden die Fließkommazahlen mit 10 multipliziert und in einen Integer umgewandelt, sodass sie ohne Fehler verglichen werden können. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" team1 | \n",
" score1 | \n",
" team2 | \n",
" score2 | \n",
" year | \n",
" week | \n",
" videoID | \n",
" comment | \n",
" entity | \n",
" salience | \n",
" score | \n",
" magnitude | \n",
" team_by_entity | \n",
" sentiment | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" players | \n",
" 0.800480 | \n",
" 0.0 | \n",
" 0.0 | \n",
" None | \n",
" neutral | \n",
"
\n",
" \n",
" 1 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" nfl | \n",
" 0.136442 | \n",
" 0.0 | \n",
" 0.0 | \n",
" None | \n",
" neutral | \n",
"
\n",
" \n",
" 2 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" tampa bay buccaneers | \n",
" 0.063078 | \n",
" 0.0 | \n",
" 0.0 | \n",
" Buccaneers | \n",
" neutral | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" team1 score1 team2 score2 year week videoID \\\n",
"0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"\n",
" comment \\\n",
"0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"\n",
" entity salience score magnitude team_by_entity sentiment \n",
"0 players 0.800480 0.0 0.0 None neutral \n",
"1 nfl 0.136442 0.0 0.0 None neutral \n",
"2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers neutral "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0['sentiment'] = \"\" \n",
"series_score = (df0['score']*10).astype(int)\n",
"series_mag = (df0['magnitude']*10).astype(int)\n",
"df0['sentiment'] = np.where((abs(series_score) <= 1) & (series_mag <= 1), 'neutral', df0['sentiment'] )\n",
"df0['sentiment'] = np.where((series_score <= -1) & (series_mag > 1), 'negative', df0['sentiment'] )\n",
"df0['sentiment'] = np.where((series_score >= 1) & (series_mag > 1), 'positive', df0['sentiment'] )\n",
"df0.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Für die Analyse werden zwei weitere Spalten erzeugt. Die Spalte \"win_for_entity\" enthält die Attribute \"win\", \"draw\" oder \"defeat\", je nachdem ob die Entität aus der Zeile gewonnen oder verloren hat. \n",
"Die Spalte \"winner\" enthält jeweils die Mannschaft, welche gewonnen hat. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" team1 | \n",
" score1 | \n",
" team2 | \n",
" score2 | \n",
" year | \n",
" week | \n",
" videoID | \n",
" comment | \n",
" entity | \n",
" salience | \n",
" score | \n",
" magnitude | \n",
" team_by_entity | \n",
" sentiment | \n",
" win_for_entity | \n",
" winner | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" players | \n",
" 0.800480 | \n",
" 0.0 | \n",
" 0.0 | \n",
" None | \n",
" neutral | \n",
" defeat | \n",
" Buccaneers | \n",
"
\n",
" \n",
" 1 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" nfl | \n",
" 0.136442 | \n",
" 0.0 | \n",
" 0.0 | \n",
" None | \n",
" neutral | \n",
" defeat | \n",
" Buccaneers | \n",
"
\n",
" \n",
" 2 | \n",
" Buccaneers | \n",
" 31 | \n",
" Cowboys | \n",
" 29 | \n",
" 2021 | \n",
" Week 1 | \n",
" HzkUcSd3Utc | \n",
" They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated | \n",
" tampa bay buccaneers | \n",
" 0.063078 | \n",
" 0.0 | \n",
" 0.0 | \n",
" Buccaneers | \n",
" neutral | \n",
" win | \n",
" Buccaneers | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" team1 score1 team2 score2 year week videoID \\\n",
"0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n",
"\n",
" comment \\\n",
"0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n",
"\n",
" entity salience score magnitude team_by_entity sentiment \\\n",
"0 players 0.800480 0.0 0.0 None neutral \n",
"1 nfl 0.136442 0.0 0.0 None neutral \n",
"2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers neutral \n",
"\n",
" win_for_entity winner \n",
"0 defeat Buccaneers \n",
"1 defeat Buccaneers \n",
"2 win Buccaneers "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df0['score1'] = df0['score1'].astype('int')\n",
"df0['score2'] = df0['score2'].astype('int')\n",
"df0['win_for_entity'] = 'draw'\n",
"df0['winner'] = np.where(df0['score1']>df0['score2'],df0['team1'],df0['team2'])\n",
"df0['win_for_entity'] = np.where(df0['winner']==df0['team_by_entity'],'win','defeat')\n",
"df0.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Aggregation "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die bisherige Bearbeitung dient als Grundlage, um für die Analyse ein Tabelle mit Kennzahlen pro Mannschaft und Spiel zu bilden. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die aggregierte Analyse-Tabelle soll folgenden Aufbau haben: \n",
"\n",
"* Mannschaft\n",
"* Spielwoche\n",
"* Ausgang: Sieg/Niederlage/Unentschieden \n",
"* Durchschnittlicher Sentiment Score\n",
"* Anzahl Kommentare mit positiven Sentiment\n",
"* Anzahl Kommentare mit negativen Sentiment\n",
"* Anzahl Kommentare mit neutralen Sentiment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zum Zählen der Sentimentwerte wurde einen Funktion geschrieben, um Fehler zu vermeiden falls eine Kategorie (positive, negative, neutral) nicht vorkommt. "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"def count_sentiment(inp_series, sentiment_value): \n",
" try: \n",
" count = inp_series.value_counts()[sentiment_value]\n",
" except: \n",
" count = 0\n",
" return count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Die Aggragtion erfolgt über die Spalten \"team_by_entity\" und \"week\" mit den Funktionen: \n",
"* first, zum Bestimmen des Ausgangs \n",
"* mean, zur Bildung des durchschnittlichen Sentiment Scores\n",
"* count_sentiment, zum Zählen der positiven, negativen und neutralen Kommentare"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"summary= df0.groupby(['team_by_entity','week']).agg(\n",
" outcome=('win_for_entity', 'first'),\n",
" avg_score=('score', 'mean'),\n",
" count_neg=('sentiment', lambda x: count_sentiment(x, 'negative')),\n",
" count_pos =('sentiment', lambda x: count_sentiment(x, 'positive')),\n",
" count_neutral=('sentiment', lambda x: count_sentiment(x, 'neutral')))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"summary = summary.reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Abschließen wird die Spielwoche in einen numerischen Wert umgewandelt. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"summary['week'] = summary['week'].apply(lambda x: int(x.split()[1]))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"summary = summary.sort_values('week')"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" team_by_entity | \n",
" week | \n",
" outcome | \n",
" avg_score | \n",
" count_neg | \n",
" count_pos | \n",
" count_neutral | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 49ers | \n",
" 1 | \n",
" win | \n",
" 0.046951 | \n",
" 29 | \n",
" 47 | \n",
" 85 | \n",
"
\n",
" \n",
" 68 | \n",
" Broncos | \n",
" 1 | \n",
" win | \n",
" 0.232381 | \n",
" 14 | \n",
" 117 | \n",
" 77 | \n",
"
\n",
" \n",
" 85 | \n",
" Browns | \n",
" 1 | \n",
" defeat | \n",
" 0.128121 | \n",
" 125 | \n",
" 359 | \n",
" 386 | \n",
"
\n",
" \n",
" 102 | \n",
" Buccaneers | \n",
" 1 | \n",
" win | \n",
" 0.084448 | \n",
" 87 | \n",
" 178 | \n",
" 281 | \n",
"
\n",
" \n",
" 119 | \n",
" Cardinals | \n",
" 1 | \n",
" win | \n",
" 0.193962 | \n",
" 20 | \n",
" 119 | \n",
" 122 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 42 | \n",
" Bengals | \n",
" 18 | \n",
" defeat | \n",
" 0.017647 | \n",
" 3 | \n",
" 4 | \n",
" 10 | \n",
"
\n",
" \n",
" 364 | \n",
" Panthers | \n",
" 18 | \n",
" defeat | \n",
" 0.325000 | \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" 25 | \n",
" Bears | \n",
" 18 | \n",
" defeat | \n",
" -0.125000 | \n",
" 2 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" 467 | \n",
" Seahawks | \n",
" 18 | \n",
" win | \n",
" 0.079167 | \n",
" 1 | \n",
" 7 | \n",
" 15 | \n",
"
\n",
" \n",
" 177 | \n",
" Colts | \n",
" 18 | \n",
" defeat | \n",
" -0.042105 | \n",
" 7 | \n",
" 7 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
543 rows × 7 columns
\n",
"
"
],
"text/plain": [
" team_by_entity week outcome avg_score count_neg count_pos \\\n",
"0 49ers 1 win 0.046951 29 47 \n",
"68 Broncos 1 win 0.232381 14 117 \n",
"85 Browns 1 defeat 0.128121 125 359 \n",
"102 Buccaneers 1 win 0.084448 87 178 \n",
"119 Cardinals 1 win 0.193962 20 119 \n",
".. ... ... ... ... ... ... \n",
"42 Bengals 18 defeat 0.017647 3 4 \n",
"364 Panthers 18 defeat 0.325000 0 2 \n",
"25 Bears 18 defeat -0.125000 2 1 \n",
"467 Seahawks 18 win 0.079167 1 7 \n",
"177 Colts 18 defeat -0.042105 7 7 \n",
"\n",
" count_neutral \n",
"0 85 \n",
"68 77 \n",
"85 386 \n",
"102 281 \n",
"119 122 \n",
".. ... \n",
"42 10 \n",
"364 2 \n",
"25 5 \n",
"467 15 \n",
"177 5 \n",
"\n",
"[543 rows x 7 columns]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Das Ergebniss wird in einer CSV-Datei gespeichert und im anschließenden Notebook zur Datenanalyse genutzt. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"summary.to_csv(\"summary_2021.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "condatascience",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
},
"vscode": {
"interpreter": {
"hash": "fd0a90840973ec53c960faac2ec5a80b0cd969d8e1e064cbc11596d09821ee2d"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}