{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "(gc_analysis)=\n", "# Datenaufbereitung" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In diesem Notebook werden die Daten aus Webscrapping und der Sentimentanalyse für die Analyse vorbereitetet und angereichert. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "import pandas as pd\n", "import numpy as np \n", "from pymongo import MongoClient\n", "from scipy.stats import ttest_ind\n", "from difflib import SequenceMatcher\n", "\n", "#import plotly.express as px\n", "#import plotly.figure_factory as ff\n", "#from plotly.offline import init_notebook_mode\n", "#init_notebook_mode() # To show plotly plots when notebook is exported to html" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "mongodb_pass = json.load(open('API_Data.json'))['mongoDB_pass'] # password mongodb user\n", "client = MongoClient(mongodb_pass)\n", "db = client.gc_nfl\n", "mycoll = db.gc_games" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "````{margin}\n", "```{note} Da die Youtube-Kommentare sehr lang sein können, wird die maximale Zeichenlänge der Zellen im Dataframe erhöht.\n", "```\n", "````" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "pd.options.display.max_colwidth = 1000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hole Daten von MongoDB" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Dafür werden zu nächste die Daten aus mongodb abgerufen und in einen Dataframe umgewandelt. Die explorative Datenanalyse wird am Beispiel der Season 2021/2022 gezeigt, da diese Season bereits abgeschlossen ist." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pipeline = [\n", " {'$match':{'year':'2021'}},\n", " {'$project':{\n", " '_id':0\n", " }}\n", " ]\n", "\n", "x = mycoll.aggregate(pipeline)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df0 = pd.DataFrame.from_dict(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vorbereitung Entitäten" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Für den Erkenntnisgewinn sind die Enitäten der Kommentare interessant, welche Mannschaftsnamen enthalten. Die Mannschaftsnamen der NFL-Teams bestehen aus zwei Bestandteilen: einen Namen des Heimatsort (z.B. Tampa Bay) und einem Teamnamen (z.B. Buccaneers). Ziel der Data Preperation ist es so viele Enitäten, wie möglich als Mannschaftsnamen zu identifizieren. Dafür müssen Synonyme wie z.B. Spitznamen oder nur der Heimatort als Mannschaftsnamen und Rechtschreibfehler erkannt werden. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team1score1team2score2yearweekvideoIDcommententitysaliencescoremagnitude
0Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedplayers0.8004800.00.0
1Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedNFL0.1364420.00.0
2Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedTampa Bay Buccaneers0.0630780.00.0
3Buccaneers31Cowboys292021Week 1HzkUcSd3Utc#🐰💀 vs #🤠🦆#🐰1.0000000.10.1
4Buccaneers31Cowboys292021Week 1HzkUcSd3UtcOMG!! I really thought we were gonna beat the goat!! Such an amazing game!!game0.5284930.90.9
\n", "
" ], "text/plain": [ " team1 score1 team2 score2 year week videoID \\\n", "0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "3 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "4 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "\n", " comment \\\n", "0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "3 #🐰💀 vs #🤠🦆 \n", "4 OMG!! I really thought we were gonna beat the goat!! Such an amazing game!! \n", "\n", " entity salience score magnitude \n", "0 players 0.800480 0.0 0.0 \n", "1 NFL 0.136442 0.0 0.0 \n", "2 Tampa Bay Buccaneers 0.063078 0.0 0.0 \n", "3 #🐰 1.000000 0.1 0.1 \n", "4 game 0.528493 0.9 0.9 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Berücksichtigung von Rechtschreibungsfehler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Insofern die Entitäten mit Rechtschreibfehler von Google NLP erkannt wurden, sollen diese mit Hilfe des SequenceMatchers und Casefold zu einem \"team_by_entity\"-Wert zugeordnet werden." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df0['entity'] = df0['entity'].apply(lambda x: x.casefold())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Der SequenceMatcher berechent einen Ähnlichkeitswert von zwei Strings. Für das Projekt wird der Grenzwert auf 0.8 festgelegt. Mit der Funktion `match_sequence()` können die Teamnamen mit den Enität aus den Kommentaren abgeglichen werden. Liegt Ähnlichkeit vor, werden die Teamnamen der Enitäten korrigiert. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9230769230769231" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myStr1 = \"Titans\"\n", "myStr2 = \"tittans\"\n", "\n", "SequenceMatcher(a=myStr1.casefold(),b=myStr2.casefold()).ratio()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def match_sequence(string1, string2):\n", " '''\n", " Input: string1 -> Entität\n", " string2 -> Textstelle im Kommentar\n", " '''\n", " value = string2\n", " if (SequenceMatcher(a = string1.casefold(), b = string2.casefold()).ratio() > 0.8):\n", " value = string1\n", " return value" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "df0['entity']= df0.apply(lambda x: match_sequence(x['team1'], x['entity']), axis=1)\n", "df0['entity']= df0.apply(lambda x: match_sequence(x['team2'], x['entity']), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit der Anwendung des SequenceMatcher konnten 965 weitere Enitäten zugeordnet werden. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Erstellungen eines Synonym-Wörterbuchs " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Datenbasis des Synonym-Wörterbuchs ist eine Tabelle aus Github, welche die vollständigen Mannschaftsnamen enthält. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "nfl_teams = pd.read_csv('https://gist.githubusercontent.com/cnizzardini/13d0a072adb35a0d5817/raw/f315c97c7677845668a9c26e9093d0d550533b00/nfl_teams.csv')\n", "nfl_teams['Name_1'] = nfl_teams['Name'].apply(lambda x : x.split()[-1])\n", "nfl_teams['Name_2'] = nfl_teams['Name'].apply(lambda x : \" \".join(x.split()[0:-1]))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDNameAbbreviationConferenceDivisionName_1Name_2
01Arizona CardinalsARINFCWestCardinalsArizona
12Atlanta FalconsATLNFCSouthFalconsAtlanta
23Baltimore RavensBALAFCNorthRavensBaltimore
34Buffalo BillsBUFAFCEastBillsBuffalo
45Carolina PanthersCARNFCSouthPanthersCarolina
\n", "
" ], "text/plain": [ " ID Name Abbreviation Conference Division Name_1 \\\n", "0 1 Arizona Cardinals ARI NFC West Cardinals \n", "1 2 Atlanta Falcons ATL NFC South Falcons \n", "2 3 Baltimore Ravens BAL AFC North Ravens \n", "3 4 Buffalo Bills BUF AFC East Bills \n", "4 5 Carolina Panthers CAR NFC South Panthers \n", "\n", " Name_2 \n", "0 Arizona \n", "1 Atlanta \n", "2 Baltimore \n", "3 Buffalo \n", "4 Carolina " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nfl_teams.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mit den Spalten 'Name_1' und 'Name_2' wird der vollständige Mannschaftsname in den Teamname und den Heimatort getrennt, um die ersten Synonyme zu erhalten. Der Teamname dient in den nachfolgenden Betrachtungen als zentrale Entität (\"team_by_entity\") für den Mannschaftsnamen. Alle anderen Synonyme werden dem Teamnamen zugeordnetet. \n", "\n", "Nachfolgend wird die Tabelle mit den gebildeten Entitäten verschlankt. Die Werte für \"team_by_entity\" werden mit dem zugehörigen Synonymen kombiniert und unterhalb der bestehenden Tabelle angefügt. Schließlich werden die Spaltennamen umbenannt." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df_name2 = nfl_teams[['Name_1','Name_2']].copy(deep=True)\n", "df_name1 = nfl_teams[['Name_1']].copy(deep=True)\n", "df_name1['Name'] = df_name1['Name_1']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "nfl_teams = nfl_teams.merge(df_name2.rename(columns={'Name_2':'Name'}),how='outer')\n", "nfl_teams = nfl_teams.merge(df_name1,how='outer')\n", "nfl_teams.drop(['ID','Abbreviation', 'Conference','Division','Name_2'], axis=1, inplace=True)\n", "nfl_teams.rename(columns={'Name':'entity','Name_1':'team_by_entity'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
entityteam_by_entity
11Green Bay PackersPackers
43Green BayPackers
75PackersPackers
\n", "
" ], "text/plain": [ " entity team_by_entity\n", "11 Green Bay Packers Packers\n", "43 Green Bay Packers\n", "75 Packers Packers" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vikings = nfl_teams[nfl_teams['team_by_entity'] == \"Packers\"] \n", "vikings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Für die Mannschaft \"Packers\" ist \"Cheeseheads\" ein sehr gängiger Spitzname. Solche Entitäten sollen ebenfalls berücksichtigt werden. Mit der nachfolgenden Funktion können derartige Synonyme hinzugeügt werden. Exemplarisch wird dies für drei Teams durchgeführt. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def append_pairs(df, entity, team_by_entity, single=True):\n", " '''\n", " takes: pandas dataframe, entity and team_by_entity, trigger for multiple or single values\n", " if single=False the input for entity and team_by_entity has to be a list.\n", " returns a pandas dataframe object that includes old and new data.\n", " '''\n", " if single:\n", " new_pair = {\n", " 'entity':[entity],\n", " 'team_by_entity':[team_by_entity]\n", " }\n", " else:\n", " new_pair = {\n", " 'entity':entity,\n", " 'team_by_entity':team_by_entity\n", " } \n", " return pd.concat([df,pd.DataFrame(new_pair)])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "nfl_teams = append_pairs(nfl_teams, \"Cheeseheads\", \"Packers\", single=True)\n", "nfl_teams = append_pairs(nfl_teams, \"Redskins\", \"Commanders\", single=True)\n", "nfl_teams = append_pairs(nfl_teams, \"Bucs\", \"Buccaneers\", single=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Es ist denkbar, die Synonyme noch deutlich weiter zu ergänzen, z.B. mit den Namen der Quaterbacks. Dem Team \"Buccaneers\" könnte so weitere 1700 Kommentare mit der Entität \"Brady\" zugeordnet werden. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anwendung des Synonym-Wörterbuchs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Das Synonym-Wörterbuch wird im folgenden auf den gesamten Dataframe angewendet." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "df0['entity'] = df0['entity'].apply(lambda x: x.casefold())" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "df0['team_by_entity'] = None\n", "for index, row in nfl_teams.iterrows():\n", " df0['team_by_entity'] = np.where(df0['entity'] == row['entity'].casefold(),row['team_by_entity'], df0['team_by_entity'])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([None, 'Buccaneers', 'Cowboys', 'Patriots', 'Jets', 'Rams',\n", " 'Broncos', '49ers', 'Eagles', 'Chiefs', 'Saints', 'Falcons',\n", " 'Steelers', 'Ravens', 'Browns', 'Dolphins', 'Texans', 'Lions',\n", " 'Commanders', 'Jaguars', 'Raiders', 'Giants', 'Bills', 'Packers',\n", " 'Titans', 'Colts', 'Vikings', 'Panthers', 'Cardinals', 'Bears',\n", " 'Chargers', 'Bengals', 'Seahawks'], dtype=object)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0['team_by_entity'].unique()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "330066" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0['team_by_entity'].isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Dokumente, welche keiner \"team_by_enity\" zugeordnet werden konnten, werden entfernt. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team1score1team2score2yearweekvideoIDcommententitysaliencescoremagnitudeteam_by_entity
2Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedtampa bay buccaneers0.0630780.00.0Buccaneers
12Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThough Dallas has nothing to be ashamed of, this game was stripped from them. That was definitely a push-off by Godwin. Outside of that, it was an awesome game to watch.dallas0.2956580.00.0Cowboys
30Buccaneers31Cowboys292021Week 1HzkUcSd3UtcI’m very lost Tom brad 50mil contract best in the league barely get hurts. But go to Dallas 160mil for dak couldn’t clutch a last year game and get hurt a lot df🤔going on here the goat get paid less but a person who get hurt a lot get paid more with no rings🤨dallas0.021111-0.10.1Cowboys
36Buccaneers31Cowboys292021Week 1HzkUcSd3UtcBrady the \"goat\" barely very barely beating the cowboys.... I'd even go as far as to say struggled to beat them, barely beat them.... the goat.... ok sure... 2 points... :/cowboys0.0895470.10.1Cowboys
50Buccaneers31Cowboys292021Week 1HzkUcSd3UtcCan someone tell me how to get the Patriots game and watch it on my phone thank youpatriots0.0736560.20.2Patriots
\n", "
" ], "text/plain": [ " team1 score1 team2 score2 year week videoID \\\n", "2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "12 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "30 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "36 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "50 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "\n", " comment \\\n", "2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "12 Though Dallas has nothing to be ashamed of, this game was stripped from them. That was definitely a push-off by Godwin. Outside of that, it was an awesome game to watch. \n", "30 I’m very lost Tom brad 50mil contract best in the league barely get hurts. But go to Dallas 160mil for dak couldn’t clutch a last year game and get hurt a lot df🤔going on here the goat get paid less but a person who get hurt a lot get paid more with no rings🤨 \n", "36 Brady the \"goat\" barely very barely beating the cowboys.... I'd even go as far as to say struggled to beat them, barely beat them.... the goat.... ok sure... 2 points... :/ \n", "50 Can someone tell me how to get the Patriots game and watch it on my phone thank you \n", "\n", " entity salience score magnitude team_by_entity \n", "2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers \n", "12 dallas 0.295658 0.0 0.0 Cowboys \n", "30 dallas 0.021111 -0.1 0.1 Cowboys \n", "36 cowboys 0.089547 0.1 0.1 Cowboys \n", "50 patriots 0.073656 0.2 0.2 Patriots " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0.dropna(subset='team_by_entity').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ebenfalls werden die Dokumente entfernt, welche ein Teamnamen enthalten, welcher nicht für das betrachtete Spiel relevant ist und damit nicht die Stimmung der spielenden Mannschaften repräsentieren, z.B. ein Kommentar zu den Cowboys im Spiel der Packers gegen Buccaners." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Anreichung des DataFrames" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "df0['team_by_entity'] = np.where((df0['team_by_entity'] == df0['team1']) | (df0['team_by_entity'] == df0['team2']), df0['team_by_entity'], None )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Im nächsten Schritt werden den Größen 'Score' und 'Magnitude' ein \"Sentiment\" zugeordnet. Dafür werden Schwellenwerte für negativ, neutral und positiv festgelegt. \n", "\n", "|senitment|score|magnitude|\n", "|---------|-----|------|\n", "|positive| > 0.1| > 0.1|\n", "|negative| > -0.1| >0.1|\n", "|neutral| <= abs(0.1) | <= 0.1|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um an dieser Stelle den \"Floating-point error\" zu umgehen, werden die Fließkommazahlen mit 10 multipliziert und in einen Integer umgewandelt, sodass sie ohne Fehler verglichen werden können. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team1score1team2score2yearweekvideoIDcommententitysaliencescoremagnitudeteam_by_entitysentiment
0Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedplayers0.8004800.00.0Noneneutral
1Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatednfl0.1364420.00.0Noneneutral
2Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedtampa bay buccaneers0.0630780.00.0Buccaneersneutral
\n", "
" ], "text/plain": [ " team1 score1 team2 score2 year week videoID \\\n", "0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "\n", " comment \\\n", "0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "\n", " entity salience score magnitude team_by_entity sentiment \n", "0 players 0.800480 0.0 0.0 None neutral \n", "1 nfl 0.136442 0.0 0.0 None neutral \n", "2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers neutral " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0['sentiment'] = \"\" \n", "series_score = (df0['score']*10).astype(int)\n", "series_mag = (df0['magnitude']*10).astype(int)\n", "df0['sentiment'] = np.where((abs(series_score) <= 1) & (series_mag <= 1), 'neutral', df0['sentiment'] )\n", "df0['sentiment'] = np.where((series_score <= -1) & (series_mag > 1), 'negative', df0['sentiment'] )\n", "df0['sentiment'] = np.where((series_score >= 1) & (series_mag > 1), 'positive', df0['sentiment'] )\n", "df0.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Für die Analyse werden zwei weitere Spalten erzeugt. Die Spalte \"win_for_entity\" enthält die Attribute \"win\", \"draw\" oder \"defeat\", je nachdem ob die Entität aus der Zeile gewonnen oder verloren hat. \n", "Die Spalte \"winner\" enthält jeweils die Mannschaft, welche gewonnen hat. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team1score1team2score2yearweekvideoIDcommententitysaliencescoremagnitudeteam_by_entitysentimentwin_for_entitywinner
0Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedplayers0.8004800.00.0NoneneutraldefeatBuccaneers
1Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatednfl0.1364420.00.0NoneneutraldefeatBuccaneers
2Buccaneers31Cowboys292021Week 1HzkUcSd3UtcThey made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heatedtampa bay buccaneers0.0630780.00.0BuccaneersneutralwinBuccaneers
\n", "
" ], "text/plain": [ " team1 score1 team2 score2 year week videoID \\\n", "0 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "1 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "2 Buccaneers 31 Cowboys 29 2021 Week 1 HzkUcSd3Utc \n", "\n", " comment \\\n", "0 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "1 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "2 They made all the bad NFL players sign as Tampa Bay Buccaneers sometimes they get heated \n", "\n", " entity salience score magnitude team_by_entity sentiment \\\n", "0 players 0.800480 0.0 0.0 None neutral \n", "1 nfl 0.136442 0.0 0.0 None neutral \n", "2 tampa bay buccaneers 0.063078 0.0 0.0 Buccaneers neutral \n", "\n", " win_for_entity winner \n", "0 defeat Buccaneers \n", "1 defeat Buccaneers \n", "2 win Buccaneers " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df0['score1'] = df0['score1'].astype('int')\n", "df0['score2'] = df0['score2'].astype('int')\n", "df0['win_for_entity'] = 'draw'\n", "df0['winner'] = np.where(df0['score1']>df0['score2'],df0['team1'],df0['team2'])\n", "df0['win_for_entity'] = np.where(df0['winner']==df0['team_by_entity'],'win','defeat')\n", "df0.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die bisherige Bearbeitung dient als Grundlage, um für die Analyse ein Tabelle mit Kennzahlen pro Mannschaft und Spiel zu bilden. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die aggregierte Analyse-Tabelle soll folgenden Aufbau haben: \n", "\n", "* Mannschaft\n", "* Spielwoche\n", "* Ausgang: Sieg/Niederlage/Unentschieden \n", "* Durchschnittlicher Sentiment Score\n", "* Anzahl Kommentare mit positiven Sentiment\n", "* Anzahl Kommentare mit negativen Sentiment\n", "* Anzahl Kommentare mit neutralen Sentiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zum Zählen der Sentimentwerte wurde einen Funktion geschrieben, um Fehler zu vermeiden falls eine Kategorie (positive, negative, neutral) nicht vorkommt. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def count_sentiment(inp_series, sentiment_value): \n", " try: \n", " count = inp_series.value_counts()[sentiment_value]\n", " except: \n", " count = 0\n", " return count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Die Aggragtion erfolgt über die Spalten \"team_by_entity\" und \"week\" mit den Funktionen: \n", "* first, zum Bestimmen des Ausgangs \n", "* mean, zur Bildung des durchschnittlichen Sentiment Scores\n", "* count_sentiment, zum Zählen der positiven, negativen und neutralen Kommentare" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "summary= df0.groupby(['team_by_entity','week']).agg(\n", " outcome=('win_for_entity', 'first'),\n", " avg_score=('score', 'mean'),\n", " count_neg=('sentiment', lambda x: count_sentiment(x, 'negative')),\n", " count_pos =('sentiment', lambda x: count_sentiment(x, 'positive')),\n", " count_neutral=('sentiment', lambda x: count_sentiment(x, 'neutral')))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "summary = summary.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Abschließen wird die Spielwoche in einen numerischen Wert umgewandelt. " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "summary['week'] = summary['week'].apply(lambda x: int(x.split()[1]))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "summary = summary.sort_values('week')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team_by_entityweekoutcomeavg_scorecount_negcount_poscount_neutral
049ers1win0.046951294785
68Broncos1win0.2323811411777
85Browns1defeat0.128121125359386
102Buccaneers1win0.08444887178281
119Cardinals1win0.19396220119122
........................
42Bengals18defeat0.0176473410
364Panthers18defeat0.325000022
25Bears18defeat-0.125000215
467Seahawks18win0.0791671715
177Colts18defeat-0.042105775
\n", "

543 rows × 7 columns

\n", "
" ], "text/plain": [ " team_by_entity week outcome avg_score count_neg count_pos \\\n", "0 49ers 1 win 0.046951 29 47 \n", "68 Broncos 1 win 0.232381 14 117 \n", "85 Browns 1 defeat 0.128121 125 359 \n", "102 Buccaneers 1 win 0.084448 87 178 \n", "119 Cardinals 1 win 0.193962 20 119 \n", ".. ... ... ... ... ... ... \n", "42 Bengals 18 defeat 0.017647 3 4 \n", "364 Panthers 18 defeat 0.325000 0 2 \n", "25 Bears 18 defeat -0.125000 2 1 \n", "467 Seahawks 18 win 0.079167 1 7 \n", "177 Colts 18 defeat -0.042105 7 7 \n", "\n", " count_neutral \n", "0 85 \n", "68 77 \n", "85 386 \n", "102 281 \n", "119 122 \n", ".. ... \n", "42 10 \n", "364 2 \n", "25 5 \n", "467 15 \n", "177 5 \n", "\n", "[543 rows x 7 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Das Ergebniss wird in einer CSV-Datei gespeichert und im anschließenden Notebook zur Datenanalyse genutzt. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "summary.to_csv(\"summary_2021.csv\")" ] } ], "metadata": { "kernelspec": { "display_name": "condatascience", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "fd0a90840973ec53c960faac2ec5a80b0cd969d8e1e064cbc11596d09821ee2d" } } }, "nbformat": 4, "nbformat_minor": 2 }