Capstone Part 1 - Classifying Political News Media Text with Natural Language Processing

For my final capstone project as part of the Data Science Immersive program at General Assembly NYC, I decided to create a political text classification model using Natural Language Processing. As with all data science projects, this was a non-linear, iterative process that required extensive data cleaning. I learned a great deal about NLP in the process, and I look forward to further projects on this topic. The full repo for this project can be found here.

Purpose

In 2018, I don’t think I need to remind anyone of the current news media landscape in the United States. Needless to say, the media culture is extremely polarized and the societal effects of how people source their information is becoming increasingly evident. It hasn’t always been this way, and it is important to be able to know what kind of source a piece of political news text is coming from before deciding what to do with it. Political Science is often defined as the study of power. In the era of Trump, the power of language is increasingly relevant and important. In this project, I aim to build a model that predicts whether or not political media text is right wing or not right wing. I could have chosen to call these labels “right” and “left”, but I chose more of a “one vs. rest” terminology to reflect the fact that the political spectrum isn’t as simple as Right and Left. This website has a great “media bias chart” for visualization. There is more to say about far right news media entering the mainstream, but we can save that for another day. Furthermore, this project is a stepping stone to future projects using multi class targets that more accurately reflect the kinds of political news media that are out there. Likewise, it is a stepping stone to more intimidating and complex projects that tackle social media language as well.

Gathering and Load Data

I collected data for 36,854 unique articles from a variety of right wing and not right wing sources. The features collected were title, description, date posted, url, author, and source. Choosing obviously partisan sources, I labeled whether they are right wing or not based on the source. Therefore, since I would only be starting with title and description as features, a challenge I faced is making sure that any giveaway of the source is removed from an article’s text features. All of my data was scraped from News API except for Infowars and Democracy Now, for which I used BeautifulSoup to scrape article information from their respective RSS feeds.

My right wing sources included:

  • Fox News
  • Breitbart News
  • National Review
  • Infowars

Not right wing sources:

  • MSNBC
  • Huffington Post
  • Vice News
  • CNN
  • Democracy Now!

I scraped the data in separate notebooks and then loaded in all of the individual .csv files into my main notebook. The notebooks for this data collection process can be found in the main folder of the GitHub repo.

Clean Data

The initial data cleaning was quite simple and straightforward compared to the iterative process of removing noisy n-grams that would come once I began exploring and modeling. I dropped all of the rows with missing titles or descriptions, as those are the only features I would start with in my model.

# Check for missing values
text.isna().sum()
author         9466
description      69
publishedAt       0
source            0
title             1
url               0
yes_right         0
dtype: int64
# Dropped all rows with missing text, as that is all I will be using as features
text.dropna(subset=['description','title'], inplace=True)
# It's okay if there is no author
text.fillna('no_author', inplace=True)
text.shape
(36854, 7)
text.yes_right.value_counts()[0]/len(text.yes_right) 
# Baseline accuracy is 53.5%, almost a 50/50 split in target
# There are a bit more not right sources than right
0.53470450968687255

It’s barely worth including Infowars and Democracy Now in the model as it would have taken quite a long time to gather a comparable amount of articles as those from News API by scraping from an RSS feed. However, I got as many articles as I could over a week’s time.

text.source.value_counts()
# Democracy Now and Infowars don't produce quite as much content...
fox news           6381
national review    5348
huffington post    4950
msnbc              4950
vice news          4949
breitbart          4948
cnn                4776
infowars            471
democracy now        81
Name: source, dtype: int64

Removing Source Giveaways from the Title and Description

There was a multi-step process to remove source giveaways from the articles’ text features. Here I create a function to use in the .apply() function to remove any mention of any source from all of the title and description text, as well as remove that specific article’s author from the text. This cleaned up a lot of source giveaways, but there was still much that slipped through the cracks.

# Many CNN articles had "CNN" after the author's name
text['author'] = [a.replace(', CNN', '') for a in text['author']]
def remove_source_info(row):
    sources = ['Breitbart', 'CNN', 'Fox News', 'National Review', 'Vice News',
               'Democracy Now', 'Infowars', 'MSNBC', 'Huffington Post']
    for source in sources:
        row['description'] = row['description'].replace(source, '')
        row['title'] = row['title'].replace(source, '')
    row['title'] = row['title'].replace(row['author'], '')
    row['description'] = row['description'].replace(row['author'], '')
    return row
text = text.apply(remove_source_info, axis=1)
text.query("source=='Fox News'").head()
author description publishedAt source title url yes_right
9724 no_author Judge Napolitano and Marie Harf discuss the Tr... 2018-06-20T18:51:36Z Fox News Freedom Watch: Napolitano and Harf dig in on i... http://video.foxnews.com/v/5799861947001/ 1
9725 no_author Find out where and who caught a rare cotton ca... 2018-06-20T18:50:00Z Fox News See it: Cotton candy-colored lobster caught http://video.foxnews.com/v/5799865520001/ 1
9726 no_author Steve Harrigan reports from outside a detentio... 2018-06-20T18:47:58Z Fox News Media not given access to 'tender age' shelters http://video.foxnews.com/v/5799862990001/ 1
9727 no_author Reports: More than 2,000 children have been se... 2018-06-20T18:47:54Z Fox News Critics denounce 'tender age' shelters http://video.foxnews.com/v/5799860019001/ 1
9728 no_author Nearly a year after the body of little boy was... 2018-06-20T18:38:34Z Fox News ‘Little Jacob’ has been identified http://video.foxnews.com/v/5799856889001/ 1

NLP Feature Engineering

Using NLP tools from the TextBlob library, I tagged each word of the title and description with a part of speech and added the normalized value counts for each part of speech as a new feature. The meaning of those parts of speech tags can be found here. I also used TextBlob’s sentiment analysis scoring to create features for polarity (positive or negative) and subjectivity. I then took the difference between title polarity and description polarity as well as the difference between title subjectivity and description subjectivity to measure any possible discrepancy in sentiment between title and description. Additionally, I created an average title word length feature. I could have spent weeks just engineering features using NLP tools, but this seemed like a good base to start with for this project.

text['combined'] = text.title + ' ' + text.description # all text together

For creation of features, I treated title and description separately, but for modeling with TfIdf I combined the two into one document for each row.

I created a feature of average word length in the title using a RegEx tokenizer.

tokenizer = nltk.RegexpTokenizer(r'\w+')
title_tokens = [tokenizer.tokenize(w) for w in text.title]

avg_word_length = []
for title in title_tokens:
    wordlen = []
    for word in title:
        wordlen.append(len(word))
        if len(wordlen)==len(title):
            avg_word_length.append(np.sum(wordlen)/len(wordlen))

text['avg_word_len_title'] = avg_word_length

Sentiment Analysis

I used TextBlob to create polarity and subjectivity features for both the title and description.

text['title_polarity'] = [TextBlob(w).sentiment.polarity for w in text.title]

text['title_subjectivity'] = [TextBlob(w).sentiment.subjectivity for w in text.title]

text['desc_polarity'] = [TextBlob(w).sentiment.polarity for w in text.description]

text['desc_subjectivity'] = [TextBlob(w).sentiment.subjectivity for w in text.description]

text['subj_difference'] = text['title_subjectivity'] - text['desc_subjectivity']
text['polarity_difference']  = text['title_polarity'] - text['desc_polarity']

Parts of Speech Tagging

Also using TextBlob, I tagged each word with a part of speech and then took the normalized value counts for each part of speech in the document.

title_tags = [TextBlob(w.lower(), tokenizer=tokenizer).tags for w in text.title]
title_tags[0]
[('peter', 'NN'),
 ('fonda', 'NN'),
 ('lying', 'VBG'),
 ('gash', 'JJ'),
 ('kirstjen', 'NNS'),
 ('nielsen', 'VBN'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('whipped', 'VBN'),
 ('naked', 'JJ'),
 ('in', 'IN'),
 ('public', 'JJ')]
tags_counts = []
for row in title_tags:
    tags = [n[1] for n in row]
    tags_counts.append(tags)
title_parts_of_speech = []
for n in tags_counts:
    foo = dict(pd.Series(n).value_counts(normalize=True))
    title_parts_of_speech.append(foo)
title_parts_of_speech = pd.DataFrame(title_parts_of_speech).fillna(0)
# A NaN value means that part of speech did not appear in the text
# Label as title parts of speech
title_parts_of_speech.columns = [str(n) + '_title' for n in title_parts_of_speech.columns]

I then followed the same process for description.

# Checking for correct length
desc_parts_of_speech.shape[0] == title_parts_of_speech.shape[0]
True
pos_tags = pd.concat([title_parts_of_speech, desc_parts_of_speech], axis=1)
# Concatenate all created features together with target
df = pd.concat([text[['title_polarity', 'title_subjectivity', 'desc_polarity',
                      'desc_subjectivity', 'subj_difference', 'polarity_difference',
                      'avg_word_len_title']], pos_tags, text.yes_right], axis=1)
df.head(3)
title_polarity title_subjectivity desc_polarity desc_subjectivity subj_difference polarity_difference avg_word_len_title CC_title CD_title DT_title ... VBD_desc VBG_desc VBN_desc VBP_desc VBZ_desc WDT_desc WP_desc WP$_desc WRB_desc yes_right
0 0.50 0.233333 0.500000 0.000000 0.233333 0.000000 5.166667 0.0 0.0 0.0 ... 0.068182 0.045455 0.022727 0.0 0.00 0.0 0.022727 0.0 0.0 1
1 0.25 0.900000 0.250000 0.650000 0.250000 0.000000 5.750000 0.0 0.0 0.0 ... 0.000000 0.160000 0.040000 0.0 0.04 0.0 0.000000 0.0 0.0 1
2 0.40 0.300000 0.555417 0.666667 -0.366667 -0.155417 6.545455 0.0 0.0 0.0 ... 0.020000 0.060000 0.020000 0.1 0.04 0.0 0.000000 0.0 0.0 1

3 rows × 79 columns

Every created features is already scaled except for a few, which I just simply manually scaled.

def scaled_checker(df):
    for col in df.columns:
        if max(df[col]) > 1:
            print(col)
        if min(df[col]) < 0:
            print(col)
        else:
            pass
        
scaled_checker(df)
subj_difference
polarity_difference
avg_word_len_title
df['subj_difference'] = (df['subj_difference'] - min(df['subj_difference'])) \n
    /(max(df['subj_difference'])-min(df['subj_difference']))
df['polarity_difference'] = (df['polarity_difference'] - min(df['polarity_difference']))/ \n
    (max(df['polarity_difference'])-min(df['polarity_difference']))
df['avg_word_len_title'] = (df['avg_word_len_title'] - min(df['avg_word_len_title']))/ \n
    (max(df['avg_word_len_title'])-min(df['avg_word_len_title']))

scaled_checker(df) # numerical data is scaled
# Save my mostly cleaned datasets
text.to_csv('./datasets/text2.csv')
df.to_csv('./datasets/df2.csv')

Explore Data

Exploring the sentiment analysis features, there is not much significant to note. Correlations between features and the target are extremely weak. All distributions had the same shape, with all of polarity having a normal distribution and all description distributions being skewed right. In other words, most text was neutral in terms of polarity and most text was more objective than subjective. One difference is that there was more subjective language in the descriptions than in the titles.

num_corrs = df[['title_polarity','title_subjectivity','desc_polarity',
                'desc_subjectivity', 'subj_difference','polarity_difference',
                'avg_word_len_title','yes_right']].corr()
mask = np.zeros_like(num_corrs, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,8))
plt.title('Correlations Between Sentiment Analysis Features and Target', fontsize=20)
sns.heatmap(num_corrs, linewidths=0.5, mask=mask, annot=True);
# Very insignificant 

png

# Looking at features based on target
rightwing = df[df.yes_right==1]
notrightwing = df[df.yes_right==0]
figure, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

rightwing.title_polarity.plot(kind='hist', ax=ax[0,1],
                              title='Title Polarity: Right')
notrightwing.title_polarity.plot(kind='hist', ax=ax[0,0],
                                 title='Title Polarity: Not Right')
notrightwing.desc_polarity.plot(kind='hist',
                                ax=ax[1,0], title='Description Polarity: Not Right')
rightwing.desc_polarity.plot(kind='hist',
                             ax=ax[1,1], title='Description Polarity: Right');

png

figure, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

rightwing.title_subjectivity.plot(kind='hist', ax=ax[0,1],
                                  title='Title Subjectivity: Right')
notrightwing.title_subjectivity.plot(kind='hist', ax=ax[0,0],
                                     title='Title Subjectivity: Not Right')
notrightwing.desc_subjectivity.plot(kind='hist', ax=ax[1,0],
                                    title='Description Subjectivity: Not Right')
rightwing.desc_subjectivity.plot(kind='hist', ax=ax[1,1],
                                 title='Description Subjectivity: Right');

png

Exploring Word Counts

In the following several sections, I used Count Vectorizer to find the highest count single-grams, bi-grams, tri-grams, and quad-grams in the titles and descriptions, sorted by both target classes. After initially looking at the counts and seeing a lot of source information and noise, I created a custom preprocessor that removes stop grams of noise and source information as well as lemmatize the text. I also created a custom group of stop words that contain source information and appended it to SciKitLearn’s list of English stop words, which I used in all of my Count Vectorizer and TfIdf models. The initial words counts demonstrated that there is much noise in the data, and TfIdf might be the best choice for a vectorizer as it will penalize terms that appear very frequently across documents. However, if some of these terms only appear frequently in only one unique source, I decided to remove them manually.

There were a couple interesting interpretations, especially as the n-grams increased, but the most common n-grams in the documents were mostly similar between left and right until the tri-gram level, when noticeable differences began to form.

Set Up Vectorizer

lemmatizer = WordNetLemmatizer()

def my_preprocessor(doc):
    stop_grams = ['national review','National Review','Fox News','Content Uploads',
                  'content uploads','fox news', 'Associated Press','associated press',
                  'Fox amp Friends','Rachel Maddow','Morning Joe','morning joe',
                  'Breitbart News', 'fast facts', 'Fast Facts','Fox &', 'Fox & Friends',
                  'Ali Velshi','Stephanie Ruhle','Raw video', '& Friends', 'Ari Melber',
                  'amp Friends', 'Content uploads', 'Geraldo Rivera']
    for stop in stop_grams:
        doc = doc.replace(stop,'')
    return lemmatizer.lemmatize(doc.lower())
cust_stop_words = ['CNN','cnn','amp','huffpost','fox','reuters','ap','vice','breitbart',
                   'nationalreview', 'www','msnbc','infowars', 'foxnews','Vice','Breitbart',
                   'National Review','Fox News','Reuters','Fast Facts','Infowars','Vice',
                   'MSNBC','www','AP','Huffpost','HuffPost','Maddow', '&', 'Ingraham', 'ingraham']
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
original_stopwords = list(ENGLISH_STOP_WORDS)
cust_stop_words += original_stopwords

Single Grams for Title

right = text['yes_right'] == 1
notright = text['yes_right'] == 0
# Separate title and description based on target class
right_title = text[right].title
right_desc = text[right].description
notright_title = text[notright].title
notright_desc = text[notright].description
stem = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def get_top_grams(df_r, df_nr, n):
    cvec = CountVectorizer(preprocessor=my_preprocessor,
                           ngram_range=(n,n), stop_words=cust_stop_words, min_df=5)
    cvec.fit(df_r)
    word_counts = pd.DataFrame(cvec.transform(df_r).todense(),
                       columns=cvec.get_feature_names())
    counts = word_counts.sum(axis=0)
    counts = pd.DataFrame(counts.sort_values(ascending = False),columns=['right']).head(15)
    
    cvec = CountVectorizer(preprocessor=my_preprocessor,
                           ngram_range=(n,n), stop_words=cust_stop_words, min_df=5)
    cvec.fit(df_nr)
    word_counts2 = pd.DataFrame(cvec.transform(df_nr).todense(),
                         columns=cvec.get_feature_names())
    counts2 = word_counts2.sum(axis=0)
    counts2 = pd.DataFrame(counts2.sort_values(ascending = False),columns=['not right']).head(15)
    return counts, counts2
r
right
trump 2981
report 569
north 523
korea 502
says 481
new 453
house 419
border 419
kim 397
summit 390
president 386
day 375
police 366
immigration 340
donald 304
nr
not right
trump 3872
new 1043
says 585
house 471
2018 430
north 414
korea 401
cohen 391
mueller 377
white 350
kim 347
18 339
world 338
like 333
gop 328

Single Grams for Description

r, nr = get_top_grams(right_desc, notright_desc, 1)
r
right
trump 3830
president 3216
new 1479
house 1214
said 1205
donald 1111
north 938
state 909
says 846
news 760
year 735
white 734
police 703
tuesday 682
people 632
nr
not right
trump 4756
president 2799
new 2052
donald 1374
said 1093
house 1039
says 840
people 831
discuss 777
white 768
north 721
michael 710
news 687
day 623
year 584

BiGrams for Title

r, nr = get_top_grams(right_title, notright_title, 2)
r
right
north korea 417
donald trump 303
white house 230
kim jong 182
president trump 181
ig report 141
supreme court 129
trump kim 121
judicial activism 104
day liberal 104
liberal judicial 104
new york 100
year old 91
kim summit 82
cartoons day 80
nr
not right
north korea 321
white house 247
michael cohen 196
kim jong 165
donald trump 160
stormy daniels 113
world cup 111
royal wedding 99
scott pruitt 98
president trump 96
new york 87
korea summit 86
family separation 85
trump kim 83
kanye west 77

BiGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 2)
r # Seeing some noise here 
right
president trump 1187
donald trump 1088
president donald 932
white house 621
new york 481
kim jong 474
north korea 473
year old 339
united states 314
north korean 313
trump administration 257
wp com 238
content uploads 238
com com 238
wp content 238
nr
not right
donald trump 1352
president trump 904
president donald 673
white house 640
new york 420
kim jong 414
michael cohen 405
trump administration 401
north korea 386
north korean 254
special counsel 236
robert mueller 233
rudy giuliani 224
united states 223
stormy daniels 213

TriGrams for Title

r, nr = get_top_grams(right_title, notright_title, 3)
r # Mentions of MS 13 Gang and "judicial activism"
right
liberal judicial activism 104
day liberal judicial 104
north korea summit 75
trump kim summit 66
new york times 27
border patrol agents 26
south china sea 26
trump kim jong 23
judicial activism june 22
cartoons day march 21
judicial activism march 20
ms 13 gang 20
judicial activism april 19
cartoons day april 19
judicial activism february 19
nr
# Interesting: NYT is not one of my sources
# Mentions of Stormy Daniels and immigration policy
not right
north korea summit 73
trump kim summit 37
family separation policy 29
trump tower meeting 28
quickly catch day 21
catch day news 21
trump legal team 21
trump north korea 20
iran nuclear deal 20
stormy daniels lawyer 19
daily horoscope june 19
trump kim jong 18
mtp daily june 17
santa fe high 17
sarah huckabee sanders 17

TriGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 3)
r # For some reason "content uploads" is still appearing. Source giveaway. 
# Also image noise data.
right
president donald trump 926
wp com com 238
wp content uploads 238
com wp content 238
com com wp 238
content uploads 2018 233
jpg fit 1024 218
1024 2c597 ssl 216
fit 1024 2c597 216
dictator kim jong 125
new york times 118
special counsel robert 110
leader kim jong 108
counsel robert mueller 108
new york city 102
nr
not right
president donald trump 672
north korean leader 158
new york times 155
leader kim jong 149
korean leader kim 133
special counsel robert 125
counsel robert mueller 123
fbi director james 81
director james comey 81
joy reid panel 81
reid panel discuss 74
lawyer michael cohen 69
attorney general jeff 64
general jeff sessions 64
new york city 60

QuadGrams for Title

r, nr = get_top_grams(right_title, notright_title, 4)
r
# Again, lost of "judicial activism" and MS 13 appears 
right
day liberal judicial activism 104
liberal judicial activism june 22
liberal judicial activism march 20
liberal judicial activism february 19
liberal judicial activism april 19
things caught eye today 11
north korea kim jong 9
fashion notes melania trump 9
santa fe high school 7
new york attorney general 7
public sector union dues 7
ms 13 gang members 7
fashion designer kate spade 7
southern poverty law center 7
trump says north korea 7
nr
not right
quickly catch day news 21
santa fe high school 15
trump family separation policy 14
prince harry meghan markle 13
new albums heavy rotation 11
new shows netflix stream 10
watch hulu new week 10
watch amazon prime new 10
trump north korea summit 10
amazon prime new week 10
best new shows netflix 10
shows netflix stream right 10
ranking best new shows 10
funniest tweets women week 9
20 funniest tweets women 9

QuadGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 4)
r
# Very noisy
right
com wp content uploads 238
com com wp content 238
wp com com wp 238
wp content uploads 2018 233
fit 1024 2c597 ssl 216
jpg fit 1024 2c597 209
special counsel robert mueller 108
north korean dictator kim 94
korean dictator kim jong 94
north korean leader kim 91
korean leader kim jong 90
https i0 wp com 83
i0 wp com com 83
i2 wp com com 78
https i2 wp com 78
nr
# Less noise, but still source giveaways like "Chris Matthews". 
not right
north korean leader kim 133
korean leader kim jong 132
special counsel robert mueller 123
fbi director james comey 81
joy reid panel discuss 74
attorney general jeff sessions 63
discuss political news day 56
matthews panel guests discuss 56
panel guests discuss political 56
guests discuss political news 56
chris matthews panel guests 56
mika discuss big news 55
joe mika discuss big 55
discuss big news day 55
hayes discusses day political 55

Wordclouds

These provide a nice, fun visualization of words found in right wing and not right wing title text. The larger the word the more it appears.

plt.figure(figsize=(15,8))
wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                      max_words=(100),
                      width=2000, height=1000,
                      relative_scaling = 0.5,
                      background_color='white',
                      colormap='Dark2'
                  
).generate(' '.join(text[right].title))
plt.imshow(wordcloud)
plt.title("Right Wing Title Text", fontsize=24)
plt.axis("off")
plt.show()

png

plt.figure(figsize=(15,8))
wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                      max_words=(100),
                      width=2000, height=1000,
                      relative_scaling = 0.5,
                      background_color='white',
                      colormap='Dark2'
                  
).generate(' '.join(text[notright].title))
plt.imshow(wordcloud)
plt.title("Not Right Wing Title Text", fontsize=24)
plt.axis("off")
plt.show()

png

In the next post I will go over the modeling process and evaluation (the fun part!) of this project.

To be continued…

Written on July 14, 2018