Capstone Part 1 - Classifying Political News Media Text with Natural Language Processing

For my final capstone project as part of the Data Science Immersive program at General Assembly NYC, I decided to create a political text classification model using Natural Language Processing. As with all data science projects, this was a non-linear, iterative process that required extensive data cleaning. I learned a great deal about NLP in the process, and I look forward to further projects on this topic. The full repo for this project can be found here.

Purpose

In 2018, I don’t think I need to remind anyone of the current news media landscape in the United States. Needless to say, the media culture is extremely polarized and the societal effects of how people source their information is becoming increasingly evident. It hasn’t always been this way, and it is important to be able to know what kind of source a piece of political news text is coming from before deciding what to do with it. Political Science is often defined as the study of power. In the era of Trump, the power of language is increasingly relevant and important. In this project, I aim to build a model that predicts whether or not political media text is right wing or not right wing. I could have chosen to call these labels “right” and “left”, but I chose more of a “one vs. rest” terminology to reflect the fact that the political spectrum isn’t as simple as Right and Left. This website has a great “media bias chart” for visualization. There is more to say about far right news media entering the mainstream, but we can save that for another day. Furthermore, this project is a stepping stone to future projects using multi class targets that more accurately reflect the kinds of political news media that are out there. Likewise, it is a stepping stone to more intimidating and complex projects that tackle social media language as well.

Gathering and Load Data

I collected data for 36,854 unique articles from a variety of right wing and not right wing sources. The features collected were title, description, date posted, url, author, and source. Choosing obviously partisan sources, I labeled whether they are right wing or not based on the source. Therefore, since I would only be starting with title and description as features, a challenge I faced is making sure that any giveaway of the source is removed from an article’s text features. All of my data was scraped from News API except for Infowars and Democracy Now, for which I used BeautifulSoup to scrape article information from their respective RSS feeds.

My right wing sources included:

Fox News
Breitbart News
National Review
Infowars

Not right wing sources:

MSNBC
Huffington Post
Vice News
CNN
Democracy Now!

I scraped the data in separate notebooks and then loaded in all of the individual .csv files into my main notebook. The notebooks for this data collection process can be found in the main folder of the GitHub repo.

Clean Data

The initial data cleaning was quite simple and straightforward compared to the iterative process of removing noisy n-grams that would come once I began exploring and modeling. I dropped all of the rows with missing titles or descriptions, as those are the only features I would start with in my model.

# Check for missing values
text.isna().sum()

author         9466
description      69
publishedAt       0
source            0
title             1
url               0
yes_right         0
dtype: int64

# Dropped all rows with missing text, as that is all I will be using as features
text.dropna(subset=['description','title'], inplace=True)

# It's okay if there is no author
text.fillna('no_author', inplace=True)

text.shape

(36854, 7)

text.yes_right.value_counts()[0]/len(text.yes_right) 
# Baseline accuracy is 53.5%, almost a 50/50 split in target
# There are a bit more not right sources than right

0.53470450968687255

It’s barely worth including Infowars and Democracy Now in the model as it would have taken quite a long time to gather a comparable amount of articles as those from News API by scraping from an RSS feed. However, I got as many articles as I could over a week’s time.

text.source.value_counts()
# Democracy Now and Infowars don't produce quite as much content...

fox news           6381
national review    5348
huffington post    4950
msnbc              4950
vice news          4949
breitbart          4948
cnn                4776
infowars            471
democracy now        81
Name: source, dtype: int64

Removing Source Giveaways from the Title and Description

There was a multi-step process to remove source giveaways from the articles’ text features. Here I create a function to use in the .apply() function to remove any mention of any source from all of the title and description text, as well as remove that specific article’s author from the text. This cleaned up a lot of source giveaways, but there was still much that slipped through the cracks.

# Many CNN articles had "CNN" after the author's name
text['author'] = [a.replace(', CNN', '') for a in text['author']]

def remove_source_info(row):
    sources = ['Breitbart', 'CNN', 'Fox News', 'National Review', 'Vice News',
               'Democracy Now', 'Infowars', 'MSNBC', 'Huffington Post']
    for source in sources:
        row['description'] = row['description'].replace(source, '')
        row['title'] = row['title'].replace(source, '')
    row['title'] = row['title'].replace(row['author'], '')
    row['description'] = row['description'].replace(row['author'], '')
    return row

text = text.apply(remove_source_info, axis=1)

text.query("source=='Fox News'").head()

	author	description	publishedAt	source	title	url	yes_right
9724	no_author	Judge Napolitano and Marie Harf discuss the Tr...	2018-06-20T18:51:36Z	Fox News	Freedom Watch: Napolitano and Harf dig in on i...	http://video.foxnews.com/v/5799861947001/	1
9725	no_author	Find out where and who caught a rare cotton ca...	2018-06-20T18:50:00Z	Fox News	See it: Cotton candy-colored lobster caught	http://video.foxnews.com/v/5799865520001/	1
9726	no_author	Steve Harrigan reports from outside a detentio...	2018-06-20T18:47:58Z	Fox News	Media not given access to 'tender age' shelters	http://video.foxnews.com/v/5799862990001/	1
9727	no_author	Reports: More than 2,000 children have been se...	2018-06-20T18:47:54Z	Fox News	Critics denounce 'tender age' shelters	http://video.foxnews.com/v/5799860019001/	1
9728	no_author	Nearly a year after the body of little boy was...	2018-06-20T18:38:34Z	Fox News	‘Little Jacob’ has been identified	http://video.foxnews.com/v/5799856889001/	1

NLP Feature Engineering

Using NLP tools from the TextBlob library, I tagged each word of the title and description with a part of speech and added the normalized value counts for each part of speech as a new feature. The meaning of those parts of speech tags can be found here. I also used TextBlob’s sentiment analysis scoring to create features for polarity (positive or negative) and subjectivity. I then took the difference between title polarity and description polarity as well as the difference between title subjectivity and description subjectivity to measure any possible discrepancy in sentiment between title and description. Additionally, I created an average title word length feature. I could have spent weeks just engineering features using NLP tools, but this seemed like a good base to start with for this project.

text['combined'] = text.title + ' ' + text.description # all text together

For creation of features, I treated title and description separately, but for modeling with TfIdf I combined the two into one document for each row.

I created a feature of average word length in the title using a RegEx tokenizer.

tokenizer = nltk.RegexpTokenizer(r'\w+')
title_tokens = [tokenizer.tokenize(w) for w in text.title]

avg_word_length = []
for title in title_tokens:
    wordlen = []
    for word in title:
        wordlen.append(len(word))
        if len(wordlen)==len(title):
            avg_word_length.append(np.sum(wordlen)/len(wordlen))

text['avg_word_len_title'] = avg_word_length

Sentiment Analysis

I used TextBlob to create polarity and subjectivity features for both the title and description.

text['title_polarity'] = [TextBlob(w).sentiment.polarity for w in text.title]

text['title_subjectivity'] = [TextBlob(w).sentiment.subjectivity for w in text.title]

text['desc_polarity'] = [TextBlob(w).sentiment.polarity for w in text.description]

text['desc_subjectivity'] = [TextBlob(w).sentiment.subjectivity for w in text.description]

text['subj_difference'] = text['title_subjectivity'] - text['desc_subjectivity']
text['polarity_difference']  = text['title_polarity'] - text['desc_polarity']

Parts of Speech Tagging

Also using TextBlob, I tagged each word with a part of speech and then took the normalized value counts for each part of speech in the document.

title_tags = [TextBlob(w.lower(), tokenizer=tokenizer).tags for w in text.title]

title_tags[0]

[('peter', 'NN'),
 ('fonda', 'NN'),
 ('lying', 'VBG'),
 ('gash', 'JJ'),
 ('kirstjen', 'NNS'),
 ('nielsen', 'VBN'),
 ('should', 'MD'),
 ('be', 'VB'),
 ('whipped', 'VBN'),
 ('naked', 'JJ'),
 ('in', 'IN'),
 ('public', 'JJ')]

tags_counts = []
for row in title_tags:
    tags = [n[1] for n in row]
    tags_counts.append(tags)

title_parts_of_speech = []
for n in tags_counts:
    foo = dict(pd.Series(n).value_counts(normalize=True))
    title_parts_of_speech.append(foo)

title_parts_of_speech = pd.DataFrame(title_parts_of_speech).fillna(0)
# A NaN value means that part of speech did not appear in the text

# Label as title parts of speech
title_parts_of_speech.columns = [str(n) + '_title' for n in title_parts_of_speech.columns]

I then followed the same process for description.

# Checking for correct length
desc_parts_of_speech.shape[0] == title_parts_of_speech.shape[0]

True

pos_tags = pd.concat([title_parts_of_speech, desc_parts_of_speech], axis=1)

# Concatenate all created features together with target
df = pd.concat([text[['title_polarity', 'title_subjectivity', 'desc_polarity',
                      'desc_subjectivity', 'subj_difference', 'polarity_difference',
                      'avg_word_len_title']], pos_tags, text.yes_right], axis=1)
df.head(3)

	title_polarity	title_subjectivity	desc_polarity	desc_subjectivity	subj_difference	polarity_difference	avg_word_len_title	...	VBD_desc	VBG_desc	VBN_desc	VBP_desc	VBZ_desc	WP_desc	yes_right
0	0.50	0.233333	0.500000	0.000000	0.233333	0.000000	5.166667	...	0.068182	0.045455	0.022727	0.0	0.00	0.022727	1
1	0.25	0.900000	0.250000	0.650000	0.250000	0.000000	5.750000	...	0.000000	0.160000	0.040000	0.0	0.04	0.000000	1
2	0.40	0.300000	0.555417	0.666667	-0.366667	-0.155417	6.545455	...	0.020000	0.060000	0.020000	0.1	0.04	0.000000	1

3 rows × 79 columns

Every created features is already scaled except for a few, which I just simply manually scaled.

def scaled_checker(df):
    for col in df.columns:
        if max(df[col]) > 1:
            print(col)
        if min(df[col]) < 0:
            print(col)
        else:
            pass
        
scaled_checker(df)

subj_difference
polarity_difference
avg_word_len_title

df['subj_difference'] = (df['subj_difference'] - min(df['subj_difference'])) \n
    /(max(df['subj_difference'])-min(df['subj_difference']))
df['polarity_difference'] = (df['polarity_difference'] - min(df['polarity_difference']))/ \n
    (max(df['polarity_difference'])-min(df['polarity_difference']))
df['avg_word_len_title'] = (df['avg_word_len_title'] - min(df['avg_word_len_title']))/ \n
    (max(df['avg_word_len_title'])-min(df['avg_word_len_title']))

scaled_checker(df) # numerical data is scaled

# Save my mostly cleaned datasets
text.to_csv('./datasets/text2.csv')
df.to_csv('./datasets/df2.csv')

Explore Data

Exploring the sentiment analysis features, there is not much significant to note. Correlations between features and the target are extremely weak. All distributions had the same shape, with all of polarity having a normal distribution and all description distributions being skewed right. In other words, most text was neutral in terms of polarity and most text was more objective than subjective. One difference is that there was more subjective language in the descriptions than in the titles.

num_corrs = df[['title_polarity','title_subjectivity','desc_polarity',
                'desc_subjectivity', 'subj_difference','polarity_difference',
                'avg_word_len_title','yes_right']].corr()
mask = np.zeros_like(num_corrs, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,8))
plt.title('Correlations Between Sentiment Analysis Features and Target', fontsize=20)
sns.heatmap(num_corrs, linewidths=0.5, mask=mask, annot=True);
# Very insignificant 

png

# Looking at features based on target
rightwing = df[df.yes_right==1]
notrightwing = df[df.yes_right==0]

figure, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

rightwing.title_polarity.plot(kind='hist', ax=ax[0,1],
                              title='Title Polarity: Right')
notrightwing.title_polarity.plot(kind='hist', ax=ax[0,0],
                                 title='Title Polarity: Not Right')
notrightwing.desc_polarity.plot(kind='hist',
                                ax=ax[1,0], title='Description Polarity: Not Right')
rightwing.desc_polarity.plot(kind='hist',
                             ax=ax[1,1], title='Description Polarity: Right');

png

figure, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

rightwing.title_subjectivity.plot(kind='hist', ax=ax[0,1],
                                  title='Title Subjectivity: Right')
notrightwing.title_subjectivity.plot(kind='hist', ax=ax[0,0],
                                     title='Title Subjectivity: Not Right')
notrightwing.desc_subjectivity.plot(kind='hist', ax=ax[1,0],
                                    title='Description Subjectivity: Not Right')
rightwing.desc_subjectivity.plot(kind='hist', ax=ax[1,1],
                                 title='Description Subjectivity: Right');

png

Exploring Word Counts

In the following several sections, I used Count Vectorizer to find the highest count single-grams, bi-grams, tri-grams, and quad-grams in the titles and descriptions, sorted by both target classes. After initially looking at the counts and seeing a lot of source information and noise, I created a custom preprocessor that removes stop grams of noise and source information as well as lemmatize the text. I also created a custom group of stop words that contain source information and appended it to SciKitLearn’s list of English stop words, which I used in all of my Count Vectorizer and TfIdf models. The initial words counts demonstrated that there is much noise in the data, and TfIdf might be the best choice for a vectorizer as it will penalize terms that appear very frequently across documents. However, if some of these terms only appear frequently in only one unique source, I decided to remove them manually.

There were a couple interesting interpretations, especially as the n-grams increased, but the most common n-grams in the documents were mostly similar between left and right until the tri-gram level, when noticeable differences began to form.

Set Up Vectorizer

lemmatizer = WordNetLemmatizer()

def my_preprocessor(doc):
    stop_grams = ['national review','National Review','Fox News','Content Uploads',
                  'content uploads','fox news', 'Associated Press','associated press',
                  'Fox amp Friends','Rachel Maddow','Morning Joe','morning joe',
                  'Breitbart News', 'fast facts', 'Fast Facts','Fox &', 'Fox & Friends',
                  'Ali Velshi','Stephanie Ruhle','Raw video', '& Friends', 'Ari Melber',
                  'amp Friends', 'Content uploads', 'Geraldo Rivera']
    for stop in stop_grams:
        doc = doc.replace(stop,'')
    return lemmatizer.lemmatize(doc.lower())

cust_stop_words = ['CNN','cnn','amp','huffpost','fox','reuters','ap','vice','breitbart',
                   'nationalreview', 'www','msnbc','infowars', 'foxnews','Vice','Breitbart',
                   'National Review','Fox News','Reuters','Fast Facts','Infowars','Vice',
                   'MSNBC','www','AP','Huffpost','HuffPost','Maddow', '&', 'Ingraham', 'ingraham']
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
original_stopwords = list(ENGLISH_STOP_WORDS)
cust_stop_words += original_stopwords

Single Grams for Title

right = text['yes_right'] == 1
notright = text['yes_right'] == 0

# Separate title and description based on target class
right_title = text[right].title
right_desc = text[right].description
notright_title = text[notright].title
notright_desc = text[notright].description

stem = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def get_top_grams(df_r, df_nr, n):
    cvec = CountVectorizer(preprocessor=my_preprocessor,
                           ngram_range=(n,n), stop_words=cust_stop_words, min_df=5)
    cvec.fit(df_r)
    word_counts = pd.DataFrame(cvec.transform(df_r).todense(),
                       columns=cvec.get_feature_names())
    counts = word_counts.sum(axis=0)
    counts = pd.DataFrame(counts.sort_values(ascending = False),columns=['right']).head(15)
    
    cvec = CountVectorizer(preprocessor=my_preprocessor,
                           ngram_range=(n,n), stop_words=cust_stop_words, min_df=5)
    cvec.fit(df_nr)
    word_counts2 = pd.DataFrame(cvec.transform(df_nr).todense(),
                         columns=cvec.get_feature_names())
    counts2 = word_counts2.sum(axis=0)
    counts2 = pd.DataFrame(counts2.sort_values(ascending = False),columns=['not right']).head(15)
    return counts, counts2

	right
trump	2981
report	569
north	523
korea	502
says	481
new	453
house	419
border	419
kim	397
summit	390
president	386
day	375
police	366
immigration	340
donald	304

nr

	not right
trump	3872
new	1043
says	585
house	471
2018	430
north	414
korea	401
cohen	391
mueller	377
white	350
kim	347
18	339
world	338
like	333
gop	328

Single Grams for Description

r, nr = get_top_grams(right_desc, notright_desc, 1)

	right
trump	3830
president	3216
new	1479
house	1214
said	1205
donald	1111
north	938
state	909
says	846
news	760
year	735
white	734
police	703
tuesday	682
people	632

nr

	not right
trump	4756
president	2799
new	2052
donald	1374
said	1093
house	1039
says	840
people	831
discuss	777
white	768
north	721
michael	710
news	687
day	623
year	584

BiGrams for Title

r, nr = get_top_grams(right_title, notright_title, 2)

	right
north korea	417
donald trump	303
white house	230
kim jong	182
president trump	181
ig report	141
supreme court	129
trump kim	121
judicial activism	104
day liberal	104
liberal judicial	104
new york	100
year old	91
kim summit	82
cartoons day	80

nr

	not right
north korea	321
white house	247
michael cohen	196
kim jong	165
donald trump	160
stormy daniels	113
world cup	111
royal wedding	99
scott pruitt	98
president trump	96
new york	87
korea summit	86
family separation	85
trump kim	83
kanye west	77

BiGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 2)

r # Seeing some noise here

	right
president trump	1187
donald trump	1088
president donald	932
white house	621
new york	481
kim jong	474
north korea	473
year old	339
united states	314
north korean	313
trump administration	257
wp com	238
content uploads	238
com com	238
wp content	238

nr

	not right
donald trump	1352
president trump	904
president donald	673
white house	640
new york	420
kim jong	414
michael cohen	405
trump administration	401
north korea	386
north korean	254
special counsel	236
robert mueller	233
rudy giuliani	224
united states	223
stormy daniels	213

TriGrams for Title

r, nr = get_top_grams(right_title, notright_title, 3)

r # Mentions of MS 13 Gang and "judicial activism"

	right
liberal judicial activism	104
day liberal judicial	104
north korea summit	75
trump kim summit	66
new york times	27
border patrol agents	26
south china sea	26
trump kim jong	23
judicial activism june	22
cartoons day march	21
judicial activism march	20
ms 13 gang	20
judicial activism april	19
cartoons day april	19
judicial activism february	19

nr
# Interesting: NYT is not one of my sources
# Mentions of Stormy Daniels and immigration policy

	not right
north korea summit	73
trump kim summit	37
family separation policy	29
trump tower meeting	28
quickly catch day	21
catch day news	21
trump legal team	21
trump north korea	20
iran nuclear deal	20
stormy daniels lawyer	19
daily horoscope june	19
trump kim jong	18
mtp daily june	17
santa fe high	17
sarah huckabee sanders	17

TriGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 3)

r # For some reason "content uploads" is still appearing. Source giveaway. 
# Also image noise data.

	right
president donald trump	926
wp com com	238
wp content uploads	238
com wp content	238
com com wp	238
content uploads 2018	233
jpg fit 1024	218
1024 2c597 ssl	216
fit 1024 2c597	216
dictator kim jong	125
new york times	118
special counsel robert	110
leader kim jong	108
counsel robert mueller	108
new york city	102

nr

	not right
president donald trump	672
north korean leader	158
new york times	155
leader kim jong	149
korean leader kim	133
special counsel robert	125
counsel robert mueller	123
fbi director james	81
director james comey	81
joy reid panel	81
reid panel discuss	74
lawyer michael cohen	69
attorney general jeff	64
general jeff sessions	64
new york city	60

QuadGrams for Title

r, nr = get_top_grams(right_title, notright_title, 4)

r
# Again, lost of "judicial activism" and MS 13 appears 

	right
day liberal judicial activism	104
liberal judicial activism june	22
liberal judicial activism march	20
liberal judicial activism february	19
liberal judicial activism april	19
things caught eye today	11
north korea kim jong	9
fashion notes melania trump	9
santa fe high school	7
new york attorney general	7
public sector union dues	7
ms 13 gang members	7
fashion designer kate spade	7
southern poverty law center	7
trump says north korea	7

nr

	not right
quickly catch day news	21
santa fe high school	15
trump family separation policy	14
prince harry meghan markle	13
new albums heavy rotation	11
new shows netflix stream	10
watch hulu new week	10
watch amazon prime new	10
trump north korea summit	10
amazon prime new week	10
best new shows netflix	10
shows netflix stream right	10
ranking best new shows	10
funniest tweets women week	9
20 funniest tweets women	9

QuadGrams for Description

r, nr = get_top_grams(right_desc, notright_desc, 4)

r
# Very noisy

	right
com wp content uploads	238
com com wp content	238
wp com com wp	238
wp content uploads 2018	233
fit 1024 2c597 ssl	216
jpg fit 1024 2c597	209
special counsel robert mueller	108
north korean dictator kim	94
korean dictator kim jong	94
north korean leader kim	91
korean leader kim jong	90
https i0 wp com	83
i0 wp com com	83
i2 wp com com	78
https i2 wp com	78

nr
# Less noise, but still source giveaways like "Chris Matthews". 

	not right
north korean leader kim	133
korean leader kim jong	132
special counsel robert mueller	123
fbi director james comey	81
joy reid panel discuss	74
attorney general jeff sessions	63
discuss political news day	56
matthews panel guests discuss	56
panel guests discuss political	56
guests discuss political news	56
chris matthews panel guests	56
mika discuss big news	55
joe mika discuss big	55
discuss big news day	55
hayes discusses day political	55

Wordclouds

These provide a nice, fun visualization of words found in right wing and not right wing title text. The larger the word the more it appears.

plt.figure(figsize=(15,8))
wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                      max_words=(100),
                      width=2000, height=1000,
                      relative_scaling = 0.5,
                      background_color='white',
                      colormap='Dark2'
                  
).generate(' '.join(text[right].title))
plt.imshow(wordcloud)
plt.title("Right Wing Title Text", fontsize=24)
plt.axis("off")
plt.show()

png

plt.figure(figsize=(15,8))
wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                      max_words=(100),
                      width=2000, height=1000,
                      relative_scaling = 0.5,
                      background_color='white',
                      colormap='Dark2'
                  
).generate(' '.join(text[notright].title))
plt.imshow(wordcloud)
plt.title("Not Right Wing Title Text", fontsize=24)
plt.axis("off")
plt.show()

png

In the next post I will go over the modeling process and evaluation (the fun part!) of this project.

To be continued…

Written on July 14, 2018