Exploring MTA Tweets

I recently got my hands on a dataset of all the tweets from MTA’s official Twitter account for all of 2017 and up to July of 2018 from a colleague. He gathered it by scraping from Twitter’s API. There wasn’t really a target to predict, so this was mostly a practice in exploring data. No massive revelations were discovered, but there were some interesting bits to pull from the process. The repo for this project can be found on Github here.

After cleaning the tweets up, I decided to incorporate some unsupervised learning techniques and use Latent Dirichlet allocation to identify topics in the tweets. After looking at the results with two, three, and four identified topics, clustering them into two made most sense. Essentially the tweets can be categorized into a service update category, using words like delay, service, indicent, resumed, problem, running, allow, additional, signal, mechanical, and an apology category, using words/n_grams like regret, supervision, thank, inconvenience, matter, report, change, thanks.

# gensim prepares interavtive LDA model
# two general categories: service updates and apologies/explanations for those updates
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

Next we decided to create a target to predict. Using Regex, we identified the MTA employee who was the author of each tweet, which was signified by a circumflex and the author’s initials. We then used some NLP tools to use the tweet (with the author’s signature removed) to predict who the author was. The high accuracy rate of 84% on test data with a simple multiclass Logistic Regression model was surprising, and very interesting that it was able to pick up on the subtle differences in language between authors. There’s not a whole lot of value in these findings, but it’s a good practice and interesting to see how these models work.

tweeters[:10]
# Top ten authors we were predicting

['^JG', '^JP', '^BD', '^KF', '^GES', '^DG', '^JZ', '^HKD', '^RT', '^EE']

y.value_counts()
# Baseline accuracy is 20.28%

^JG     15382
^JP     13759
^BD     11511
^KF      6821
^GES     5762
^DG      5206
^JZ      4714
^HKD     4527
^RT      4442
^EE      3693
Name: sig, dtype: int64

These were features with the highest feature importances from a Random Forest Classifier model. This model worked with 80% accuracy on test data.

top_feat_importances = pd.DataFrame(list(zip(rf.feature_importances_, cv.get_feature_names())),
             columns=['f_importance','feature']).sort_values('f_importance', ascending=False).head(20)
top_feat_importances

	f_importance	feature
25933	0.006591	good evening
64959	0.006346	time
61677	0.005757	supervision
63703	0.005720	thank
28848	0.005119	hi
37829	0.005099	location
63199	0.004789	tell
50852	0.004534	regrets
40539	0.004356	morning
50183	0.004141	reference
25472	0.004011	good
59799	0.003989	station
21404	0.003985	en
48461	0.003833	proceeding
50472	0.003588	referring
26551	0.003408	good morning
64733	0.003396	thanks
10138	0.003297	bound
42163	0.003222	mta nyc custhelp com
49950	0.003149	ref

Looking at the highest value in the list of probabilities for each target, we can get a sense of how sure the model is of its predictions. You can see that the Logistic Regression model, which was also more accurate, has a distribution of probabilities that indicates it is more confident in its predictions than the Random Forest model.

# How sure is the model about its predictions?
plt.figure(figsize=(12,7))
plt.title("Distribution of Highest Probability Values, Random Forest Model", fontsize=18)
plt.xlabel("Probability")
plt.ylabel("Number of Tweets")
plt.hist(max_probs, color='navy', bins=20);

png

# Most probabilities are much higher, model is more sure of its predictions
plt.figure(figsize=(12,7))
plt.title("Distribution of Highest Probability Values, LogReg Model", fontsize=18)
plt.xlabel("Probability")
plt.ylabel("Number of Tweets")
plt.hist(max_lr_probs, color='navy', bins=30);

png

To find which features were most important overall, I created a dataframe of each coefficient by author (from the Logistic Regression model), with the sum and mean of the absolute values of the coefficients. The following plots show which features were most important overall and by author.

abs_coef_df = abs(coef_df)
abs_coef_df.head()

	jg	jp	bd	kf	ges	dg	jz	hkd	rt	ee	abs_sum	abs_mean
feature
00	0.172850	0.287960	0.084425	0.170298	0.156555	0.305762	0.049209	0.008900	0.729685	0.101591	2.067236	0.375861
00 info	0.119540	0.000602	0.005627	0.000786	0.035574	0.005080	0.003147	0.000155	0.172403	0.001891	0.344806	0.062692
00 info web	0.119540	0.000602	0.005627	0.000786	0.035574	0.005080	0.003147	0.000155	0.172403	0.001891	0.344806	0.062692
00 info web mta	0.119540	0.000602	0.005627	0.000786	0.035574	0.005080	0.003147	0.000155	0.172403	0.001891	0.344806	0.062692
00 pm	0.035549	0.198963	0.054368	0.495864	0.029549	0.461121	0.302342	0.006940	0.004700	0.030295	1.619691	0.294489

# Exploring which features were most important across all authors

plt.figure(figsize=(12,8))
plt.title("Features with Largest Coefficient Abs Sums")
abs_coef_df['abs_mean'].sort_values().tail(20).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1a5861f898>

png

# what the correlations are between top features and each author with the abs_mean shown for indication of overall importance
# for example, JP doesn't use the word 'apologies' very often...
coef_df[[col for col in coef_df.columns if col !='abs_sum']].sort_values('abs_mean', ascending=False).head(8).plot\
(kind='barh',figsize=(18,14))

<matplotlib.axes._subplots.AxesSubplot at 0x1a5a4fd978>

png

Written on August 29, 2018