Exploring MTA Tweets

I recently got my hands on a dataset of all the tweets from MTA’s official Twitter account for all of 2017 and up to July of 2018 from a colleague. He gathered it by scraping from Twitter’s API. There wasn’t really a target to predict, so this was mostly a practice in exploring data. No massive revelations were discovered, but there were some interesting bits to pull from the process. The repo for this project can be found on Github here.

After cleaning the tweets up, I decided to incorporate some unsupervised learning techniques and use Latent Dirichlet allocation to identify topics in the tweets. After looking at the results with two, three, and four identified topics, clustering them into two made most sense. Essentially the tweets can be categorized into a service update category, using words like delay, service, indicent, resumed, problem, running, allow, additional, signal, mechanical, and an apology category, using words/n_grams like regret, supervision, thank, inconvenience, matter, report, change, thanks.

# gensim prepares interavtive LDA model
# two general categories: service updates and apologies/explanations for those updates
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

Next we decided to create a target to predict. Using Regex, we identified the MTA employee who was the author of each tweet, which was signified by a circumflex and the author’s initials. We then used some NLP tools to use the tweet (with the author’s signature removed) to predict who the author was. The high accuracy rate of 84% on test data with a simple multiclass Logistic Regression model was surprising, and very interesting that it was able to pick up on the subtle differences in language between authors. There’s not a whole lot of value in these findings, but it’s a good practice and interesting to see how these models work.

tweeters[:10]
# Top ten authors we were predicting
['^JG', '^JP', '^BD', '^KF', '^GES', '^DG', '^JZ', '^HKD', '^RT', '^EE']
y.value_counts()
# Baseline accuracy is 20.28%
^JG     15382
^JP     13759
^BD     11511
^KF      6821
^GES     5762
^DG      5206
^JZ      4714
^HKD     4527
^RT      4442
^EE      3693
Name: sig, dtype: int64

These were features with the highest feature importances from a Random Forest Classifier model. This model worked with 80% accuracy on test data.

top_feat_importances = pd.DataFrame(list(zip(rf.feature_importances_, cv.get_feature_names())),
             columns=['f_importance','feature']).sort_values('f_importance', ascending=False).head(20)
top_feat_importances
f_importance feature
25933 0.006591 good evening
64959 0.006346 time
61677 0.005757 supervision
63703 0.005720 thank
28848 0.005119 hi
37829 0.005099 location
63199 0.004789 tell
50852 0.004534 regrets
40539 0.004356 morning
50183 0.004141 reference
25472 0.004011 good
59799 0.003989 station
21404 0.003985 en
48461 0.003833 proceeding
50472 0.003588 referring
26551 0.003408 good morning
64733 0.003396 thanks
10138 0.003297 bound
42163 0.003222 mta nyc custhelp com
49950 0.003149 ref

Looking at the highest value in the list of probabilities for each target, we can get a sense of how sure the model is of its predictions. You can see that the Logistic Regression model, which was also more accurate, has a distribution of probabilities that indicates it is more confident in its predictions than the Random Forest model.

# How sure is the model about its predictions?
plt.figure(figsize=(12,7))
plt.title("Distribution of Highest Probability Values, Random Forest Model", fontsize=18)
plt.xlabel("Probability")
plt.ylabel("Number of Tweets")
plt.hist(max_probs, color='navy', bins=20);

png

# Most probabilities are much higher, model is more sure of its predictions
plt.figure(figsize=(12,7))
plt.title("Distribution of Highest Probability Values, LogReg Model", fontsize=18)
plt.xlabel("Probability")
plt.ylabel("Number of Tweets")
plt.hist(max_lr_probs, color='navy', bins=30);

png

To find which features were most important overall, I created a dataframe of each coefficient by author (from the Logistic Regression model), with the sum and mean of the absolute values of the coefficients. The following plots show which features were most important overall and by author.

abs_coef_df = abs(coef_df)
abs_coef_df.head()
jg jp bd kf ges dg jz hkd rt ee abs_sum abs_mean
feature
00 0.172850 0.287960 0.084425 0.170298 0.156555 0.305762 0.049209 0.008900 0.729685 0.101591 2.067236 0.375861
00 info 0.119540 0.000602 0.005627 0.000786 0.035574 0.005080 0.003147 0.000155 0.172403 0.001891 0.344806 0.062692
00 info web 0.119540 0.000602 0.005627 0.000786 0.035574 0.005080 0.003147 0.000155 0.172403 0.001891 0.344806 0.062692
00 info web mta 0.119540 0.000602 0.005627 0.000786 0.035574 0.005080 0.003147 0.000155 0.172403 0.001891 0.344806 0.062692
00 pm 0.035549 0.198963 0.054368 0.495864 0.029549 0.461121 0.302342 0.006940 0.004700 0.030295 1.619691 0.294489
# Exploring which features were most important across all authors

plt.figure(figsize=(12,8))
plt.title("Features with Largest Coefficient Abs Sums")
abs_coef_df['abs_mean'].sort_values().tail(20).plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a5861f898>

png

# what the correlations are between top features and each author with the abs_mean shown for indication of overall importance
# for example, JP doesn't use the word 'apologies' very often...
coef_df[[col for col in coef_df.columns if col !='abs_sum']].sort_values('abs_mean', ascending=False).head(8).plot\
(kind='barh',figsize=(18,14))
<matplotlib.axes._subplots.AxesSubplot at 0x1a5a4fd978>

png

Written on August 29, 2018