Exploring MTA Tweets

I recently got my hands on a dataset of all the tweets from MTA’s official Twitter account for all of 2017 and up to July of 2018 from a colleague. He gathered it by scraping from Twitter’s API. There wasn’t really a target to predict, so this was mostly a practice in exploring data. No massive revelations were discovered, but there were some interesting bits to pull from the process. The repo for this project can be found on Github here.

Capstone - Using a RNN Model with Keras for Text Classification

I was curious to see how a Recurrent Neural Network model would perform on the text data from my Classifying Political News Media Text Capstone project, as I know that they are known to work very well with text classification. I used this LSTM model on Kaggle as a template. After preprocessing the text, it is converted to a series of integers with sequence, which is somewhat like how a Hashing Vectorizer works. It is then converted into a NumPy array matrix with padding to ensure it has the right shape for the RNN.

Capstone Part 2 - Classifying Political News Media Text with Natural Language Processing

Now for the modeling and evaluation part of the project. Again, this was an iterative process where, using various models, I found the strongest features, checked for source giveaways or noise, and then went back and adjusted the custom list of stop words and stop-grams.

Capstone Part 1 - Classifying Political News Media Text with Natural Language Processing

For my final capstone project as part of the Data Science Immersive program at General Assembly NYC, I decided to create a political text classification model using Natural Language Processing. As with all data science projects, this was a non-linear, iterative process that required extensive data cleaning. I learned a great deal about NLP in the process, and I look forward to further projects on this topic. The full repo for this project can be found here.

Project 3 - Predicting Number of Comments with Reddit's API

In this project, I aimed to predict which features would predict whether or not a post on reddit makes it to the “hot” subreddit, which is a page for posts with high user interaction, as measured by the number of comments on the post. To gather the data, I scraped JSON post data from reddit’s API and saved it to a .csv file. I set the target variable to a binarized measure of number of comments: above the mean amount of comments or below it.

Project 2 - Ames Housing Data

For this project (full repo here) I created a model to predict price based on a dataset on the housing stock of Ames, Iowa, a small midwestern town north of Des Moines that is home to Iowa State University. There were 81 columns with about an even split between categorical data and numerical data. Some of the columns, such as “Overall Quality,” could be initially assumed to have a strong correlation with price, while other more obscure qualities such as “Garage Type” and “Masonry Veneer Type” occupied most of the columns. I only had a limited amount of time to complete this project, so efficiency was critical. My goal was to create a model that accurately predicted price without combing over every minute detail.

I have a blog now.

Welcome to my new blog about data science. I will be posting about projects I have worked on and the experience of taking the Data Science Immersive course at General Assembly in NYC. The sequence of my work should demonstrate the extremely fast-paced, immersive nature of this course and the wide breadth of topics it covered.