Case Study 6.1 - NYC Taxi Trips

Helpful Links

Introduction
- Logistics
- Setup
During
Conclusion

Introduction

The purpose of this document is to show you how to complete the second and final graded case study of the course. The goal of this case study is to experiment with various features and models for predicting taxi trips in New York City.

You will gain practical experience in:

loading a large dataset
creating various feature objects and injecting them into different models
evaluating the predictions of those models
utilizing tools like Deep feature synthesis and feature transformations
understanding metrics like feature importances, for looking at a model post-training

Logistics

Here is how this case study will work.

You can follow along with the instructions in this document. They correspond to the code provided and activity template you will fill out to complete the case study. We have blank sections in the code where you can write your responses to our questions. These sections will be marked RED so you won’t miss them. Some responses will include code, while others will not. We do our best to make this clear where needed.

Once you complete the case study, you will submit your work to the online MITx platform. From there, the system will assign you 2 peers to review. You will grade their submissions according to the provided evaluation rubric. We recommend you check out this rubric now, so you know how you will be graded.

Here are the key dates for this case study:

By 23:59 UTC, Sunday, June 24, 2018. Submit your case study to MITx. Be sure to convert to your local time zone.
By 23:59 UTC, Monday, June 25, 2018. Complete the peer reviews for 2 of your peers. (If the system does not give you 2 reviews, see the Next Steps section below for what to do. But, it is critical you complete any reviews you are assigned.)

Please note that this second date is the end of the course. See below for important notes about deadline extensions.

Setup

Follow these steps to get set up with the code and submission template for the case study.

If you have not already done so, follow the instructions for getting started with Microsoft Azure Notebooks. You will need to create and confirm your account there before proceeding to step 2.
Clone the case study library, which can be found here. Go to that page and click the Clone button to copy the code to your personal library. The dialog should look something like this:
Select your experience level. We have support for both beginner and advanced users of Python this case study. You should select either beginner_python.ipynb or advanced_python.ipynb, depending on your experience level.
Open the notebook file you selected from the table above. This will start up the Jupyter notebook. If you need a refresher on how to work with a notebook, we have information for you here.
As described above, this notebook file will be the template you use to record your answers. So, before you do anything else, be sure to fill in the information in the Identification Information section at the top of the notebook.
Next, go to the Setup section and follow the instructions. This will install all the packages you need to complete the remainder of the case study.

During

Great! Now, you should be all set up and ready to complete this case study. As stated above, we are going to be creating custom features to understand NYC taxi trip patterns, based on an existing dataset. This document will walk you through how to work with this data and framework to arrive at some meaningful results!

Remember our tip from our first Microsoft Azure Notebooks tutorial: save your notebook often to avoid losing your work!

Import

One of the first steps in any data science task is importing the necessary tools you will use.

See the Import section of your notebook and follow the instructions to import the required tools. We will be using featuretools in this case study as our main feature extraction and synthesis engine.

Data

The trips table has the following fields, which will be important for you to understand moving forward.

id - uniquely identifies the trip
vendor_id - the taxi cab company - in our case study we have data from three different cab companies
pickup_datetime - the time stamp for pickup
dropoff_datetime - the time stamp for drop-off
passenger_count - the number of passengers for the trip
trip_distance - total distance of the trip in miles
pickup_longitude - the longitude for pickup
pickup_latitude - the latitude for pickup
dropoff_longitude - the longitude of dropoff
dropoff_latitude - the latitude of dropoff
payment_type - a numeric code signifying how the passenger paid for the trip
- 1=Credit card
- 2=Cash
- 3=No charge
- 4=Dispute
- 5=Unknown
- 6=Voided
trip_duration - this is the duration we would like to predict using other fields
pickup_neighborhood - a one or two letter id of the neighboorhood where the trip started
dropoff_neighborhood - a one or two letter id of the neighboorhood where the trip ended

See the Data section of your notebook and follow the instructions to load the NYC taxi trip data.

Entities and Relationships

An entity is a component of our dataset. A relationship is some connection between 2 or more entities.

We have the following 3 entities:

trips
pickup_neighborhoods
dropoff_neighborhoods

And we also have the following 2 relationships:

pickup_neighborhoods -> trips: One neighborhood can have multiple trips that start in it. This means pickup_neighborhoods is the parent_entity and trips is the child_entity.
dropoff_neighborhoods -> trips: One neighborhood can have multiple trips that end in it. This means dropoff_neighborhoods is the parent_entity and trips is the child_entity.

See the Entities and Relationships section of your notebook and follow the instructions to create these objects, which we will use later in our feature analysis.

Transform Primitives

Instead of manually creating features, such as “month of pickup datetime”, we can let DFS come up with them automatically. It does this by:

interpreting the variable types of the columns e.g categorical, numeric and others
matching the columns to the primitives that can be applied to their variable types
creating features based on these matches

Features fall into two major categories: transform and aggregate. In featureools, we can create transform features by specifying transform primitives. Below we specify a transform primitive called weekend. This is how it works:

It can be applied to any datetime column in the data.
For each entry in the column, it determines if it is a weekend and returns a boolean.

In this specific data, there are two datetime columns: pickup_datetime and dropoff_datetime. The tool automatically creates features using the primitive and these two columns.

See the Transform Primitives section of your notebook and follow the instructions to create this new weekend feature.

First Model

Now, we can build a model to predict our target variable given our new feature matrix. To build a model, we:

Seperate the data into a porition for training (75% in this case) and a portion for testing
Get the log of the trip duration so that a more linear relationship can be found
Train a model using a GradientBoostingRegressor

See the First Model section of your notebook and follow the instructions to create, train and evaluate our first model of the case study.

More Transform Primitives

Now, we can ddd Minute, Hour, Week, Month and Weekday primitives. All of these transform primitives apply to datetime columns.

See the More Transform Primitives section of your notebook and follow the instructions to create these new features and evaluate the new model performance.

Aggregation Primitives

As described above, the second type of feature we consider here is an aggregate feature, where we compute some function across an entire feature column in our dataset and include that value in our feature matrix.

Let’s add some of these aggregation primitives. These primitives will generate features for the parent entities pickup_neighborhoods, and dropoff_neighborhood and then add them to the trips entity, which is the entity for which we are trying to make prediction.

See the Aggregation Primitives section of your notebook and follow the instructions to create these new features and evaluate the new model performance.

Evaluate on Test Data

Now, we want to actually predict the trip durations of our test set. See the Evaluate on Test Data section of your notebook and follow the instructions to make these predictions and plot the resulting distribution.

Feature Importance

Finally, let’s look at how important each feature was for the model. If you want to learn more about how these “importance values” are computed, you can read more here.

See the Feature Importance section of your notebook and follow the instructions to get these importance values and analyze them a bit further.

Conclusion

We really hope you enjoyed completing this case study and feel more comfortable about some of the learning objectives listed above.

There are just a few more things you need to know before you are done.

Submitting

Follow these instructions to submit your responses to the MITx platform.

Export your work. In your notebook file on Microsoft Azure Notebooks, go to File > Download As then select HTML.
Convert your file to a PDF. You can use this tool to do this. The MITx platform doesn’t allow HTML uploads, so we must take this in-between step.
Name your file correctly. We ask that you name your file in the following format:

61_LastName_FirstName_mitxprousername.pdf

For example, if you are Bob Smith and your MITx username is bobs123, then you should save your file with this name:

61_Smith_Bob_bobs123.pdf

This makes it easier for other students and course TA’s to identify your work and make sure you receive as much credit as possible.
Upload your file. Go to the Case Study page (see Helpful Links above), then click the right tab. You should see a page that looks like this:
- You can leave the Your Response field blank.
- Click the Choose Files button and select the file from your computer. This is the file you named properly in step 2 above.
- In the “Describe” text box, enter a description that exactly matches the name of your file (i.e. 61_Smith_Bob_bobs123.pdf).
- Click the Save your progress button.
- Then, click the Upload Files button. If you have any issues with this, fill out the Issue Submission form at the bottom of the page.
- You can re-upload as many times as you wish before you submit. See the next bullet on how to submit.
- Once you are confident with the last file you have uploaded, press the blue button that says Submit your response and move on to the next step. If you have any issues with this, fill out the Issue Submission form at the bottom of the page.

Next Steps

Remember - after you submit your own work, be prepared for the system to assign you peers to review.

If the system does not give you any peers to review right away, we ask that you check back to the site again to see if you have any new reviews. It may take a few hours for a new review to come in.

FAQ

Here are some questions we see from students often and our answers to those questions. This list is subject to change.

Q: I don’t know anything about Python, and I would prefer to use R. Is there an R version of the code I can use?

A: Unfortunately, at this time, we have not found an R library that achieves the same sort of feature extraction and synthesis that featuretools does. Thus, we only have a Python version.

However, we are confident that students of all coding backgrounds should be able to complete this case study. If you need some more introductory Python resources, we have some for you!

Q: The code is taking a really long time to run (over 5 minutes for 1 cell). What should I do?

We would recommend restarting the notebook here and then clicking Kernel > Restart and Run All Cells.

If this still doesn’t fix things, see the troubleshooting section below for help.

Q: I re-open the notebook after not working on it for a while, try to run some code, then get Python errors. What should I do?

A: This is likely because the notebook has gone stale. The best solution would be to re-run all the cells in the notebook up to your current cell. You can easily do this by clicking Cell > Run All.

If this still doesn’t solve the error you are getting, see the troubleshooting section below for help.

Q: I am getting an error like this. What do I do?

~/anaconda3_501/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in 
calculate_feature_matrix(features, entityset, cutoff_time, instance_ids, entities, relationships, cutoff_time_in_index, training_window, approximate, save_progress, verbose, chunk_size, profile)
    101 entityset = EntitySet("entityset", entities, relationships)
    102
--> 103 target_entity = entityset[features[0].entity.id] 
	104 pass_columns = []
	105

TypeError: 'NoneType' object is not subscriptable

This was an error on our end! Sorry about that. Please see this thread for a solution to your error.

Q: I can’t get my code to work. I keep getting a package error or some other error I don’t know how to solve. What do I do?

A: See the troubleshooting section below for help.

Q: The numbers I am getting on the case study vary slightly from the ones other people are getting. Is this normal?

A: Yes, this is completely normal! This has to do with the way is randomly split, or the way the model trains based on the data. As long as your numbers are reasonably close to the values of others, you should be all set.

Q: It’s getting really close to the submission deadline and I am struggling to make progress on the assignment. Can I get an extension? What should I do?

A: Unfortunately, due to the nature of the course, it is our official policy that no extensions will be granted on graded assignments. This is especially important as this case study is due at the same time as the course closure on June 25.

However, we encourage you to submit what you have so far. Then, if you are concerned about your grade, you can contact us privately.

Troubleshooting

Having trouble getting your code to run? Questions about a particular part of the case study? Not able to submit your assignment to MITx? Your course TA’s are here to help!

Please follow these steps if you get stuck.

Make sure you followed all of the instructions exactly as specified, and checked the FAQ section of this page.
Search the Discussion Forum to see if your question has already been answered.
Please post in the Module 6: Case Studies section of the Discussion Forum and your (new, unanswered) question will be answered as soon as possible!