The purpose of this document is to show you how to complete the second and final graded case study of the course. The goal of this case study is to experiment with various features and models for predicting taxi trips in New York City.
You will gain practical experience in:
Here is how this case study will work.
You can follow along with the instructions in this document. They correspond to the code provided and activity template you will fill out to complete the case study. We have blank sections in the code where you can write your responses to our questions. These sections will be marked RED so you won’t miss them. Some responses will include code, while others will not. We do our best to make this clear where needed.
Once you complete the case study, you will submit your work to the online MITx platform. From there, the system will assign you 2 peers to review. You will grade their submissions according to the provided evaluation rubric. We recommend you check out this rubric now, so you know how you will be graded.
Here are the key dates for this case study:
By 23:59 UTC, Sunday, June 24, 2018. Submit your case study to MITx. Be sure to convert to your local time zone.
By 23:59 UTC, Monday, June 25, 2018. Complete the peer reviews for 2 of your peers. (If the system does not give you 2 reviews, see the Next Steps section below for what to do. But, it is critical you complete any reviews you are assigned.)
Please note that this second date is the end of the course. See below for important notes about deadline extensions.
Follow these steps to get set up with the code and submission template for the case study.
If you have not already done so, follow the instructions for getting started with Microsoft Azure Notebooks. You will need to create and confirm your account there before proceeding to step 2.
Clone the case study library, which can be found here. Go to that page and click the Clone button to copy the code to your personal library. The dialog should look something like this:
Select your experience level. We have support for both beginner and advanced users of Python this case study. You should select either beginner_python.ipynb
or advanced_python.ipynb
, depending on your experience level.
Open the notebook file you selected from the table above. This will start up the Jupyter notebook. If you need a refresher on how to work with a notebook, we have information for you here.
As described above, this notebook file will be the template you use to record your answers. So, before you do anything else, be sure to fill in the information in the Identification Information section at the top of the notebook.
Next, go to the Setup section and follow the instructions. This will install all the packages you need to complete the remainder of the case study.
Great! Now, you should be all set up and ready to complete this case study. As stated above, we are going to be creating custom features to understand NYC taxi trip patterns, based on an existing dataset. This document will walk you through how to work with this data and framework to arrive at some meaningful results!
Remember our tip from our first Microsoft Azure Notebooks tutorial: save your notebook often to avoid losing your work!
One of the first steps in any data science task is importing the necessary tools you will use.
See the Import section of your notebook and follow the instructions to import the required tools. We will be using featuretools
in this case study as our main feature extraction and synthesis engine.
The trips
table has the following fields, which will be important for you to understand moving forward.
id
- uniquely identifies the tripvendor_id
- the taxi cab company - in our case study we have data from three different cab companiespickup_datetime
- the time stamp for pickupdropoff_datetime
- the time stamp for drop-offpassenger_count
- the number of passengers for the triptrip_distance
- total distance of the trip in milespickup_longitude
- the longitude for pickuppickup_latitude
- the latitude for pickupdropoff_longitude
- the longitude of dropoffdropoff_latitude
- the latitude of dropoffpayment_type
- a numeric code signifying how the passenger paid for the trip
trip_duration
- this is the duration we would like to predict using other fieldspickup_neighborhood
- a one or two letter id of the neighboorhood where the trip starteddropoff_neighborhood
- a one or two letter id of the neighboorhood where the trip endedSee the Data section of your notebook and follow the instructions to load the NYC taxi trip data.
An entity is a component of our dataset. A relationship is some connection between 2 or more entities.
We have the following 3 entities:
trips
pickup_neighborhoods
dropoff_neighborhoods
And we also have the following 2 relationships:
pickup_neighborhoods
-> trips
: One neighborhood can have multiple trips
that start in it. This means pickup_neighborhoods
is the parent_entity
and trips
is the child_entity
.dropoff_neighborhoods
-> trips
: One neighborhood can have multiple trips
that end in it. This means dropoff_neighborhoods
is the parent_entity
and trips
is the child_entity
.See the Entities and Relationships section of your notebook and follow the instructions to create these objects, which we will use later in our feature analysis.
Instead of manually creating features, such as “month of pickup datetime”, we can let DFS come up with them automatically. It does this by:
Features fall into two major categories: transform
and aggregate
. In featureools
, we can create transform features by specifying transform
primitives. Below we specify a transform
primitive called weekend
. This is how it works:
datetime
column in the data.weekend
and returns a boolean.In this specific data, there are two datetime
columns: pickup_datetime
and dropoff_datetime
. The tool automatically creates features using the primitive and these two columns.
See the Transform Primitives section of your notebook and follow the instructions to create this new weekend
feature.
Now, we can build a model to predict our target variable given our new feature matrix. To build a model, we:
training
(75% in this case) and a portion for testing
log
of the trip duration so that a more linear relationship can be foundGradientBoostingRegressor
See the First Model section of your notebook and follow the instructions to create, train and evaluate our first model of the case study.
Now, we can ddd Minute
, Hour
, Week
, Month
and Weekday
primitives. All of these transform primitives apply to datetime
columns.
See the More Transform Primitives section of your notebook and follow the instructions to create these new features and evaluate the new model performance.
As described above, the second type of feature we consider here is an aggregate feature, where we compute some function across an entire feature column in our dataset and include that value in our feature matrix.
Let’s add some of these aggregation primitives. These primitives will generate features for the parent entities pickup_neighborhoods
, and dropoff_neighborhood
and then add them to the trips
entity, which is the entity for which we are trying to make prediction.
See the Aggregation Primitives section of your notebook and follow the instructions to create these new features and evaluate the new model performance.
Now, we want to actually predict the trip durations of our test set. See the Evaluate on Test Data section of your notebook and follow the instructions to make these predictions and plot the resulting distribution.
Finally, let’s look at how important each feature was for the model. If you want to learn more about how these “importance values” are computed, you can read more here.
See the Feature Importance section of your notebook and follow the instructions to get these importance values and analyze them a bit further.
We really hope you enjoyed completing this case study and feel more comfortable about some of the learning objectives listed above.
There are just a few more things you need to know before you are done.
Follow these instructions to submit your responses to the MITx platform.
Export your work. In your notebook file on Microsoft Azure Notebooks, go to File > Download As then select HTML.
Convert your file to a PDF. You can use this tool to do this. The MITx platform doesn’t allow HTML uploads, so we must take this in-between step.
Name your file correctly. We ask that you name your file in the following format:
61_LastName_FirstName_mitxprousername.pdf
For example, if you are Bob Smith and your MITx username is bobs123, then you should save your file with this name:
61_Smith_Bob_bobs123.pdf
This makes it easier for other students and course TA’s to identify your work and make sure you receive as much credit as possible.
Upload your file. Go to the Case Study page (see Helpful Links above), then click the right tab. You should see a page that looks like this:
61_Smith_Bob_bobs123.pdf
).Remember - after you submit your own work, be prepared for the system to assign you peers to review.
If the system does not give you any peers to review right away, we ask that you check back to the site again to see if you have any new reviews. It may take a few hours for a new review to come in.
Here are some questions we see from students often and our answers to those questions. This list is subject to change.
Q: I don’t know anything about Python, and I would prefer to use R. Is there an R version of the code I can use?
A: Unfortunately, at this time, we have not found an R library that achieves the same sort of feature extraction and synthesis that featuretools
does. Thus, we only have a Python version.
However, we are confident that students of all coding backgrounds should be able to complete this case study. If you need some more introductory Python resources, we have some for you!
Q: The code is taking a really long time to run (over 5 minutes for 1 cell). What should I do?
We would recommend restarting the notebook here and then clicking Kernel > Restart and Run All Cells.
If this still doesn’t fix things, see the troubleshooting section below for help.
Q: I re-open the notebook after not working on it for a while, try to run some code, then get Python errors. What should I do?
A: This is likely because the notebook has gone stale. The best solution would be to re-run all the cells in the notebook up to your current cell. You can easily do this by clicking Cell > Run All.
If this still doesn’t solve the error you are getting, see the troubleshooting section below for help.
Q: I am getting an error like this. What do I do?
~/anaconda3_501/lib/python3.6/site-packages/featuretools/computational_backends/calculate_feature_matrix.py in
calculate_feature_matrix(features, entityset, cutoff_time, instance_ids, entities, relationships, cutoff_time_in_index, training_window, approximate, save_progress, verbose, chunk_size, profile)
101 entityset = EntitySet("entityset", entities, relationships)
102
--> 103 target_entity = entityset[features[0].entity.id]
104 pass_columns = []
105
TypeError: 'NoneType' object is not subscriptable
This was an error on our end! Sorry about that. Please see this thread for a solution to your error.
Q: I can’t get my code to work. I keep getting a package error or some other error I don’t know how to solve. What do I do?
A: See the troubleshooting section below for help.
Q: The numbers I am getting on the case study vary slightly from the ones other people are getting. Is this normal?
A: Yes, this is completely normal! This has to do with the way is randomly split, or the way the model trains based on the data. As long as your numbers are reasonably close to the values of others, you should be all set.
Q: It’s getting really close to the submission deadline and I am struggling to make progress on the assignment. Can I get an extension? What should I do?
A: Unfortunately, due to the nature of the course, it is our official policy that no extensions will be granted on graded assignments. This is especially important as this case study is due at the same time as the course closure on June 25.
However, we encourage you to submit what you have so far. Then, if you are concerned about your grade, you can contact us privately.
Having trouble getting your code to run? Questions about a particular part of the case study? Not able to submit your assignment to MITx? Your course TA’s are here to help!
Please follow these steps if you get stuck.
Make sure you followed all of the instructions exactly as specified, and checked the FAQ section of this page.
Search the Discussion Forum to see if your question has already been answered.
Please post in the Module 6: Case Studies section of the Discussion Forum and your (new, unanswered) question will be answered as soon as possible!