Skip to main content


Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Prediction using Machine Learning algorithms, Collaborative Filtering, and Textual Entity Recognition exercises that teach students how to manipulate datasets using parallel processing with PySpark, Spark SQL, and Spark Machine Learning Pipelines.


This course covers advanced undergraduate-level material. It requires a programming background and experience with Python, PySpark (part of Apache Spark), and Spark SQL. The edX course CS 105x, Introduction to Apache Spark, is a mandatory prerequisite.

Piazza Discussion group

We are using Piazza for all course questions and discussions. Please sign up at the discussion forum here: (access key: cs110x)


Week 1: Big Data and Data Science - Launches Aug 15 at 17:00 UTC

    • Introduction to Big Data and Data Science - learn about big data, see examples of how data science can leverage big data, and learn about the risks of performing data science without statistics
    • Performing Data Science and Preparing Data - explore data science definitions and topics, and the process of acquiring and preparing data
    • Exploratory Data Analysis - understand the statistics of Exploratory Data Analysis
    • Machine Learning - learn about Spark's machine learning libraries, ML and mllib 
    • Lab 1: Power Plant Machine Learning Pipeline (Due Sept 12, 2016 at 23:00 UTC) - perform your first course lab where you will perform data exploration and visualization, learn about Spark's Machine Learning Pipeline, and apply and evaluate several Machine Learning algorithms to answer a business question

Week 2: Performing Data Science - Launches Aug 22 at 17:00 UTC

    • Data Science Roles - understand the different roles of a Data Scientist
    • Data Quality - understand the importance of data quality
    • Data Cleaning - learn about the challenges of data cleaning
    • Statistical Inference - learn about estimation, bias, variability, data distributions and the Central Limit Theorem 
    • Lab 2: Collaborative Filtering on a Movie Dataset (Due Sept 12, 2016 at 23:00 UTC) - use Apache Spark's Machine Learning Pipelines library to perform collaborative filtering on a movie dataset in the second course lab 

Week 3: Apache Spark's Resilient Distributed Datasets - Launches Aug 29 at 17:00 UTC

    • Spark Low-Level Primitives - learn about Spark's Resilient Distributed Datasets, transformations, and actions, and Spark's shared variables 
    • File Performance - understand the considerations for the performance of file read and write actions 
    • Lab 3: Text Analysis and Entity Resolution (Due Sept 12, 2016 at 23:00 UTC) - perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab 

Week 4: Statistics Launches Sept 5 at 17:00 UTC

    • Statistics - learn about relations, associations, trends, patterns, correlation, and regression

Course Grading Policy

The components of the course grade are:

    • 16% Quizzes in the lectures (scroll down below some of the videos for multiple choice and check box quizzes)
    • 84% Three lab exercises

The course is graded pass-fail with a passing grade of 40%

All course assignments (quizzes and the three lab exercises) are due by the end of the course: Sept 12, 2016 at 23:00 UTC. 


This course is sponsored in part by Databricks and UC Berkeley's AMPLab.

The content in this course includes notes and content created by Ani Adhikari, Michael Armbrust, Doug Bateman, Dan Bruckner, Brian Clapper, John Canny, John DeNero, Sameer Farooqui, Michael Franklin, Richard Garris, Paco Nathan, Kay Ousterhout,  Evan Sparks, Shivaram Venkataraman, David Wagner, Patrick Wendell, and Matei Zaharia.

Erik Arvai edited the course videos.