Syllabus | CS120x

COURSE OVERVIEW

Machine learning aims to extract knowledge from data and enables a wide range of applications. With datasets rapidly growing in size and complexity, learning techniques are fast becoming a core component of large-scale data processing pipelines. This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including feature extraction, supervised learning, model evaluation, and exploratory data analysis. Students will gain hands-on experience applying these principles by using Apache Spark to implement several scalable learning pipelines.

PREREQUISITES

Programming background; comfort with mathematical and algorithmic reasoning; familiarity with basic machine learning concepts; exposure to algorithms, probability, linear algebra and calculus; experience with Python (or the ability to learn it quickly). All exercises will use PySpark, and previous experience with PySpark equivalent to CS105x: Introduction to Apache Spark is required. You should take this Python quiz before the course and take this Python mini-course if you need to learn Python or refresh your Python knowledge. This self-assessment document provides online resources that review additional relevant background material.

The free software environment we are using for this course is Databricks Community Edition. The environment requires the Google Chrome (preferred) or Firefox web browser. Safari, Internet Explorer, and Edge are not supported.

Note: If you already have a Databricks Community Edition account, e.g., from CS105x, you should use this same account for CS120x.

Piazza Discussion Group

We are using Piazza for all course questions and discussions. Please sign up for the discussion forum here: piazza.com/edx_berkeley/summer2016/cs120x (access key: cs120x).

COURSE CONTENT

WEEK 1: Intro to Machine Learning and Spark RDDs - Launches July 11 at 17:00 UTC

Topics: Course goals, Apache Spark overview, basic machine learning concepts, steps of typical supervised learning pipelines, linear algebra review, computational complexity / big O notation review, Spark's RDD data structure.
Setup/Lab 0: Register with the free course software environment, and run your first Apache Spark notebook. (Due August 12, 2016 at 23:00 UTC)
Lab 1a,1b: Math Review and Introduction to RDDs. In lab 1a, you will gain hands on experience using Python's scientific computing library to manipulate matrices and vectors, and learn about lambda functions which will be used throughout the course. In lab 1b you will write a word counting program to count the words in all of Shakespeare's plays. The word counting exercise mirrors Lab1b from CS105x, but involves RDDs instead DataFrames. (Due August 12, 2016 at 23:00 UTC)

WEEK 2: Linear Regression and Distributed Machine Learning Principles - Launches July 18 at 17:00 UTC

Topics: Linear regression formulation and closed-form solution, distributed machine learning principles (related to computation, storage, and communication), gradient descent, quadratic features, grid search, Spark's pipeline API.
Lab 2: Millionsong Regression Pipeline. Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. You will implement a gradient descent solver for linear regression, use Spark's machine Learning library (mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition. Finally, you will write a concise version of this pipeline using Spark's pipeline API (Due August 12, 2016 at 23:00 UTC)

WEEK 3: Logistic Regression and Click-through Rate Prediction - Launches July 25 at 17:00 UTC

Topics: Online advertising, linear classification, logistic regression, working with probabilistic predictions, categorical data and one-hot-encoding, feature hashing for dimensionality reduction.
Lab 3: Click-through Rate Prediction Pipeline. Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. You will extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot. (Due August 12, 2016 at 23:00 UTC)

WEEK 4: Principal Component Analysis and Neuroimaging - Launches August 1 at 17:00 UTC

Topics: Introduction to neuroscience and neuroimaging data, exploratory data analysis, principal component analysis (PCA) formulations and solution, distributed PCA.
Lab 4: Neuroimaging Analysis via PCA - Identify patterns of brain activity in larval zebrafish. You will work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish's neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, you will use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli. (Due August 12, 2016 at 23:00 UTC)

Course Grading Policy

The components of the course grade are:

15% Quizzes (associated with lectures and labs)
5% Setup
80% Four Spark coding labs

The course is graded on the following scale:

45 or greater: passing
< 45: non-passing

We encourage you to start software setup and the Spark coding labs as early as possible. All assignments are due at the end of the course (August 12, 2016 at 23:00 UTC) with no grace period for late submissions.

Credits

This course is sponsored by Databricks.

BerkeleyX: CS120x Distributed Machine Learning with Apache Spark