Skip to main content

COURSE OVERVIEW

Apache Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

This statistics and data analysis course will teach you the basics of working with Spark and will provide you with the necessary foundation for diving deeper into Spark. You’ll learn about Spark’s architecture and programming model, including commonly used APIs. After completing this course, you’ll be able to write and debug basic Spark applications. This course will also explain how to use Spark’s web user interface (UI), how to recognize common coding errors, and how to proactively prevent errors. The focus of this course will be Spark Core and Spark SQL. The course assignments include word counting and Web Server Log Mining using real world datasets and parallel processing with PySpark.

PREREQUISITES

This course covers advanced undergraduate-level material. It requires a programming background and experience with Python (or the ability to learn it quickly). All exercises will use PySpark (the Python API for Apache Spark), but previous experience with Spark or distributed computing is NOT required. Students should take this Python mini-quiz before the course and take this Python mini-course if they need to learn Python or refresh their Python knowledge.

The free software environment we are using for this course is Databricks Community Edition. The environment requires the Google Chrome (preferred) or Firefox web browser. Safari, Internet Explorer, and Edge are not supported.

Piazza Discussion group

We are using Piazza for all course questions and discussions. Please sign up for the discussion forum here: https://piazza.com/edx_berkeley/summer2016/cs1051x (access key: cs1051x)

Course Content


Week 1: Apache Spark Architecture and Programming Model - Launches June 15 at 17:00 UTC

      • Introduction to Big Data - learn about where big data comes from, the challenges of working with big data, and real world Apache Spark applications for big data 
      • The Structure Spectrum - explore the concepts of data structure (unstructured, semi-structured, and structured data) and how tabular data is handled in Spark
      • Apache Spark - learn how Spark DataFrames, Transformations, and Actions enable analysis of Big Data
      • Best Programming Practices - walk through an example of how to apply best principles when writing a Spark application
      • Setting up the Course Software Environment (Due September 11, 2016 at 00:00 UTC) - register with the free course software environment, and run your first Apache Spark notebook.

Week 2: The Structured Query Language and Spark SQL  - Launches June 22 at 17:00 UTC

      • The Big Data Problem - learn about challenges of working with Big Data and the opportunities presented by Cloud Computing
      • Cluster Computing Challenges and the Map Reduce Programming Paradigm - explore the history of Big Data processing hardware, the switch to lower-cost hardware for cluster computing, and the Map Reduce programming paradigm
      • Apache Spark: Technology Trends, Opportunity, and Advantages - explore the organization of modern datacenters, the technology trends of storage and the opportunities, they present, and the Spark's differences and advantages relative to Map Reduce
      • Structured Data and the Structured Query Language - review key data management concepts, learn about the relational data model, relational databases, and the Structured Query Language, and understand how queries and joins are performed with SQL 
      • Spark SQL and Spark DataFrames - explore how Spark SQL and Spark DataFrames support SQL join operations
      • Spark Community and Resources - understand the available online Spark resources and community, including documentation, technical blogs, YouTube channel, community mailing lists, meetups, and forums, and Spark packages and source code repositories
      • Semi-Structured Web Server Logs - explore semi-structured data and the Apache Common Log Format used by Apache Web servers
      • Spark Research Papers - learn about the history of Apache Spark through three research papers
      • Lab 1A/1B: Learning Apache Spark (Due September 11, 2016 at 00:00 UTC) - perform your first course lab in two parts: in the first part, you will learn about the Spark data model, DataFrames, transformations, and actions; in the second part, you will write a word counting program to count the words in all of Shakespeare's plays

Week 3: Web Server Log Analysis with Apache Spark  - Launches June 29 at 17:00 UTC

    • Lab 2: Web Server Log Analysis with Apache Spark (Due September 11, 2016 at 00:00 UTC) - use Spark to explore a NASA Apache web server log in the second course lab 

Course Grading Policy

The components of the course grade are:

    • 15% Quizzes in the lectures (scroll down below some of the videos for multiple choice and check box quizzes)
    • 5% Setup lab exercise
    • 80% Two lab exercises

The passing grade for this course is 40%.

We plan to leave the course open until the end of CS 110.1x at which time (September 11 at 00:00 UTC) all lab exercises will be due with no grace period for late submissions. 

Credits

This course is sponsored in part by Databricks and UC Berkeley's AMPLab.

The content in this course includes notes and content created by Dan Bruckner, Brian Clapper, John Canny, Sameer Farooqui, Richard Garris, Vida Ha, Michael Franklin, Anthony D. Joseph, Paco Nathan, Kay Ousterhout,  Evan Sparks, Shivaram Venkataraman, Patrick Wendell, and Matei Zaharia.

Erik Arvai edited the course videos.