PH125.3x: Data Science: Probability - Course Syllabus
Course Instructor
Rafael Irizarry
Course Description
In this third course of nine in the HarvardX Data Science Professional Certificate, we learn the basics of probability theory.
Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists.
In this course, you will learn important concepts in probability theory. The motivation for this course is the circumstances surrounding the financial crisis of 2007-2008. Part of what caused this financial crisis was that the risk of certain securities sold by financial institutions was underestimated. To begin to understand this very complicated event, we need to understand the basics of probability.
We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem. These statistical concepts are fundamental to conducting statistical tests on data and understanding whether the data you are analyzing are likely occurring due to an experimental method or to chance.
Note that statistical inference, covered in the next course in this series, builds upon probability theory.
What you'll learn:
- Important concepts in probability theory including random variables and independence
- How to perform a Monte Carlo simulation
- The meaning of expected values and standard errors and how to compute them in R
- The importance of the Central Limit Theorem
Course Structure
When you join the course, we encourage you to meet your peers, get set up with R, and tell us about yourselves and what you hope to get out of the course! You can progress through the material at your own pace.
Grading
There are 10 graded assessments available to all learners, some on the DataCamp platform and some directly on the edX platform. These assessments are worth 85% of your grade. For Verified learners, there is an additional comprehensive assessment that is worth 15% of your grade.
All other components of the course, such as the the discussion boards, are not for credit.
Certification
In order to receive a Verified Certificate, you must sign up and pay for a Verified Certificate by the deadline on the course page and earn a passing grade of at least 70%.
Installing R
To install R, you can download it freely from the Comprehensive R Archive Network (CRAN). but if you need further help you can check out chapter 1 of the textbook.
Research
HarvardX pursues the science of learning. When you participate in this course, you will also participate in research about learning. Read our research statement to learn more.
COURSE OUTLINE
Section 1: Discrete Probability
You will learn about basic principles of probability related to categorical data using card games as examples.
Section 2: Continuous Probability
You will learn about basic principles of probability related to numeric and continuous data.
Section 3: Random Variables, Sampling Models, and the Central Limit Theorem
You will learn about random variables (numeric outcomes resulting from random processes), how to model data generation procedures as draws from an urn, and the Central Limit Theorem, which applies to large sample sizes.
Section 4: The Big Short
You will learn how interest rates are determined and how some bad assumptions led to the financial crisis of 2007-2008.
FAQS - ABOUT THE COURSE
What is this R language we will be using?
R is a programming language and environment that is used in many fields for statistical analysis. R is also completely free and open source.
Do I need to have a background in statistics or R to take the course?
This is the third course in the series. We assume you have either taken the first two courses or have at least some background in R. However, we do not assume any knowledge in statistics for this course, and it is not needed. The statistics and programming aspects of the series increase in difficulty the farther you progress. The third course is an introduction to probability.
I do not have a background in programming. Can I still take the course?
Yes, we do not assume a background in programming. You will learn programming skills by completing the exercises. That said, unless you have some familiarity with R, we highly recommend starting with the first course, PH125.1x R Basics.
Is this class challenging?
These courses are taught at the college level. Some of the material, depending on your exposure, may be fairly challenging. However, the first few courses are meant to be a "gentle" introduction to statistics and R. Furthermore, there is also substantial help from the community including a lively discussion board.
Is there a textbook?
Yes, there is a free PDF textbook available here in English and here in Spanish. (Note: The book is "free" in that you can slide the "YOU PAY" scale to $0. You are welcome to pay what you can afford, and there is no advantage in the course to anyone that "purchases" the book for more money.)
There is also an HTML version of the textbook here.
How long do I have to complete the course?
In principle, if you spend 2-4 hours per week, you should be able to complete the course within around four weeks. That said, this course is self-paced, so you can take as long as you want, provided you complete the course before the deadline listed on your course homepage.
Do I have to take the courses in sequence?
The courses in the HarvardX Data Science Professional Certificate are designed to be taken in the following order:
Each subsequent course assumes familiarity with the content in the preceding courses. Depending on your experience with data science generally and R specifically, you may be able to take the courses out of sequence if you choose.
FAQS - SOFTWARE, ETC.
How do I get started?
First, you will need to install R onto your machine. You can do that from CRAN.
The latest version is 4.1.0. If you are using an older version of R, it will likely be fine so long as you are using version 3.2.0 or later. This course makes use of packages such as "dplyr" which only work if you are running version 3.1.2 or more recent. Depending on your machine you may have to resolve various dependencies, but in most instances it should be straightforward to install R.
What do I do after installing R?
You can begin learning or optionally install RStudio.
RStudio is a graphical user interface for R. RStudio is NOT part of the R language nor is it required in order to complete the course. However, it does provide a nice interface (Professor Irizarry uses RStudio in the videos). Please note that RStudio comes in various commercial flavors. However, it does include a free Open Source Edition. If you are unable to install RStudio for whatever reason, we suggest you skip this step and just continue with the course so long as R is successfully installed on your computer.
How do I install packages?
We will be using and downloading various packages throughout this course and the subsequent courses. However, installing packages is straightforward. For example, if you want to install dslabs you just enter
install.packages("dslabs")
Please note that the most common error when trying to install a package is spelling. For example, if you had typed “dslab”
instead of “dslabs”
you would get an error. The second most common error is forgetting the quotes.
I installed the package but it is not working.
After installing a package, you must load it. For example, after installing dslabs, you load it by typing:
library(dslabs)
After you hit enter, if you do not get an error, then you are good to go. Note that after running some packages, you may get a message, but that does not imply there is an error. Also note that when starting a new R session you typically will need to load the package again.
Can I install packages using RStudio?
Yes, you can install most if not all packages via RStudio. Click on the right of the screen on Tools > Install Packages. Then you can enter the name of the package you want. If the package is found, you should be able to directly download it.
I'm trying to install other packages such as rafalib and I get an error (similar to): 'lib = "C:/Program Files/R/R-3.2.2/library"' is not writable
This problem occurs when you don't have administrative privileges to overwrite the file location. We strongly recommend that you fix this by having administrator privileges on the machine you are using. If you are using a public or shared machine, you may have difficulty installing the necessary packages.
I'm having problems installing devtools and other packages. Can you help?
devtools can be hard to install which is why we are not using it in any of the exercises. You do not need it for this course. If you wish to install packages which are not used in the course, we may not be able to help you. We certainly do not want to discourage you from exploring, since there are numerous fabulous packages. However, we are going to focus on material necessary for this course.
When I try to open a file I am getting an error such as cannot open file 'femaleMiceWeights.csv': No such file or directory
Probably because the file is not in your working directory. The file needs to be in the folder or you need to change the working directory to the one containing the file.
I am getting some strange error message in R. Ideas?
If you can not figure out what is wrong, a good idea especially if you have been using R for a long time is to exit and restart. If you are still seeing it after exiting and restarting, let us know (see below).
FAQS - GETTING HELP
My code is not working! Can I ask you about it?
Yes certainly! You need to post to the course discussion board. Please READ BELOW on how to post.
How do I post?
Okay, this is important! READ THIS… really, read it, we explain how to ask questions.
When you ask a question please make sure to:
- List it as a question.
- Choose the appropriate Topic Area.
- Have a clear, specific title.
For example if you have a question on the first exercise in section 1, then state so (e.g., "help with question 1a in section 1") in your question.
Also, you can show us your code (or certainly parts of it, we ask you not to ruin the course for others by carelessly posting solutions). Make sure your code is legible. Even simple code can be difficult and annoying to read if garbled. For example do NOT present your code as such:
sum <- 100 for(i in 1:50) sum <- sum + i sum
Instead please make it neat and format it as code:
sum <- 100
for(i in 1:50)
sum <- sum + i
sum
To do this, you can insert your code, then highlight it and press Ctrl+K and it should be nice and legible (if not, please fix it using the guidance in the pinned post in the “general” discussion forum).
This not only helps us, the staff, identify your problem, it also helps other students who may have similar questions.
For more information about how to use the discussion forum, check out the edX documentation.
FAQs - PROFESSIONAL CERTIFICATE
How often will the courses be offered?
Courses in the program are offered frequently, with overlap - so if now isn’t a good time for you to start one of the courses you need as a prerequisite or if you missed a deadline, there will be another offering of the course you need coming soon!
Does the order of courses in the Professional Certificate Program matter?
Yes, order does matter, particularly for the first four courses in the sequence. For the later courses, depending on your previous experience, you may be able to swap the sequence of some of the courses. The courses are designed to be taken in the following order:
Do I need to register for all of the courses at once in order to be eligible for the Professional Certificate?
No! You can take courses individually - once you have obtained an ID Verified Certificate in each course, you will be eligible for the Professional Certificate. If you choose to pre-pay for the entire program, you receive a discount on the total registration cost.
Please find a list of HarvardX's most commonly asked questions below. You will need to scroll to see the whole list.
If you can't see the question list above, click this link to open the FAQ in a new window.