Analytics Dictionary | QM202x

Glossary

Jump to bottom

A

Accuracy:: Accuracy is a metric by which one can examine how good is the machine learning model. It is the ratio of correctly predicted classes to the total classes predicted.

ANOVA:: A measure of the total variability in a set of data is given by the sum of squared differences of the observations from their overall mean.

Auto-regression:: The regression model that uses data from the same input variable at previous time steps is referred to as an auto-regression.

B

Bar Chart:: Bar charts are a type of graph that are used to display and compare the numbers, frequency or other measures (e.g. mean) for different discrete categories of data. They are used for categorical variables.

Bayes Theorem:: Bayes theorem is used to calculate the conditional probability. Conditional probability is the probability of an event occurring given the related event has already occurred.

Bayesian Statistics:: Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data. It differs from classical frequentest approach and is based on the use of Bayesian probabilities to summarize evidence.

Big Data:: Big data is a term that describes the large volume of data – both structured and unstructured. But it’s not the amount of data that’s important. It’s how organizations use this large amount of data to generate insights. Companies use various tools, techniques and resources to make sense of this data to derive effective business strategies.

Binomial Distribution:: Binomial Distribution is applied only on discrete random variables. It is a method of calculating probabilities for experiments having fixed number of trials. Binomial distribution has following properties: 1.The experiment should have finite number of trials, 2.There should be two outcomes in a trial: success and failure, 3.Trials are independent, 4.Probability of success (p) remains constant.

Box Plot:: It displays the full range of variation (from min to max), the likely range of variation (the Interquartile range), and a typical value (the median).

Business Analytics:: Business analytics is mainly used to show the practical methodology followed by an organization for exploring data to gain insights. The methodology focusses on statistical analysis of the data.

Business Intelligence:: Business intelligence are a set of strategies, applications, data, technologies used by an organization for data collection, analysis and generating insights to derive strategic business opportunities.

C

Categorical Variable:: Categorical variables (or nominal variables) are those variables which have discrete qualitative values.

Classification:: It is supervised learning method where the output variable is a category, such as “Male” or “Female” or “Yes” and “No”.

Classification Threshold:: Classification threshold is the value which is used to classify a new observation as 1 or 0. When we get an output as probabilities and have to classify them into classes, we decide some threshold value and if the probability is above that threshold value we classify it as 1, and 0 otherwise. To find the optimal threshold value, one can plot the AUC-ROC and keep changing the threshold value. The value which will give the maximum AUC will be the optimal threshold value.

Clustering:: Clustering is an unsupervised learning method used to discover the inherent groupings in the data.

Confidence Interval:: A confidence interval is used to estimate what percent of a population fits a category based on the results from a sample population.

Confusion Matrix:: A confusion matrix is a table that is often used to describe the performance of a classification model. It is a N * N matrix, where N is the number of classes. We form confusion matrix between prediction of model classes Vs actual classes. The 2nd quadrant is called type II error or False Negatives, whereas 3rd quadrant is called type I error or False positives.

Continuous Variable:: Continuous variables are those variables which can have infinite number of values but only in a specific range.

Correlation:: Correlation is the ratio of covariance of two variables to a product of variance (of the variables). It takes a value between +1 and -1. An extreme value on both the side means they are strongly correlated with each other. A value of zero indicates a NIL correlation but not a non-dependence.

Covariance:: Covariance is a measure of the joint variability of two random variables. It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.

D

Data Mining:: Data mining is a study of extracting useful information from structured/unstructured data taken from various sources. This is done usually for 1.Mining for frequent patterns 2.Mining for associations 3.Mining for correlations 4.Mining for clusters 5.Mining for predictive analysis.

Data Science:: Data science is a combination of data analysis, algorithmic development and technology in order to solve analytical problems. The main goal is a use of data to generate business value.

Data Transformation:: Data transformation is the process to convert data from one form to the other. This is usually done at a preprocessing step.

Dataset:: A dataset (or data set) is a collection of data. A dataset is organized into some type of data structure. In a database, for example, a dataset might contain a collection of business data (names, salaries, contact information, sales figures, and so forth). Several characteristics define a dataset’s structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as standard deviation and kurtosis.

Decision Tree:: Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.

Deep Learning:: Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) which uses the concept of human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously.

Descriptive Statistics:: Descriptive statistics is comprised of those values which explains the spread and central tendency of data.

Dependent Variable:: A dependent variable is what you measure and which is affected by independent / input variable(s). It is called dependent because it “depends” on the independent variable.

Degree of Freedom:: It is the number of variables that have the choice of having more than one arbitrary value.

Dimensionality Reduction:: Dimensionality Reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely.

Dummy Variable:: Dummy Variable is another name for Boolean variable. An example of dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age < 25) and 1 means value is false (i.e. age >= 25)

E

ETL:: ETL is the acronym for Extract, Transform and Load. An ETL system has the following properties: 1.It extracts data from the source systems 2.It enforces data quality and consistency standards 3.Delivers data in a presentation-ready format. This data can be used by application developers to build applications and end users for making decisions.

Evaluation Metrics:: The purpose of evaluation metric is to measure the quality of the statistical / machine learning model. For example, below are a few evaluation metrics: 1.AUC, 2.ROC score, 3.F-Score, 4.Log-Loss.

F

Factor Analysis:: Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of factors. Factor analysis aims to find independent latent variables. Factor analysis also assumes several assumptions: 1.There is linear relationship 2.There is no multicollinearity 3.It includes relevant variables into analysis 4.There is true correlation between variables and factors.

Feature Reduction:: Feature reduction is the process of reducing the number of features to work on a computation intensive task without losing a lot of information.

F-Score:: F-score evaluation metric combines both precision and recall as a measure of effectiveness of classification. It is calculated in terms of ratio of weighted importance on either recall or precision as determined by β coefficient.

G

Goodness of Fit:: The goodness of fit of a model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model. With regard to a machine learning algorithm, a good fit is when the error for the model on the training data as well as the test data is minimum. Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases. So the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset is known as the good fit of the model.

Gradient Descent:: Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. In machine learning algorithms, we use gradient descent to minimize the cost function. It find out the best set of parameters for our algorithm.

H

Hadoop:: Hadoop is an open source distributed processing framework used when we have to deal with enormous data. It allows us to use parallel processing capability to handle big data.

Hidden Markov Model:: Hidden Markov Process is a Markov process in which the states are invisible or hidden, and the model developed to estimate these hidden states is known as the Hidden Markov Model (HMM). However, the output (data) dependent on the hidden states is visible. This output data generated by HMM gives some cue about the sequence of states.

Histogram:: Histogram is one of the methods for visualizing data distribution of continuous variables.d>

Hypothesis:: A hypothesis is a possible view or assertion of an analyst about the problem that he or she is working upon. It may be true or may not be true.d>

I

Inferential Statistics:: In Inferential statistics, we try to hypothesize about the population by only looking at a sample of it.

IQR:: IQR (or interquartile range) is a measure of variability based on dividing the rank-ordered data set into four equal parts. It can be derived by Quartile3 – Quartile1.

J

Joint Probability:: The Joint probability of a set of events is the probability that all occur simultaneously.

K

K-Means:: It is a type of unsupervised algorithm which solves the clustering problem. It is a procedure which follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.

kNN:: K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

Kurtosis:: Kurtosis is defined as the thickness (or heaviness) of the tails of a given distribution. Depending on the value of kurtosis, it can be classified into the below 3 categories: 1.Mesokurtic: The distribution with kurtosis value equal to 3. A random variable which follows a normal distribution has a kurtosis value of 3 2.Platykurtic: If the kurtosis is less than 3. In this, the given distribution has thinner tails and a lower peak than a normal distribution 3.Leptykurtic: When the kurtosis value is greater than 3. In this, the given distribution has fatter tails and a higher peak than a normal distribution

L

Line Chart:: Line charts are used to display information as series of points connected by straight line segment. These charts are used to communicate information visually, such as to show an increase or decrease in the trend in data over intervals of time.

Linear regression:: In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables.

Log Loss:: Log Loss or Logistic loss is one of the evaluation metrics used to find how good the model is. Lower the log loss, better is the model. Log loss is the logarithm of the product of all probabilities.

Logistic Regression:: In simple words, it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression. Since, it predicts the probability, the output values lies between 0 and 1 (as expected).

M

Machine Learning:: Machine Learning refers to the techniques involved in dealing with vast data in the most intelligent fashion (by developing algorithms) to derive actionable insights. In these techniques, we expect the algorithms to learn by itself wiithout being explicitly programmed.

MapReduce:: Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Market Basket Analysis:: Market Basket Analysis (also called as MBA) is a widely used technique among the Marketers to identify the best possible combinatory of the products or services which are frequently bought by the customers. This is also called product association analysis.Association analysis mostly done based on an algorithm named “Apriori Algorithm”. The Outcome of this analysis is called association rules. Marketers use these rules to strategize their recommendations.

Maximum Likelihood Estimation:: It is a method for finding the values of parameters which make the likelihood maximum. The resulting values are called maximum likelihood estimates (MLE).

Mean:: For a dataset, mean is said to be the average value of all the numbers. It can sometimes be used as a representation of the whole data.

Median:: Median of a set of numbers is usually the middle value. When the total numbers in the set are even, the median will be the average of the two middle values. Median is used to measure the central tendency.

Mode:: Mode is the most frequent value occuring in the population. It is a metric to measure the central tendency, i.e. a way of expressing, in a (usually) single number, important information about a random variable or a population.

Model Selection:: Model selection is the task of selecting a statistical model from a set of known models. Various methods that can be used for choosing the model are: 1.Exploratory Data Analysis 2.Scientific Methods. Some of the criteria for selecting the model can be: 1.Akaike Information Criterion (AIC) 2.Adjusted R2 3.Bayesian Information Criterion (BIC) 4.Likelihood ratio test.

Multivariate Analysis:: Multivariate analysis is a process of comparing and analyzing the dependency of multiple variables over each other.

Multivariate Regression:: Multivariate, as the word suggests, refers to ‘multiple dependent variables’. A regression model designed to deal with multiple dependent variables is called a multivariate regression model.

N

Naive Bayes:: It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Natural Language Processing:: In simple words, Natural Language Processing is a field which aims to make computer systems understand human speech. NLP is comprised of techniques to process, structure, categorize raw text and extract information.

Nominal Variable:: Nominal variables are categorical variables having two or more categories without any kind of order to them.

Normal Distribution:: The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the bell curve, because it has a peculiar shape of a bell. Mostly, a binomial distribution is similar to normal distribution. The difference between the two is normal distribution is continuous.

Normalization:: Normalization is the process of rescaling your data so that they have the same scale. Normalization is used when the attributes in our data have varying scales.

O

One Shot Learning:: It is a machine learning approach where the model is trained on a single example. One-shot Learning is generally used for object classification. This is performed to design effective classifiers from a single training example.

Ordinal Variable:: Ordinal variables are those variables which have discrete values but has some order involved.

Outlier:: Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

Overfitting:: A model is said to overfit when it performs well on the train dataset but fails on the test set. This happens when the model is too sensitive and captures random patterns which are present only in the training dataset. There are two methods to overcome overfitting- 1.Reduce the model complexity 2.Regularization

P

Parameters:: Parameters are a set of measurable factors that define a system. For machine learning models, model parameters are internal variables whose values can be determined from the data.

Pattern Recognition:: Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and regularities in data. Classification is an example of pattern recognition wherein each input value is assigned one of a given set of classes.

Pie Chart:: A pie chart is a circular statistical graphic which is divided into slices to illustrate numerical proportion. The arc length of each slice, is proportional to the quantity it represents.

Polynomial Regression:: In this technique, a curve fits into the data points. In a polynomial regression equation, the power of the independent variable is greater than 1. Although higher degree polynomials give lower error, they might also result in over-fitting.

Precision and Recall:: Precision can be measured as of the total actual positive cases, how many positives were predicted correctly.It can be represented as:Precision = TP / (TP + FP), Whereas recall is described as the measured of how many of the positive predictions were correct.It can be represented as:Recall = TP / (TP + FN)

Predictor Variable:: Predictor variable is used to make a prediction for dependent variables.

Principal Component Analysis (PCA):: Principal component analysis (PCA) is an approach to factor analysis that considers the total variance in the data, and transforms the original variables into a smaller set of linear combinations. PCA is sensitive to outliers; they should be removed.It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It’s often used to visualize genetic distance and relatedness between populations.

P-Value:: P-value is the value of probability of getting a result equal to or greater than the observed value, when the null hypothesis is true.

Python:: Python is an open source programming language, widely used for various applications, such as general purpose programming, data science and machine learning. Usually preferred by beginners in these fields because of the following major advantages- 1.Easy to learn 2.High-level language 3.Broadly used and supported.

Q

Quartile:: Quartile divides a series into 4 equal parts. For any series, there are 4 quartiles denoted by Q1, Q2, Q3 and Q4. These are known as First Quartile , Second Quartile and so on.

R

R:: R is an open-source programming language and a software environment for statistical computing, machine learning, and data visualization.Features of R: 1.It is platform independent, so it is compatible with multiple operating systems 2.R has a very strong and consistent online community support 3.The graphical capabilities of R are awesome 4.There is abundance of literature to learn R

Range:: Range is the difference between the highest and the lowest value of the population. It is used to measure the spread of the data.

Recommendation Engine:: Generally people tend to buy products recommended to them by their friends or the people they trust. Nowadays in the digital age, any online shop you visit utilizes some sort of recommendation engine. Recommendation engines basically are data filtering tools that make use of algorithms and data to recommend the most relevant items to a particular user. If we can recommend items to a customer based on their needs and interests, it will create a positive effect on the user experience and they will visit more frequently. There are few types of recommendation engines: 1.Content based filtering 2.Collaborative filtering 3.User-User collaborative filtering 4.Item-Item collaborative filtering 5.Hybrid recommendation systems

Regression:: It is supervised learning method where the output variable is a real value, such as “amount” or “weight”.Example of Regression: Linear Regression, Ridge Regression, Lasso Regression

Regularization:: Regularization is a technique used to solve the overfitting problem in statistical models. In machine learning, regularization penalizes the coefficients such that the model generalize better. We have different types of regression techniques which uses regularization such as Ridge regression and lasso regression.

Reinforcement Learning:: It is an example of machine learning where the machine is trained to take specific decisions based on the business requirement with the sole motto to maximize efficiency (performance). The idea involved in reinforcement learning is: The machine/ software agent trains itself on a continual basis based on the environment it is exposed to, and applies it’s enriched knowledge to solve business problems. This continual learning process ensures less involvement of human expertise which in turn saves a lot of time!

Residual:: Residual of a value is the difference between the observed value and the predicted value of the quantity of interest. Using the residual values, you can create residual plots which are useful for understanding the model.

Response Variable:: Response variable (or dependent variable) is that variable whose variation depends on other variables.

ROC:: The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Root Mean Squared Error (RMSE):: RMSE is a measure of the differences between values predicted by a model or an estimator and the values actually observed. It is the standard deviation of the residuals. Residuals are a measure of how far from the regression line data points are.

S

Semi-Supervised Learning:: Problems where you have a large amount of input data (X) and only some of the data, is labeled (Y) are called semi-supervised learning problems.

Skewness:: Skewness is a measure of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

Standard Deviation:: Standard deviation signifies how dispersed is the data. It is the square root of the variance of underlying data. Standard deviation is calculated for a population.

Standardization:: Standardization (or Z-score normalization) is the process where the features are rescaled so that they’ll have the properties of a standard normal distribution with μ=0 and σ=1, where μ is the mean (average) and σ is the standard deviation from the mean.

Standard error:: A standard error is the standard deviation of the sampling distribution of a statistic. The standard error is a statistical term that measures the accuracy of which a sample represents a population. In statistics, a sample mean deviates from the actual mean of a population this deviation is known as standard error.

Statistics:: It is the study of the collection, analysis, interpretation, presentation, and organisation of data.

Supervised Learning:: Supervised Learning algorithm consists of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of predictors, we generate a function that map inputs to desired outputs. Like: y= f(x).,Here, The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

SVM:: It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate.

T

Tokenization:: Tokenization is the process of splitting a text string into units called tokens. The tokens may be words or a group of words. It is a crucial step in Natural Language Processing.

Transfer Learning:: Transfer learning refers to applying a pre-trained model on a new dataset. A pre-trained model is a model created by someone to solve a problem. This model can be applied to solve a similar problem with similar data.

Type I error:: The decision to reject the null hypothesis could be incorrect, it is known as Type I error.

Type II error:: The decision to retain the null hypothesis could be incorrect, it is know as Type II error.

T-Test:: T-test is used to compare two population by finding the difference of their population means.

U

Underfitting:: Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It refers to a model that can neither model on the training data nor generalize to new data. An underfit model is not a suitable model as it will have poor performance on the training data.

Univariate Analysis:: Univariate analysis is comparing and analyzing the dependency of a single predictor and a response variable

Unsupervised Learning: In Unsupervised Learning algorithm, we do not have any target or outcome variable to predict/estimate. The goal of unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data or segment into different groups based on their attributes.

V

Variance:: Variance is used to measure the spread of given set of numbers and calculated by the average of squared distances from the mean.

W

Wald statistic:: The Wald test is a parametric statistical test named after the statistician Abraham Wald. Whenever a relationship within or between data items can be expressed as a statistical model with parameters to be estimated from a sample, the Wald test can be used to test the true value of the parameter based on the sample estimate.

White Noise:: In discrete time, white noise is a discrete signal whose samples are regarded as a sequence of serially uncorrelated random variables with zero mean and finite variance; a single realization of white noise is a random shock.

X

X-bar and R chart:: X-bar and R charts are used to monitor the mean and variation of a process based on samples taken from the process at given times (hours, shifts, days, weeks, months, etc.).

Y

Youden Index:: Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test.

Z

Z-test:: Z-test determines to what extent a data point is away from the mean of the data set, in standard deviation.We can make use of z-test to test the claims made by the principal. 1.Stating null hypothesis and alternate hypothesis. 2.State the alpha level. If you don’t have an alpha level, use 5% (0.05). 3.Find the rejection region area (given by your alpha level above) from the z-table. An area of .05 is equal to a z-score of 1.645.

Go to top

IIMBx: QM202x Foundations of Data Science

Glossary

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

edX

Legal

Connect