Statistics is important for Analytics, Data Science, Machine Learning.

Subhashini Sharma Tripathi
5 min readJul 5, 2021

A data scientist is a professional who is able to analyse data sources, clean and process the data, understand why and how such data has been generated, take insights from it, and make changes such that they profit the organization. These days, everything revolves around data.

1. Data Cleaning: It deals with gathering the data and structuring it so that it becomes easy to pass this data as input to any machine learning algorithm. This way, redundant, irrelevant data and noise can also be eliminated.

2. Data Analysis: This deals with understanding more about the data, why the data has yielded certain results, and what can be done to improve it. It also helps calculate certain numerical values like mean, variance, the distributions, and the probability of a certain prediction.

One of the most comprehensive definitions of Data Science was recently given by Cao as the formula:

data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking).

In this formula, sociology stands for the social aspects and | (data + environment + thinking) means that all the mentioned sciences act on the basis of data, the environment and the so-called data-to-knowledge-to-wisdom thinking.

Statistical methods are crucial in most fundamental steps of Data Science. Hence, the premise of our contribution is:

Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

Data Science is a sequence of the following steps: Note that topics in small capitals indicate steps where statistics is less involved.

1. Data Acquisition and Enrichment,

2. DATA STORAGE AND ACCESS,

3. Data Exploration,

4. Data Analysis and Modeling,

5. OPTIMIZATION OF ALGORITHMS,

6. Model Validation and Selection,

7. Representation and Reporting of Results, and

8. BUSINESS DEPLOYMENT OF RESULTS

Core statistics concepts

· Descriptive statistics: As the name suggests, it uses the data to give out more information about every aspect of the data with the help of graphs, plots, or numbers. It organizes the data into a structure, and helps think about the attributes that highlight the important parts of the data.

· Inferential statistics: It deals with drawing inferences/conclusions on the sample data set which is obtained from the population (entire data set) based on the relationship identified between data points in the data set. It helps in generalizing the relationship to the entire dataset. It is important to remember that the dataset drawn from the population is relevant and represents the population accurately.

· Regression: The term ‘regression’ which is a part of statistics and machine learning, talks about how data can be fit to a line, and how every point from the straight line gives some insights. In terms of machine learning, it can be understood as tasks that can be solved without explicitly being programmed. They discuss how a line can be fit to a given set of data points, and how it can be further extrapolated for the predictions to be done.

· Maximum likelihood: It is a method that helps in finding values of parameters for a specific model. The values of the parameters have to be such that the likelihood of the predictions that occur have to be maximum in comparison to the data values that were actually observed. This means the difference between the actual and predicted value has to be less, thereby reducing the error and increasing the accuracy of the predictions.

Note: This concept is generally used with Logistic regression when we are trying to find the output as 0 or 1, yes or no, wherein the maximum likelihood tells about how likely a data point is near to 0 or 1.

· Bayesian thinking: Bayesian thinking deals with using probability to model the process of sampling and being able to quantify the uncertainty associated with the data that would be collected.

This is known as prior probability- which means the level of uncertainty that is associated with the data before it is collected to be analyzed.

Posterior probability deals with the uncertainty that occurs after the data has been collected.

Machine learning algorithms are usually focused on giving the best predictions as output with minimal errors, exact probabilities of specific events occurring and so on. Bayes theorem is a way of calculating the probability of a hypothesis (a situation, which might not have occurred in reality) based on our previous experiences and the knowledge we have gained by it. This is considered as a basic concept that needs to be known.

· Naïve Bayes classifier

It is an algorithm that can be used with binary or multi-class classification problems. It is a simple algorithm wherein the probability for every hypothesis is simplified.

Once a Naïve Bayes classifier has learnt from the data, it stores a list of probabilities in a data structure. Probabilities such as ‘class probability’ and ‘condition probability’ are stored. Training such a model is quick since the probability of every class and its associated value needs to be determined, and this doesn’t involve any optimization processes or changing of coefficient to give better predictions.

Class probability: It tells about the probability of every class that is present in the training dataset. It can be calculated by finding the frequency of values that belongs to each class divided by the total number of values.

Class probability = (number of classes/(number of classes of group 0 + number of classes of group 1))

Conditional probability: It talks about the conditional probability of every input that is associated with a class value. It can be calculated by finding the frequency of every data attribute in the data for a given class, and this can be determined by the number of data values that have that data label/class value.

Conditional probability P(condition | result ) = number of ((values with that condition and values with that result)/ (number of values with that result))

Do go though this interesting pdf The Cartoon Guide To Statistics : MrKraken : Free Download, Borrow, and Streaming : Internet Archive

--

--

Subhashini Sharma Tripathi

My passion is Intelligence Amplification - using tech and data to make great decisions. Currently, am bullish on Generative AI for Banking solutions.