Practical Statistics for Data Scientist Notes Chapter One Exploratory Data Analysis
A major challenge of data science is to harness this torrent of raw data into actionable information.
Numerical and categorical data. Numerical include continuous and discrete data. Categorical includes Binary, ordinal, and other data.
1.1 Rectangular data
It means the database table. It is a two-dimensional matrix with rows indicating records and columns indicating features. Data in a relational database must be extracted and put into a single table for most data analysis and modeling tasks. In pandas, it is also possible to set multilevel indexes to improve the efficiency of certain operations.
Terminology: predictor variables to predict response or dependent variable, features to predict the target.
1.2 Estimate of Location
Mean, trimmed mean, weighted mean, median, weighted median,
Trimmed mean to drop a fixed number of sorted values at each end and then take an average of the remaining values.
Weighted mean: Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. The data collected does not equally represent the different groups that we are interested in measuring. So we can give a higher weight to the values from the groups that were underrepresented.
Median is a better metric for location. The weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list. The weighted median is robust to outliers just as median.
Outlier:
An outlier is any value that is very distant from the other values in a data set. The definition of the outlier is subjective. When outliers are the result of bad data, the mean will be affected more than the median. They should be identified and further investigated.
Anomaly Detection:
In anomaly detection, the points of interest are the outliers. The greater mass of data serves primarily to define the “normal” against which anomalies are measured. trimmed mean is a compromise between the median and the mean: it is robust to extreme values in the data but uses more data to calculate the estimate for location.
1.3 Variability
Measuring, reducing, distinguishing random from variability, identifying the various sources of real variability, and making decisions in the presence of it.
Mean absolute deviation: the average of the absolute values of the deviations from the mean.
why the degree of freedom is n-1 because if it is n, the true value of the variance and the standard deviation in the population will be underestimated. There is only one constraint: the SD depends on calculating the sample mean.
Neither standard deviation nor the mean absolute deviation is robust to outliers and extreme values. The variance and standard deviation are sensitive to outliers since they are based on the squared deviations.
Only the MEDIAN ABSOLUTE DEVIATION is robust.
SD > Mean AD>MAD
MAD will multiple a constant scaling factor 1.4826, to put MAD on the same scale at the SD in the case of ND(Normal Distribution)
Estimate based on Percentiles
IQR = Q3 − Q1. In other words, the IQR is the first quantile subtracted from the third quantile; these quantiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used robust measure of scale.
1.4 Exploring the data distribution
The boxplot’s top and bottom of the box are 75th and 25th percentiles. The R function extends the whiskers to the furthest point beyond the box, except it will not go beyond 1.5 times the IQR. Any data outside of the whiskers are points.
The histogram is a way to visualize a frequency table.
A distribution has four steps: location, variability, skewness, and kurtosis.
Skewness refers to whether the data is skewed to larger or smaller values, and kurtosis indicates the propensity of the data to have extreme values.
Density Estimates can be considered as a smoothed histogram.
5 Categorical and binary data
The bar chart and pie chart are the charts for categorical data.
The mode is the value that appears most often in the categorical data.
EV(Expected Value) is of weighted mean in which the weights are probabilities. It is a fundamental concept in business valuation and capital budgeting.
6 Correlation
The correlation coefficient is sensitive to outliers in the data.
Explore two or more variables
Use Means and variance to look at variables one at a time(univariate analysis)
Correlation to look at two variables (bivariate analysis)
Two or more variables (multivariate analysis)
Hexagon Binning, contour plots are useful tools that permit the graphical examination of two numeric variables at a time, without being overwhelmed by huge amounts of data.
Contingency tables are the standard tool for looking at the counts of two categorical variables.
Boxplots and violin plots allow us to plot a numeric variable against a categorical variable.
Practice- Python for EDA