Facebook Pixel
Searching...
English
EnglishEnglish
EspañolSpanish
简体中文Chinese
FrançaisFrench
DeutschGerman
日本語Japanese
PortuguêsPortuguese
ItalianoItalian
한국어Korean
РусскийRussian
NederlandsDutch
العربيةArabic
PolskiPolish
हिन्दीHindi
Tiếng ViệtVietnamese
SvenskaSwedish
ΕλληνικάGreek
TürkçeTurkish
ไทยThai
ČeštinaCzech
RomânăRomanian
MagyarHungarian
УкраїнськаUkrainian
Bahasa IndonesiaIndonesian
DanskDanish
SuomiFinnish
БългарскиBulgarian
עבריתHebrew
NorskNorwegian
HrvatskiCroatian
CatalàCatalan
SlovenčinaSlovak
LietuviųLithuanian
SlovenščinaSlovenian
СрпскиSerbian
EestiEstonian
LatviešuLatvian
فارسیPersian
മലയാളംMalayalam
தமிழ்Tamil
اردوUrdu
Practical Statistics for Data Scientists

Practical Statistics for Data Scientists

50 Essential Concepts
by Peter Bruce 2017 318 pages
4.02
100+ ratings
Listen

Key Takeaways

1. Exploratory Data Analysis: The Foundation of Data Science

The key idea of EDA: the first and most important step in any project based on data is to look at the data.

Origins and importance. Exploratory Data Analysis (EDA), pioneered by John Tukey in the 1960s, forms the cornerstone of data science. It involves summarizing and visualizing data to gain intuition and understanding before formal modeling or hypothesis testing.

Key components of EDA:

  • Examining data quality and completeness
  • Calculating summary statistics
  • Creating visualizations (histograms, scatterplots, etc.)
  • Identifying patterns, trends, and outliers
  • Formulating initial hypotheses

EDA is an iterative process that helps data scientists understand the context of their data, detect anomalies, and guide further analysis. It bridges the gap between raw data and actionable insights, setting the foundation for more advanced statistical techniques and machine learning models.

2. Understanding Data Types and Structures

The basic data structure in data science is a rectangular matrix in which rows are records and columns are variables (features).

Data types. Understanding different data types is crucial for proper analysis and modeling:

  • Numeric: Continuous (e.g., temperature) or Discrete (e.g., count data)
  • Categorical: Nominal (unordered) or Ordinal (ordered)
  • Binary: Special case of categorical with two categories

Data structures. The most common data structure in data science is the rectangular format:

  • Rows represent individual observations or records
  • Columns represent variables or features
  • This structure is often called a data frame in R or a DataFrame in Python

Understanding data types and structures helps in choosing appropriate visualization techniques, statistical methods, and machine learning algorithms. It also aids in data preprocessing and feature engineering, which are critical steps in the data science workflow.

3. Measures of Central Tendency and Variability

The mean is easy to compute and expedient to use, it may not always be the best measure for a central value.

Central tendency. Measures of central tendency describe the "typical" value in a dataset:

  • Mean: Average of all values (sensitive to outliers)
  • Median: Middle value when data is sorted (robust to outliers)
  • Mode: Most frequent value (useful for categorical data)

Variability. Measures of variability describe the spread of the data:

  • Range: Difference between maximum and minimum values
  • Variance: Average squared deviation from the mean
  • Standard deviation: Square root of variance (same units as data)
  • Interquartile range (IQR): Difference between 75th and 25th percentiles (robust)

Choosing appropriate measures depends on the data distribution and the presence of outliers. For skewed distributions or data with outliers, median and IQR are often preferred over mean and standard deviation. These measures provide a foundation for understanding data distributions and are essential for further statistical analysis.

4. Visualizing Data Distributions

Neither the variance, the standard deviation nor the mean absolute deviation are robust to outliers and extreme values.

Importance of visualization. Visual representations of data distributions provide insights that summary statistics alone may miss. They help identify patterns, outliers, and the overall shape of the data.

Key visualization techniques:

  • Histograms: Show frequency distribution of a single variable
  • Box plots: Display median, quartiles, and potential outliers
  • Density plots: Smooth version of histograms, useful for comparing distributions
  • Q-Q plots: Compare sample distribution to theoretical distribution (e.g., normal)

These visualizations complement numerical summaries and help data scientists make informed decisions about data transformations, outlier treatment, and appropriate statistical methods. They also aid in communicating findings to non-technical stakeholders, making complex data more accessible and interpretable.

5. Correlation and Relationships Between Variables

Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y.

Understanding correlation. Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

Visualization and measurement:

  • Scatter plots: Visualize relationship between two continuous variables
  • Correlation matrix: Display correlations between multiple variables
  • Pearson correlation coefficient: Measure linear relationship
  • Spearman rank correlation: Measure monotonic relationship (robust to outliers)

While correlation doesn't imply causation, it's a valuable tool for exploring relationships in data. It helps identify potential predictors for modeling, detect multicollinearity, and guide feature selection. However, it's important to remember that correlation only captures linear relationships and may miss complex, non-linear patterns in the data.

6. The Importance of Random Sampling and Bias Avoidance

Even in the era of Big Data, random sampling remains an important arrow in the data scientist's quiver.

Random sampling. Random sampling is crucial for obtaining representative data and avoiding bias. It ensures that each member of the population has an equal chance of being selected, allowing for valid statistical inferences.

Types of sampling:

  • Simple random sampling: Each unit has an equal probability of selection
  • Stratified sampling: Population divided into subgroups, then sampled randomly
  • Cluster sampling: Groups are randomly selected, then all units within the group are studied

Avoiding bias. Bias can lead to incorrect conclusions and poor decision-making. Common types of bias include:

  • Selection bias: Non-random sample selection
  • Survivorship bias: Focusing only on "survivors" of a process
  • Confirmation bias: Seeking information that confirms pre-existing beliefs

Proper sampling techniques and awareness of potential biases are essential for ensuring the validity and reliability of data science projects, even when working with large datasets.

7. Sampling Distributions and the Bootstrap Method

The bootstrap does not compensate for a small sample size - it does not create new data, nor does it fill in holes in an existing dataset.

Sampling distributions. A sampling distribution is the distribution of a statistic (e.g., mean, median) calculated from repeated samples of a population. Understanding sampling distributions is crucial for estimating uncertainty and making inferences.

The bootstrap method. Bootstrap is a powerful resampling technique that estimates the sampling distribution by repeatedly sampling (with replacement) from the original dataset:

  1. Draw a sample with replacement from the original data
  2. Calculate the statistic of interest
  3. Repeat steps 1-2 many times
  4. Use the distribution of calculated statistics to estimate standard errors and confidence intervals

Advantages of bootstrap:

  • Does not require assumptions about the underlying distribution
  • Applicable to a wide range of statistics
  • Particularly useful for complex estimators without known sampling distributions

The bootstrap provides a practical way to assess the variability of sample statistics and model parameters, especially when theoretical distributions are unknown or difficult to derive.

8. Confidence Intervals: Quantifying Uncertainty

Presenting an estimate not as a single number but as a range is one way to counteract this tendency.

Purpose of confidence intervals. Confidence intervals provide a range of plausible values for a population parameter, given a sample statistic. They quantify the uncertainty associated with point estimates.

Interpretation. A 95% confidence interval means that if we were to repeat the sampling process many times, about 95% of the intervals would contain the true population parameter.

Calculation methods:

  • Traditional method: Uses theoretical distributions (e.g., t-distribution)
  • Bootstrap method: Uses resampling to estimate the sampling distribution

Factors affecting confidence interval width:

  • Sample size: Larger samples lead to narrower intervals
  • Variability in the data: More variability leads to wider intervals
  • Confidence level: Higher confidence levels (e.g., 99% vs. 95%) lead to wider intervals

Confidence intervals provide a more nuanced view of statistical estimates, helping data scientists and decision-makers understand the precision of their findings and make more informed decisions.

9. The Normal Distribution and Its Limitations

It is a common misconception that the normal distribution is called that because most data follow a normal distribution, i.e. it is the normal thing.

The normal distribution. Also known as the Gaussian distribution, it's characterized by its symmetric, bell-shaped curve. Key properties:

  • Mean, median, and mode are all equal
  • 68% of data falls within one standard deviation of the mean
  • 95% within two standard deviations, 99.7% within three

Importance in statistics:

  • Central Limit Theorem: Sampling distributions of means tend to be normal
  • Many statistical tests assume normality
  • Basis for z-scores and standardization

Limitations:

  • Many real-world phenomena are not normally distributed
  • Assuming normality can lead to underestimation of extreme events
  • Can be inappropriate for inherently skewed data (e.g., income distributions)

While the normal distribution is theoretically important and useful in many contexts, data scientists must be cautious about assuming normality without checking their data. Techniques like Q-Q plots and normality tests can help assess whether the normal distribution is appropriate for a given dataset.

10. Long-Tailed Distributions in Real-World Data

Assuming a normal distribution can lead to underestimation of extreme events ("black swans").

Long-tailed distributions. Many real-world phenomena exhibit long-tailed distributions, where extreme events occur more frequently than predicted by normal distributions. Examples include:

  • Income distributions
  • City populations
  • Word frequencies in natural language
  • Stock market returns

Characteristics:

  • Higher kurtosis (more peaked) than normal distributions
  • Heavier tails, indicating higher probability of extreme events
  • Often asymmetric (skewed)

Implications for data science:

  • Need for robust statistical methods that don't assume normality
  • Importance of considering extreme events in risk analysis and modeling
  • Potential for using alternative distributions (e.g., log-normal, power law)

Recognizing and properly handling long-tailed distributions is crucial for accurate modeling and decision-making in many domains. It requires a shift from traditional statistical thinking based on normal distributions to more flexible approaches that can capture the complexity of real-world data.

Last updated:

Review Summary

4.02 out of 5
Average of 100+ ratings from Goodreads and Amazon.

Practical Statistics for Data Scientists receives mostly positive reviews, with readers praising its concise yet comprehensive overview of statistical concepts for data science. Many find it useful as a reference or refresher, particularly for those with some background in statistics. The book's practical approach and clear explanations are appreciated, though some readers note it may be too basic for experienced data scientists. The inclusion of R code examples is helpful for some but limiting for others. Overall, it's considered a valuable resource for understanding statistical methods in data science.

Your rating:

About the Author

Peter Bruce is the author of "Practical Statistics for Data Scientists." While specific details about the author are not provided in the given information, it can be inferred that Bruce has expertise in statistics and data science. The book's content and approach suggest that the author has practical experience in the field and understands the needs of data scientists. Bruce's writing style is described as clear and concise, making complex statistical concepts accessible to readers. The book's focus on practical applications and its inclusion of R code examples indicate the author's intent to bridge the gap between theoretical knowledge and real-world data science applications.

Download PDF

To save this Practical Statistics for Data Scientists summary for later, download the free PDF. You can print it out, or read offline at your convenience.
Download PDF
File size: 0.27 MB     Pages: 13

Download EPUB

To read this Practical Statistics for Data Scientists summary on your e-reader device or app, download the free EPUB. The .epub digital book format is ideal for reading ebooks on phones, tablets, and e-readers.
Download EPUB
File size: 2.98 MB     Pages: 11
0:00
-0:00
1x
Dan
Andrew
Michelle
Lauren
Select Speed
1.0×
+
200 words per minute
Create a free account to unlock:
Bookmarks – save your favorite books
History – revisit books later
Ratings – rate books & see your ratings
Unlock unlimited listening
Your first week's on us!
Today: Get Instant Access
Listen to full summaries of 73,530 books. That's 12,000+ hours of audio!
Day 4: Trial Reminder
We'll send you a notification that your trial is ending soon.
Day 7: Your subscription begins
You'll be charged on Nov 21,
cancel anytime before.
Compare Features Free Pro
Read full text summaries
Summaries are free to read for everyone
Listen to summaries
12,000+ hours of audio
Unlimited Bookmarks
Free users are limited to 10
Unlimited History
Free users are limited to 10
What our users say
30,000+ readers
“...I can 10x the number of books I can read...”
“...exceptionally accurate, engaging, and beautifully presented...”
“...better than any amazon review when I'm making a book-buying decision...”
Save 62%
Yearly
$119.88 $44.99/yr
$3.75/mo
Monthly
$9.99/mo
Try Free & Unlock
7 days free, then $44.99/year. Cancel anytime.
Settings
Appearance