If You’re a Data Scientist, You Should Read this Book About Statistics

Juan Diego Raimondi is a Solutions Architect here at Making Sense. He’s been with our company for over a decade now, and maintains a long-running blog over at https://blog.alphasmanifesto.com/. It’s a great place to browse and he’s posted dozens of great articles on everything from AI to UX. It’s where this post was originally published and we’re excited to include it here on our own blog as a helpful resource for all our readers. Read his book review!

Practical Statistics for Data Scientists

I recently finished reading this amazing book by Peter and Andrew Bruce. It is definitely one of the must-have books for starter and middle-level data scientists, and even more so for the starter statistician.

The awesomeness of this book relies on it going straight to the point instead of giving lengthy introductions to subjects. This allows it to be very deep in the explanation of the subject itself without making the reader feel like they’re wasting their time.

The first big section is all about Exploratory Data Analysis, which metrics can be used, and their strengths and weaknesses. It then talks about variability, data distributions, variable correlations and how to work even with multiple types of variables.

The second section digs a bit deeper into how to appropriately assess Data and Sampling Distributions, and you’ll see more than a few instances of the phrase “it depends”. The book makes it perfectly clear what techniques might work in some situations and which pitfalls they have. They also give further reading and indications on how to go even deeper on those strategies, in case you need to assess if you’re a victim to one of those pitfalls.

The third section is all about Statistical Experiments and Significance Testing. Based on the theoretical knowledge provided by the first chapters, this one gets more practical with decision making based on metrics and analysis of the data. This is where you’ll find things like A/B testing and p-values. Remember that this is not a narrative book, so it won’t lose the focus of being a good reference, meaning that it quickly moves to different concepts.

The fourth section tackles Regression and Prediction. At this point it’s all about numeric regression, starting with Linear Regression and wrapping up with Splines.

Then the next one is about Classification, including Bayes, Logistic Regression and some of their relatives. Notice that while introducing the techniques, their evaluation metrics like ROC Curve, AOC and such are also presented, so it’s very easy to come back to these sections to review these concepts.

The last two sections are about supervised learning (Statistical Machine Learning) and Unsupervised Learning. They briefly cover variable encodings and standarization, model strategies, trees and clustering algorithms.

Note that one aspect lacking is text input data or data manipulation. Also, there’s not much about data cleanup techniques, although some quick strategies for this are scattered around the bias detection, imbalance detection and variation sections in the initial chapters.

Throughout the book, the author provides nice and concise examples in R to follow along, playing with new default datasets that are available to use. This makes it really nice to follow along and experiment with variations on your own while learning the concepts that the book presents.

All in all, it’s a very good introductory book to the complexities of statistical work. It does its job nicely without losing you in a sea of theory. At the same time, it’s well-written in a way that it can also serve as a quick-reference book.