Books on Data Science | Next - Issue #42
Five Stories
Ani Adhikari's book details simple topics like the basics of probability to critical ones like the bias-variance tradeoff. It is aimed at people who're new to data science and want to create a strong foundation in probability and mathematical statistics.
My favourite chapters on a quick read are Random Counts and The German Tank Problem.
Many of us are aspiring for machine learning jobs in future.
This book is the result of the collective wisdom of many people who have sat on both sides of the table and who have spent a lot of time thinking about the hiring process. It was written with candidates in mind, but hiring managers who saw the early drafts told me that they found it helpful to learn how other companies are hiring, and to rethink their own process.
If you're short on time, here's a bullet list of 30 open-ended questions to test your understanding and solve practical challenges.
This book collects many of their discussions from the podcast Not So Standard Deviations and distils them into a readable format. The podcast discusses the backstory and day-to-day life of data scientists in academia and industry. There are discussions on the life of data scientists, seemingly "easy" analyses, team communication and more.
You can pay what you want for the book, as low as $0.
In this book, the authors cover topics about R that are largely ignored, such as .Renviron.
We focus on building holistic and project-oriented workflows that address the most common sources of friction in data analysis, outside of doing the statistical analysis itself.
The book is targeted at long-time users of R who are largely self-learnt and want to improve their programming efficiency.
Big Book of R describes it best:
If R’s behaviour has ever suprised you, then this book is a guide for many more surprises, written in the style of Dante. It’s a concise report on number of common-errors and unexpected behaviours in R. This book would make more sense, if you have been programming and are familiar with such behaviours (not all though), as there is little time spent on explaining why part of behaviour. As mentioned, it’s a concise book, 126 pages only.
Four Packages
censored is a parsnip extension package which provides engines for various models for censored regression and survival analysis within the tidymodels framework. Vignette.
infer is tidymodel framework for statistical inference. It has four important functions: specify(), hypothesize(), generate() and calculate() for this purpose. Vignette.
corrr is a package for exploring correlations in R. Unlike base R's cor() this produces a data frame instead of a matrix. Vignette.
tune is tidymodel framework for hyperparameter tuning of machine learning models. Vignette.
Three Jargons
Breidbart Index measures the severity of spam invented by long-time hacker Seth Breidbart and is used for programming cancelbots. It takes into account the fact that excessive multi-posting is worse than excessive cross-posting. It is computed as follows: For each article in a spam, take the square-root of the number of newsgroups to which the article is posted. The Breidbart Index is the sum of the square roots of all of the posts in the spam. For example, one article posted to nine newsgroups and again to sixteen would have BI = sqrt(9) + sqrt(16) = 7. It is generally agreed that a spam is cancelable if the Breidbart Index exceeds 20.
Brook's Law states that the expected advantage from splitting development work among N programmers is O(N) (that is, proportional to N). Still, the complexity and communications cost associated with coordinating and merging their work is O(N^2) (that is, proportional to the square of N). It is frequently summarised as "Adding manpower to a late software project makes it later".
Second system effect: When designing the successor to a relatively small, elegant, and successful system, one tends to become grandiose in one's success and create an elephantine feature-laden monstrosity.
Two Tweets


