Who are Twitter Blue users?
Next — Today I Learnt About Data Science | Issue #67
I have a puzzle for you:
How big would a wall clock have to be so that the tip of its minute hand moves a centimeter per second?
Give it a try. Answers at the end.
Who are Twitter Blue Users?
My recent blog post analyzes the trends and characteristics of Twitter Blue subscribers, based on data collected by Travis Brown. It shows that most subscribers are regular users with few followers, and that many of them have unsubscribed or been suspended. Elon Musk is the only one among the top-10 most popular accounts to have a Blue.
In fact, Blue gained most subscribers within the first two weeks of launch. With White House, New York Times and many important accounts refusing to pay for the badge, it remains to be seen how Elon Musk handles this looming chaos. You can find R Codes on Github.
R for Applied Epidemiology and Public Health
Edited by Neale Batra
The book is a guide for epidemiologists who want to use R, a programming language and software environment for statistical computing and graphics. It shows how to perform common epidemiological tasks with R code and examples. It also helps epidemiologists learn R skills and apply them to their work.
This blog is about data visualization and data analysis. The author reviews and critiques various charts and graphs from different sources, and offers suggestions for improvement.
Some cool ones:
Lay off bubbles (How bubble charts scale)
Some chart designs bring out more information than others (Why bump charts are better than two-column bar plot to show differences)
An interactive map that mostly works, except for the color scale (How not to use colours)
Storytelling in ggplot using rounded rectangles
This blogpost is about how to create rounded rectangles in
ggplot2. Albert shows two ways to achieve this effect: one using the
ggchicklet package, and another using
grobs, which are graphical objects that can be manipulated. (TIL about
He also demonstrates how to improve a plot by highlighting words and adding text elements. Check it out!
National Geographic Society World Water Map
This story is about the global water crisis and how humans are using more water than the water cycle can provide. It is based on a model developed by Utrecht University that shows where and why water gaps arise, how climate change might worsen them, and how they might be managed
India has had to pump more groundwater than any other country.
The bulk of it is for irrigation. In the arid northwestern states of Punjab and Haryana, thirsty rice and wheat are now the dominant crops, and wells the main source of irrigation water. The water table is sinking up to three meters a year.
Choosing between mass famine and groundwater depletion, the Indian government chose the latter.
Segment Anything is an open-source project for image segmentation by Meta AI. It includes a promptable model (SAM) and a huge dataset (SA-1B) with 1 billion masks. SAM can segment any object in any image using different types of prompts, such as clicks, boxes, text, etc. Blogpost. Github.
nanoGPT is the simplest, fastest method to train your own GPT. You will be able to train GPT-2 model on your computer from scratch pretty quickly. It is great for you to experiment with GPT models and learning how it’s trained. Github.
openai is R package for communicating with Open AI’s API. There are functions for GPT with chatting capabilities, Dall-E for generating images and Whisper for converting from speech-to-text. Vignette. Github.
pillar is the package for styling columns of data, artfully using colour and unicode characters to guide the eye. It powers the Tidyverse and RStudio print outputs. Vignette. Github.
Endogeneity is a term used in econometrics to describe a situation where an independent variable (or explanatory variable) is correlated with the error term in a regression model. This correlation may arise from omitted variables, measurement error, or simultaneity, leading to biased and inconsistent estimates.
Instrumental Variable (IV)
Instrumental Variable is an econometric technique used to address endogeneity issues in regression models. The method involves using an external variable, known as an instrument, which is correlated with the endogenous explanatory variable but uncorrelated with the error term.
The IV estimator replaces the endogenous variable with its predicted values from the first-stage regression of the endogenous variable on the instrument. This method helps to obtain consistent estimates of the causal effect of the endogenous variable on the dependent variable, assuming that the instrument satisfies the required conditions of relevance and exogeneity.
Granger Causality is a statistical hypothesis test used to determine whether one time series can help forecast another time series. It is based on the idea that if a variable X Granger-causes variable Y, then past values of X should contain information that helps predict Y. The test involves estimating two separate vector autoregression (VAR) models, comparing the prediction errors, and using statistical tests such as the F-test or the likelihood ratio test to determine if including the lagged values of X significantly improves the prediction of Y.
(Apparently, Twitter has blocked Substack from embedding tweets into newsletters. Please just click the links. 🤷♂️)
This is an album that I’m listening these days. It is blissful instrumental, helps me focus and unwind. First caught it in an interview of Daniel Ek, Spotify’s CEO.
A wall clock would have to be approximately 1145.92 cm (or about 11.46 meters) in diameter for the tip of its minute hand to move at 1 centimeter per second. Solution in next issue!
Thanks for reading Next — Today I Learnt About Data Science! Subscribe for free to receive new posts and support my work.