Hi there!
I missed Next’s delivery time by nine hours. With semester picking up the pace with several projects and homeworks due, I keep confusing between deadlines. Suggestions on handling this unfathomable list of to-dos are welcome. Enjoy the newsletter!
Five Stories
posit::conf(2023) is open for registration!
Posit PBC.
Arguably the biggest conference in the R world is open for registration now. Held in Chicago, IL between September 17-20, the event will have three keynotes, around 30 workshops and many many talks.
Call for presentations and talks is also open. Talks are 20 minutes long and can be in-person or remote. For all the talks that don’t make up the bar, there’d also be lightening talks: byte-sized information served in five minutes. Finally, there are opportunity scholarships for those who need it.
Common statistical tests are linear models
Jonas Kristoffer Lindeløv
Most statistical tests can be reformulated as linear models. I “discovered” this sometime around second year of my undergraduate and asked my professor why didn’t anyone tell me I could fit a linear regression model and arrive at the same conclusion!
Which generation controls the senate?
wcd.fyi
Which generation do the US senators belong to? Pretty interesting question, right? Especially since President Joe Biden, at 79, is the oldest president in U.S. history. (Donald Trump is the second oldest.)
A recent YouGov poll found that more than half of Americans support a maximum age limit for elected officials to hold office, and prominent public figures including Elon Musk and former President Jimmy Carter have expressed desires to see limitations put in place.
This article shows an interactive chart of US senators by their age and generation using data from ProPublica Congress and Pew Research Center. At a clear glance, you can note how the average age of Senators has increased over the years.
What literature do we study from the 1990s?
Matt Daniels, pudding.cool
The great pudding.cool is back with one more interesting scroll: the most common literature we study from the 1990s. Looking at college syllabi data, they seeked to understand which books have gotten more popular and which have lost popularity.
The most popular book is (and you didn’t guess it right): The Things They Carried (1990) by Tim O’Brien. The book wasn’t popular when published and didn’t make it to New York Times Best Seller (but was a Pulitzer finalist).
Going beyond this, they add New York Times Best Seller data, Goodreads ranking, Literary Prizes data to triangulate and understand other popular books.
What should you use ChatGPT for?
Vicki Boykis
I feel like I’ve written about ChatGPT every week for the last two months or so. In this blog post, Vicky looks at what ChatGPT can and cannot do for us (just yet). She is skeptical about using ChatGPT for generating creative work or writing code, as they value their own ideas and expression, and find that ChatGPT often fails to provide correct syntax or logic.
These AI tech work best when you iterate on the results and use them as a source of inspiration or direction, rather than a search engine or a substitute for creative work. They’ll will inspire you, but will not get you across the finish line.
Four Packages
streamlit lets you convert Python data scripts into sharable web apps. I’ve been desperately looking for Shiny alternatives in Python and maybe this is it! Github.
vroom is a package in R for handling tabulated data known for its speed. Just look at these benchmarks! Vignette.
DataExplorer in R aims to automate most of data handling and visualization so that users could focus on studying the data and extracting insights. plot_missing()
is really cool for finding what data is missing. Vignette.
pyjanitor is Python implementation of R package janitor. It simplifies use of pandas for data handling and can be used alongside it. Vignette.
Three Jargons
Elastic net: A regularization technique in machine learning that combines the penalties of lasso and ridge regression to promote sparsity and prevent overfitting.
Empirical Bayes estimation: A method for estimating parameters of a statistical model by using the data to estimate the prior distribution of the parameters, which is then used to update the posterior distribution.
Bagging: A technique in machine learning that involves training multiple models on bootstrap samples of the training data and aggregating their predictions to reduce variance and improve performance.