posit::conf(2023) is open for registration

Next | Issue #61

Mar 01, 2023

Hi there!

I missed Next’s delivery time by nine hours. With semester picking up the pace with several projects and homeworks due, I keep confusing between deadlines. Suggestions on handling this unfathomable list of to-dos are welcome. Enjoy the newsletter!

Five Stories

posit::conf(2023) is open for registration!

Posit PBC.

Arguably the biggest conference in the R world is open for registration now. Held in Chicago, IL between September 17-20, the event will have three keynotes, around 30 workshops and many many talks.

Call for presentations and talks is also open. Talks are 20 minutes long and can be in-person or remote. For all the talks that don’t make up the bar, there’d also be lightening talks: byte-sized information served in five minutes. Finally, there are opportunity scholarships for those who need it.

Learn more

Common statistical tests are linear models

Jonas Kristoffer Lindeløv

Most statistical tests can be reformulated as linear models. I “discovered” this sometime around second year of my undergraduate and asked my professor why didn’t anyone tell me I could fit a linear regression model and arrive at the same conclusion!

Read on

Which generation controls the senate?

wcd.fyi

Which generation do the US senators belong to? Pretty interesting question, right? Especially since President Joe Biden, at 79, is the oldest president in U.S. history. (Donald Trump is the second oldest.)

A recent YouGov poll found that more than half of Americans support a maximum age limit for elected officials to hold office, and prominent public figures including Elon Musk and former President Jimmy Carter have expressed desires to see limitations put in place.

This article shows an interactive chart of US senators by their age and generation using data from ProPublica Congress and Pew Research Center. At a clear glance, you can note how the average age of Senators has increased over the years.

Learn more

What literature do we study from the 1990s?

Matt Daniels, pudding.cool

The great pudding.cool is back with one more interesting scroll: the most common literature we study from the 1990s. Looking at college syllabi data, they seeked to understand which books have gotten more popular and which have lost popularity.

The most popular book is (and you didn’t guess it right): The Things They Carried (1990) by Tim O’Brien. The book wasn’t popular when published and didn’t make it to New York Times Best Seller (but was a Pulitzer finalist).

Going beyond this, they add New York Times Best Seller data, Goodreads ranking, Literary Prizes data to triangulate and understand other popular books.

Read on

What should you use ChatGPT for?

Vicki Boykis

I feel like I’ve written about ChatGPT every week for the last two months or so. In this blog post, Vicky looks at what ChatGPT can and cannot do for us (just yet). She is skeptical about using ChatGPT for generating creative work or writing code, as they value their own ideas and expression, and find that ChatGPT often fails to provide correct syntax or logic.

These AI tech work best when you iterate on the results and use them as a source of inspiration or direction, rather than a search engine or a substitute for creative work. They’ll will inspire you, but will not get you across the finish line.

Read on

Four Packages

streamlit lets you convert Python data scripts into sharable web apps. I’ve been desperately looking for Shiny alternatives in Python and maybe this is it! Github.

vroom is a package in R for handling tabulated data known for its speed. Just look at these benchmarks! Vignette.

DataExplorer in R aims to automate most of data handling and visualization so that users could focus on studying the data and extracting insights. plot_missing() is really cool for finding what data is missing. Vignette.

pyjanitor is Python implementation of R package janitor. It simplifies use of pandas for data handling and can be used alongside it. Vignette.

Three Jargons

Elastic net: A regularization technique in machine learning that combines the penalties of lasso and ridge regression to promote sparsity and prevent overfitting.
Empirical Bayes estimation: A method for estimating parameters of a statistical model by using the data to estimate the prior distribution of the parameters, which is then used to update the posterior distribution.
Bagging: A technique in machine learning that involves training multiple models on bootstrap samples of the training data and aggregating their predictions to reduce variance and improve performance.

Two Tweets

Karandeep Singh @kdpsinghlab

A Visual Tour of the Meta-Tidyverse For years, I’ve been trying out different non-tidyverse implementations of tidyverse. It’s fun seeing folks mold languages to run analysis code inspired by it. If you like screenshots of code, come along for a visual tour. Let’s start w/ R.

Jasmine Hughes @Jas_Hughes

My two least favorite data rabbit holes to go down are: 1) Why are these two numbers that should be the same different? 2) Why are these two numbers that should be different the same? #DataMishapsNight

One Meme

r/ProgrammerHumor - i program in english — Reddit, u/Hobomojoe.

Next — Today I Learnt About Data Science

Discussion about this post