What Is ChatGPT Doing… and Why Does It Work?

Next | Issue #60

Feb 22, 2023

Welcome to the latest edition of our data science newsletter! In this issue, I explore some exciting topics in data science, including how ChatGPT works, the importance of commenting code, what the future holds for rtweet, and more.

I hope you enjoy this issue of Next, and as always, I welcome your feedback and suggestions for future topics. Happy reading!

Five Stories

1. What Is ChatGPT Doing… and Why Does It Work?

Stephen Wolfram

How does ChatGPT work? While the exact methodology isn’t public, Stephen Wolfram wrote an article on it. It is a system that generates text by predicting the next word based on probabilities learned from billions of web pages and books. It uses a parameter called temperature to control the text's creativity and diversity level.

He uses Wolfram language with GPT-2, making it possible to run on our computers. The article is long but informative. Here’s a video version on YouTube.

Read now

2. Why comment your code as little (and as well) as possible

Maëlle Salmon

The purpose of commenting is to help the reader know as much as the writer did. Comments should explain the why, not the what or how of the code. Comments should be used sparingly and strategically, not as a band-aid for the bad code design.

Naming things well, using helper functions or explaining variables, wrapping external functions with a nicer interface, and having someone review your code are tips for improving code readability without over-commenting.

Roxygen2 comments result in R package documentation, so they should include all possible details.

Read now

3. rtweet future

Lluís Revilla Sancho

R package for accessing Twitter is looking for a co-maintainer! In this blog post, Lluís recalls his journey of how he became the maintainer of a package downloaded by over 12,000 people every month. Responsibilities include:

Supporting new endpoints, using httr2, testing an API in CI,
Review changes to avoid new bugs,
Help with issues and questions that the transition to API v2,
Consulting jobs related to rtweet.

Contribute now

4. Causal Inference from Panel Data using Dynamic Multivariate Panel Models

Jouni Helske

Jouni introduces dynamite, an R package for causal inference from panel data using Dynamic Multivariate Panel Models (DMPM). DMPM can jointly model multiple response variables with different distributions and varying coefficients over time. It can also estimate long-term causal effects using posterior predictive simulations.

The blog compares dynamite for synthetic control with time series data and compares it with the CausalImpact package. It also provides references and links to the preprint paper, the vignette and the GitHub repository of dynamite.

Dynamite

5. A Critical Field Guide For Working With Machine Learning Datatsets

Sarah Ciston

Working with machine learning datasets can be challenging. They are often too big to manually inspect for errors, biases, or harmful content. These issues can affect the datasets' technical, legal, and ethical aspects. However, datasets can also be useful if we handle them carefully and critically.

This extensively-long guide provides tips, questions, methods, and resources to help you work with existing machine learning datasets throughout their lifecycle.

Read on

Four Packages

thesisdown is a bookdown-based package for writing a thesis using R Markdown. Github.

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. Github.

dynamite package provides an easy-to-use interface for Bayesian inference of complex panel (time series) data comprising multiple measurements per multiple individuals measured in time. Github.

mikropml is an interface to build machine learning models for classification and regression problems. Vignette. Github.

Three Jargons

Adversarial Examples: Inputs to a machine learning model intentionally designed to cause errors or misclassifications.
Transfer Learning: A technique for reusing a pre-trained model on a new, related task to improve performance, reduce the need for labelled data, and accelerate training time.
Neural Architecture Search (NAS): A technique for automatically searching for the optimal neural network architecture for a given task.