Mojo may be the biggest programming language advance in decades

Next — Today I Learnt About Data Science | Issue #71

May 10, 2023

Hi there!

Did you know that yo-yo was originally a weapon used in the Philippine jungle? It was made out of wood and had a long cord that could be used to strike enemies or animals from distance!

Before you get engrossed into reading the stories, a small announcement. I’m traveling next week thus we would be talking again on May 24, 2023. Till then, sayonara!

Five Stories

How to learn R as a SAS user

Posit PBC

When I had to use SAS for my financial analytics class, my first thoughts were: why are people still using this? proc sql, data, etc. In this guide,Isabella Velasquez and Phil Bowsher explain how to use R for a longtime SAS user. The best part is the cheatsheet between SAS and R commands.

1 dataset. 100 visualizations.

Ferdio

One simple question: can you make 100 visualizations from the same dataset?

The dataset is simple. Number of UNESCO World Heritage Sites in the Nordic countries (Norway, Denmark and Sweden) for two time periods: 2004 and 2022.

Ferdio, the visualization company, did an curious job of compiling 100 different vizzes from these six data points. Check it out!

We Have No Moat, And Neither Does OpenAI

Some Googler, Dylan Patel and Afzal Ahmad

In a leaked document, a Google engineer shares their thoughts on the recent AI race. Some of my highlights:

We [Google] have no secret sauce. Our best hope is to learn from and collaborate with what others are doing outside Google.
Many of these projects are saving time by training on small, highly curated datasets. [Berkeley’s Koala is based on Meta’s Llama, trained on curated dataset of best examples from sharegpt.com]
Individuals are not constrained by licenses to the same degree as corporations… [In fact] these models are used and created by people who are deeply immersed in their particular subgenre, lending a depth of knowledge and empathy we cannot hope to match.

Mojo may be the biggest programming language advance in decades

Fast.AI

Python programmers often use wrappers over faster languages like C++, which leads to challenges in deploying and debugging AI models. Mojo, a new programming language, aims to solve this two-language problem by providing a simple and high-performance language for AI development.

There have been many proposals to allow native parallel processing in Python. But due to Global Interpreter Locking (GIL) mechanism, multithreading is impossible. Python could have GIL as optional. Or add multiprocessing as a library.

Mojo solves this by acting like Python++. In some cases, it gives 35000x performance improvements.

charlatan: make fake data in R

Kyle Voytovich

charlatan is an R package to generate fake data. It supports:

person names
jobs
phone numbers
colors: names, hex, rgb
credit cards
DOIs
numbers in range and from distributions
gene sequences
geographic coordinates
emails
URIs, URLs, and their parts
IP addresses
more coming…

Four Packages

charlatan is an R package to generate fake data in R. Vignette. Github.

faker is a Python package that generates fake data for you. Vignette.

ivreg provides a comprehensive implementation of instrumental variables regression using two-stage least-squares (2SLS) estimation. Vignette. Another vignette.

RSelenium provides R bindings for the Selenium Webdriver API. Selenium is a project focused on automating web browsers. Vignette.

Three Jargons

Endogeneity: Endogeneity is like having an undercover mole in your regression model. It occurs when an explanatory variable is secretly plotting with the error term, causing all sorts of havoc in your model's results. This sneaky correlation leads to biased and inconsistent estimates, making it tough for researchers to uncover the true relationships between variables.
Heteroskedasticity: Heteroskedasticity is when the error terms in a regression model decide to throw a wild party, and the level of noise varies across different observations. This non-constant variance crashes the efficiency of the ordinary least squares (OLS) estimator, and the standard errors for the coefficients become unreliable, making hypothesis testing a bit of a guessing game.
Multicollinearity: Multicollinearity is like a soap opera where independent variables in a regression model are entangled in an overly dramatic love triangle, and their high correlation makes it hard to tell them apart. The resulting chaos can lead to unstable coefficient estimates, inflated standard errors, and hypothesis tests that are more unpredictable than the next episode's plot twist.