April 2024: Maps, LLMs and People

Map of languages South Asians speak; Ryan Gosling explains LLMs

Apr 24, 2024

Hi there!

Today, let’s take a look at some cartography-related resources, some language model things, and some interesting insights about us, Homo sapiens.

Let’s dive in. As usual, a bigger collection is available on Raindrop.

Maps

India in Pixels

India in Pixels post some amazing maps of India on Twitter. Their sonification of volume of cashless transactions in India is awesome. Which party received most funding through electoral bonds by month is also good. (A brief primer on Electoral Bonds written by me.) The following shows Indian states’ population compared to different countries of the world.

They also have a (mostly) free tool to make mapping visualizations that requires no coding or GIS knowledge.

Newspaper Map

What are the local newspapers in the rest of the world? Or even where you live? This map lets you zoom in to a random spot in the world and take a look at their local reportage.

https://newspapermap.com/

Languages in South Asia

South Asia is home to several hundred languages, including Hindustani (parent of Hindi and Urdu) — the third most widely spoken language in the world after English and Chinese. However, this doesn’t begin the unravel the diversity in what people speak.

Aryaman Arora, a first year PhD student at Stanford created this interactive map of languages spoken in India, Pakistan and Nepal at the block (most granular) level. It will make it obvious to you why most of South Asia is so multilingual.

Rape Statistics by Country

Sexual violence, a widespread and critical social problem, demands continuous global attention and action. Rape, a grave violation of human rights, stands out as a significant global occurrence of sexual violence.

But where does it get reported more than others? DataPandas has a report and a visualization.

Rape statistics across the globe highlight significant disparities in reported rates, with Botswana, Lesotho, and South Africa reporting the highest incidents per 100,000 people. The Sweden ranks 5th, United States ranks 13th, and India 95th among 118 countries. However, do note that factors such as societal norms of reporting (not all rapes get reported), definitions of rape, and reporting structures heavily influence these statistics.

AI and Large Language Models

Where do images for training AI models come from?

Knowing Machines investigate LAION-5B, an open-source dataset of 5.8 billion images with their textual descriptions collected from internet. This data comes from Common Crawl, and is a subset of all available images filtered for “good quality” image captions written as Alt text.

However, the key problem is because the entire process is automated, no one truly knows what’s in the data. In December 2023, Stanford researchers found the dataset included child sex abuse imagery. The scroll-log investigates how the dataset itself can be biased in various ways, because its automated creation uses many algorithms which themselves could be biased. Effectively, bias compounds while EVERYONE ignores the fine-print of precaution before real-world applications.

Intelligence is…

ChatGPT and language models work by predicting the next token. Cool, that much we know. But, how different would be the predictions each time we ask?

Santiago Ortiz provided the two words “Intelligence is…” and asked the AI to finish the sentence. The results give us a close look at how LLMs think.

This is an interactive piece of art, explore and enjoy!

How do LLMs like ChatGPT work? By Ryan Gosling (sort of…)

Here is a deep fake of Ryan Gosling, with his deep fake voice created from ElevenLabs explaining to us how large language models work. I’m impressed at the easy-to-follow explanation and how well does lip-syncing works!

If this is possible today, I wonder what’d be possible tomorrow when deep fake videos and audio becomes mainstream.

People and Demographics

When Your Vision and Hearing Declines with Age

More than half of US population starts having some trouble reading fine print starting around the age of 40. They soon require glasses or contact lenses. On the other hand, hearing aids do not pick up until the age of 65 years, when around 5% starts using them.

Learn more in this Flowing Data report.

Notable People in History

This website gives us a glimpse of notable people in all of history. The data comes from the cross-verified dataset of notable people between 3500BC-2018AD published in Nature. Again, worth exploring!

N.B. The paper has a fine-print saying the entire dataset shouldn’t be used and only a subset, upon proper verification, should be used. But just like the Knowing Machines found, we humans routinely ignore fine prints.

This is a Teenager

In this data scroll story (also available as a YouTube video), The Pudding presents data about thousands of kids’ information from National Longitudinal Survey of Youth. The researchers have followed over a thousand of teenagers to understand how they face life.

The data scroll through their adverse experiences, who went to college, what are they doing right now, annual income, and a lot more. I liked to watch the video than scroll indefinitely.

Sleep Hours and Feeling Rested

Most people require at least seven hours of sleep (I aim for more than eight, just to be on a safer side.) In this visualization, again from Flowing Data, we explore what’s the relationship between hours of sleep, time we went to sleep, to how well-rested we felt.

Although some can sleep less than recommended hours and still feel rested, probabilistically you’re not one of them.

Next — Today I Learnt About Data Science

Discussion about this post