Meta introduces speech technology tools in 1100+ languages
Next — Today I Learnt About Data Science | Issue #73
In today’s letter, I will cover some new developments in speech technology, a visual tool to trace your identity leaks, how to build a search-bot with GPT, and more. Keep reading for a bonus story on Lions 🦁 from India’s Gir National Park. (Fun fact: India is the only country which is home to Lions as well as Tigers.)
Five Stories
Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
Text-to-speech and speech-to-text tools are breakthroughs in communicating effectively for many of us. Some need it, some want it. Hitherto, the tech was limited to around 100 languages (of 7000+ known languages). A vast majority of these languages are on the verge of extinction in the digital age, simply due to limited data.
Meta’s newest open-source model and datasets try to solve it:
In the Massively Multilingual Speech (MMS) project, we overcome some of these challenges by combining wav2vec 2.0, our pioneering work in self-supervised learning, and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these, such as the Tatuyo language, have only a few hundred speakers, and for most of these languages, no prior speech technology exists.
It thrills me to know Maithili will be in the mix, a language so sonorous that its rumoured that the Mughal Emperor Akbar started learning it after hearing a girl yelp in discomfort from stepping on a pebble.
Up Your Blogging Game: Three Enhancements to Hugo Apero
For my personal website, I have used Hugo Apero and Blogdown for over two years now. In this blogpost, I cover three suggestions on how to improve your Apero site.
Change your theme, including some unpublished themes
Add a search bar with Google’s Custom Search Engine (CSE)
Change fonts including any Google Font
Bonus: How to create shortlinks on your domain with in Blogdown.
See your identity pieced together from stolen data
You may be familiar with "Have I Been Pwned," a website designed to track where your personal information appears in various data leaks. This service was illuminating in its own right, but ABC introduces a more sophisticated visualization.
Enter your email address (rest assured, ABC News does not retain it) to embark on a captivating journey through past data breaches. Discover which application or website released specific pieces of your information and observe how these fragments can be assembled to create your comprehensive digital identity. It is scary.
How AI is helping astronomers study the universe
AI has been in many fields: from identifying protein and genome sequences, to counter human trafficking. This article presents numerous examples from the field of astrophysics. Some interesting ones:
The first image of Black Hole became two times sharper, thanks to generative AI
Identifying galaxies, exoplanets and ancestral stars
Improving search for aliens while reducing false positives (remember WOW! signal?)
Quite an interesting read!
Question answering using embeddings-based search
GPT language models have a hurdle when it comes to new, unfamiliar topics. To overcome this limitation, a Search-Ask method can be used to facilitate GPT's responses using a library of reference texts. The two-step process starts with a search through your text library for relevant sections, which are then incorporated into a message to GPT, along with the question. It's a way to supplement the model's knowledge without relying on fine-tuning, which can be unreliable for factual recall.
The Search-Ask method is better as it takes a ‘short-term memory’ approach, equipping GPT with ‘open notes’ for the ‘exam’ — the query at hand; versus ‘fine-tuning’ which is like reading the textbook one week before the ‘exam’. The search process may involve lexical, graph, or embedding-based search methods, with the latter being a reliable starting point.
See the example notebook by OpenAI on how to do this.
Four Packages
tidycensus is an R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames, optionally with simple feature geometry included.
Shiny for Python provides easy to build web app tools in Python. Read this blog to find how it differs from Streamlit and Dash.
tabulate is a Python package for pretty printing tables. It can input dictionaries, list of lists, pandas df, and print markdown, LaTeX and other types of tables. Github. PyPi.
retrying is a Python package for situations when working with a flaky function, something that’s error prone. Github. Blog.
Three Jargon
Duck Typing: This is a programming concept that Python implements, named after the phrase "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck." In Python, it means that you don't care about the type of an object, you care about what it can do. If an object can perform the required operations, then it is considered suitable, regardless of its actual class or type.
Decorator: In Python, a decorator is a special type of function that modifies the behavior of other functions. You can think of decorators as wrappers that add or change the functionality of the function they wrap, without permanently modifying it. Decorators are used with an "@" before the function definition.
Lambda Functions: These are small, anonymous functions that you can create with the lambda keyword. They're handy when you need a quick, small function for something like sorting or filtering data. They can do anything a normal function can do, but their body is limited to a single expression. For example,
lambda x: x**2
is a lambda function that squares a number.
Two Tweets
https://twitter.com/harshbutjust/status/1662960075110244352?s=20
https://twitter.com/allison_horst/status/1662103406927056896?s=20
One Meme
Bonus
From Bengal in India to Mesopotamia (Middle East), the Asiatic Lion once reigned supreme. Today, its world has contracted to Gujarat's Gir National Park in India. They would’ve been hunted to extinction if not for the efforts of Nawabs of Junagadh.
Check out this excellent piece which dives into Royal hunting practice and Nawabs’ love for animals. (Nawab’s love for his dogs is legendary. When Pakistan-India partition, the Nawab famously ran away to Pakistan with his hundred dogs, leaving behind his wives.)