Happy new year 2023! ⭐
I admit I haven’t kept my promise to write once a month for the last two months. Things have been busy, and while I have collected a HUGE number of resources, I haven’t got the chance to share them yet. Fasten your seat belts as I take you through a journey of a collection of internet resources including guides, datasets, and more.
Let’s dive in! 🎢
How Big is YouTube?
YouTube doesn’t provide any estimate of its size or the variety. Ethan Zuckerman and Jason Baumgartner used an innovative statistical method to estimate them and present their results on the website https://tubestats.org/
Some interesting insights:
There are approximately 13 billion videos, with 4 billion of them joining the club in 2023.
Most videos (20%) get somewhere between 15 to 60 views, with less than 20% getting more than 500 views.
English is the most popular language (of course, at 30%). Hindi is second with 10% videos.
Their Reddit explorer (https://redditmap.social/) is also worth exploring.
CLI: Improved
Every developer at some point realizes the importance of command line, and starts relying on them more and more. Remy Sharp, a developer from UK, collected a list of useful CLI tools that are better than the defaults. In fact, most of the “defaults” were news to me as well and I bet many of them will be to you too. Here are some of my favourites:
You can use Ctrl + R to search your command line history. You can use the up arrow key to go to the previous command as well.
fzf
can be used for fuzzy search as well.Knowing disk usage is handy with
du
command. (I usually usedu -sh [directory]
) to get summary usage in human readable format. Remy also recommends enhanced alternativesncdu
andnnn
.grep
is a powerful tool to search within files. It is super useful to find files that contain a specific keyword (for e.g., Jupyter Notebooks that import a library, or use a data source). ag and ack are two alternatives that are slightly better and provide simpler interface and more functions.
60 Days of Life in Gaza 🔒
What is life like in Gaza? This ten year old describes it objectively: like shit (despite the reporter asking him thrice).
New York Times created a visual essay about the life in Gaza strip, sleeping around constant aerial bombings. It is heart-wrenching to observe 22,600 deaths in three months. (Not to “compare” but contrast that with 9,700 deaths in Ukraine last two years that illustrates the difference in approaches.) Hope things improve. Om shanti.
mice: Multivariate Imputation by Chained Equations
The mice
package in R offers a sophisticated solution for handling missing data. It generates multiple substitute values for datasets with multivariate missing entries. This approach relies on Fully Conditional Specification, where distinct models are used to fill in each variable with missing data.
The MICE (Multiple Imputation by Chained Equations) algorithm is versatile, capable of imputing various data types including continuous, binary, unordered and ordered categorical data. It's also adept at handling continuous data across two levels and ensures consistency across imputations through passive imputation. The package also includes several diagnostic plots for assessing the imputation's quality.
Airbnb’s Data Quality Score
Airbnb has experienced rapid growth in recent years. Its Data Quality Score (DQ Score) initiative addresses the challenge of diminishing data quality due to rapid growth and vast data volumes, emphasizing the importance of data quality over quantity.
The DQ Score introduces a multi-dimensional, automated scoring system for data quality, balancing rigorous standards with scalability and incentivizing both data producers and consumers to maintain high-quality data.
This innovative approach to data quality management is integrated into Airbnb's existing data tools, offering actionable insights and improvements, thereby enhancing trust and usability of the data for both producers and consumers.
Is R on the decline?
R shot to popularity in the last decade [Stackoverflow Blog].
Despite R's strengths in data analysis, particularly with the tidyverse package, its ranking in programming language popularity has seen a notable decline, dropping from 12th to 19th in the TIOBE index from 2022 to 2023.
Berk Orbay, while appreciative of R's capabilities, acknowledges Python's growing dominance in areas beyond R's traditional stronghold, including deep learning, cloud services, and production environments.
The decision to use R or switch to another language, like Python, depends on specific needs; R remains robust for data analysis and reporting, but Python offers broader applications, especially in developing fields like deep learning.
My own use case has shifted more to Python due to better machine learning support, though my love for R remains as strong. Let’s see how this develops.
I’m Not Normal About Friends
I like analyses where people get obsessive about a specific detail, like Friends here. While I’ve only watched around two seasons of the show, the analysis by Jenny B feels right at home for me. (If you know a similar analysis for The Office, I’m interested.)
Rachel and Ross are the two most speaking characters. Obvious, considering their long running dynamic as friends and/or spouses.
TimeGPT - 1
In this (in my opinion, crazy) paper, the authors present a pre-trained model for time-series forecasting.
In this paper, we introduce TimeGPT, the first foundation model for time series, capable of generating accurate predictions for diverse datasets not seen during training. We evaluate our pre-trained model against established statistical, machine learning, and deep learning methods, demonstrating that TimeGPT zero-shot inference excels in performance, efficiency, and simplicity.
Our study provides compelling evidence that insights from other domains of artificial intelligence can be effectively applied to time series analysis. We conclude that large-scale time series models offer an exciting opportunity to democratize access to precise predictions and reduce uncertainty by leveraging the capabilities of contemporary advancements in deep learning.
What Is: Important Concepts in Numerical Algebra
This collection comprises "What Is" articles offering concise explanations of key concepts in numerical analysis and related fields. The articles are designed with the following guidelines:
Each post is concise, spanning no more than two or three screens (approximately 500 words).
They use minimal mathematical symbols, equations, and citations.
The content is comprehensible to advanced undergraduates in mathematically-focused disciplines.
A brief list of references is included, focusing on comprehensive, readable sources with additional literature links.
The articles can be accessed in the blog and as PDFs in this repository. The PDFs feature clickable hyperlinks too.
Which Letters are Friends?
Did you know that letters in the English language have their own 'social circles', just like people at a party? Here is a really fun article with some interesting insights:
The letters 'Q' and 'U' are almost inseparable! They have the highest 'friendship score', making them the ultimate letter duo. (Qatar is one of the few words where they don’t work together.)
Surprisingly, some letters prefer their own company. Pairs like 'ZZ' and 'FF' rank high on the friendship scale, showing a preference for pairing up with themselves.
While 'Q' and 'U' might be expected, did you know 'JU', 'IZ', and 'GN' are also close pals? These combinations occur more often than you'd think!
Not all letter relationships are mutual. For instance, 'A' and 'V' have a one-sided relationship, with 'A' seemingly not too fond of 'V'.
In the world of letters, true friendship is rare. 'E and R' and 'I and N' are each other's best friends, showing mutual preferences.
Like in any social gathering, some letters are more popular. 'R', 'N', and 'G' are like the social butterflies of the alphabet, often found mingling with various other letters.
This linguistic exploration reveals the hidden social dynamics of the alphabet, where every letter has its place and preference!
Tipping Culture in America: Public Sees a Changed Landscape
Tipping has become far more frequent these days in the US. About 72% of Americans notice an increase in tipping expectations across various services compared to five years ago, a phenomenon known as “tipflation.”
Despite the increased frequency of tipping, only a third of Americans feel confident about when and how much to tip for different services. While 21% view tipping as more of a choice, 29% see it as an obligation, and 49% believe it depends on the situation, highlighting the ambiguity in tipping norms.
How much to tip? 57% of Americans say they would tip 15% or less for an average meal at a sit-down restaurant, with only a quarter tipping 20% or more.
Telenet
Telehack is an online simulation of a stylized interface for ARPANET and Usenet, created anonymously in 2010. It is a full multi-user simulation, including 26,600+ simulated hosts with files spanning the years 1985 to 1990.
Really fun things?
You can watch the full ASCII Star Wars in Terminal with
telnet towel.blinkenlights.nl
You can view an ASCII aquarium with
telnet telehack.com
and enter
aquarium
Dante’s Inferno: Learning CLI
This concept presents a creative way to learn bash commands, inspired by the circles of hell from Dante's "Inferno" in the Divine Comedy. It is from Adam Spannbauer, lecturer at my alma mater University of Tennessee.
Each level represents a different set of bash tasks:
Limbo: Focuses on basic navigation, including moving in and out of directories and listing directory contents.
Lust: Teaches how to list hidden files in directories.
Gluttony: Concentrates on displaying only the beginning (head) or end (tail) of file contents.
Greed: Involves creating files, modifying file permissions, and executing scripts.
Anger: Centers on creating directories and moving files within the file system.
Heresy: Introduces the skill of unzipping zip files.
Violence: Emphasizes deleting files, a critical task in file management.
Fraud: Covers copying files and opening text editors for file editing.
Treachery: Concludes with deleting entire directories and their contents.
This unique approach intertwines learning bash commands with a thematic exploration based on Dante's famous work, offering an engaging way to grasp essential command-line skills. Pretty cool!
That’s a wrap!
There’s a few more in my collection. But that’s for the next cycle. See you all next in February!
Best wishes, ✨
Harsh