

Discover more from Next — Today I Learnt About Data Science
Five Stories
Curious About ChatGPT? Get Your Answers Here
Some of us researchers at the University of Tennessee, wrote up an essay explaining the basics of GPT models. It is for people who are like, “what the heck is this thing and how is it so good?”. Based on the article, Prof Charles Liu and I gave a short interview on the advent of AI.
One interesting tid-bit:
In your paper, you note that ChatGPT can perform human-like undertakings, such as solving quadratic equations and giving definitions of morality plays. It appears this technology covers an incredible array of human knowledge. What should college faculty and students take away from this?
Harshvardhan: This technology is capable of many feats, including writing computer code and essays. However, it “hallucinates” various facts and gives wrong information while showing complete confidence in its accuracy. In such a situation, the user has an added responsibility to verify the information produced by GPT. ChatGPT also has a hard limit on its information: it has only been trained on data till September 2021. Given these limitations, the primary concern would be about the accuracy of its information.
Liu: ChatGPT seems good at generating ideas — it’s good at brainstorming because there are no obviously wrong ideas in brainstorming, right? However, since ChatGPT is ultimately an AI system trained on historical data, the information it generates can hinder creative progress. If users just rely on it for brainstorming, it would be as if AI is dictating the ideas, and we humans simply implement them.
The American Opportunity Index
The American Opportunity Index: A Corporate Scorecard of Worker Advancement is a new effort to give companies and other stakeholders a set of robust tools that measure how well major employers are doing in fostering economic mobility for workers and how they could do better.
Using data on career histories and compensation experiences of 3 million employees at the 250 largest U.S. public companies, and job experiences and descriptions, they devised a ranking on which employers create better access, pay and mobility conditions. It is worthwhile exploring the visualisation!
Comparing Two Large Dataframes in Pandas
A few weeks ago, I had to compare two large data frames — with over five million rows and hundred columns. With such large data frame, doing an element by element comparison wasn’t the best option. I asked ChatGPT for help.
The solution it suggested was truly genuine that I couldn’t have thought on my own, I’m 70% sure. It suggested me to convert the data frames into SHA-256 hash with pd.util.hash_pandas_object
and compare the hashes instead of data frames. All of this was new to me and I learnt about hash collisions, and much more about pandas.
The most densely populated square km in the United States
Alasdair Rae studied the dense places in US. The most densely populated square kilometer in the United States is located on the Upper East Side in New York City, which doesn’t come as a surprise to people aware of NYC. NYC also has the highest population density in the country, with 148 out of 161 grid squares having a population of over 20,000.
San Francisco, Chicago, Los Angeles, Miami, Philadelphia, Madison, Union City (NJ), and West New York (NJ) also have densely populated areas.
Type in your job to see how much AI will affect it
Washington Post and UPenn researchers considered two AI use cases: image generators and large language models. Then, they calculated how much would the job be affected by AI. This doesn’t necessarily mean they would be replaced by AI, but could simply mean the productivity boost.
Management, business and finance, and Sciences and computer science are the two largest categories getting affected by the advent of AI.
Four Packages
Today, let’s learn about four packages in R about data summarisation and inspection.
summarytools is a an R package for data cleaning, exploring, and simple reporting. It’s
dfSummary()
is a killer. Vignette. Github.Missing values are ubiquitous in real-world datasets. naniar provides methods to identify, visualise and handle missing values. Its
vis_data()
is a killer. Vignette. Github.skimr provides a frictionless approach to summary statistics. Its
skim()
is a killer. Vignette. Github.inspectdf is collection of utilities to summarise missingness, categorical levels, numeric distribution, correlation, column types and memory usage. Vignette. Github.
Three Jargons
For fun, I asked ChatGPT to write the definitions in “humorous but intelligent manner”. It slaps!
Hash: Ah, hash! No, we're not talking about the delicious dish of diced meat, potatoes, and onions. In the computer realm, a hash is like a magical cooking pot where you throw in any quantity of data, no matter how large, stir it a little (or a lot), and it spits out a fixed-size garble of characters that looks like your cat danced on the keyboard. You can use this to uniquely identify your data, just like you uniquely identify that oddball uncle in every family reunion photo.
Hash Collision: This is the unfortunate event that occurs when two distinct pieces of data go through the magic cooking pot (i.e., hashing) and end up with the same cat-dancing-on-the-keyboard outcome. This is akin to two totally different people ending up with the same fingerprints. In the world of hashing, this is like the universe saying, "Oops! My bad." Remember, this is extremely rare, like finding someone who enjoys fruitcake.
Salting the Hash: Now, just when you thought hashing couldn't get more culinary, here comes a pinch of salt. Salting the hash is the act of adding a little secret spice (i.e., a random string) to your data before you hash it. This is the digital equivalent of adding a mustache and glasses to your fingerprints to avoid hash collisions. Basically, it's like your data goes undercover to avoid being mistaken for someone else, ensuring that the outcome is as unique as a snowflake in the Sahara.
Two Tweets
https://twitter.com/rappa753/status/1659559565967622145?s=20
https://twitter.com/Claudia_Sahm/status/1658131328301166593
One Meme
https://twitter.com/PR0GRAMMERHUM0R/status/1660118242176774148?s=20
This is due to Node.js (like many programming languages) considering anything as string.
Bonus
Last week when we weren’t talking, I had gone for a camping trip in Oregon, Utah, and Colorado. You can read about my 10-day nature escape on my blog (and meet Frank, Richard’s Dog)!