Hi there!
Today we will sail through the vast seas of data science, where we'll navigate through the intricacies of feature engineering, explore the treasure troves of Reddit data with Pushshift, and set sail with Hugging Face's Diffusers. We'll also discover the magic of Noteable, a new mate in our ChatGPT crew, and learn the language of Git, the map to our coding treasure.
Let’s dive in.
Five Stories
Feature Engineering and Selection: A Practical Approach for Predictive Models
This is an online book by Max Kuhn and Kjell Johnson that provides a comprehensive guide to feature engineering and selection for predictive models. The book aims to provide tools for representing predictors and to place these tools in the context of a good predictive modeling framework.
It covers a wide range of topics, including exploratory visualizations, encoding categorical predictors, engineering numeric predictors, detecting interaction effects, handling missing data, and various feature selection methods. The book is intended to help readers generate better models by focusing on the predictor representations. All the data sets and R code used in the book are freely available for further exploration and learning.
What is Pushshift?
With the ongoing Reddit protests, I was checking Reddit after a long time. Pushshift is one of the sites they have allowed to scrape the data for free — for research purposes.
Pushshift is a big-data storage and analytics project known for its copy of Reddit comments and submissions. The FAQ post explains when and why to use Pushshift data instead of solely using the Reddit API, such as for analyzing large quantities of Reddit data, grabbing data for a specific date range in the past, searching for comments, and aggregating data.
It also discusses the limitations of the data provided by the Pushshift API, such as the fact that scores and other metadata may not reflect what is displayed by Reddit. The post provides links to the raw data, user-friendly interfaces for querying Pushshift data, and ways to support the project.
Hugging Face Diffusers
Hugging Face's Diffusers is a library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. It provides a modular toolbox that supports both simple inference solutions and training of custom diffusion models.
The library offers three core components: state-of-the-art diffusion pipelines for inference, interchangeable noise schedulers for different diffusion speeds and output quality, and pretrained models that can be used as building blocks for creating custom end-to-end diffusion systems. The library is designed with a focus on usability, simplicity, and customizability. It can be installed via PyPi or Conda and supports both PyTorch and Flax.
Noteable: The ChatGPT Plugin That Automates Data Analysis
This article introduces Noteable, a new plugin for ChatGPT Plus users that enhances its data analysis capabilities. The author, The PyCoach, emphasizes that Noteable is not the same as the code interpreter, which is not yet available to all ChatGPT Plus subscribers. This essentially connects to Noteable’s sandbox computers with Python that host and execute your Notebook on instructions of ChatGPT.
The article suggests that Noteable has improved the speed of exploratory analysis with quick brainstorming by offering super fast data analysis within ChatGPT. The full article likely provides more details about the features and benefits of Noteable, as well as how it integrates with ChatGPT.
Git for Humans – Alice Bartlett at UX Brighton
This is a video presentation by Alice Bartlett at the UX Brighton 2016 conference, titled "Git for Humans". The video, hosted by UX Brighton, aims to make Git, a distributed version control system for tracking changes in source code during software development, more accessible and understandable for people.
Key takeaways:
Collaboration between developers and designers can lead to more interesting and enjoyable project outcomes.
Git is a powerful tool for managing project work, but it can be unfriendly and intimidating to beginners.
Git allows users to take snapshots of files (commits) and track changes over time, enabling effective project storytelling and version control.
Time travel through commit history helps users explore past project states.
Git's branch feature supports experimentation and easy discarding of changes, providing a safe space for trying new ideas.
Four Packages
BeautifulSoup: A Python library for extracting and parsing data from HTML and XML files. Vignette.
Telethon: A Python library for interacting with Telegram’s API and handling its encryption scheme. Vignette.
Numerizer: A Python library that converts natural language numbers into integers or floats. Github.
PyAztro: A Python library that fetches your daily horoscope from aztro API. Github.
Three Jargons
"Ahoy there, matey! Ever heard of Gradient Boosting? 'Tis not a magical potion, but a powerful technique in machine learning. It be like a crew of weak learners, say decision trees, coming together to form a mighty strong learner. Each new tree corrects the errors made by the previous one, just like a crew learning from their past mistakes to find the hidden treasure, the accurate prediction!"
"Arr, let's talk about Dimensionality Reduction, shall we? Imagine ye have a map to a treasure, but it's filled with unnecessary paths and landmarks. It'd be a nightmare to navigate, aye? Dimensionality Reduction be like simplifying that map, removing the irrelevant paths, and focusing on the ones leading straight to the gold. Techniques like Principal Component Analysis (PCA) be the compass guiding us through this process."
"Last but not least, let's sail towards the Neural Networks. Picture a ship's crew, where each sailor has a specific task, and they all work together to sail the ship smoothly. Neural Networks work in a similar fashion. They be composed of layers of nodes, or 'neurons', each processing a bit of information and passing it on, just like a well-coordinated crew. And when they work together, they can navigate through complex tasks, like image recognition or natural language processing, just like steering a ship through a stormy sea!"
Arr matey, I be indebted to ye, ChatGPT, for sailing these rough seas of knowledge with me. Savvy?
Two Tweets
https://twitter.com/mdancho84/status/1668598351871242242?s=20
https://twitter.com/simonpcouch/status/1668697665511600134?s=20
One Meme
That’s a wrap!
If you liked the newsletter, please share. If a friend shared it with you, subscribe to support.
— Harsh