Hi there!
Today we will talk about data scrapping, reproducible pipelines, gt and pins packages, and elevators to space. Keep an eye for the meme! UT Austin students built a rating map of toilets on campus.
Five Stories
Scraping Data Without Programming
Samantha Sunne
In this slide deck, Samantha explains several neat tricks on how to scrap data from websites with no programming. Specifically, she shares how to use ImportHTML and its cousin ImportXML to import data from live pages into Google Sheets. Finally, she also shares some free tools like Parsehub and Webscrapper.
These free tools are not reliable as websites change behaviours and might not be updated. That’s why we use coding tools like rvest or Beautiful Soup.
Building reproducible analytical pipelines with R
Bruno Rodrigues
This book teaches some of the best practices from software engineering and DevOps to make our R projects more robust, reliable and reproducible.
The goal is to teach you a set of tools, practices and project management techniques that should make your projects easier to reproduce, replicate and retrace. These tools and techniques can be used right from the start of your project at a minimal cost, such that once you’re done with the analysis, you’re also done with making the project reproducible. Your projects are going to be reproducible simply because they were engineered, from the start, to be reproducible.
What's new and exciting in gt 0.8.0
Rich Iannone
In this YouTube video, Rich Iannone, who works at Posit, talks about the gt package, version 0.8.0, and the new functions and improvements it offers. The new functions include sub_values
for find-and-replace, tab_style_body
for styling cells, extract_cells
for pulling out specific cells, cols_align_decimal
for decimal alignment, and tab_info
for revealing the ID values of columns and rows. There are also improvements to date and time formatting functions, including format_date_time
and format_date
.
Announcing pins 1.1.0
Julia Silge
The pins 1.1.0 package for R is now available on CRAN, allowing users to share R objects and models across projects and with colleagues. The package now supports Google Cloud Storage and allows for versioning, making it easy to track changes and undo mistakes.
Other improvements include the ability to read from boards on the web, improved functionality for Posit Connect, and a change in caching to reduce the likelihood of broken states. Pins for Python were previously released, enabling users to read and write on AWS, Microsoft, or Google platforms in either R or Python. Pins stored with Python can be read with R, and vice-versa.
Space Elevators
Neal.fun
Neal is back with yet another interesting webpage! Now, you can ride the space elevator to the end of the atmosphere. Price? Two minutes of scrolling. (Far cheaper than Blue Origin and Virgin Galactic, eh.)
The elevator’s last stop is at Kármán line — an imaginary line drawn at 100 kms above sea level to separate Earth from outer space. I liked the temperature bar on the right hand side and to observe how it decreases near the boundary, but then starts increasing again.
Four Packages
Pins is an R package that enables users to publish and share data, models, and other R objects across projects and with colleagues. It allows for versioning and can pin objects to various pin boards, including Google Cloud Storage, Amazon S3, and Azure blob storage. Vignette. Github.
gt is an R package that facilitates the creation of publication-ready data tables using a flexible grammar of table construction. It enables users to customize the appearance and formatting of tables, including cell and column spans, grouping, merging, and pivoting. With a wide range of themes and styling options, gt produces high-quality and interactive HTML and LaTeX tables that can be easily exported to a variety of file formats. Vignette. Github.
The targets package in R is a pipeline tool similar to Make, that coordinates and orchestrates computationally intensive analysis projects for data science and statistics. It improves efficiency by skipping costly runtime for tasks that are already up to date and abstracting files as R objects, while ensuring the pipeline is up to date and results are trustworthy when all current output matches the current upstream code and data. Vignette. Github.
renv is an R package that helps you manage your project dependencies. With renv
, you can create an isolated environment for your R project that contains all the packages you need. This ensures that your code runs consistently across different systems and avoids conflicts between different package versions. Vignette. Github.
Three Jargons
Related to R
Environment: An environment in R is a collection of objects (e.g., variables, functions) that have a specific context or scope. The global environment is the default workspace for R, but users can create new environments or work within the environments provided by packages.
Closure: A closure is a function in R that captures the environment in which it was created, allowing it to retain access to variables and other objects from that environment even when it is called in a different context.
Namespace: A namespace in R is a mechanism for managing and isolating the functions and variables defined within a package, preventing conflicts with other packages or user-defined objects. Each package in R has its own namespace.
Two Tweets
https://twitter.com/selcukorkmaz/status/1650938868382703624
https://twitter.com/selcukorkmaz/status/1650246911947931650
One Meme
https://twitter.com/donniejsackey/status/1650556142123200522
Link to the Poop Ratings Map: UT Austin Crap Map
Bonus
“I don’t believe anybody asked for this. Or needed this. Or even actually thought it.
But goddamn am I glad you've done this.” — u/Malaguena on Reddit.