What makes a city resilient?
Next — Today I Learnt About Data Science | Issue #68
In today’s letter we will look at several interesting examples of machine learning used in practice. Dea’s article examines Tirana, capital of Albania, for resilience; Javen’s piece presents how embeddings are used at Lyft; Simon’s blog gives an example how programmers are taking more ambitious projects with GPT; Jay’s post helps us visualise sequence models; and Tony’s article compares UMAP + GMM against PCA + K-means.
Dive into this knowledge-packaged edition! (Last week’s quiz solution is at the end.)
Urban Resilience: Tirana, a Case Study [Part 1]
In this intriguing exposition, Dea explores resilience of Albania's bustling capital, Tirana. She uses population count data and spatial analysis tools like PySAL to understand the city's rapid urbanization and shifting demographics over three transformative decades.
To analyze population dynamics and spatial dependence, she employs spatial Markov models. Spatial autocorrelation is used to measure spatial dependence between similar and different valued areas, with Queen Contiguity computing the spatial weight matrix for this dependence.
The results reveal that areas are more likely to remain in their current state, suggesting that places tend to retain their character in the short term. Pretty cool! I’m looking forward to Part 2.
lyft2vec — Embeddings at Lyft
Embeddings are useful to represent high dimensional data into lower dimensions. Lyft employs graph learning methods to generate embeddings, offering compact representations of high-dimensional information about riders, drivers, locations, and time. Lyft's approach uncovers intriguing rideshare insights by using ride graphs to learn about riders, drivers, and their ride preferences concerning factors such as location and time.
The company utilizes embedding vectors to calculate similarity between entities in a graph, unveiling fascinating patterns and insights. For instance, driver ride patterns around the Bay Area can be compared using similarity scores using just embeddings. Embeddings can also expose hidden relationships between entities that might be overlooked if only employing traditional features.
AI-enhanced development makes me more ambitious with my projects
In this interesting blog, Simon Willson exemplifies how ChatGPT and Github Copilot have made him more productive and thus more ambitious. He wants to build an archive of his ChatGPT conversations and has some ideas on how to do it. In old days, the effort to do it all by himself is daunting enough that he’d drop this project for something more important. With ChatGPT, he asks it to build the app and with a few exchanges, lo and behold, he’s done!
This reminded me of Jevons Paradox which says when something is more efficient, people use it more rather than less. If coding becomes easier with Copilot, people would be code-solving a lot more problems. Consequently, many more programmers would be needed, not fewer.
Visualizing A Neural Machine Translation Model
This article examines sequence-to-sequence models used in tasks like machine translation and text summarization. These deep learning models consist of an encoder and a decoder, both recurrent neural networks (RNNs). The encoder processes the input sequence into a context vector, which the decoder uses to generate the output sequence.
It also discusses attention mechanisms in neural machine translation models, which pass all hidden states to the decoder and use attention to concentrate on pertinent input areas. Attention mechanisms enable the model to focus on relevant input parts, improving translation quality. This allows the model to align words in input and output languages, with visualizations provided to illustrate the concept.
Tired: PCA + kmeans, Wired: UMAP + GMM
Tony suggests using UMAP + GMM over PCA + K-means for dimension reduction and clustering in data science, demonstrated using football player stats. The results indicate that UMAP pre-processing with K-means models and GMM models with more UMAP components perform better.
The analysis shows that UMAP combos excel in accuracy, while GMM clustering methods outperform in log loss. The choice between PCA and UMAP, kmeans, and GMMs should depend on the specific problem, with PCA and k-means preferable for speed and UMAP and GMMs for accuracy.
You can learn more about UMAP here and GMM here.
ggbump creates elegant bump charts in ggplot. Bump charts are good to use to plot ranking over time, or other examples when the path between two nodes have no statistical significance. Github.
fastDummies is an R package that generates dummy columns from character, factor, and Date columns. It is faster model.matrix(), another method for creating dummies. Github. Vignette.
mclust is an R package that offers model-based clustering, classification, and density estimation using finite normal mixture modeling, with functions for parameter estimation, simulation, and comprehensive strategies incorporating model-based hierarchical clustering, EM, and BIC. Vignette.
clue is an R package that provides an extensible computational environment for creating and analyzing cluster ensembles, which are collections of clusterings of the same objects. Vignette.
UMAP (Uniform Manifold Approximation and Projection) is a machine learning algorithm used for dimensionality reduction and data visualization by constructing a low-dimensional representation of high-dimensional data while preserving the structure.
In deep learning with attention, a context vector is a summary of the important parts of the input sequence that are relevant for predicting the current output. This summary is created by calculating a weighted average of the hidden states of the input sequence. The weights used for the average are calculated based on the similarity between the current decoder hidden state and the encoder hidden states. The context vector is then used as input to the decoder for generating the output.
Queen Contiguity is a spatial relationship concept used in geography and spatial analysis. It refers to the criterion that considers two areas or units as neighbors if they share a common boundary or a vertex, just like how a queen chess piece can move diagonally, horizontally, and vertically on a chessboard. This method of measuring adjacency is commonly used in spatial data analysis, such as in the analysis of regional development patterns and the identification of clusters or hotspots.
Last week, I asked the question: “How big would a wall clock have to be so that the tip of its minute hand moves a centimeter per second?”. My friend Pablo found this question as the simplest task ChatGPT failed at. Later when Bing / GPT-4 was launched, we found it could solve it. Here are GPT-3.5 and GPT-4 solutions. GPT-4 gets it right (albeit in second attempt), which GPT-3.5 doesn’t.
Thanks for reading Next — Today I Learnt About Data Science! Subscribe for free to receive new posts and support my work.