Scraping mobility trends data from Apple
· 4 minutes read
Yesterday I tweeted about Apple’s COVID-19 trends mobility data. It’s neat that you can just download the CSV file from their website, but what if you could automate that (yes, I am too lazy to click a button)?
Luckily, all we have to do is use a headless Chrome interface to render the page: here we’ll be using the
chromote R package to do it all from the R console.
chromotepackage assumes you have a compatible browser installed in your system, such as Chrome or Chromium.
Extracting the URL
We start by loading the
We open a new session and navigate to the page on Apple’s website:
b <- ChromoteSession$new() b$Page$navigate("https://www.apple.com/covid19/mobility")
## $frameId ##  "15E022CD8D92CED7BB1E5EFE2D526EA5" ## ## $loaderId ##  "5401F56760908B6561A00AB8419FC263"
Here we suspend execution (using
Sys.sleep() for 5 seconds to let the page load and render.
We need to find something to identify the download button, and we can do that by using developer tools (e.g. on Firefox):
We now know it’s in a
div container of class
.download-button-container: this is what we’re going to use to extract the URL automagically.
Let’s do that, shall we?
x <- b$DOM$getDocument() x <- b$DOM$querySelector(x$root$nodeId, ".download-button-container") x <- b$DOM$getOuterHTML(x$nodeId)$outerHTML
First we get the whole document, then we select the
div container that we want, and finally we extract the HTML source code that will be stored in
##  "<div class=\"download-button-container\"><a href=\"https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev37/v3/en-us/applemobilitytrends-2020-05-20.csv\"></a><ui-button class=\"bar download\" ontouchstart=\"void(0)\" role=\"button\" tabindex=\"0\"><button type=\"button\" tabindex=\"-1\"></button>All Data CSV</ui-button></div>"
The final step consists of extracting the URL from that blob of HTML, and of course we can do that by writing an appropriate regex. And of course Bob Rudis already solved this for us on Stack Overflow!
This works perfectly (e.g. using the
library(stringr) url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" url <- str_extract(string = x, pattern = url_pattern) url
##  "https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev37/v3/en-us/applemobilitytrends-2020-05-20.csv"
Not too bad!
Load the data into R
Now it comes the easy part: we can import CSV directly from the web using the awesome
library(readr) dt <- readr::read_csv(file = url)
dt will be in long format, hence we need to do some data wrangling before we could have a tidy dataset, but that’s also a piece of cake using
library(dplyr) library(tidyr) dt <- dt %>% pivot_longer(cols = starts_with("2020-"), values_to = "value", names_to = "date") %>% mutate(date = as.Date(date)) %>% mutate(value = value / 100)
Here we divided
value by 100 to work on a percentage scale.
Use the data
All we have to do now is use the data for whatever we want to with it.
Disclaimer: I have no domain knowledge on the topic, and this particular dataset has massive limitations (see the About This Data section on Apple’s website). No inference should be made out of this: it’s all for illustration purposes. Don’t poorly use this data. Don’t be that guy.
ggplot2 to plot data trends for Italy by transportation type:
library(ggplot2) library(viridis) library(scales) dt %>% filter(region == "Italy") %>% ggplot(aes(x = date, y = value, color = transportation_type)) + geom_line(size = 1) + scale_y_continuous(limits = c(0, 2), labels = percent) + scale_x_date(date_breaks = "2 weeks", date_labels = "%b-%d") + scale_color_viridis(discrete = TRUE) + theme_minimal(base_size = 12) + theme( panel.grid.minor = element_blank(), legend.position = c(0, 0), legend.justification = c(0, 0) ) + labs( y = "Mobility Index", x = "", color = "", title = "Change in mobility trends relative to January 13, 2020" )
Good job Italy. And if you’re reading this, now more than ever: stay at home, wash your hands, and be safe!
Update: I did some tweaks to the blog (and recompiled all files) on 2020-05-22, that’s why you get up-to-date data in the latest plot. Glad to see the approach outlined above still works!