Scraping mobility trends data from Apple

· 4 minutes read

Yesterday I tweeted about Apple’s COVID-19 trends mobility data. It’s neat that you can just download the CSV file from their website, but what if you could automate that (yes, I am too lazy to click a button)?

It turns out, you can scrape that website and obtain a direct URL to the CSV file — directly from your R session! It’s quite simple, actually, with one caveat: the download button (and the URL) is rendered dynamically using JavaScript, hence you can’t just scrape the HTML source code for the URL. We’ll need to render the page first, and then extract the URL we are looking for.

Luckily, all we have to do is use a headless Chrome interface to render the page: here we’ll be using the chromote R package to do it all from the R console.

Note: the chromote package assumes you have a compatible browser installed in your system, such as Chrome or Chromium.

Extracting the URL

We start by loading the chromote package:

1
library(chromote)

We open a new session and navigate to the page on Apple’s website:

1
2
b <- ChromoteSession$new()
b$Page$navigate("https://www.apple.com/covid19/mobility")
## $frameId
## [1] "15E022CD8D92CED7BB1E5EFE2D526EA5"
## 
## $loaderId
## [1] "5401F56760908B6561A00AB8419FC263"
1
Sys.sleep(5)

Here we suspend execution (using Sys.sleep() for 5 seconds to let the page load and render.

We need to find something to identify the download button, and we can do that by using developer tools (e.g. on Firefox):

We now know it’s in a div container of class .download-button-container: this is what we’re going to use to extract the URL automagically. Let’s do that, shall we?

1
2
3
x <- b$DOM$getDocument()
x <- b$DOM$querySelector(x$root$nodeId, ".download-button-container")
x <- b$DOM$getOuterHTML(x$nodeId)$outerHTML

First we get the whole document, then we select the div container that we want, and finally we extract the HTML source code that will be stored in x:

1
x
## [1] "<div class=\"download-button-container\"><a href=\"https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev37/v3/en-us/applemobilitytrends-2020-05-20.csv\"></a><ui-button class=\"bar download\" ontouchstart=\"void(0)\" role=\"button\" tabindex=\"0\"><button type=\"button\" tabindex=\"-1\"></button>All Data CSV</ui-button></div>"

Cool right?

The final step consists of extracting the URL from that blob of HTML, and of course we can do that by writing an appropriate regex. And of course Bob Rudis already solved this for us on Stack Overflow!

This works perfectly (e.g. using the stringr package):

1
2
3
4
library(stringr)
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
url <- str_extract(string = x, pattern = url_pattern)
url
## [1] "https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev37/v3/en-us/applemobilitytrends-2020-05-20.csv"

Not too bad!

Load the data into R

Now it comes the easy part: we can import CSV directly from the web using the awesome readr package:

1
2
library(readr)
dt <- readr::read_csv(file = url)

Done! dt will be in long format, hence we need to do some data wrangling before we could have a tidy dataset, but that’s also a piece of cake using tidyr and dplyr:

1
2
3
4
5
6
library(dplyr)
library(tidyr)
dt <- dt %>%
  pivot_longer(cols = starts_with("2020-"), values_to = "value", names_to = "date") %>%
  mutate(date = as.Date(date)) %>%
  mutate(value = value / 100)

Here we divided value by 100 to work on a percentage scale.

Use the data

All we have to do now is use the data for whatever we want to with it.

Disclaimer: I have no domain knowledge on the topic, and this particular dataset has massive limitations (see the About This Data section on Apple’s website). No inference should be made out of this: it’s all for illustration purposes. Don’t poorly use this data. Don’t be that guy.

Anyway. Let’s use ggplot2 to plot data trends for Italy by transportation type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(ggplot2)
library(viridis)
library(scales)

dt %>%
  filter(region == "Italy") %>%
  ggplot(aes(x = date, y = value, color = transportation_type)) +
  geom_line(size = 1) +
  scale_y_continuous(limits = c(0, 2), labels = percent) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b-%d") +
  scale_color_viridis(discrete = TRUE) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    legend.position = c(0, 0),
    legend.justification = c(0, 0)
  ) +
  labs(
    y = "Mobility Index",
    x = "",
    color = "",
    title = "Change in mobility trends relative to January 13, 2020"
  )

Good job Italy. And if you’re reading this, now more than ever: stay at home, wash your hands, and be safe!

Update: I did some tweaks to the blog (and recompiled all files) on 2020-05-22, that’s why you get up-to-date data in the latest plot. Glad to see the approach outlined above still works!