Download lots of data
library(magrittr)
I have a situation where I need to download lots of files with a specific format.
The files are monthly measurements and are located at a URL depending on the year and month of year:
http://some.site.com/data/{year}/{month}/interesting_file_{year}{month}.zip
I want to download the files to the subfolder data/zip
of my current project, so I split the url:
url_base_pattern <- "http://some.site.com/data/{year}/{month}/"
file_pattern <- "interesting_file_{year}{month}.zip"
To generate all combinations of year and month, I use the tidyr::crossing
function.
If January through September should be prefixed with a “0” the following code will do the trick:
tidyr::crossing(year = 2010:2011, month = stringr::str_pad(1:12, width = 2, pad = "0"))
## # A tibble: 24 x 2
## year month
## <int> <chr>
## 1 2010 01
## 2 2010 02
## 3 2010 03
## 4 2010 04
## 5 2010 05
## 6 2010 06
## 7 2010 07
## 8 2010 08
## 9 2010 09
## 10 2010 10
## # … with 14 more rows
From the year and month the full URL and desired local location are created with the wonderful glue package and here package:
urls <- tidyr::crossing(
year = 2010:2011,
month = stringr::str_pad(1:12, width = 2, pad = "0")
) %>%
dplyr::mutate(
file_name = glue::glue_data(., file_pattern),
destfile = here::here("data", "zip", file_name),
url = paste0(glue::glue_data(., url_base_pattern), file_name)
)
The glue package really shines here compared to an ordinary paste
because there are numerous year
and month
in the URL.
To make things a little more automatic I also create the download folder in my script:
download_path <- here::here("data", "zip")
if (isFALSE(fs::dir_exists(download_path)))
fs::dir_create(download_path)
Finally, I iterate over the rows of the urls
tibble to download each of the files.
urls %>%
dplyr::select(url, destfile) %>%
purrr::pwalk(download.file, quiet = TRUE)
Since I have no interest in the output of download.file
, but only in the side effect (downloading the file), I use walk
function.
Multiple runs
The form above gets the job done.
But if I re-run the script all the files will be downloaded again – a waste if all I want is to add a few months.
This can be remedied by making a custom download function to replace download.file
above:
download_once <- function(url, destfile, ...) {
if (isFALSE(fs::file_exists(destfile)))
download.file(url = url, destfile = destfile, ...)
}
Read data
In my case each zip file contains a single csv file.
To read all files I first need a vector of filenames (or use the column destfile
from the urls
tibble).
zip_files <- fs::dir_ls(here::here("data", "zip"))
The oldschool way of loading all of them into one big tibble would be with the purrr package:
tbl <- purrr::map_dfr(zip_files, readr::read_csv, col_types = readr::cols(...))
Since (potentially) a large number of files have been downloaded I prefer to list the column types with to be warned about any unexpected deviations.
The new school way would be with the vroom package:
tbl <- vroom::vroom(zip_files, col_types = vroom::cols(...))