Weather fetchR example

Author

Claire Punturieri

Published

August 18, 2025

Introduction

This Quarto document provides a walkthrough for using the associated weather fetchR functions. The goal is to take movement data (lat/lon + time), identify where the subject spent the most time each day, find the nearest NOAA weather stations, and pull daily weather data for those locations.

Set-Up

Load in dependencies and necessary functions.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(future)

source(here::here("functions/get_top_locations.R"))
source(here::here("functions/identify_stations.R"))
source(here::here("functions/pull_weather.R"))

We will be using example data adapted from a study on movement data collected from fishers (original data can be accessed at: https://r-packages.io/datasets/leroy).

These data have been simplified and slightly altered for the purposes of demonstrating pulling weather data.

To set up, let’s load in and take a look at the data.

fisher <- read_csv(here::here("tutorial/data/pennanti_abridged.csv"),
                   show_col_types = FALSE) |> 
  glimpse()

Rows: 32,904
Columns: 7
$ subid     <dbl> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, …
$ date      <date> 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11,…
$ day_label <dbl> 117, 117, 117, 117, 117, 117, 117, 118, 118, 119, 119, 119, …
$ lat       <dbl> 42.79528, 42.79534, 42.79528, 42.79483, 42.79509, 42.79515, …
$ lon       <dbl> -73.86015, -73.86001, -73.86013, -73.86012, -73.86037, -73.8…
$ time      <dttm> 2011-02-11 18:06:14, 2011-02-11 18:10:09, 2011-02-11 18:20:…
$ duration  <dbl> 3.916667, 10.833333, 20.533333, 59.666667, 9.783333, 10.2500…

Calculating locations where the most time was spent - get_top_locations()

To examine a broader area of space, you can round your latitude-longitude coordinates. One decimal place covers the space of about ~11km. Because this code makes external queries, when working with real participant data, using a wider area might afford greater anonymity/privacy.

fisher <- fisher |>
  mutate(lat = round(lat, 1),
         lon = round(lon, 1))

fisher |> glimpse()

Rows: 32,904
Columns: 7
$ subid     <dbl> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, …
$ date      <date> 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11,…
$ day_label <dbl> 117, 117, 117, 117, 117, 117, 117, 118, 118, 119, 119, 119, …
$ lat       <dbl> 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, …
$ lon       <dbl> -73.9, -73.9, -73.9, -73.9, -73.9, -73.9, -73.9, -73.9, -73.…
$ time      <dttm> 2011-02-11 18:06:14, 2011-02-11 18:10:09, 2011-02-11 18:20:…
$ duration  <dbl> 3.916667, 10.833333, 20.533333, 59.666667, 9.783333, 10.2500…

Once you’ve completed that (optional) step, you can first get your subject’s top locations.

fisher_longest <- fisher$day_label |>
  unique() |>
  furrr::future_map(\(day_label_value) get_top_locations(data = fisher, day_label_value,
                                                   day_col = "day_label",
                                                   lat_col = "lat", lon_col = "lon",
                                                   duration_col = "duration")) |>
  list_rbind() |> 
  distinct(day_label, .keep_all = TRUE) #|> 
  #select(-day_label)

fisher_longest |> glimpse()

Rows: 209
Columns: 5
$ subid     <dbl> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, …
$ date      <date> 2011-02-11, 2011-02-12, 2011-02-13, 2011-02-14, 2011-02-15,…
$ day_label <dbl> 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, …
$ lat       <dbl> 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, 42.8, …
$ lon       <dbl> -73.9, -73.9, -73.9, -73.9, -73.9, -73.8, -73.9, -73.9, -73.…

Getting corresponding weather stations - identify_stations()

Next, you’ll need to pull a list of stations. There are different ways that this can be done (you might already be hosting these files locally), but you can easily download codes for all NOAA stations by country using the NOAA_countryStations() function in the FluMoDL package. To pull stations during a relevant time period, I recommend searching on study dates.

first_date <- fisher_longest |>
  arrange(date) |>
  mutate(first_date = first(date)) |>
  pull(first_date) |>
  unique()

last_date <- fisher_longest |>
  arrange(date) |>
  mutate(last_date = last(date)) |>
  pull(last_date) |>
  unique()

weather_stns <- rbind(FluMoDL::NOAA_countryStations(fips = "US",
                                                    from = first_date - 1, to = last_date + 1))

Downloading list of weather stations from NOAA (waiting for server)...

Download finished.

weather_stns |> head()

        usaf  wban                        station.name ctry state icao    lat
13625 690150 93121                   TWENTY NINE PALMS   US    CA KNXP 34.294
13867 700001 26492                     PORTAGE GLACIER   US    AK PATO 60.784
13869 700197 26558                             SELAWIK   US    AK PASK 66.600
13871 700260 27502 W POST-WILL ROGERS MEMORIAL AIRPORT   US    AK PABR 71.287
13875 700300 27503                  WAINWRIGHT AIRPORT   US    AK PAWI 70.637
13883 700631 25715                        ATKA AIRPORT   US    AK PAAK 52.220
           lon elev.m.      begin        end
13625 -116.147   610.5 1990-01-02 2025-08-17
13867 -148.848    32.5 2006-01-01 2025-08-17
13869 -159.986     7.6 2006-01-01 2025-08-17
13871 -156.739     8.1 1945-01-01 2025-08-17
13875 -160.013    12.5 1999-11-02 2024-09-09
13883 -174.206    17.1 2006-01-01 2013-04-30

As a side note should you opt to download these date via FluMoDL, all WBANs should be five digits. If a WBAN is not five digits, we should append one or two 0s to the beginning of the WBAN. We also need to ensure that blank rows in the ICAO column are set to NA for later filtering purposes.

weather_stns <- weather_stns |> 
  mutate(wban = ifelse(nchar(wban) == 3, paste0(00, wban), wban),
         wban = ifelse(nchar(wban) == 4, paste0(0, wban), wban),
         icao = na_if(icao, ""))

Once you have a list of relevant stations, you can then apply the identify_stations() function. Distance is automatically calculated in meters. A benefit of pulling more than one station is not every location registered as a station collects conventional weather data – sometimes you might be pinging a buoy or a given station might not have weather for whatever reason. This will help you have less missing data later on (you could consider pulling the closest station with accurate data, or averaging across multiple stations).

match_stns <- fisher_longest$day_label |>
  unique() |>
  furrr::future_map(\(day_label_value) identify_stations(data = fisher_longest,
                                                         stations = weather_stns,
                                                         day_label_value,
                                                         n_stn = 5)) |>
  purrr::map_dfr(~ tibble(subid = .x$subid, day_label = .x$day_label, wban = .x$wban,
                              icao = .x$icao, distance = .x$distance, date = .x$date)) |>
  mutate(obs_label = 1:n())

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
numbers without specifying argument 'seed'. There is a risk that those random
numbers are not statistically sound and the overall results might be invalid.
To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe
random numbers are produced via the L'Ecuyer-CMRG method. To disable this
check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".

match_stns |> head()

# A tibble: 6 × 7
  subid day_label wban  icao  distance date       obs_label
  <dbl>     <dbl> <chr> <chr>    <dbl> <date>         <int>
1    23       117 04741 KSCH     6897. 2011-02-11         1
2    23       117 14735 KALB    10148. 2011-02-11         2
3    23       117 54781 KDDH    54328. 2011-02-11         3
4    23       117 54768 KAQW    60852. 2011-02-11         4
5    23       117 14750 KGFL    64267. 2011-02-11         5
6    23       118 04741 KSCH     6897. 2011-02-12         6

Get weather data per day - pull_weather()

We have matched each point where our participant spent the most time on each day on study to the closest weather station. We can now make direct queries to get weather data! When we do this, you can see that there are a lot of NAs. It is not unusual if some of the requested elements (like snow or precipitation) are sparse.

weather <- match_stns$obs_label |>
  unique() |>
  furrr::future_map(\(obs_label_value) pull_weather(obs_label_value,
                                                    data = match_stns)) |> 
  bind_rows()

weather |> glimpse()

Rows: 1,045
Columns: 10
$ date  <date> 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11, 2011-02-11, 201…
$ avgt  <dbl> NA, 10.5, 6.0, 6.0, 5.0, NA, 22.0, 17.5, 19.0, 18.5, NA, 28.5, 2…
$ mint  <dbl> NA, -5, -12, -15, -15, NA, 10, 3, 5, 2, NA, 18, 15, 19, 7, NA, 2…
$ maxt  <dbl> NA, 26, 24, 27, 25, NA, 34, 32, 33, 35, NA, 39, 43, 38, 31, NA, …
$ snow  <dbl> NA, 0.000, NA, NA, NA, NA, 0.001, NA, NA, NA, NA, 0.001, NA, NA,…
$ snwd  <dbl> NA, 15, NA, NA, NA, NA, 15, NA, NA, NA, NA, 14, NA, NA, NA, NA, …
$ pcpn  <dbl> NA, 0.000, 0.001, 0.000, 0.000, NA, 0.001, 0.001, 0.001, 0.001, …
$ subid <chr> "23", "23", "23", "23", "23", "23", "23", "23", "23", "23", "23"…
$ wban  <chr> "04741", "14735", "54781", "54768", "14750", "04741", "14735", "…
$ icao  <chr> "KSCH", "KALB", "KDDH", "KAQW", "KGFL", "KSCH", "KALB", "KDDH", …

Here’s a simple average to summarize across the five nearest stations for each point. You might consider doing something more sophisticated (like a weighted mean).

weather <- weather |> 
  group_by(subid, date) |> 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)), .groups = "drop")

weather |> glimpse()

Rows: 209
Columns: 8
$ subid <chr> "23", "23", "23", "23", "23", "23", "23", "23", "23", "23", "23"…
$ date  <date> 2009-02-11, 2009-02-12, 2009-02-13, 2009-02-14, 2009-02-15, 200…
$ avgt  <dbl> 41.250, 41.000, 26.750, 24.500, 24.000, 23.750, 24.750, 24.625, …
$ mint  <dbl> 29.50, 34.50, 18.25, 14.00, 15.00, 13.75, 16.50, 12.75, 18.25, 1…
$ maxt  <dbl> 53.00, 47.50, 35.25, 35.00, 33.00, 33.75, 33.00, 36.50, 39.25, 2…
$ snow  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 2.100, 0.001, 0…
$ snwd  <dbl> 4.000, 1.000, 0.001, 0.001, 0.000, 0.000, 0.000, 0.000, 2.000, 0…
$ pcpn  <dbl> 0.00025, 0.39000, 0.00550, 0.00000, 0.00025, 0.00000, 0.00000, 0…