Using rerddapUtils

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(duckdb)
#> Loading required package: DBI
library(duckplyr)
#> The duckplyr package is configured to fall back to dplyr when it encounters an
#> incompatibility. Fallback events can be collected and uploaded for analysis to
#> guide future development. By default, no data will be collected or uploaded.
#> → Run `duckplyr::fallback_sitrep()` to review the current settings.
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
#> 
#> Attaching package: 'duckplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#> 
#>     duration
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(ncdf4)
library(rerddap)
library(rerddapUtils)

rerddapUtils is an R package is a set of four functions desigened to work with and extend the rerddap package to provide capabilities requested by users that meet specialized needs which are better not being included in the rerddao package.

The first function is:

griddap_season <- function(datasetx, ..., fields = 'all', stride = 1, season = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list())

which will only extract data for the period during the year defined by ‘season’. The second function is:

griddap_split <- function(datasetx, ..., fields = 'all', stride = 1, request_split = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list(), aggregate_file = NULL)

which is designed to split a very large request into pieces defined by ‘request_split’ and then aggregate the parts either in memory, in a duckdb database or in a netcdf4 file. This allows for requests larger than 2GB to be made in an appropriate manner.

Note that griddap_season() and griddap_split() are designed to have basically the same interface as rerddao::griddap(), except for one or two extra function arguments and some arguments that are interpted differently. In both functions the argument ‘store = disk()’ is ignored. In griddap_season the ‘fmt’ argument is ignored while ‘season’ defines the days during the year to make the extract.

In griddap_split the argument ‘fmt’ defines whether the aggregate download should be stored in memory as a dataframe, in a duckdb database file, or in a netcdf file. For the duckdb and nectdf options, ‘aggregate_file’ lets you define where to write the file, if not defined it will set up a temporary file following proper R guidelines and in both cases the path to the file will be returned.

The other two functions are designed to make it easier to work with projected datasets:

latlon_to_xy <- function (dataInfo, latitude, longitude, yName = 'latitude', xName = 'longitude', crs = NULL)

which for a given dataset will convert a latitude and longitude request into the projected coordiantes to be used in rerddao::griddap(), while the fourth function:

xy_to_latlon <- function (resp, yName = 'cols', xName = 'rows', crs = NULL)

does the reverse, given an rerddao::griddap() extract will convert the projected coordinates into latitiude and longitude values.

This vignette assumes familiarity with the various ‘rerddap’ functions and how to use them. More information about the various ‘rerddap’ functions can be found in the documentation for that package.

Extract during a season - `griddap_season()`

The function griddap_season() works exactly like the function rerddap::griddap except that there is one extra argument, “season”, which restricts the extract to the part of the year defined by season, and also ignores several of the rerddap::griddap options, as it will ignore what format to downaod the data, and will only save to a standard rerddap::griddap output structure.

The argument “season” is a character string such that for a given “year”, paste0(year, ‘-’, season) produces a valid ISO datetime string. To test “season” is properly formed, use lubridate::as_datetime():

year <- '2023'
season <- c('02-01', '06-01')
# test that "season" is defined properly
test_time <- paste0(year, '-', season)
lubridate::as_datetime(test_time)
#> [1] "2023-02-01 UTC" "2023-06-01 UTC"

Even though the example only includes month and day, hours etc can be included as long as the final result, when a year is appended, produces a valid datetime.

As an example call and result, suppose you have a working rerddap::griddap() extract given by:

wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
                            time = c('2015-01-01','2019-01-01'),
                            latitude = c(20, 40),
                            longitude = c(220, 240),
                            fields = 'mod_current'
)

but it is desired to restrict extract to March 1 to May 1 inclusive, then the following modification of the above using griddap_season()produces the desired result:

wind_info <- rerddap::info('erdQMekm14day')
season <- c('03-01', '05-01')
season_extract <- griddap_season(wind_info,
 time = c('2015-01-01','2019-01-01'),
 latitude = c(20, 40),
 longitude = c(220, 240),
 fields = 'mod_current',
 season = season
)
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap

To see that the extract has been limited to the given season:

extract_times  <- lubridate::as_datetime(season_extract$data$time)
extract_months <- lubridate::month(extract_times)
unique(extract_months)
#> [1] 3 4 5

Only results for months 3, 4, and 5 are included, as per the values of “season”. May is included because the definition of “season” is inclusive, hence May 1 will be included.

Split an extract into parts - `griddap_split()`

The function griddap_split() allows to make rerddap::griddap() requests that are larger than the 2GB limit by splitting the request into parts and then combining the results either in memory in a dataframe, in a netcdf4 file, or into a duckdb database file. If the “in memory” option is chosen, there is no check that there is adequate memory, so it is possible to crash both R and your computer. The nectdf4 and duckdb options depend more on having adequate disk space, not memory. Also, there is no check that the split given will reduce the requests to a small enough size, that is up to the user to estimate.

griddap_split() uses the same arguments as griddap_split() execpt that:

The “store” and “read” arguments are ignored.
“fmt” is one of “memory”, “nc”, or “duckdb” - default “nc”
an added argument “split”
an added argument “aggregate_file” - default is to create an R temporary file

The argument “split” is a named list that defines the number of splits in each dimension, not the values where to make the splits, and a value must be given for each coordinate in the dataset. So for the dataset ‘erdQMekm14day’ above, there are four coordinates, “time”, “altitude”, “latitude” and “longitude”, to break the request into five “time” parts and two “latitude” and “longitude” parts (if you remember your combinatorics the original request will be broken into 20 separate requests) use:

request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)

Note that:

request_split <- list(time = 5, latitude = 2, longitude = 2)

will fail, even though “altitude” is dimension one. A value must be given for all of the dataset coordinates.

“fmt” defines how the split extracts will be combined to give one result. If fmt = “memory” then the results are combined into a usual rerddap::griddap() in memory, and that structure is returned by the function. If fmt = “nc” then the extracts wil be combined into a netcdf4 file, and a path to the netcdf file is returned by the function. If fmt = “duckdb” then the extracts wil be combined into a duckdb database file, and a path to the duckdb file is returned by the function.

As in the previous example suppose I had an rerddap::griddap() extract:

wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
                            time = c('2015-01-01','2016-01-01'),
                            latitude = c(20, 40),
                            longitude = c(220, 240),
                            fields = 'mod_current'
)

which would request an extract too large for ERDDAP to fulfill (that is not the case here so a split is not needed, but this illustrates the usage). Then the following modifies the code to split the request and save into a structure in memory:


wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
 time = c('2015-01-01','2016-01-01'),
 latitude = c(20, 40),
 longitude = c(220, 240),
 fields = 'mod_current',
 request_split = request_split,
 fmt = "memory"
)
str(split_extract)

On return, “split_extract” will contain the usual rerddap::griddap() structure. Again, there is no check that there is enough memory to do this operation, so it is possible to crash R (or your computer) if not used carefully, but it is not unusual for computers to have 16GB of RAM or more, so extracts a good bit larger than 2GB can be stored in memory.

If instead it is desired to store the result in a netcdf4 file:


wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
 time = c('2015-01-01','2016-01-01'),
 latitude = c(20, 40),
 longitude = c(220, 240),
 fields = 'mod_current',
 request_split = request_split,
 fmt = "nc"
)
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap

Since “aggregate_file” is not given, the netcdf file will be written to a temporary space and the path to that file will be returned by the function. The reason for this is it is bad form for an R package to write to a user’s space without permission. To give permission to wrtie to a file of one’s choosing, either give the full path to the file (or if just the name is given it will be written to the present directory), such as for the example above:


wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
 time = c('2015-01-01','2019-01-01'),
 latitude = c(20, 40),
 longitude = c(220, 240),
 fields = 'mod_current',
 request_split = request_split,
 fmt = "nc",
 aggregate_file = 'wind_extract.nc'
)

To access the data in the file, any of the various R packages that can read netcdf4 files can be used, in particular ones that allow for only part of the dataset to be read into memory at a time. For example to open the result of the netcdf file created above using ‘ncdf4’ and viewing a summary:

wind_file <- ncdf4::nc_open(split_extract )
wind_file
ncdf4::nc_close(wind_file)

To make the same extract and save to a duckbb database file:


wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
 time = c('2015-01-01','2016-01-01'),
 latitude = c(20, 40),
 longitude = c(220, 240),
 fields = 'mod_current',
 request_split = request_split,
 fmt = "duckdb",
 aggregate_file = 'wind_extract.duckdb'
)

There are many tools that can used to access the duckdb database file, among others the packages ‘dplyr’ and ‘duckplyr’ can be used:

con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
tbl(con_db, "extract") |>
     head(5) |>
     collect()
dbDisconnect(con_db, shutdown=TRUE)

Note that the table will always be called “extract”.

The default duckdb format is used because it allows for incremental additions to the dataset on disk, so datasets can be larger than memory. A popular format for storing data is “parquet”, or if it is desired to encode geometry information into the data, then the “geoparquet” format. To copy the duckdb file to parquet:

con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
# Use DuckDB's COPY command to write directly to Parquet file without loading into R
query <- sprintf("COPY extract TO '%s' (FORMAT 'parquet')", "wind_extract.parquet")
DBI::dbExecute(con_db, query)
dbDisconnect(con_db, shutdown=TRUE)

To write to a geoparquet file the package geoarrow cna be used, which involves first adding the geocoding information - to do so requires reading the entire data into memory so this will not work for datasets larger than memory:

con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
wind_data <- DBI::dbGetQuery(con_db, "SELECT * FROM extract")
# Create a geometry column using geoarrow using latitude and longitude
wind_data <- sf::st_as_sf(wind_data, coords = c("longitude", "latitude"), crs = 4326)
arrow::write_parquet(wind_data,  "wind_extract_geo.parquet")
dbDisconnect(con_db, shutdown=TRUE)

The resulting parquet files are half the size of the duckdb file, or smaller. While these examples are for writing the duckdb database to the parquet format, similar steps can be used if the data are in tibbles, but it may require reading the entire dataset into memory.

Working with projected datasets - `latlon_to_xy()` and `xy_to_latlon()`

Extracting projected data in ERDDAP (that is not on a latitude-longitude) grid can be difficult because most people do not think in terms of the projection in defining the region to make an extract, and similary once an extract has been made, it is often desirable to know the coordinates in terms of latitude and longitude. The functions latlon_to_xy() and xy_to_latlon() are desigend to help with that process.

In both cases the functions try to determine the dataset projection either from the dataset summary given by rerddap::info() when projecting the values of a latitude-longitude bounding box, and when an extract has been made, from the metadata in the extract. This is done by searching for any of the terms:

   proj_strings <- c('proj4text', 'projection', 'proj4string', 'grid_mapping_epsg_code',
                    'WKT',  'proj_crs_code',
                    'grid_mapping_epsg_code', 'grid_mapping_proj4',
                    'grid_mapping_proj4_params', 'grid_mapping_proj4text')

in the metadata. If this is not found, then the user has to provide “CRS”, which can be in any of the forms above. For example the Ice dataset ‘nsidcG02202v4sh1day’ on the Polarwatch ERDDAP is projected. The bounding box of interest is:

latitude <- c( 80., 85.)
longitude <- c(-170., -165)

then the bounding box in projected coordinates is found from:

myURL <- 'https://polarwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
coords <- latlon_to_xy(myInfo,  longitude, latitude)
coords

Similary, given an extract on a projected coordinate system, the values on terms of latitude and longitude can be found by:

rows <- c( -889533.8, -469356.9)
cols <- c(622858.3, 270983.4)
myURL <- 'https://polarwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
proj_extract <- rerddap::griddap(myInfo,
                                  time = c('2023-01-30T00:00:00Z', '2023-01-30T00:00:00Z'),
                                  rows = rows,
                                  cols = cols,
                                  fields = 'IceThickness',
                                  url = myURL
  )
test <- xy_to_latlon(proj_extract)
head(test)

Note that xy_to_latlon() will also work with the output from the functions rerddapXtracto::rxtracto(), rerddapXtracto::rxtracto_3D() and `rerddapXtracto::rxtractogon().

Extract during a season - griddap_season()

Split an extract into parts - griddap_split()

Working with projected datasets - latlon_to_xy() and xy_to_latlon()

Extract during a season - `griddap_season()`

Split an extract into parts - `griddap_split()`

Working with projected datasets - `latlon_to_xy()` and `xy_to_latlon()`