using-rerddapUtils.Rmd
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(duckdb)
#> Loading required package: DBI
library(duckplyr)
#> The duckplyr package is configured to fall back to dplyr when it encounters an
#> incompatibility. Fallback events can be collected and uploaded for analysis to
#> guide future development. By default, no data will be collected or uploaded.
#> → Run `duckplyr::fallback_sitrep()` to review the current settings.
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
#>
#> Attaching package: 'duckplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:arrow':
#>
#> duration
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(ncdf4)
library(rerddap)
library(rerddapUtils)
rerddapUtils
is an R
package is a set of four functions desigened to work with and extend the
rerddap
package to provide capabilities requested by users
that meet specialized needs which are better not being included in the
rerddao
package.
The first function is:
griddap_season <- function(datasetx, ..., fields = 'all', stride = 1, season = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list())
which will only extract data for the period during the year defined by ‘season’. The second function is:
griddap_split <- function(datasetx, ..., fields = 'all', stride = 1, request_split = NULL, fmt = "nc", url = eurl(), store = disk(), read = TRUE, callopts = list(), aggregate_file = NULL)
which is designed to split a very large request into pieces defined by ‘request_split’ and then aggregate the parts either in memory, in a duckdb database or in a netcdf4 file. This allows for requests larger than 2GB to be made in an appropriate manner.
Note that griddap_season()
and
griddap_split()
are designed to have basically the same
interface as rerddao::griddap()
, except for one or two
extra function arguments and some arguments that are interpted
differently. In both functions the argument ‘store = disk()’ is ignored.
In griddap_season
the ‘fmt’ argument is ignored while
‘season’ defines the days during the year to make the extract.
In griddap_split
the argument ‘fmt’ defines whether the
aggregate download should be stored in memory as a dataframe, in a
duckdb database file, or in a netcdf file. For the duckdb and nectdf
options, ‘aggregate_file’ lets you define where to write the file, if
not defined it will set up a temporary file following proper R guidelines and in both cases the path to the
file will be returned.
The other two functions are designed to make it easier to work with projected datasets:
latlon_to_xy <- function (dataInfo, latitude, longitude, yName = 'latitude', xName = 'longitude', crs = NULL)
which for a given dataset will convert a latitude and longitude
request into the projected coordiantes to be used in
rerddao::griddap()
, while the fourth function:
xy_to_latlon <- function (resp, yName = 'cols', xName = 'rows', crs = NULL)
does the reverse, given an rerddao::griddap()
extract
will convert the projected coordinates into latitiude and longitude
values.
This vignette assumes familiarity with the various ‘rerddap’ functions and how to use them. More information about the various ‘rerddap’ functions can be found in the documentation for that package.
griddap_season()
The function griddap_season()
works exactly like the
function rerddap::griddap
except that there is one extra
argument, “season”, which restricts the extract to the part of the year
defined by season, and also ignores several of the
rerddap::griddap
options, as it will ignore what format to
downaod the data, and will only save to a standard
rerddap::griddap
output structure.
The argument “season” is a character string such that for a given
“year”, paste0(year, ‘-’, season) produces a valid ISO datetime string.
To test “season” is properly formed, use
lubridate::as_datetime()
:
year <- '2023'
season <- c('02-01', '06-01')
# test that "season" is defined properly
test_time <- paste0(year, '-', season)
lubridate::as_datetime(test_time)
#> [1] "2023-02-01 UTC" "2023-06-01 UTC"
Even though the example only includes month and day, hours etc can be included as long as the final result, when a year is appended, produces a valid datetime.
As an example call and result, suppose you have a working
rerddap::griddap()
extract given by:
wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
time = c('2015-01-01','2019-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current'
)
but it is desired to restrict extract to March 1 to May 1 inclusive,
then the following modification of the above using
griddap_season()
produces the desired result:
wind_info <- rerddap::info('erdQMekm14day')
season <- c('03-01', '05-01')
season_extract <- griddap_season(wind_info,
time = c('2015-01-01','2019-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
season = season
)
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
To see that the extract has been limited to the given season:
extract_times <- lubridate::as_datetime(season_extract$data$time)
extract_months <- lubridate::month(extract_times)
unique(extract_months)
#> [1] 3 4 5
Only results for months 3, 4, and 5 are included, as per the values of “season”. May is included because the definition of “season” is inclusive, hence May 1 will be included.
griddap_split()
The function griddap_split()
allows to make
rerddap::griddap()
requests that are larger than the 2GB
limit by splitting the request into parts and then combining the results
either in memory in a dataframe, in a netcdf4 file, or into a duckdb
database file. If the “in memory” option is chosen, there is no check
that there is adequate memory, so it is possible to crash both R and
your computer. The nectdf4 and duckdb options depend more on having
adequate disk space, not memory. Also, there is no check that the split
given will reduce the requests to a small enough size, that is up to the
user to estimate.
griddap_split()
uses the same arguments as
griddap_split()
execpt that:
The argument “split” is a named list that defines the number of splits in each dimension, not the values where to make the splits, and a value must be given for each coordinate in the dataset. So for the dataset ‘erdQMekm14day’ above, there are four coordinates, “time”, “altitude”, “latitude” and “longitude”, to break the request into five “time” parts and two “latitude” and “longitude” parts (if you remember your combinatorics the original request will be broken into 20 separate requests) use:
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
Note that:
request_split <- list(time = 5, latitude = 2, longitude = 2)
will fail, even though “altitude” is dimension one. A value must be given for all of the dataset coordinates.
“fmt” defines how the split extracts will be combined to give one
result. If fmt = “memory” then the results are combined into a usual
rerddap::griddap()
in memory, and that structure is
returned by the function. If fmt = “nc” then the extracts wil be
combined into a netcdf4 file, and a path to the netcdf file is returned
by the function. If fmt = “duckdb” then the extracts wil be combined
into a duckdb database file, and a path to the duckdb file is returned
by the function.
As in the previous example suppose I had an
rerddap::griddap()
extract:
wind_info <- rerddap::info('erdQMekm14day')
extract <- rerddap::griddap(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current'
)
which would request an extract too large for ERDDAP to fulfill (that is not the case here so a split is not needed, but this illustrates the usage). Then the following modifies the code to split the request and save into a structure in memory:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "memory"
)
str(split_extract)
On return, “split_extract” will contain the usual
rerddap::griddap()
structure. Again, there is no check that
there is enough memory to do this operation, so it is possible to crash
R (or your computer) if not used carefully, but it is not unusual for
computers to have 16GB of RAM or more, so extracts a good bit larger
than 2GB can be stored in memory.
If instead it is desired to store the result in a netcdf4 file:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "nc"
)
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
#> info() output passed to x; setting base url to: https://upwell.pfeg.noaa.gov/erddap
Since “aggregate_file” is not given, the netcdf file will be written to a temporary space and the path to that file will be returned by the function. The reason for this is it is bad form for an R package to write to a user’s space without permission. To give permission to wrtie to a file of one’s choosing, either give the full path to the file (or if just the name is given it will be written to the present directory), such as for the example above:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2019-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "nc",
aggregate_file = 'wind_extract.nc'
)
To access the data in the file, any of the various R packages that can read netcdf4 files can be used, in particular ones that allow for only part of the dataset to be read into memory at a time. For example to open the result of the netcdf file created above using ‘ncdf4’ and viewing a summary:
To make the same extract and save to a duckbb database file:
wind_info <- rerddap::info('erdQMekm14day')
request_split <- list(time = 5, altitude = 1, latitude = 2, longitude = 2)
split_extract <- griddap_split(wind_info,
time = c('2015-01-01','2016-01-01'),
latitude = c(20, 40),
longitude = c(220, 240),
fields = 'mod_current',
request_split = request_split,
fmt = "duckdb",
aggregate_file = 'wind_extract.duckdb'
)
There are many tools that can used to access the duckdb database file, among others the packages ‘dplyr’ and ‘duckplyr’ can be used:
con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
tbl(con_db, "extract") |>
head(5) |>
collect()
dbDisconnect(con_db, shutdown=TRUE)
Note that the table will always be called “extract”.
The default duckdb format is used because it allows for incremental additions to the dataset on disk, so datasets can be larger than memory. A popular format for storing data is “parquet”, or if it is desired to encode geometry information into the data, then the “geoparquet” format. To copy the duckdb file to parquet:
con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
# Use DuckDB's COPY command to write directly to Parquet file without loading into R
query <- sprintf("COPY extract TO '%s' (FORMAT 'parquet')", "wind_extract.parquet")
DBI::dbExecute(con_db, query)
dbDisconnect(con_db, shutdown=TRUE)
To write to a geoparquet file the package geoarrow
cna
be used, which involves first adding the geocoding information - to do
so requires reading the entire data into memory so this will not work
for datasets larger than memory:
con_db <- dbConnect(duckdb::duckdb(), "wind_extract.duckdb")
wind_data <- DBI::dbGetQuery(con_db, "SELECT * FROM extract")
# Create a geometry column using geoarrow using latitude and longitude
wind_data <- sf::st_as_sf(wind_data, coords = c("longitude", "latitude"), crs = 4326)
arrow::write_parquet(wind_data, "wind_extract_geo.parquet")
dbDisconnect(con_db, shutdown=TRUE)
The resulting parquet files are half the size of the duckdb file, or smaller. While these examples are for writing the duckdb database to the parquet format, similar steps can be used if the data are in tibbles, but it may require reading the entire dataset into memory.
latlon_to_xy()
and
xy_to_latlon()
Extracting projected data in ERDDAP (that is not on a
latitude-longitude) grid can be difficult because most people do not
think in terms of the projection in defining the region to make an
extract, and similary once an extract has been made, it is often
desirable to know the coordinates in terms of latitude and longitude.
The functions latlon_to_xy()
and
xy_to_latlon()
are desigend to help with that process.
In both cases the functions try to determine the dataset projection
either from the dataset summary given by rerddap::info()
when projecting the values of a latitude-longitude bounding box, and
when an extract has been made, from the metadata in the extract. This is
done by searching for any of the terms:
proj_strings <- c('proj4text', 'projection', 'proj4string', 'grid_mapping_epsg_code',
'WKT', 'proj_crs_code',
'grid_mapping_epsg_code', 'grid_mapping_proj4',
'grid_mapping_proj4_params', 'grid_mapping_proj4text')
in the metadata. If this is not found, then the user has to provide “CRS”, which can be in any of the forms above. For example the Ice dataset ‘nsidcG02202v4sh1day’ on the Polarwatch ERDDAP is projected. The bounding box of interest is:
then the bounding box in projected coordinates is found from:
myURL <- 'https://polarwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
coords <- latlon_to_xy(myInfo, longitude, latitude)
coords
Similary, given an extract on a projected coordinate system, the values on terms of latitude and longitude can be found by:
rows <- c( -889533.8, -469356.9)
cols <- c(622858.3, 270983.4)
myURL <- 'https://polarwatch.noaa.gov/erddap/'
myInfo <- rerddap::info('noaacwVIIRSn20icethickNP06Daily', url = myURL)
proj_extract <- rerddap::griddap(myInfo,
time = c('2023-01-30T00:00:00Z', '2023-01-30T00:00:00Z'),
rows = rows,
cols = cols,
fields = 'IceThickness',
url = myURL
)
test <- xy_to_latlon(proj_extract)
head(test)
Note that xy_to_latlon()
will also work with the output
from the functions rerddapXtracto::rxtracto()
,
rerddapXtracto::rxtracto_3D()
and
`rerddapXtracto::rxtractogon()
.