Introduction

rerddap is a general purpose R client for working with ERDDAP servers. ERDDAP is a web service developed by Bob Simons of NOAA. At the time of this writing, there are over sixty ERDDAP servers (though not all are public facing) providing access to literally petabytes of data and model output relevant to oceanography, meteorology, fisheries and marine mammals, among other areas. ERDDAP is a simple to use, RESTful web service, that allows data to be subsetted and returned in a variety of formats.

This vignette goes over some of the nuts and bolts of using the rerddap package, and shows the power of the combination of the rerddap package with ERDDAP servers. Some of the examples are taken from the xtractomatic package (available from CRAN - https://cran.r-project.org/package=xtractomatic), and some from the rerddapXtracto package available on Github (https://github.com/rmendels/rerddapXtracto), but reworked to use rerddap directly. Other examples are new to this vignette, and include both gridded and non-gridded datasets from several ERDDAPs.

Installation

The first step is to install the rerddap package, the stable version is available from CRAN:

install.packages("rerddap")

or the development version can be installed from GitHub:

devtools::install_github("ropensci/rerddap")

and to load the library:

library("rerddap")

Besides rerddap the following libraries are used in this vignette:

library("akima")
library("dplyr")
library("ggplot2")
library("mapdata")
library("ncdf4")
library("plot3D")

Code chunks are always given with the required libraries so that the chunks are more standalone in nature. Many of the plots use the cmocean colormaps designed by Kristen Thyng (see http://matplotlib.org/cmocean/ and https://github.com/matplotlib/cmocean). These colormaps were initally developed for Python, but a version of the colormaps is used in the oce package by Dan Kelley and Clark Richards and that is what is used here.

The main rerddap functions

The complete list of rerddap functions can be seen by looking at he rerddap package help:

?rerddap

and selecting the index of the package. The main functions used here are:

Be careful when using the functions ed_search() and ed_datasets(). The default ERDDAP has over 9,000 datasets, most of which are grids, so that a list of all the gridded datasets can be quite long. A seemly reasonable search:

whichSST <- ed_search(query = "SST")

returns about 1000 responses. The more focused query:

whichSST <- ed_search(query = "SST MODIS")

still returns 172 responses. If the simple search doesn’t narrow things enough, look at the advanced search function ed_search_adv().

Finding the Data You Want

The first way to find a dataset is to browse the builtin web page for a particular ERDDAP server. A list of some of the public available ERDDAP servers can be obtained from the rerddap command:

servers()

The second way to find and obtain the desired data is to use functions in rerddap. The basic steps are:

  1. Find the dataset on an ERDDAP server (rerddap::servers(), rerddap::ed_search(), rerddap::ed_datasets() ).
  2. Get the needed information about the dataset (rerddap::info() )
  3. Think about what you are going to do.
  4. Make the request for the data (rerddap::griddap() or rerddap::tabledap() ).

We discuss each of these steps in more detail, and then look at some realistic uses of the package.

Think about what you are going to do.

This may seem to be a strange step in the process, but it is important because many of the datasets are high-resolution, and data requests can get very large, very quickly. As an example, based on a real use case. The MUR SST ( Multi-scale Ultra-high Resolution (MUR) SST Analysis fv04.1, see https://coastwatch.pfeg.noaa.gov/erddap/griddap/jplMURSST41.html ) is a daily, high-quality, high-resolution sea surface temperature product. The user wanted the MUR data for a 2x2-degree box, daily for a year. That seems innocuous enough. Except that MURsst is at a one-hundreth of degree resolution. If we assume just a binary representation of the data, assuming 8-bytes per value, and do the math:

100*100*4*8*365
#> [1] 116800000

Yes, 116,800,000 bytes or roughly 115MB for that request. Morever the user wanted the data as a .csv file, which usually makes the resulting file 8-10 times larger, so now we are over a 1GB for the request. Even more so, there are four parameters in that dataset, and in rerddap::griddap() if “fields” is not specified, all fields are downloaded, therefore the resulting files will be four times as large as given above.

So the gist of this is to think about your request before you make it. Do a little mental math to get a rough estimate of the size of the download. There are times receiving the data as a .csv file is convenient, but make certain the request will not be too large. For larger requests, obtain the data as netCDF files. By default, rerddap::griddap() “melts”" the data into a dataframe, so a .csv only provides a small convenience. But for really large downloads, you should select the option in rerddap::griddap() to not read in the data, and use instead the netcdf4 package to read in the data, as this allows for only reading in parts of the data at a time. Below we provide a brief tutorial on reading in data using the ncdf4 package.

Some ERDDAP Basics

One of the main advantages of a service such as ERDDAP is that you only need to download the subset of the data you desire, rather than the entire dataset, which is both convenient and essential for large datasets. The underlying data model in ERDDAP is quite simple - everything is either a (multi-dimensional) grid (think R array) or a table (think a simple spreadsheet or data table). Grids are subsetted using the function griddap() and tables are subset using the function tabledap().

If you know the datasetID of the data you are after, and are unsure if it is a grid or a table, there are several ways to find out. We will look at two datasets, ‘jplMURSST41’ and ‘siocalcofiHydroCasts’. The first method is to use the rerddap function browse()

browse('jplMURSST41')
browse('siocalcofiHydroCasts')

which brings up information on the datasets in the browser, in the first case the “data” link is under “griddap”, the second is under “tabledap”.

The other method is to use the rerddap function info:

info('jplMURSST41')
#> <ERDDAP info> jplMURSST41 
#>  Dimensions (range):  
#>      time: (2002-06-01T09:00:00Z, 2017-02-06T09:00:00Z) 
#>      latitude: (-89.99, 89.99) 
#>      longitude: (-179.99, 180.0) 
#>  Variables:  
#>      analysed_sst: 
#>          Units: degree_C 
#>      analysis_error: 
#>          Units: degree_C 
#>      mask: 
#>      sea_ice_fraction: 
#>          Units: fraction
info('siocalcofiHydroCasts')
#> <ERDDAP info> siocalcofiHydroCasts 
#>  Variables:  
#>      ac_line: 
#>      ac_sta: 
#>      barometer: 
#>          Units: millibars 
#>      bottom_d: 
#>          Units: meters 
#>      cast_id: 
#>      civil_t: 
#>          Units: seconds since 1970-01-01T00:00:00Z 
#>      cloud_amt: 
#>          Units: oktas 
#>      cloud_typ: 
#>      cruise_id: 
#>      cruz_leg: 
#>      cruz_num: 
#>      cruz_sta: 
#>      cst_cnt: 
#>      data_or: 
#>      data_type: 
#>      date: 
#>      dbsta_id: 
#>      distance: 
#>      dry_t: 
#>          Units: degree C 
#>      event_num: 
#>      forelu: 
#>          Units: Forel-Ule scale 
#>      inc_end: 
#>          Units: seconds since 1970-01-01T00:00:00Z 
#>      inc_str: 
#>          Units: seconds since 1970-01-01T00:00:00Z 
#>      intc14: 
#>          Units: milligrams Carbon per square meter 
#>      intchl: 
#>      julian_date: 
#>      julian_day: 
#>      latitude: 
#>          Range: 18.417, 47.917 
#>          Units: degrees_north 
#>      latitude_degrees: 
#>      latitude_hemisphere: 
#>      latitude_minutes: 
#>      longitude: 
#>          Range: -164.083, -105.967 
#>          Units: degrees_east 
#>      longitude_degrees: 
#>      longitude_hemisphere: 
#>      longitude_minutes: 
#>      month: 
#>          Range: 1, 12 
#>      order_occ: 
#>      orig_sta_id: 
#>      pst_lan: 
#>          Units: seconds since 1970-01-01T00:00:00Z 
#>      quarter: 
#>      rpt_line: 
#>      rpt_sta: 
#>      secchi: 
#>          Units: meters 
#>      ship_code: 
#>      ship_name: 
#>      st_line: 
#>      st_station: 
#>      sta_code: 
#>      sta_id: 
#>      time: 
#>          Range: -6.5759508E8, 1.423140778E9 
#>          Units: seconds since 1970-01-01T00:00:00Z 
#>      time_ascii: 
#>      timezone: 
#>      visibility: 
#>      wave_dir: 
#>          Units: degrees 
#>      wave_ht: 
#>          Units: feet 
#>      wave_prd: 
#>          Units: seconds 
#>      wea: 
#>      wet_t: 
#>          Units: degree C 
#>      wind_dir: 
#>          Units: degrees 
#>      wind_spd: 
#>          Units: knots 
#>      year: 
#>          Range: 1949, 2015

Notice the information on ‘jplMURSST41’ lists the dimensions (that is a grid) while that of ‘siocalcofiHydroCasts’ does not (that is a table).

Subsetting griddap()

Like an R array, ERDDAP grids are subsetted by setting limits on the dimension variables, the difference being that a subset is defined in coordinate space (latitude values, longitude values, time values) rather than array space as is done with R arrays. Thus for ‘jplMURSST41’ if the desired area of the data extract is latitude limits of (22N, 51N), longitude limits of (140W, 105W), and with time limits of (2017-01-01, 2017-01-02) the following would be passed to the function griddap():

latitude = c(22., 51.)
longitude = c(-140., -105)
time = c("2017-01-01", "2017-01-02")

A full griddap() request to retrieve “analysed_sst” with these constraints would be:

sstInfo <- info('jplMURSST41')
murSST <- griddap(sstInfo, latitude = c(22., 51.), longitude = c(-140., -105), time = c("2017-01-01", "2017-01-02"), fields = 'analysed_sst')

Strides

Strides allow to retrieve data within a coordinate bound only at every “n” values, where “n” is an integer - think of the “by” part of the R function seq(). This is useful say for a monthly dataset where only the December values are desired, or you want to subsample a very high resolution dataset. The default is a stride of 1 for all dimensions. If you want to change the stride value in any dimension, then the value must be given for all dimensions. So in the previous example, if only every fifth longitude is desired, the call would be:

murSST <- griddap(sstInfo, latitude = c(22., 51.), longitude = c(-140., -105), time = c("2017-01-01", "2017-01-02"), stride = c(1,1,5), fields = 'analysed_sst')

Strides are done in array space, not in coordinate space - so it is not skipping say a number of degrees of longitude, but is skipping a number of values of the array index - if longitude is thought of as an array, then every fifth value is used. There are many cases where having strides work in coordinate space would be handy, but it can cause a lot of problems. Consider the case where neither the starting longitude, nor the ending longitude in the request lie on the actual data grid, and the stride is in coordinate units in such a way that no value requested actually lies on an actual grid value. This would be equivalent to the more complicated problem of data regridding and data interpolation.

When ERDDAP receives a request where the bounds are not on the actual dataset grid, ERDDAP finds the closest values on the grid to the requested bounds, and returns those points and all grid points in between. If a stride is added of a value greater than one but resricted to array space, this guarantees that every value returned lies on the dataset grid.

Subsetting tabledap()

Tables in ERDDAP are subset by using “constraint expressions” on any variable in the table. The valid operators are =, != (not equals), =~ (a regular expression test), <, <=, >, and >=. The constraint is constructed as the parameter on the left, value on the right, and the operator in the middle, all within a set of quotes. For example, if in the SWFSC/FRD trawl catch data (datasetID ‘FRDCPSTrawlLHHaulCatch’), only sardines for 2010 were desired, the following constraints would be set in the tabledap() call:

'time>=2010-01-01'
'time<=2010-12-31'
'scientific_name="Sardinops sagax"'

Note that in tabledap() character strings usually must be passed as “double-quoted”, as seen in the example with the scientific name. A full tabledap() request to retrieve ‘latitude’, ‘longitude’, ‘time’, ‘scientific_name’, and ‘subsample_count’ with these constraints would be:

CPSinfo <- info('FRDCPSTrawlLHHaulCatch')
sardines <- tabledap(CPSinfo, fields = c('latitude',  'longitude', 'time', 'scientific_name', 'subsample_count'), 'time>=2010-01-01', 'time<=2010-12-31', 'scientific_name="Sardinops sagax"' )

griddap

MUR SST

MUR (Multi-scale Ultra-high Resolution) is an analyzed SST product at 0.01-degree resolution going back to 2002, providing one of the longest satellite based time series at such high resolution (see https://podaac.jpl.nasa.gov/dataset/MUR-JPL-L4-GLOB-v4.1). The latest data available for a region off the west coast can be extracted and plotted by:

require("ggplot2")
require("mapdata")
require("rerddap")
sstInfo <- info('jplMURSST41')
# get latest daily sst
murSST <- griddap(sstInfo, latitude = c(22., 51.), longitude = c(-140., -105), time = c('last','last'), fields = 'analysed_sst')
mycolor <- colors$temperature
w <- map_data("worldHires", ylim = c(22., 51.), xlim = c(-140, -105))
ggplot(data = murSST$data, aes(x = lon, y = lat, fill = analysed_sst)) + 
    geom_polygon(data = w, aes(x = long, y = lat, group = group), fill = "grey80") +
    geom_raster(interpolate = FALSE) +
    scale_fill_gradientn(colours = mycolor, na.value = NA) +
    theme_bw() + ylab("latitude") + xlab("longitude") +
    coord_fixed(1.3, xlim = c(-140, -105),  ylim = c(22., 51.)) + ggtitle("Latest MUR SST")

VIIRS SST and Chlorophyll

VIIRS (Visible Infrared Imaging Radiometer Suite) is a scanning radiometer, that collects visible and infrared imagery and radiometric measurements of the land, atmosphere, cryosphere, and oceans. VIIRS data is used to measure cloud and aerosol properties, ocean color, sea and land surface temperature, ice motion and temperature, fires, and Earth’s albedo. Both NASA and NOAA provide VIIRS-based high resolution SST and chlorophyll products.

ERD provides a 3-day composite SST product at 750 meter resolution developed from a real-time NOAA product (see http://coastwatch.noaa.gov/cwn/cw_products_sst.html). The most recent values can be obtained by setting “time” to be “last”. (Note that R sees the latitude-longitude grid as slightly uneven (even though it is in fact even), and that produces artificial lines in ggplot2::geom_raster(). In order to remove those lines, the latitude-longitude grid is remapped to an evenly-space grid.)

require("ggplot2")
require("mapdata")
require("rerddap")
sstInfo <- info('erdVHsstaWS3day')
# get latest 3-day composite sst
viirsSST <- griddap(sstInfo, latitude = c(41., 31.), longitude = c(-128., -115), time = c('last','last'), fields = 'sst')
# remap latitiudes and longitudes to even grid
myLats <- unique(viirsSST$data$lat)
myLons <- unique(viirsSST$data$lon)
myLats <- seq(range(myLats)[1], range(myLats)[2], length.out = length(myLats))
myLons <- seq(range(myLons)[1], range(myLons)[2], length.out = length(myLons))
# melt these out to full grid
mapFrame <- expand.grid(x = myLons, y = myLats)
mapFrame$y <- rev(mapFrame$y)
# form a frame with the new values and the data
tempFrame <- data.frame(sst = viirsSST$data$sst, lat = mapFrame$y, lon = mapFrame$x)
mycolor <- colors$temperature
w <- map_data("worldHires", ylim = c(30., 42.), xlim = c(-128, -114))
ggplot(data = tempFrame, aes(x = lon, y = lat, fill = sst)) + 
    geom_polygon(data = w, aes(x = long, y = lat, group = group), fill = "grey80") +
    geom_raster(interpolate = FALSE) +
    scale_fill_gradientn(colours = mycolor, na.value = NA) +
    theme_bw() + ylab("latitude") + xlab("longitude") +
    coord_fixed(1.3, xlim = c(-128, -114),  ylim = c(30., 42.)) + ggtitle("Latest VIIRS 3-day SST")