metScanR

Summary

metScanR is an R package that enables users to quickly locate and work with freely available meteorological (MET) data across multiple networks. This package can currently find data across 107,000 stations among 18 different networks across the globe. The wide range of networks and their associated, but varying documentation, meta-data, data formats, and even station identifiers can pose a major roadblock to finding, wrangling, and synthesizing MET data.

metScanR currently allows for a user to ‘bypass’ many steps involved in finding MET data. A user can:

  • specify a location
  • search distance range
  • sampling date ranges
  • MET network (COOP, USCRN, USRCRN, ASOS, AWOS, SNOTEL, SCAN, NEON)
  • meteorological variables

metScanR will return an R list object containing all weather stations that meet the criteria. metScanR also empowers users to explore data by providing an interactive map of all returned MET stations (powered by Leaflet).

Tutorial Outline

This brief tutorial is intended for users that are both familiar and unfamiliar with R. The R code higlighted below can be copy-pasted and executed inside an R script.

A general workflow for locating meteorological data is outlined below.

There are two primary functions that a user can interact with:

In this example we’ll do the following:

Version Updates: From 0.0.1 to 1.0.0

The release of version 1.0.0 has brought not only more MET station data (from 13,000 stations to 107,000) but enhanced functionality. The expansion in MET station data required a total re-working of how data are stored and accessed. The current data set occupies a 600 megabyte binary file; this large file is loaded from a remote GitHub repo into R’s ‘background’ when the package is loaded.

A 10x increase in data also necessitated a new data structure, nested lists, and functions to access those data. A series of get... functions are now used to quickly pull data from the metScanR_DB object. Storing data as a list rather than a data.frame has two key benefits: (1) lists take up less disk space (and random access memory) and (2) lists are processed much more quickly.

Core Functions

  • siteFinder : an all in one “wrapper” that accesses the functionality of several lower level get functions (listed below)
    • Note: using the get functions directly will return data more quickly
  • mapSiteFinder : will map the outputs returned from siteFinder and/or get functions
  • getCountry : query MET stations by country of origin
  • getDates : query by startdate, enddate, or date range (startdate AND enddate)
  • getElevation : query by station elevation
  • getId : find specific stations by unique identifier
  • getNearby : find stations near a specific location and radius
  • getNetwork : return list of stations from a given MET network
  • getVars : query stations by environmental variables measured
    • Note: this function employs a ‘fuzzy search’ to parse through the hundreds of variables measured across MET stations. As an example, a user can narrow the granularity of their search by increasing the number of key words entered e.g. one can search for “wind”, “wind speed”, or “5-second wind speed”.

Meta-Data Structure

An object, metScanR_DB, is imported into the metScanR environment when the library is loaded in R. This metScanR_DB object contains all meta-data for the MET stations captured by this project. All metScanR functions return meta-data in a nested and named list that follows the same general structure for every station in the database.

If we return an object named “data” from a getVars function call (e.g. data <- getVars[1], this gives you the first list element or station of the metScanR_DB), the data are formatted as such (with a description in parentheses):

  • data$stationid (unique identifier) [example: data$USW00094893]
  • data$stationid$namez
  • data$stationid$identifiers (associated MET networks) [example: data$USW00094893$identifiers]
    • data$stationid$identifiers$idType [example: data$USW00094893$identifiers$idType]
    • data$stationid$identifiers$id
    • data$stationid$platform (primary MET network)
  • data$stationid$elements (environmental variables and start/end dates for monitoring)
    • data$stationid$elements$element
    • data$stationid$elements$date.begin
    • data$stationid$elements$date.end
  • data$stationid$location (geographic information)
    • data$stationid$location$latitude_dec
    • data$stationid$location$longitude_dec
    • data$stationid$location$elev
    • data$stationid$location$country
    • data$stationid$location$state
    • data$stationid$location$county
    • data$stationid$location$utcoffset
    • data$stationid$location$date.begin
    • data$stationid$location$date.end

Future Directions

In the near future, metScanR will provide functionality for directly downloading MET data via existing APIs. We are also planning on including meta-data from Ameriflux and NADP stations.

Installation

  • Install official releases from CRAN with
install.packages("metScanR")

If you encounter a bug, please provide a reproducible example on this package’s github issues page.

Getting Started

Search for meteorological (MET) data

Scenario 1: locate data via latitude and longitude

Find station meta-data near a given coordinate, assign the output to object scenario1:

  • NOTE: loading the metScanR library will take longer than normal (20-30 seconds) as it loads the large MET station data set into R
library(metScanR)
## Warning: package 'metScanR' was built under R version 3.3.3
## Welcome to metScanR! This package takes a few extra seconds to load because it checks for updates to an external database upon startup.  Thank you for your patience.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
scenario1 <- siteFinder(lat=40.05,lon=-105.27,startDate="2000-01-05",radius=45) # returns 40 stations

Scenario 2: locate data by location, MET network, and environmental variable of interest

Our search can be narrowed to more specific variables of interest via the vars parameter. There are hundreds of weather/climatological variables included in the metScanR database. Rather than requiring the user to know the exact syntax, case, phrase, or network of variables before they search the database, we have implemented a ‘fuzzy search’ function that interprets a user’s text input for the vars variable. The ‘fuzzy search’ function attempts to match the user’s input against a large set of overlapping keywords for each variable listed in the database.

In this example we search for any variables associated with air temperature measurements:

scenario2 <- siteFinder(lat=40.05,lon=-105.27,startDate="2000-01-05",radius=45,network="COOP",vars="air temperature", includeUnk = TRUE) # returns 15 stations

Typing the object name cenario2 shows the resulting output, where values such as “TMAX”, “TMIN”, and “AVGT” are returned (i.e. variables associated with air temperature measurements). Note that an error message is the resulting output if none of the stations within the search have variables that match the vars argument.

Visualize metScanR function output

metScanR was designed to support users who want to quickly and interactively locate meteorological data. The output of the siteFinder() and get() functions can therefore be visualized by passing the output data to the mapSiteFinder() function:

mapSiteFinder(scenario1)

The mapSiteFinder() function produces an interactive Leaflet map with every MET station represented by a colored circle. Colors denote which network a particular station belongs to. Users can click on circles to view key station meta-data such as name, platform, unique identifier, monitoring start date, monitoring end date, and station elevation. Users can also pan, zoom, and click on stations to better visualize the spatial arrangement of stations and/or networks.

mapSiteFinder(scenario1) returns 56 stations across 5 networks. Attempting to find, download, and organize data from this many stations and networks can be a time consuming task. One strategy for reducing the complexity of this task is to ‘eyeball’ which MET networks have the most extensive spatial coverage or time series (i.e. broadest monitoring start and end dates).

A brief glance at the map shows that the “COOP” network fits our criteria:

scenario3 <- siteFinder(lat=40.05,lon=-105.27,startDate="2000-01-05",radius=45,network="COOP")
mapSiteFinder(scenario3)
## count the number of stations
length(names(scenario3)) 
## [1] 19

Here we see that the data set has been reduced to 19 stations from the same network, thus simplifying our original task.

Exploring metScanR functionality

The next section walks you through how to browse through the many environmental variables logged by metScanR.

Explore environmental variables listed in metScanR_DB

There are 830 different environmental variables listed in the metScanR_DB. You can search through the list of terms associated with any station by combining two functions in RStudio, View() and plyr’s ldply, to neatly display, and then search through, the data:

View(plyr::ldply(metScanR:::metScanR_terms))

In the search bar, typing in a term like “temperature” will narrow the number of rows displayed.

Pratical Use Cases with metScanR

Locate and Map the occurrence of a ‘rarely’ monitored environmental variable

mapSiteFinder(getVars("conductivity"))

Map an entire global MET network

  • Shout out to the RNRCS package authors for adding NRCS station meta-data!
## Save huge map as a PDF or png to share with others, or render as HTML
maps <- mapSiteFinder(getNetwork("SCAN"))
maps

Find the oldest continuously monitoring weather station

oldest <- getDates(startDate = "1800-01-01", endDate="2015-01-01") # search for stations that begin at the turn of the 19th century and monitor to 2015
length(oldest) # count to see how many stations metScanR finds
## [1] 2
names(oldest) # see what the stationIDs are
## [1] "USC00226177" "USW00013782"
oldest$USC00226177
## $namez
## [1] "NATCHEZ"
## 
## $identifiers
##       idType          id
## 1      GHCND USC00226177
## 5    GHCNMLT USC00226177
## 9       COOP      226177
## 13     NWSLI       NATM6
## 14 NCDCSTNID    20011197
## 
## $platform
## [1] "COOP"
## 
## $elements
##      element date.begin   date.end
## 1     PRECIP 1948-01-01 2017-04-17
## 2       TEMP 1948-01-01    present
## DAPR    DAPR       1953       2016
## MDPR    MDPR       1953       2016
## PRCP    PRCP       1892       2017
## SNOW    SNOW       1894       2017
## SNWD    SNWD       1909       2017
## TMAX    TMAX       1892       2017
## TMIN    TMIN       1892       2017
## TOBS    TOBS       1901       2017
## WT01    WT01       1906       1962
## WT03    WT03       1915       1991
## WT04    WT04       1898       2011
## WT05    WT05       1915       1985
## WT06    WT06       1936       2011
## WT07    WT07       1935       1944
## WT08    WT08       1919       1949
## WT11    WT11       1934       1990
## WT14    WT14       1924       1978
## WT16    WT16       1895       1929
## WT18    WT18       1898       1924
## 
## $location
##   latitude_dec longitude_dec elev       country state county utcoffset
## 1       31.589      -91.3409 59.4 UNITED STATES    MS  ADAMS        -6
##   date.begin date.end
## 1 1799-01-01  present

Find multiple variables of interest where sensors are co-located

## find stations with soil moisture
soilMoisture <- getVars("soil moisture")
## find stations with snow depth
snow <- getVars("snow depth")

## determine which stations have both variables by finding intersecting station names between the two lists
colocated <- intersect(names(soilMoisture), names(snow))
colocated %>% head
## [1] "SCAN:2221" "SCAN:2214" "SCAN:2216" "SCAN:2213" "SCAN:2210" "SCAN:2211"

Returning just the meta-data for the colocated stations is a little more involved, as it requires filtering the metScanR_DB list object. You can subset a list either by numeric index or the element name (if it exists). The nicely formatted metScanR_DB object is a named list, so we access the colocated data via the station’s identifier:

## this method takes about 1.7 seconds on my system
## for each stationid, subset the metScanR_DB object and return the meta-data as a smaller list
colocated_metadata <- lapply(colocated, function(x) metScanR:::metScanR_DB[x])
colocated_metadata %>% head(3)

Write station meta-data to an external file

Since metScanR only returns station meta-data at the moment (functionality for downloading station data directly will be implemented in the future), users may want a ‘hard copy’ .csv of station identifiers to plug into various APIs or websites supported by MET networks:

write.csv(x = scenario1, file = "path/to/your/folder/metScanR_output.csv", na="")

You must specify a local file path for the output via the file argument.