Wait for rgbif download to complete before proceeding - r

I am developing a small application in R Shiny. Part of the application will need to query GBIF to download species occurrence data. This is possible using rgbif. The function rgbif::occ_download() will download the data and rgbif::occ_download_meta() will check whether GBIF has fulfilled your request. For example:
geometry <- "POLYGON((30.1 10.1,40 40,20 40,10 20,30.1 10.1))"
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
occ_download_meta(res)
<<gbif download metadata>>
Status: RUNNING
Format: SPECIES_LIST
Download key: 0004089-190415153152247
Created: 2019-04-25T09:18:20.952+0000
Modified: 2019-04-25T09:18:21.045+0000
Download link: http://api.gbif.org/v1/occurrence/download/request/0004089-190415153152247.zip
Total records: 0
So far, so good. However, the following function rgbif::occ_download_get() can't download the data for downstream analysis until occ_download_meta(res) has completed (when Status = SUCCEEDED).
How can I make the session wait until the download from GBIF has been completed? I cannot hard code a wait time into the script as different sized extents will take GBIF longer or shorter amounts of time to process. Also, the number of other active users querying the service could also alter wait times. I therefore need some sort of flag where Status == Succeeded before proceeding.
I have copied some skeleton code with comments below.
library(rgbif)
geometry <- "POLYGON((30.1 10.1,40 40,20 40,10 20,30.1 10.1))" # Define boundary
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
# WAIT HERE UNTIL Status == SUCCEEDED
occ_download_meta(res)
x <- occ_download_get(res, overwrite = TRUE) # Download data
data<-occ_download_import(x) # Import into R

rgbif maintainer here. You could do something like we have within the occ_download_queue() function:
res <- occ_download(paste0("geometry within ", geometry), type = "within", format = "SPECIES_LIST")
still_running <- TRUE
status_ping <- 3
while (still_running) {
meta <- occ_download_meta(res)
status <- meta$status
still_running <- status %in% c("succeeded", "killed")
Sys.sleep(status_ping) # sleep between pings
}
you probably want to check for succeeded and killed, and do something different if killed

Related

how to write out multiple files in R?

I am a newbie R user. Now, I have a question related to write out multiple files with different names. Lets says that my data has the following structure:
IV_HAR_m1<-matrix(rnorm(1:100), ncol=30, nrow = 2000)
DV_HAR_m1<-matrix(rnorm(1:100), ncol=10, nrow = 2000)
I am trying to estimate multiple LASSO regressions. At the beginning, I was storing the iterations in one object called Dinamic_beta. This object was stored in only one file, and it saves the required information each time that my code iterate.
For doing this I was using stew which belongs to pomp package, but the total process takes 5 or 6 days and I am worried about a power outage or a fail in my computer.
Now, I want to save each environment (iterations) in a .Rnd file. I do not know how can I do that? but the code that I am using is the following:
library(glmnet)
library(Matrix)
library(pomp)
space <- 7 #THE NUMBER OF FILES THAT I would WANT TO CREATE
Dinamic_betas<-array(NA, c(10, 31, (nrow(IV_HAR_m1)-space)))
dimnames(Dinamic_betas) <- list(NULL, NULL)
set.seed(12345)
stew( #stew save the enviroment in a .Rnd file
file = "Dinamic_LASSO_RD",{ # The name required by stew for creating one file with all information
for (i in 1:dim(Dinamic_betas)[3]) {
tryCatch( #print messsages
expr = {
cv_dinamic <- cv.glmnet(IV_HAR_m1[i:(space+i-1),],
DV_HAR_m1[i:(space+i-1),], alpha = 1, family = "mgaussian", thresh=1e-08, maxit=10^9)
LASSO_estimation_dinamic<- glmnet(IV_HAR_m1[i:(space+i-1),], DV_HAR_m1[i:(space+i-1),],
alpha = 1, lambda = cv_dinamic$lambda.min, family = "mgaussian")
coefs <- as.matrix(do.call(cbind, coef(LASSO_estimation_dinamic)))
Dinamic_betas[,,i] <- t(coefs)
},
error = function(e){
message("Caught an error!")
print(e)
},
warning = function(w){
message("Caught an warning!")
print(w)
},
finally = {
message("All done, quitting.")
}
)
if (i%%400==0) {print(i)}
}
}
)
If someone can suggest another package that stores the outputs in different files I will grateful.
Try adding this just before the close of your loop
save.image(paste0("Results_iteration_",i,".RData"))
This should save your entire workspace to disk for every iteration. You can then use load() to load the workspace of every environment. Let me know if this works.

Nestled Loop not Working to gather data from NOAA

I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

What I'm Attempting to Do
I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.
After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.
Working Example
Following is a working example (or not) with the output truncated and pasted below.
options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016
for (i in years) {
remote_file_list <- RCurl::getURL(
paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
crlf = TRUE, ssl.verifypeer = FALSE)
remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]
file_list <- paste0(station, "-", i, ".op.gz")
file_list <- file_list[file_list %in% remote_file_list]
file_list <- paste0(ftp, i, "/", file_list)
Map(function(ftp, dest) utils::download.file(url = ftp,
destfile = dest, mode = "wb"),
file_list, file.path(td, basename(file_list)))
}
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes
...
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes
Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") :
cannot download all files In addition: Warning message:
In utils::download.file(url = ftp, destfile = dest, mode = "wb") :
URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':
status was '530 Not logged in'
Different Methods and Ideas I've Tried but Haven't Yet Been Successful
So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.
The login is anonymous, but you can use userpwd "ftp:your#email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.
I think you're going to need a more defensive strategy when working with this FTP server:
library(curl) # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar
# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
ssl_verifypeer=FALSE, ftp_response_timeout=30)
# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)
# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6
# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {
i <- 0
repeat {
i <- i + 1
res <- s_curl_fetch_memory(url, handle=handle)
if (!is.null(res$result)) return(res$result)
if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
}
}
# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {
# you should prbly be a bit more thorough than `basename` since
# i think there are issues with the 1971 and 1972 filenames.
# Gotta leave some work up to the OP
cache_file <- sprintf("%s/%s", cache_dir, basename(url))
if (file.exists(cache_file)) return()
i <- 0
repeat {
i <- i + 1
if (i==6) { stop("Too many retries...server may be under load") }
res <- s_curl_fetch_disk(url, cache_file)
if (!is.null(res$result)) return()
}
}
# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
"984260-41231", "984290-99999", "984300-99999", "984320-99999",
"984330-99999")
years <- 1960:2016
# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {
# the year we're working on
year_url <- sprintf(ftp_base, yr)
# fetch the directory listing
tmp <- retry_cfm(year_url, handle=dir_list_handle)
con <- rawConnection(tmp$content)
fils <- readLines(con)
close(con)
# sift out only the target stations
map(station, ~grep(., fils, value=TRUE)) %>%
keep(~length(.)>0) %>%
flatten_chr() -> fils
# grab the stations files
walk(paste(year_url, fils, sep=""), retry_cfd)
# tick off progress
pb$tick()$print()
})
You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

"download.file" Incomplete and inconsistent downloads

Am trying to understand why I am having inconsistent results downloading CSV files from a website archive. Don't know if the problem is at my end, the other side or just failed communications in between. Any suggestions are welcomed.
Using a R script to automate the downloading of CSV files by month and year from the HYCOM archives for analysis. The script generated the following URL trying URL 'http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?var=salinity&var=water_temp&var=water_u&var=water_v&latitude=13.875&longitude=-72.25&time_start=2012-05-01T00:00:00Z&time_end=2012-05-31T21:00:00Z&vertCoord=&accept=csv'
Running download.file successfully obtains the file about half the time, otherwise fails. Any suggestions are welcomed. The images below shows the failed run. Successful run is below.
Successful Log
#download one month of data
MM = '05'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste( as shown in image)
H2 = '-01T00:00:00Z&time_end='
#H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
H3 = 'T21:00:00Z&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
download.file(url =HtmlLink ,destfile=dest,cacheOK=FALSE, method="auto")
trying URL 'as shown in image'
Content type 'text/plain;charset=UTF-8' length unknown
..................................................
................downloaded 666 KB
user system elapsed
28.278 6.605 5201.421
LOG OF FAILED RUN
You can/should turn the following into a function accepting parameters and replace the hardcoded values with said params (I used httr:::parse_query() to make the list):
library(httr)
URL <- "http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly"
params <- list(var = "salinity",
var = "water_temp",
var = "water_u",
var = "water_v",
latitude = "13.875",
longitude = "-72.25",
time_start = "2012-05-01T00:00:00Z",
time_end = "2012-05-31T21:00:00Z",
vertCoord = "",
accept = "csv")
dest_file <- "filename"
res <- GET(url=URL,
query=params,
timeout(360),
write_disk(dest_file, overwrite=TRUE),
verbose())
warn_for_status(res)
You can (eventually) remove the verbose() from that GET call, but it's helpful during debugging.
The main issue is that this server is s l o w and times out before the transfer is complete. Even the value of 360 might not be enough (you'll need to experiment).
Many thanks to all for the help. The suggestion by hrbrmstr appears to be an elegant answer and I look forwards to testing it. However, I was unable to install a working copy using the program manager. Installation from a local download also failed since R complained that the OS X version that I downloaded from CRAN was a windows version, not OS X. Yes, I repeated the download several times to make sure I had the right package.
As suggested by Cyrus Mohammadian, I tried the procedures in the curl library.
Running the same URL, download.file transfers failed about 50% of the time. Using curl reduced the transfer times from 2000 seconds to 1000 seconds with no failures in 12 tries.
## calculate number of days in month
ndays <- function(d) {
last_days <- 28:31
rev(last_days[which(!is.na(
as.Date( paste( substr(d, 1, 8),
last_days, sep = ''),
'%Y-%m-%d')))])[1] }
nlat = 13.875
elon = -72.25
#download one month of data
year = 2008
MM = '01'
LastDay = ndays(paste(year,MM,'01',sep="-"))
H1 = paste('http://ncss.hycom.org/thredds/ncss/GLBu0.08/reanalysis/3hrly?
var=salinity&var=water_temp&var=water_u&var=water_v&latitude=',
nlat,'&longitude=', elon,'&time_start=',sep="")
H2 = '-01T00:00:00Z&time_end='
H3 = 'T21:00:00Z&timeStride=1&vertCoord=&accept=csv'
HtmlLink <- paste(H1,year,"-",MM,H2,year,"-",MM,"-",LastDay,H3,sep="")
dest = paste("../data/",year,MM,".csv",sep="")
curl_download(url =HtmlLink ,destfile=dest,quiet=FALSE, mode="wb")

Handling internet connection R

I`m trying to download several stocks from google, but every time the connection stops, R stops the loop. How can I handle this problem?
stocks <- c(
'MSFT',
'GOOG',
...
)
for (symbol in stocks)
{
stock_price <- getSymbols(symbol,src='google', from=startDate,to=endDate,auto.assign = FALSE)
prices[,j] <- stock_price[,1]
j <- j + 1
}
From the R manual "quantmod.pdf:
If auto.assign=FALSE or env=NULL (as of 0.4-0) the data will be returnedfrom the call, and will require the user to assign the results himself.Note that only one symbol at a time may be requested when auto assignment is disabled.
You are trying to request more than one ticket symbol at a time with the auto.assign parameter set to false and this is not allowed. However, you should be able to obtain all your symbols at once by adapting the following code:
data <- new.env()
getSymbols.extra(stocks, src = 'google', from = startDate, to = endDate, env = data, auto.assign = T)
plot(data$MSFT)
Pay careful attention to the R manual for getSymbols
"Data is fetched through one of the available getSymbols methods and saved in the env specified - the .GloblEnv by default.

Resources