R loop completes only 3 iterations out of 2504 - r

I've written a function to download multiple files from NOAA's database. Firstly, I've got sites which is a list of site ID's that I want to download off the website. It looks like this:
> head(sites)
[[1]]
[1] "9212"
[[2]]
[1] "10158"
[[3]]
[1] "11098"
> length(sites)
[1] 2504
My function is shown below.
tested<-lapply(seq_along(sites), function(x) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
check=GET(v$statusUrl)
j<-content(check)
URL<-j$archive
download.file(URL, destfile=paste0('./tree_ring/', no, '.zip'))
})
The weird issue is that it works for the first three sites (downloads properly), but then it stops after the three sites and throws the following error:
Error in charToRaw(URL) : argument must be a character vector of length 1
I've tried manually downloading the 4th and 5th site (using the same code as above, but not within function) and it works fine. What could be going on here?
EDIT 1: Showing more site ID's as requested
> dput(sites[1:6])
list("9212", "10158", "11098", "15757", "15777", "15781")

I converted your code to a for loop so I could see the most recent values of all your variables when things fail.
The fails aren't consistently on the 4th site. Running your code a few times, sometimes it fails on 2, or 3, or 4. When it fails, if I look at j, I see this:
$message
[1] "finalizing archive"
$status
[1] "working"
$message
[1] "finalizing archive"
$status
[1] "working"
If I re-run check=GET(v$statusUrl); j<-content(check) a few seconds later, then I see
$archive
[1] "https://www.ncdc.noaa.gov/web-content/paleo/bundle/1986420067_2020-04-23.zip"
$status
[1] "complete"
So, I think it takes the server a little bit of time to prepare the file for download, and sometimes R asks for the file before it's ready, which causes an error. A simple fix might look like this:
check_status <- function(v) {
check <- GET(v$statusUrl)
content(check)
}
for(x in seq_along(sites)) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
try_counter <- 0
j <- check_status(v)
while(j$status != "complete" & try_counter < 100) {
Sys.sleep(0.1)
j <- check_status(v)
}
URL<-j$archive
download.file(URL, destfile=paste0(no, '.zip'))
}
If the status isn't ready, this version will wait 0.1 seconds before checking again, up to 10 seconds.

Related

Ignore error when importing JSON files in R

I have this for loop that download a json file from a solr search server.
It loops over a vector that contain keywords (100, in this case):
library(jsonlite)
for (i in 1:100) {
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}
It works fine, until it reaches a certain keyword that is not found on the solr, and returns this error:
Error in open.connection(con, "rb") : HTTP error 400.
And then the loop stops.
Is there a way to ignore the error and proceed the loop?
I've read something using tryCatch but still couldn't figure it out.
Simpler than tryCatch, you can use the function try inside your keyword loop. This will attempt to load the URL, but if an error is encountered will print the error but continue to the next keyword.
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
})
}
If you also don't want to have the errors printed, specify silent = TRUE:
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}, silent = TRUE)
}
I'm partial to purrr's safely for this kind of task, which works well in purrr's map functions. You can test it by getting JSONs from GitHub's API:
keywords <- c("hadley", "gershomtripp", "lsjdflkaj")
url <- "https://api.github.com/users/{.}/repos"
Now we can get the JSONs and extract the repo IDs
library(jsonlite)
library(purrr)
library(glue)
json_list <- map(keywords, safely(~ fromJSON(glue(url)) %>% .$id))
This will return a list of elements containing result and error. If there was an error it will be saved in error, otherwise the results will be save in result.
[[1]]
[[1]]$result
[1] 40423928 40544418 14984909 12241750 5154874 9324319 20228011 82348 888200 3116998
[11] 8296284 137344416 133734429 2788278 28724058 9470424 116708612 34325557 41144 41157
[21] 78543290 66588778 35225488 14507273 15718805 18562209 12522 115742443 119107571 201908
[[1]]$error
NULL
[[2]]
[[2]]$result
[1] 150995700 141743224 127107806 130802586 185857872 131488780 148619375 165221804 135417803 127116088
[11] 181662388 173351888 127131146 136896011
[[2]]$error
NULL
[[3]]
[[3]]$result
NULL
[[3]]$error
<simpleError in open.connection(con, "rb"): HTTP error 404.>

Path assignment to setwd() is delayed in for/foreach loop

The objective is to change within a for loop the current working directory and do some other stuff in it,.e.g. searching for files. The paths are stored in generic variables.
The R code I am running for this is the following:
require("foreach")
# The following lines are generated by an external tool and stored in filePath.info
# Loaded via source("filePaths.info")
result1 <- '/home/user/folder1'
result2 <- '/home/user/folder2'
result3 <- '/home/user/folder3'
number_results <- 3
# So I know that I have all in all 3 folders with results by number_results
# and that the variable name that contains the path to the results is generic:
# string "result" plus 1:number_results.
# Now I want to switch to each result path and do some computation within each folder
start_dir <- getwd()
print(paste0("start_dir: ",start_dir))
# For every result folder switch into the directory of the folder
foreach(i=1:number_results) %do% {
# for (i in 1:number_results){ leads to the same output
# Assign path in variable, not the variable name as string: current_variable <- result1 (not string "result1")
current_variable <- eval(parse(text = paste0("result", i)))
print(paste0(current_variable, " in interation_", i))
# Set working directory to string in variable current_variable
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
# DO SOME OTHER STUFF WITH FILES IN THE CURRENT FOLDER
}
# Switch back into original directory
current_dir <- setwd(start_dir)
print(paste0("end_dir: ",current_dir))
The output is the following ...
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder2"
[1] "end_dir: /home/user/folder3"
... while I would have expected this:
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder2"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder3"
[1] "end_dir: /home/user/"
So it turns out that the path assigned to current_dir is somewhat "behind" ...
Why is this the case?
As I am far away from being a R expert, I have no idea what is causing this behaviour and most important how to get the desired behaviour.
So any help, hint, code correction/optimization would be highly appreciated!
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Platform: x86_64-pc-linux-gnu (64-bit)
From the ?setwd help page...
setwd returns the current directory before the change, invisibly and with the same conventions as getwd. It will give an error if it does not succeed (including if it is not implemented).
So when you do
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
You are not getting the "current" directory, you are getting the previous one. You should use getwd() to get the current one
setwd(current_variable)
current_dir <- getwd()
print(paste0("current_dir: ",current_dir))

Function decode_short_URL from twitteR package not working

I am using decode_short_url of the twitteR package to decode shortened URLs from Twitter posts, but I am not able to get the desired results, It is always giving back the same results such as:
decode_short_url(decode_short_url("http://bit.ly/23226se656"))
## http://bit.ly/23226se656
## [1] "http://bit.ly/23226se656
UPDATE I wrapped this functionality in a package and managed to get it on CRAN same-day. Now, you can just do:
library(longurl)
expand_urls("http://bit.ly/23226se656", check=TRUE, warn=TRUE)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
## Source: local data frame [1 x 2]
##
## orig_url expanded_url
## 1 http://bit.ly/23226se656 NA
##
## Warning message:
## In FUN(X[[i]], ...) : client error: (404) Not Found
You can pass in a vector of URLs and get a data_frame/data.frame back in that form.
That particular bit.ly URL gives a 404 error. Here's a version of decode_short_url that has an optional check parameter that will attempt a HEAD request and throw a warning message for any HTTP status other than 200.
You can further modify it to return NA in the event the "expanded" link 404's (I have no idea what you need this to really do in the event the link is bad).
NOTE that the addd HEAD request will significantly slow the process down, so you may want to do a first pass with check=FALSE to a separate column, then compare which weren't "expanded", then check those with check=TRUE.
You might also want to rename this to avoid namespace conflicts with the one from twitteR.
decode_short_url <- function(url, check=FALSE, ...) {
require(httr)
request_url <- paste("http://api.longurl.org/v2/expand?url=",
url, "&format=json", sep="")
response <- GET(request_url, query=list(useragent="twitteR"), ...)
parsed <- content(response, as="parsed")
ret <- NULL
if (!("long-url" %in% names(parsed))) {
ret <- url
} else {
ret <- parsed[["long-url"]]
}
if (check) warn_for_status(HEAD(url))
return(url)
}
decode_short_url("http://bit.ly/23226se656", check=TRUE)
## [1] "http://bit.ly/23226se656"
## Warning message:
## In decode_short_url("http://bit.ly/23226se656", check = TRUE) :
## client error: (404) Not Found

Importing data into R (rdata) from Github

I want to put some R code plus the associated data file (RData) on Github.
So far, everything works okay. But when people clone the repository, I want them to be able to run the code immediately. At the moment, this isn't possible because they will have to change their work directory (setwd) to directory that the RData file was cloned (i.e. downloaded) to.
Therefore, I thought it might be easier, if I changed the R code such that it linked to the RData file on github. But I cannot get this to work using the following snippet. I think perhaps there is some issue text / binary issue.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
y <- load(x)
Any help would be appreciated.
Thanks
This works for me:
githubURL <- "https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData"
load(url(githubURL))
head(df)
# X Y Z
# 1 16602794 -4183983 94.92019
# 2 16602814 -4183983 91.15794
# 3 16602834 -4183983 87.44995
# 4 16602854 -4183983 83.79617
# 5 16602874 -4183983 80.19643
# 6 16602894 -4183983 76.65052
EDIT Response to OP comment.
From the documentation:
Note that the https:// URL scheme is not supported except on Windows.
So you could try this:
download.file(githubURL,"myfile")
load("myfile")
which works for me as well, but this will clutter your working directory. If that doesn't work, try setting method="curl" in the call to download.file(...).
I've had trouble with this before as well, and the solution I've found to be the most reliable is to use a tiny modification of source_url from the fantastic [devtools][1] package. This works for me (on a Mac).
load_url <- function (url, ..., sha1 = NULL) {
# based very closely on code for devtools::source_url
stopifnot(is.character(url), length(url) == 1)
temp_file <- tempfile()
on.exit(unlink(temp_file))
request <- httr::GET(url)
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)
file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
if (is.null(sha1)) {
message("SHA-1 hash of file is ", file_sha1)
}
else {
if (nchar(sha1) < 6) {
stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
}
file_sha1 <- substr(file_sha1, 1, nchar(sha1))
if (!identical(file_sha1, sha1)) {
stop("SHA-1 hash of downloaded file (", file_sha1,
")\n does not match expected value (", sha1,
")", call. = FALSE)
}
}
load(temp_file, envir = .GlobalEnv)
}
I use a very similar modification to get text files from github using read.table, etc. Note that you need to use the "raw" version of the github URL (which you included in your question).
[1] https://github.com/hadley/devtoolspackage
load takes a filename.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
writeLines(x, tmp <- tempfile())
y <- load(tmp)

How to get the queue number from CONDOR into your R job

I think I have a simple problem because I was looking up and down the internet and couldn't find someone else asking this question:
My university has a Condor set-up. I want to run several repetitions of the same code (e.g. 100 times). My R code has a routine to store the results in a file, i.e.:
write.csv(res, file=paste(paste(paste(format(Sys.time(), '%y%m%d'),'res', queue, sep="_"), sep='/'),'.csv',sep='',collapse=''))
res are my results (a data.frame), I indicate that this file contains the results with 'res' and finally I want to add the queue number of this calculation (otherwise files would be replaced, wouldn't they?). It should look like: 140109_res_1.csv, 140109_res_2.csv, ...
My submit file to condor looks like this:
universe = vanilla
executable = /usr/bin/R
arguments = --vanilla
log = testR.log
error = testR.err
input = run_condor.r
output = testR$(Process).txt
requirements = (opsys == "LINUX") && (arch == "X86_64") && (HAS_R_2_13 =?= True)
request_memory = 1000
should_transfer_files = YES
transfer_executable = FALSE
when_to_transfer_output = ON_EXIT
queue 3
I wonder how do I get the 'queue' number into my R code? I tried a simple example with
print(queue)
print(Queue)
But there is no object found called queue or Queue. Any suggestions?
Best wishes,
Marco
Okay, I solved the problem. This is how it goes:
I had to change my submit file. I changed the slot arguments to:
arguments = --vanilla --args $(Process)
Now the process number is forwarded to the R code. There you retrieve it with the following line. The value will be stored as a character. Therefore, you should convert it to a numeric value (also check whether a number like 10 is passed on as '1' and '0' in which case you should also collapse the values).
run <- commandArgs(TRUE)
Here is an example of the code I let run.
> run <- commandArgs(TRUE)
> run
[1] "0"
> class(run)
[1] "character"
> try(as.numeric(run))
[1] 0
> try(run <- as.numeric(paste(run, collapse='')) )
> try(print(run))
[1] 0
> try(write(run, paste(run,'csv', sep='.')))
You can also find information how to pass on variables/arguments to your code here: http://research.cs.wisc.edu/htcondor/manual/v7.6/condor_submit.html
I hope this helps anyone.
Cheers and thanks for all other commenters!
Marco

Resources