R sQuote(vector) prints "'c("..."'" in non-interactive mode? - r

I have been stuck on this for quite a while. I am trying to build a sql query to write to file, but I keep writing the text 'c("...")' as part of the output file, as if the concatenate function in R was being interpreted very literally.
I have eliminated the write() function itself, toString(), and the paste0() used as part of building the final output string. The first occurrence of the 'c' appears in the output of sQuote. When I try doing a call to sQuote() in interactive mode, I don't get the same behaviour:
Browse[2]> sQuote(sqlTableColumnValues)
[1] "‘c(\"0\", \"XXX0\", \"XXX056\", \"XXX139\", \"XXX143\", \"XXX144\", \"XXX159\", \"XXX171\", \"XXX185\", \"XXX188\", \"XXX192\", \"XXX202\", \"XXX239\", \"XXX240\", \"XXX245\", \"XXX256\", \"XXX271\", \"XXX303\", \"XXX319\", \"XXX326\", \"XXX334\", \"XXX357\", \"XXX363\", \"XXX368\", \"XXX390\", \"XXX391\", \"XXX417\", \"XXX426\", \"XXX431\", \"XXX439\", \"XXX447\", \"XXX456\", \"XXX461\", \"XXX466\", \"XXX475\", \"XXX483\", \"XXX488\", \"XXX491\", \"XXX521\", \"XXX531\", \n\"XXX538\", \"XXX541\", \"XXX548\", \"XXX550\", \"XXX581\")’"
Browse[2]> str(sQuote(sqlTableColumnValues))
chr "‘c(\"0\", \"XXX0\", \"XXX056\", \"XXX139\", \"XXX143\", \"XXX144\", \"XXX159\", \"XXX171\", \"XXX185\","| __truncated__
Browse[2]> tst <- c("foo","bar") #my own interactive test
Browse[2]> tst
[1] "foo" "bar"
Browse[2]> sQuote(tst) #does not show the 'c' character in the result
[1] "‘foo’" "‘bar’"
Browse[2]>
What is causing this discrepancy and how can I stop the 'c(...)' being written to my output file?
Update: dput output as requested:
Browse[2]> dput(sqlTableColumnValues)
structure(list(`1` = c("0", "XXX0", "XXX056", "XXX139",
"XXX143", "XXX144", "XXX159", "XXX171", "XXX185", ... #etc, I've truncated.
I don't yet understand what that means / what to do with this info. :-/

Related

R loop completes only 3 iterations out of 2504

I've written a function to download multiple files from NOAA's database. Firstly, I've got sites which is a list of site ID's that I want to download off the website. It looks like this:
> head(sites)
[[1]]
[1] "9212"
[[2]]
[1] "10158"
[[3]]
[1] "11098"
> length(sites)
[1] 2504
My function is shown below.
tested<-lapply(seq_along(sites), function(x) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
check=GET(v$statusUrl)
j<-content(check)
URL<-j$archive
download.file(URL, destfile=paste0('./tree_ring/', no, '.zip'))
})
The weird issue is that it works for the first three sites (downloads properly), but then it stops after the three sites and throws the following error:
Error in charToRaw(URL) : argument must be a character vector of length 1
I've tried manually downloading the 4th and 5th site (using the same code as above, but not within function) and it works fine. What could be going on here?
EDIT 1: Showing more site ID's as requested
> dput(sites[1:6])
list("9212", "10158", "11098", "15757", "15777", "15781")
I converted your code to a for loop so I could see the most recent values of all your variables when things fail.
The fails aren't consistently on the 4th site. Running your code a few times, sometimes it fails on 2, or 3, or 4. When it fails, if I look at j, I see this:
$message
[1] "finalizing archive"
$status
[1] "working"
$message
[1] "finalizing archive"
$status
[1] "working"
If I re-run check=GET(v$statusUrl); j<-content(check) a few seconds later, then I see
$archive
[1] "https://www.ncdc.noaa.gov/web-content/paleo/bundle/1986420067_2020-04-23.zip"
$status
[1] "complete"
So, I think it takes the server a little bit of time to prepare the file for download, and sometimes R asks for the file before it's ready, which causes an error. A simple fix might look like this:
check_status <- function(v) {
check <- GET(v$statusUrl)
content(check)
}
for(x in seq_along(sites)) {
no<-sites[[x]]
data=GET(paste0('https://www.ncdc.noaa.gov/paleo-search/data/search.json?xmlId=', no))
v<-content(data)
try_counter <- 0
j <- check_status(v)
while(j$status != "complete" & try_counter < 100) {
Sys.sleep(0.1)
j <- check_status(v)
}
URL<-j$archive
download.file(URL, destfile=paste0(no, '.zip'))
}
If the status isn't ready, this version will wait 0.1 seconds before checking again, up to 10 seconds.

Ignore error when importing JSON files in R

I have this for loop that download a json file from a solr search server.
It loops over a vector that contain keywords (100, in this case):
library(jsonlite)
for (i in 1:100) {
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}
It works fine, until it reaches a certain keyword that is not found on the solr, and returns this error:
Error in open.connection(con, "rb") : HTTP error 400.
And then the loop stops.
Is there a way to ignore the error and proceed the loop?
I've read something using tryCatch but still couldn't figure it out.
Simpler than tryCatch, you can use the function try inside your keyword loop. This will attempt to load the URL, but if an error is encountered will print the error but continue to the next keyword.
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
})
}
If you also don't want to have the errors printed, specify silent = TRUE:
library(jsonlite)
for (i in 1:100) {
try({
docs <- fromJSON(paste("http://myurl.com/solr/select?df=topic&fq=",keywords[i],"&indent=on&q=*:*&rows=1&wt=json",sep=""))
numFound <- docs$response$numFound
print(numFound)
}, silent = TRUE)
}
I'm partial to purrr's safely for this kind of task, which works well in purrr's map functions. You can test it by getting JSONs from GitHub's API:
keywords <- c("hadley", "gershomtripp", "lsjdflkaj")
url <- "https://api.github.com/users/{.}/repos"
Now we can get the JSONs and extract the repo IDs
library(jsonlite)
library(purrr)
library(glue)
json_list <- map(keywords, safely(~ fromJSON(glue(url)) %>% .$id))
This will return a list of elements containing result and error. If there was an error it will be saved in error, otherwise the results will be save in result.
[[1]]
[[1]]$result
[1] 40423928 40544418 14984909 12241750 5154874 9324319 20228011 82348 888200 3116998
[11] 8296284 137344416 133734429 2788278 28724058 9470424 116708612 34325557 41144 41157
[21] 78543290 66588778 35225488 14507273 15718805 18562209 12522 115742443 119107571 201908
[[1]]$error
NULL
[[2]]
[[2]]$result
[1] 150995700 141743224 127107806 130802586 185857872 131488780 148619375 165221804 135417803 127116088
[11] 181662388 173351888 127131146 136896011
[[2]]$error
NULL
[[3]]
[[3]]$result
NULL
[[3]]$error
<simpleError in open.connection(con, "rb"): HTTP error 404.>

Using 'ignore' argument in hunspell function

I'm attempting to exclude some words when running hunspell_check on a text block in Rstudio.
ignore_me <- c("Daniel")
hunspell_check(unlist(some_text), ignore = ignore_me, dict = dictionary("en_GB"))
However, whenever I run I get the following error:
Error in hunspell_check(unlist(some_text, dict = dictionary("en_GB"), :
unused argument (ignore = ignore_me))
I've had a look around SO and trawled the documenation but am struggling to figure what's gone wrong.
It looks like you’ve missed a closing bracket after some_text, so it’s passinng ignore as an argument to unlist() rather than hunspell_check().
UPDATE: Ok, I think you were looking at an old version of the documentation. At least that's what I did at first (https://www.rdocumentation.org/packages/hunspell/versions/1.1/topics/hunspell_check). In the current version, 2.9, ignore is no longer an argument for hunspell_check(). Instead, use add_words in the call to dictionary():
library(hunspell)
some_text <- list("hello", "there", "Daniell")
hunspell_check(unlist(some_text), dict = dictionary("en_GB"))
# [1] TRUE TRUE FALSE
ignore_me <- "Daniell"
hunspell_check(unlist(some_text), dict = dictionary("en_GB", add_words = ignore_me))
# [1] TRUE TRUE TRUE

Path assignment to setwd() is delayed in for/foreach loop

The objective is to change within a for loop the current working directory and do some other stuff in it,.e.g. searching for files. The paths are stored in generic variables.
The R code I am running for this is the following:
require("foreach")
# The following lines are generated by an external tool and stored in filePath.info
# Loaded via source("filePaths.info")
result1 <- '/home/user/folder1'
result2 <- '/home/user/folder2'
result3 <- '/home/user/folder3'
number_results <- 3
# So I know that I have all in all 3 folders with results by number_results
# and that the variable name that contains the path to the results is generic:
# string "result" plus 1:number_results.
# Now I want to switch to each result path and do some computation within each folder
start_dir <- getwd()
print(paste0("start_dir: ",start_dir))
# For every result folder switch into the directory of the folder
foreach(i=1:number_results) %do% {
# for (i in 1:number_results){ leads to the same output
# Assign path in variable, not the variable name as string: current_variable <- result1 (not string "result1")
current_variable <- eval(parse(text = paste0("result", i)))
print(paste0(current_variable, " in interation_", i))
# Set working directory to string in variable current_variable
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
# DO SOME OTHER STUFF WITH FILES IN THE CURRENT FOLDER
}
# Switch back into original directory
current_dir <- setwd(start_dir)
print(paste0("end_dir: ",current_dir))
The output is the following ...
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder2"
[1] "end_dir: /home/user/folder3"
... while I would have expected this:
[1] "start_dir: /home/user"
[1] "/home/user/folder1 in interation_1"
[1] "current_dir: /home/user/folder1"
[1] "/home/user/folder2 in interation_2"
[1] "current_dir: /home/user/folder2"
[1] "/home/user/folder3 in interation_3"
[1] "current_dir: /home/user/folder3"
[1] "end_dir: /home/user/"
So it turns out that the path assigned to current_dir is somewhat "behind" ...
Why is this the case?
As I am far away from being a R expert, I have no idea what is causing this behaviour and most important how to get the desired behaviour.
So any help, hint, code correction/optimization would be highly appreciated!
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Platform: x86_64-pc-linux-gnu (64-bit)
From the ?setwd help page...
setwd returns the current directory before the change, invisibly and with the same conventions as getwd. It will give an error if it does not succeed (including if it is not implemented).
So when you do
current_dir <- setwd(current_variable)
print(paste0("current_dir: ",current_dir))
You are not getting the "current" directory, you are getting the previous one. You should use getwd() to get the current one
setwd(current_variable)
current_dir <- getwd()
print(paste0("current_dir: ",current_dir))

How to get the queue number from CONDOR into your R job

I think I have a simple problem because I was looking up and down the internet and couldn't find someone else asking this question:
My university has a Condor set-up. I want to run several repetitions of the same code (e.g. 100 times). My R code has a routine to store the results in a file, i.e.:
write.csv(res, file=paste(paste(paste(format(Sys.time(), '%y%m%d'),'res', queue, sep="_"), sep='/'),'.csv',sep='',collapse=''))
res are my results (a data.frame), I indicate that this file contains the results with 'res' and finally I want to add the queue number of this calculation (otherwise files would be replaced, wouldn't they?). It should look like: 140109_res_1.csv, 140109_res_2.csv, ...
My submit file to condor looks like this:
universe = vanilla
executable = /usr/bin/R
arguments = --vanilla
log = testR.log
error = testR.err
input = run_condor.r
output = testR$(Process).txt
requirements = (opsys == "LINUX") && (arch == "X86_64") && (HAS_R_2_13 =?= True)
request_memory = 1000
should_transfer_files = YES
transfer_executable = FALSE
when_to_transfer_output = ON_EXIT
queue 3
I wonder how do I get the 'queue' number into my R code? I tried a simple example with
print(queue)
print(Queue)
But there is no object found called queue or Queue. Any suggestions?
Best wishes,
Marco
Okay, I solved the problem. This is how it goes:
I had to change my submit file. I changed the slot arguments to:
arguments = --vanilla --args $(Process)
Now the process number is forwarded to the R code. There you retrieve it with the following line. The value will be stored as a character. Therefore, you should convert it to a numeric value (also check whether a number like 10 is passed on as '1' and '0' in which case you should also collapse the values).
run <- commandArgs(TRUE)
Here is an example of the code I let run.
> run <- commandArgs(TRUE)
> run
[1] "0"
> class(run)
[1] "character"
> try(as.numeric(run))
[1] 0
> try(run <- as.numeric(paste(run, collapse='')) )
> try(print(run))
[1] 0
> try(write(run, paste(run,'csv', sep='.')))
You can also find information how to pass on variables/arguments to your code here: http://research.cs.wisc.edu/htcondor/manual/v7.6/condor_submit.html
I hope this helps anyone.
Cheers and thanks for all other commenters!
Marco

Resources