How process special characters in "curl_fetch_memory" requests in R? - r

I have a code with "curl::curl_fetch_memory" for getting data from REST server. All works fine except processing national characters. Here is a code which generated bugs
# 1. Works
str_json <- "https://api.company.com/request?name=SEROUJ%20DEBUI"
# 2. NOT works
str_json <- "https://api.company.com/request?name=SÉROUJ%20DEBUI"
# 3. Create new handler
h <- curl::new_handle()
curl::handle_setopt(
handle = h, httpauth = 1,
userpwd = "password")
# 4. Using curl
resp <- curl::curl_fetch_memory(str_json, handle = h)
rawToChar(resp$content)
jsonlite::fromJSON(rawToChar(resp$content))
The difference between JSON is one char "E" (works) vs "É" (not works). It generates bug "502 Bad Gateway".
Any ideas on how to fix this bug are welcome!

if you do this with a space in the string it seems to work:
str_json <- "https://api.company.com/request?name=SÉROUJ DEBUI"
URLencode(str_json)
# "https://api.company.com/request?name=S%C3%89ROUJ%20DEBUI"
For some reason it does not pick it up if you already have some values encoded.
It works in other direction:
URLdecode("https://api.company.com/request?name=S%C3%89ROUJ%20DEBUI")
#[1] "https://api.company.com/request?name=SÉROUJ DEBUI"

Related

googledrive::drive_mv gives error "Parent specified via 'path' is invalid: x Does not exist"

This is a weird one and I am hoping someone can figure it out. I have written a function that uses googlesheets4 and googledrive. One thing I'm trying to do is move a googledrive document (spreadsheet) from the base folder to a specified folder. I had this working perfectly yesterday so I don't know what happened as it just didn't when I came in this morning.
The weird thing is that if I step through the function, it works fine. It's just when I run the function all at once that I get the error.
I am using a folder ID instead of a name and using drive_find to get the correct folder ID. I am also using a sheet ID instead of a name. The folder already exists and like I said, it was working yesterday.
outFolder <- 'exact_outFolder_name_without_slashes'
createGoogleSheets <- function(
outFolder
){
folder_id <- googledrive::drive_find(n_max = 10, pattern = outFolder)$id
data <- data.frame(Name = c("Sally", "Sue"), Data = c("data1", "data2"))
sheet_id <- NA
nameDate <- NA
tempData <- data.frame()
for (i in 1:nrow(data)){
nameDate <- data[i, "Name"]
tempData <- data[i, ]
googlesheets4::gs4_create(name = nameDate, sheets = list(sheet1 = tempData)
sheet_id <- googledrive::drive_find(type = "spreadsheet", n_max = 10, pattern = nameDate)$id
googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
} end 'for'
} end 'function'
I don't think this will be a reproducible example. The offending code is within the for loop that is within the function and it works fine when I run through it step by step. folder_id is defined within the function but outside of the for loop. sheet_id is within the for loop. When I move folder_id into the for loop, it still doesn't work although I don't know why it would change anything. These are just the things I have tried. I do have the proper authorization for google drive and googlesheets4 by using:
googledrive::drive_auth()
googlesheets4::gs4_auth(token = drive_token())
<error/rlang_error>
Error in as_parent():
! Parent specified via path is invalid:
x Does not exist.
Backtrace:
global createGoogleSheets(inputFile, outPath, addNames)
googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
googledrive:::as_parent(path)
Run rlang::last_trace() to see the full context.
Backtrace:
x
-global createGoogleSheets(inputFile, outPath, addNames)
-googledrive::drive_mv(file = as_id(sheet_id), path = as_id(folder_id))
\-googledrive:::as_parent(path)
\-googledrive:::drive_abort(c(invalid_parent, x = "Does not exist."))
\-cli::cli_abort(message = message, ..., .envir = .envir)
\-rlang::abort(message, ..., call = call, use_cli_format = TRUE)
I have tried changing the folder_id to the exact path of my google drive W:/My Drive... and got the same error. I should mention I have also tried deleting the folder and re-creating it fresh.
Anybody have any ideas?
Thank you in advance for your help!
I can't comment because I don't have the reputation yet, but I believe you're missing a parenthesis in your for-loop.
You need that SECOND parenthesis below:
for (i in 1:nrow(tempData) ) {
...
}

How to read unquoted extra \r with data.table::fread

Data I have to process has unquoted text with some additional \r character. Files are big (500MB), copious (>600), and changing the export is not an option. Data might look like
A,B,C
blah,a,1
bloo,a\r,b
blee,c,d
How can this be handled with data.table's fread?
Is there a better R read CSV function for this, that's similarly performant?
Repro
library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
Error in fread(csv) :
Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0:
bloo,a
Advanced repro
The simple repro might be too trivial to give a sense of scale...
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit
Further to #dirk-eddelbuettel & #nrussell's suggestions, a way of solving this is to is to pre-process the file. The processor could also be called within fread() but here it is performed in seperate steps:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")
We can try with gsub
fread(gsub("\r\n|\r", "", csv))
# A B C
#1: blah a 1
#2: bloo a b
#3: blee c d
You can also do this with tidyverse packages, if you'd like.
> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
A B C
<chr> <chr> <chr>
1 blah a 1
2 bloo a b
3 blee c d
If you do want to do it purely in R, you could try working with connections. As long as a connection is kept open, it will start reading/writing from its previous position. Of course, this means the burden of opening and closing connections falls on you.
In the following code, the file is processed by chunks:
library(data.table)
input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)
chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"
buffer <- ""
repeat {
new_chars <- readChar(in_conn, chunk_size)
buffer <- paste0(buffer, new_chars)
while (grepl("[\r\n]$", buffer, perl = TRUE)) {
next_char <- readChar(in_conn, 1)
buffer <- paste0(buffer, next_char)
if (!length(next_char))
break
}
chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
writeChar(cleaned, out_conn, eos = NULL)
if (!length(new_chars))
break
}
writeChar('\n', out_conn, eos = NULL)
close(in_conn)
close(out_conn)
result <- fread(output_csv)
Process:
If a chunk ends with a \r or \n, another character is added until it doesn't.
Quotes are put around values containing a \r which isn't adjacent to a
\n.
The cleaned chunk is added to the end of another file.
Rinse and repeat.
This code simplifies the problem by assuming no quoting is done for any field in sample.csv. It's not especially fast, but not terribly slow. Larger values for chunk_size should reduce the amount of time spent in I/O operations. If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards.

Importing data into R (rdata) from Github

I want to put some R code plus the associated data file (RData) on Github.
So far, everything works okay. But when people clone the repository, I want them to be able to run the code immediately. At the moment, this isn't possible because they will have to change their work directory (setwd) to directory that the RData file was cloned (i.e. downloaded) to.
Therefore, I thought it might be easier, if I changed the R code such that it linked to the RData file on github. But I cannot get this to work using the following snippet. I think perhaps there is some issue text / binary issue.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
y <- load(x)
Any help would be appreciated.
Thanks
This works for me:
githubURL <- "https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData"
load(url(githubURL))
head(df)
# X Y Z
# 1 16602794 -4183983 94.92019
# 2 16602814 -4183983 91.15794
# 3 16602834 -4183983 87.44995
# 4 16602854 -4183983 83.79617
# 5 16602874 -4183983 80.19643
# 6 16602894 -4183983 76.65052
EDIT Response to OP comment.
From the documentation:
Note that the https:// URL scheme is not supported except on Windows.
So you could try this:
download.file(githubURL,"myfile")
load("myfile")
which works for me as well, but this will clutter your working directory. If that doesn't work, try setting method="curl" in the call to download.file(...).
I've had trouble with this before as well, and the solution I've found to be the most reliable is to use a tiny modification of source_url from the fantastic [devtools][1] package. This works for me (on a Mac).
load_url <- function (url, ..., sha1 = NULL) {
# based very closely on code for devtools::source_url
stopifnot(is.character(url), length(url) == 1)
temp_file <- tempfile()
on.exit(unlink(temp_file))
request <- httr::GET(url)
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)
file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
if (is.null(sha1)) {
message("SHA-1 hash of file is ", file_sha1)
}
else {
if (nchar(sha1) < 6) {
stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
}
file_sha1 <- substr(file_sha1, 1, nchar(sha1))
if (!identical(file_sha1, sha1)) {
stop("SHA-1 hash of downloaded file (", file_sha1,
")\n does not match expected value (", sha1,
")", call. = FALSE)
}
}
load(temp_file, envir = .GlobalEnv)
}
I use a very similar modification to get text files from github using read.table, etc. Note that you need to use the "raw" version of the github URL (which you included in your question).
[1] https://github.com/hadley/devtoolspackage
load takes a filename.
x <- RCurl::getURL("https://github.com/thefactmachine/hex-binning-gis-data/raw/master/popDensity.RData")
writeLines(x, tmp <- tempfile())
y <- load(tmp)

Issue with curl: brackets in URLs

I have a vector URLs that I would like to download from R, using curl on Mac OSX:
## URLs
grab = c("http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1079_33994-C81-I620_5-ANI-L056-00001[006154]ready//DA_2011-06-03_STINGA SIMONA_30381371.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67011-C27-I620_6-ANI-L141-00001[045849]ready//DA_2012-05-28_SORIN VASILE_1308151.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34934-C93-I620_6-ANI-L058-00001[005631]ready//DI_2011-05-25_CONSTANTIN CATALIN IONITA_50364334.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_66964-C65-I620_5-ANI-L141-00001[045952]ready//DA_2012-05-24_DORINA ORZAC_1312037.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67290-C65-I620_5-ANI-L141-00001[045768]ready//DI_2012-06-01_JIPA CAMELIA_1304833.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-09_NICOLE MOT_50364493.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34937-C74-I620_7-ANI-L058-00001[005634]ready//DA_2011-06-14_PETRE ECATERINA_50364543.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1566_67978-C85-I780_2-ANI-L144-00001[046398]ready//DA_2012-05-25_RAMONA GHIERGA_1332323.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-05_LOVIN G. ADINA_50364475.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP2135_40131-C90-I780_3-ANI-L069-00001[009742]ready//DI_2011-05-25_VARTOLOMEI PAUL-CONSTANTIN_467652.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1086_34373-C11-I620_3-ANI-L057-00001[005657]ready//DI_2011-05-16_CAZACU LILIANA_40437536.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34935-C93-I620_6-ANI-L058-00001[005632]ready//DI_2011-06-07_ROSCA EUGEN-CONSTANTIN_50364400.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP181_27399-C11-I780_2-ANI-L051-00001[005421]ready//DI_2010-11-03_DIAMANDI SAVA-CONSTANTIN_40429564.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DI_2011-06-07_ZAMFIRESCU I. IULIA_50364498.pdf",
"http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1563_67587-C71-I780_3-ANI-L143-00001[046079]ready//DI_2012-05-21_MAZURU C. EMILIA_1317509.pdf"
)
My first attempt returned HTTP Error 400:
## fails on Mac OSX 10.9 (HTTP 400)
## for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x))
I learnt that this was due to the URLs containing brackets, so I applied the globoff fix this way:
## also fails despite fixing HTTP Err 400 (files are zero-sized)
for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x), method = "curl", extra = "--globoff")
… and the files now download, but are all empty (zero-sized).
What am I getting wrong?
P.S. I'm willing to switch to Python or shell to get the files, but would prefer to keep the code 100% R.
Have you tried URL encoding the brackets?
%5B = [
%5D = ]
A bit late, but URLencode is what you use to ensure you have a well-formed URL.
> x <- "http://example.com/[xyz]//file with spaces.pdf"
> URLencode(x)
[1] "http://example.com/%5bxyz%5d//file%20with%20spaces.pdf"

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources