No encoding supplied: defaulting to UTF-8 - r

I just wrote a code in order to extract some data.
The script is working almost just fine. By almost, I mean that I can get the results that I want, but I have always the same issue in my console which is :
No encoding supplied: defaulting to UTF-8.
The problem is that I'm running a loop with 900 lines, so I have 900 times the "No encoding supplied: defaulting to UTF-8."
Is is possible to get rid of it ? Does it have an impact on the process speed ?
Here is my code :
mykeywords <- readLines("mykeywords.txt", encoding = "UTF-8")
mykeywords <- as.character(mykeywords)
my_user_agent <- "XXXXXXXXXXX"
PAA = list()
max_length = 0
for (i in 1: length(mykeywords)){
url_to_check <- paste0("https://www.google.com/search?q=",mykeywords[i],"&ie=utf-8&oe=utf-8&client=firefox-b")
result_get <- GET(url_to_check, user_agent(my_user_agent)) %>%
htmlParse(encoding = "UTF-8") %>%
xpathSApply('//div[/*]/g-accordion-expander/div/div', xmlValue)
if (is.null(result_get)) {
PAA[[i]]= NA
} else {
result_get = result_get[result_get != "" & !is.na(result_get)]
PAA[[i]] = result_get
if (length(result_get) > max_length) {
max_length = length(result_get)}}}

Related

Error when trying to download species models in R with ebirdst::ebirdst_download()

I've requested an access key and set it. Sys.getenv("EBIRDST_KEY") returns the correct key. ebirdst_download(species = "Sharp-tailed Grouse") returns an error
Error in ebirdst_download(species = "Sharp-tailed Grouse") : Cannot
access Status and Trends data URL. Ensure that you have a working
internet connection and a valid API key for the Status and Trends
data.
My internet is working. Looking at the source code, I believe the error is generated because the function reads a url, essentially read_json(stringr::str_glue("{api_url}list-obj/{species}?key={key}")) which for some reason is null.
https://rdrr.io/github/CornellLabofOrnithology/ebirdst/src/R/ebirdst-loading.R?fbclid=IwAR1JYbCoD_VGwtZ0e1tz7yEPIR1buwN3GUyraZqokeS8rFTox4g3ceWRnns
But I can get it to work if I use the following lines of the source code. I'm not sure why the ebirdst function is failing.
species<-"Sharp-tailed Grouse"
path = rappdirs::user_data_dir("ebirdst")
species <- get_species(species)
which_run <- which(ebirdst::ebirdst_runs$species_code == species)
run <- ebirdst::ebirdst_runs$run_name[which_run]
key<-Sys.getenv("EBIRDST_KEY")
api_url <- "https://st-download.ebird.org/v1/"
list_obj_url <- stringr::str_glue("{api_url}list-obj/{species}?key={key}")
files <-jsonlite::read_json(list_obj_url, simplifyVector = TRUE)
files <- data.frame(file = files)
files <- files[!stringr::str_detect(files$file, "\\.db$"), , drop = FALSE]
files$src_path <- stringr::str_glue("{api_url}fetch?objKey={files$file}",
"&key={key}")
files$dest_path <- file.path(path, files$file)
files$exists <- file.exists(files$dest_path)
dirs <- unique(dirname(files$dest_path))
for (d in dirs) {
dir.create(d, showWarnings = FALSE, recursive = TRUE)
}
old_timeout <- getOption("timeout")
options(timeout = max(3000, old_timeout))
for (i in seq_len(nrow(files))) {
dl_response <- utils::download.file(files$src_path[i],
files$dest_path[i],
mode = "wb")
if (dl_response != 0) {
stop("Error downloading file: ", files$file[i])
}
}

r - source script file that contains unicode (Farsi) character

write the text below in a buffer and save it as a .r script:
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
then try these lines to source() it:
script <- "path/to/script.R"
file(script,
encoding = "UTF-8") %>%
readLines() # works fine
file(script,
encoding = "UTF-8") %>%
source() # works fine
source(script) # the Farsi letters in the environment are misrepresented
source(script,
encoding = "UTF-8") # gives error
The last line throws error. I tried to debug it and I believe there is a bug in the source function, in the following lines:
...
loc <- utils::localeToCharset()[1L]
...
The error occurs at .Internal(parse( line.
...
exprs <- if (!from_file) {
if (length(lines))
.Internal(parse(stdin(), n = -1, lines, "?",
srcfile, encoding))
else expression()
}
else .Internal(parse(file, n = -1, NULL, "?", srcfile,
encoding))
...
The exact error is:
Error in source(script, encoding = "UTF-8") :
script.R:2:17: unexpected INCOMPLETE_STRING
1: #' #export
2: letters_fa <- c('
^
The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.
Use source without specifying the encoding, and then modify the vector's encoding with Encoding:
source(script)
letters_fa
# [1] "الÙ\u0081" "ب" "Ù¾" "ت" "Ø«"
# [6] "ج" "چ" "ح" "خ" "ر"
# [11] "ز" "د"
Encoding(letters_fa) <- "UTF-8"
letters_fa
# [1] "الف" "ب" "پ" "ت" "ث" "ج" "چ" "ح" "خ" "ر" "ز" "د"

Parsing Unicode string with nulls in R

I am having some trouble parsing a Unicode string a JSON object pulled from an API. As the string has Encoding() like "unknown", i need to parse it for the system to know what its dealing with. The string represents a decoded .png file in UTF-8 that I then need to decode back to latin1 before writing it to a .png file (I know, very backwards, and it would be much better if the API pushed a base64-string).
I get the string from the API as chr object, and try to let fromJSON do the job, but no dice. It cuts the string at the first null (\u0000).
> library(httr)
> library(jsonlite)
> library(tidyverse)
> m
Response [https://...]
Date: 2018-04-10 11:47
Status: 200
Content-Type: application/json
Size: 24.3 kB
{"artifact": "\u0089PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0000\u0092\u0000\u0000\u0000\u00e3...
> x <- content(m, encoding = "UTF-8", as = "text")
> ## substing of the complete x:
> x <- "{\"artifact\": \"\\u0089PNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0000\\u0092\\u0000\\u0000\\u0000\\u00e3\\b\\u0006\\u0000\\u0000\\u0000n\\u0005M\\u00ea\\u0000\\u0000\\u0000\\u0006bKGD\\u0000\\u00ff\\u0000\\u00ff\\u0000\\u00ff\\u00a0\\u00bd\\u00a7\\u0093\\u0000\\u0000\\u0016\\u00e7IDATx\\u009c\\u00ed\"}\n"
>
> ## the problem
> "\u0000"
Error: nul character not allowed (line 1)
> ## this works fine
> "\\u0000"
[1] "\\u0000"
>
> y <- fromJSON(txt = x)
> y # note how the string is cut!
$artifact
[1] "\u0089PNG\r\n\032\n"
When I replace the \\u0000 with char(0), everything works fine. The problem is that the nulls seems to play an important role in the binary representation of the file that I write to in the end, causing the resulting image to be corrupted in the viewer.
> x <- str_replace_all(string = x, pattern = "\\\\u0000", replacement = chr(0))
> y <- fromJSON(txt = x)
> y
$artifact
[1] "\u0089PNG\r\n\032\n\rIHDR\u0092ã\b\006n\005Mê\006bKGDÿÿÿ ½§\u0093\026çIDATx\u009cí"
> str(y$artifact)
chr "<U+0089>PNG\r\n\032\n\rIHDR<U+0092>ã\b\006n\005Mê\006bKGDÿÿÿ ½§<U+0093>\026çIDATx<U+009C>í"
> Encoding(y$artifact)
[1] "UTF-8"
> z <- iconv(y$artifact, from = "UTF-8", to = "latin1")
> writeBin(object = z, con = "test.png", useBytes = TRUE)
I have tried these commands with the original string, to no avail
> library(stringi)
> stri_unescape_unicode(str = x)
Error in stri_unescape_unicode(str = x) :
embedded nul in string: '{"artifact": "<U+0089>PNG\r\n\032\n'
> ## and
> parse(text = x)
Error in parse(text = x) : nul character not allowed (line 1)
Is there no way for R to handle this nul character?
Any idea on how I can get the complete encoded string and write it to a file?
The same story works just fine in Python, which uses a \x convention in stead of \u00
response = r.json()
artifact = response['artifact']
artifact
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
artifact_encoded = artifact.encode('latin-1')
artifact_encoded # note the binary form!
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
fh = open("abc.png", "wb")
fh.write(artifact_encoded)
fh.close()
FYI: I have cut some most of the actual string out, but enough to use for testing purposes. The actual string contained other symbols, and it seemed impossible to copy-paste the string in a script and assign it to a new variable (e.g. y <- "{\"artifact\": \"\\u0089PNG\\..."). So, I don't know what I would do if I had to read the string from e.g. a .csv file..
Any pointers in any of my struggles would be appreciated :)

R/SublimeREPL R - code not working in sublime but working in RStudio

I am following the tutorials of Machine Learning for Hackers (https://github.com/johnmyleswhite/ML_for_Hackers) and I am using Sublime Text as a text editor. To run my code, I use SublimeREPL R.
I am using this code, taken directly from the book:
setwd("/path/to/folder")
# Load the text mining package
library(tm)
library(ggplot2)
# Loading all necessary paths
spam.path <- "data/spam/"
spam2.path <- "data/spam_2/"
easyham.path <- "data/easy_ham/"
easyham.path2 <- "data/easy_ham_2/"
hardham.path <- "data/hard_ham/"
hardham2.path <- "data/hard_ham_2/"
# Get the content of each email
get.msg <- function(path) {
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)
msg <- text[seq(which(text == "")[1] + 1, length(text),1)]
close(con)
return(paste(msg, collapse = "\n"))
}
# Create a vector where each element is an email
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path, p, sep = "")))
# Log the spam
head(all.spam)
This piece of code works fine in RStudio (with the data provided here: https://github.com/johnmyleswhite/ML_for_Hackers/tree/master/03-Classification) but when I run it in Sublime, Iget the following error message:
> all.spam <- sapply(spam.docs,
+ function(p) get.msg(file.path(spam.path, p)))
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
'from' cannot be NA, NaN or infinite
In addition: Warning messages:
1: In readLines(con) :
invalid input found on input connection 'data/spam/00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
invalid input found on input connection 'data/spam/00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
invalid input found on input connection 'data/spam/00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
incomplete final line found on 'data/spam/00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
invalid input found on input connection 'data/spam/00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
incomplete final line found on 'data/spam/00035.7ce3307b56dd90453027a6630179282e'
>
I get the same results when I take the code from John Myles White's repo.
How can I fix this?
Thanks
I think the problem got is in using encoding=latin1, you can just remove this one, I test it in my environment, it ran well.
spam.docs <- paste(spam.path,spam.docs,sep="")
all.spam <- sapply(spam.docs,get.msg)
Warning message:
In readLines(con) :
incomplete final line found on 'XXXXXXXXXXXXXXXXX/ML_for_Hackers-master/03-Classification/data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0'
still some warnnings in it, but it can produce the results well.
Thanks.

Using the openpaths.cc API with R

I tried to pull my location data from openpaths.cc to use it with R.
The API uses OAuth and is documented here, however, it only provides an example in Python.
After looking around how to handle OAuth (which I am barely familiar with) in R, I found ROAuth, so I used the usage example provided as a basis.
According to the API-documentation, the endpoint for all requests is https://openpaths.cc/api/1, and I have my access key and access secret, so I naively plugged them in for cKey, cSecret, reqURL, accessURL, authURL, and testURL, but only got "bad request" as a result from the credentials$handshake() line.
reqURL <- "https://openpaths.cc/api/1"
accessURL <- "https://openpaths.cc/api/1"
authURL <- "https://openpaths.cc/api/1"
cKey <- "key"
cSecret <- "secret"
testURL <- "https://openpaths.cc/api/1"
credentials <- OAuthFactory$new(consumerKey=cKey,
consumerSecret=cSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL,
needsVerifier=TRUE)
credentials$handshake()
## the GET isn’t strictly necessary as that’s the default
credentials$OAuthRequest(testURL, "GET")
While I feel like I have no idea what I'm doing, I at least verified that ROAuth is capable of using the HMAC-SHA1 method, wich is required by openpaths.
EDIT: I have ROAuth version 0.9.3 installed
EDIT2: After learning about httr, I thought this might be the appropriate library for the task, however I still could not produce any usable results, since the token creation via oauth1.0_token only lead to a Bad request again.
I think my primary problem is the lack of API documentation from openpaths.cc. With all these tools, I still have no idea how to properly use them.
Here is as far as I got. I receive a "400 Not Authorized", maybe this is due to the fact that my openpaths account is not connected to foursquare, maybe something is wrong with the code. Please try it out!
Required packages:
library(RCurl)
library(digest)
library(base64)
Some functions borrowed/adapted from ROAuth:
## Get a random sequence of characters.
## Nonce - number used only once.
genNonce <- function(len = 15L + sample(1:16, 1L)) {
els <- c(letters, LETTERS, 0:9, "_")
paste(sample(els, len, replace = TRUE), collapse = "")
}
## this function is derived from utils::URLencode
## Characters not in the unreserved character set ([RFC3986] section 2.3) MUST be encoded
## unreserved = ALPHA, DIGIT, '-', '.', '_', '~'
## cf. http://oauth.net/core/1.0/#encoding_parameters
encodeURI <- function(URI, ...) {
if (!is.character(URI)) {
URI
} else {
OK <- "[^-A-Za-z0-9_.~]"
x <- strsplit(URI, "")[[1L]]
z <- grep(OK, x)
if (length(z)) {
y <- sapply(x[z], function(x) paste("%", toupper(as.character(charToRaw(x))),
sep = "", collapse = ""))
x[z] <- y
}
paste(x, collapse = "")
}
}
## we escape the values of the parameters in a special way that escapes
## the resulting % prefix in the escaped characters, e.g. %20 becomes
## %2520 as %25 is the escape for %
## cf. http://tools.ietf.org/html/rfc5849#section-3.4.1.3.2
normalizeParams <- function(params, escapeFun) {
names(params) <- sapply(names(params), escapeFun, post.amp = TRUE)
params <- sapply(params, escapeFun, post.amp = TRUE)
## If two or more parameters share the same name, they are sorted by their value.
params <- params[order(names(params), params)]
return(paste(names(params), params, sep = "=", collapse = "&"))
}
## From Ozaki Toru's code at https://gist.github.com/586468
signWithHMAC <- function(key, data) {
blockSize <- 64
hashlength <- 20
innerpad <- rawToBits(as.raw(rep(0x36, blockSize)))
outerpad <- rawToBits(as.raw(rep(0x5C, blockSize)))
zero <- rep(0 ,64)
HexdigestToDigest <- function(digest) {
as.raw(strtoi(substring(digest, (1:hashlength)*2-1,
(1:hashlength)*2), base=16))
}
mac <- function(pad, text) {
HexdigestToDigest(digest(append(packBits(xor(key, pad)), text),
algo='sha1', serialize=FALSE))
}
if(nchar(key) >= 64) {
keyDigested <- digest(key, algo="sha1", serialize=FALSE)
key <- intToUtf8(strtoi(HexdigestToDigest(keyDigested), base=16))
}
key <- rawToBits(as.raw(append(utf8ToInt(key), zero)[1:blockSize]))
base64(mac(outerpad, mac(innerpad, charToRaw(data))))[1]
}
## Sign an request made up of the URL, the parameters as a named character
## vector the consumer key and secret and the token and token secret.
signRequest <- function(uri, consumerKey, consumerSecret, params=character(),
oauthKey = "", oauthSecret = "", httpMethod = "GET",
nonce = genNonce(),
timestamp = Sys.time()) {
httpMethod <- toupper(httpMethod)
params["oauth_nonce"] <- nonce
params["oauth_timestamp"] <- as.integer(timestamp)
params["oauth_consumer_key"] <- consumerKey
params["oauth_signature_method"] <- 'HMAC-SHA1'
params["oauth_version"] <- '1.0'
if(oauthKey != "") params["oauth_token"] <- oauthKey
odat <- paste(
encodeURI(httpMethod), encodeURI(uri),
encodeURI(normalizeParams(params, encodeURI), post.amp = TRUE),
sep = "&"
)
okey <- encodeURI(consumerSecret)
if(oauthSecret != "") okey <- paste(okey, encodeURI(oauthSecret), sep = "&")
params["oauth_signature"] <- signWithHMAC(okey, odat)
return(params)
}
Now this function tries to replicate the example at the openpaths website:
openpaths <- function(
access_key=getOption("openpaths.access_key"),
secret_key=getOption("openpaths.secret_key"),
curl=getCurlHandle()) {
uri <- 'https://openpaths.cc/api/1'
params <- signRequest(uri, consumerKey=access_key, consumerSecret=secret_key)
oa_header <- paste(names(params), params, sep="=", collapse=",")
ret <- getURL(
uri,
curl=curl,
.opts=list(
header=TRUE,
verbose=TRUE,
httpheader=c(Authorization=paste("OAuth ", oa_header, sep="")),
ssl.verifypeer = TRUE,
ssl.verifyhost = TRUE,
cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")
)
)
return(ret)
}
I've made some progress on this problem, although it's challenging due
to the flakiness of the site, and the custom OAuth process that they're
using. First you'll need to install development version of httr - this
exports some previously internal functions.
devtools::install_github("hadley/httr")
OpenPaths is unusual in that the app secret and key are the same as the
token and token secret. This means we need to write a custom auth
header:
library(httr)
app <- oauth_app("OpenPaths", "JSLEKAPZIMFVFROHBDT4KNBVSI")
#> Using secret stored in environment variable OPENPATHS_CONSUMER_SECRET
# Implement custom header for 2-leg authentication, and oauth_body_hash
auth_header <- function(url, method = "GET") {
oauth_signature(url, method, app, app$key, app$secret,
# Use sha1 of empty string since http request body is empty
body_hash = "da39a3ee5e6b4b0d3255bfef95601890afd80709")
}
Then you can use this to sign your request. This is currently failing
for me because the site seems to be down (again).
url <- "https://openpaths.cc/api/1"
r <- GET(url, oauth_header(auth_header(url)))
stop_for_status(r)
content(r)

Resources