R library "XML" doesn't recognize encoding - r

Problem
I have an XML file that I would like to parse in R. I know that this file is not corrupted because the following Python code seems to work:
>>> import xml.etree.ElementTree as ET
>>> xml_tree = ET.parse(PATH_TO_MY_XML_FILE)
>>> do_my_regular_xml_stuff_that_seems_to_work_no_problem(xml_tree)
Now, when I try to run the following code in R, I get an error message:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE)
Error in nchar(text_repr): invalid multibyte string, element 1
Traceback:
Alright, maybe the parser doesn't recognize the encoding. Luckily this should be specified in a decent XML file. So, I go to my shell and check:
$ head -n1 PATH_TO_MY_XML_FILE
??<?xml version="1.0" encoding="utf-16"?>
Now, I can go back to R and explicitly pass on the encoding, only to face the next error message where I got stuck now:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE, encoding='UTF-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Traceback:
1. XML::xmlParse(filePath, encoding = "UTF-16")
2. (function (msg, ...)
. {
. if (length(grep("\\\n$", msg)) == 0)
. paste(msg, "\n", sep = "")
. if (immediate)
. cat(msg)
. if (length(msg) == 0) {
. e = simpleError(paste(1:length(messages), messages, sep = ": ",
. collapse = ""))
. class(e) = c(class, class(e))
. stop(e)
. }
. messages <<- c(messages, msg)
. })(character(0))
A last attempt to check (in R) if the file is in fact "UTF-16" encoded yields:
> f <- file(filePath, 'r', encoding = "UTF-16")
> firstLine <- readLines(f, n=1)
> close(f)
> print(line)
[1] "<?xml version=\"1.0\" encoding=\"utf-16\"?>"
Which looks just about right to me.
Question(s)
Does anyone know what is happening here? Is this a bug from the XML library? Is the file maybe not 'UTF-16' encoded, even though it claims it is? What are the two question marks ?? that I see when I print the file into the shell? These question marks don't appear when reading in the file properly...

Is this a bug from the XML library?
I think there could be a bug here. If I generate a valid UTF-16 XML document, which will have an initial byte-order mark:
$ echo '<a>😊</a>' | iconv -t utf-16 >a-utf16.xml
$ xxd a-utf16.xml
00000000: fffe 3c00 6100 3e00 3dd8 0ade 3c00 2f00 ..<.a.>.=...<./.
00000010: 6100 3e00 0a00 a.>...
then I can parse it with:
> XML::xmlParse('a-utf16.xml')
<?xml version="1.0"?>
<a>😊</a>
but not if I specify the encoding:
> XML::xmlParse('a-utf16.xml', encoding='utf-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Your original problem was when you weren't specifying the encoding. However:
I know that this file is not corrupted because the following Python code seems to work
That's a good hint, but I think you'll find edge cases where that doesn't hold. Try iconv for a second opinion on whether the file is encoded correctly.
For a more specific response, you'll need to post a reproducible XML file,

Related

Error when trying to prase a HTTP-Request in R

im using R package httr to get a HTTP-Response for a specific link.
When trying to parse the content of the response im getting the Error:
Fehler in parse(text = script_content) : <text>:1:10: Unerwartete(s) '['
1: {"lines":[
Translated to enlgish it says something like this (sorry for my error messages being in German):
Error in parsing(text = script_content) : <text>1:10: Unexpected '['
1: {"lines":[
It seems as there is a problem with the format/encoding of the text. Here is my code:
script <-
GET(
url = "https://my_url.which_origin_is_not_important/my_script.R",
authenticate(username, pass)
)
script_content <- content(script, as = "text", encoding = "ISO-8859-1")
parsed_condent <- parse(text = script_content )
The value of script_content looks like this:
"{\"lines\":[{\"text\":\"################## FUNCTION ##################\"},{\"text\":\"\"},{\"text\":\"library(log4r)\"}],\"start\":0,\"size\":32,\"isLastPage\":true,\"limit\":500,\"nextPageStart\":null}"
Some more background to this operation: Im trying to source a code, which is currently inside of a private repository. I wrote the code myself i'm trying to source. I made sure, that the issue is not coming from within th code.
I got the solution from: Sourcing R files in a private github folder
Thanks for any advice!!

My paste function is not providing an accurate path to the file with csv data

In the code you'll see that it's an if/then function that will write a file directory depending on what operating system you have. This will run on Mac but when it sees I'm on windows I get this error: Error in system(paste("/RTools/bin/wc -l \"", filename, "\"", sep = ""), : '/RTools/bin/wc' not found
I've tried debugging the function and as I run it, filename is being passed the correct csv file in the correct format. I believe I have an issue with the two back slashes going the wrong way and perhaps an extra quotation mark. Line 3-5 is where I believe my problem is.
function( filename ){
if(.Platform$OS.type=="windows"){
system.time({
cmd<-system(paste("/RTools/bin/wc -l \"",filename, "\"",sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
} else {
system.time({
cmd<-system(paste("wc -l \"",filename,"\" | awk \'{print $1}\'", sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
}
}
I expect it to build the correct file path, it results in the error I listed.

gperftools Error: substr outside of string at /usr/local/bin/pprof line 3618

as suggest by Dirk Eddelbuettel in this talk and this answer I tried to profile compiled R code using gperftools. Here is what I did.
I used Dirks profilingSmall.R as script that I want to profile. I repeat it here:
## R Extensions manual, section 3.2 'Profiling R for speed'
## 'N' reduced to 99 here
suppressMessages(library(MASS))
suppressMessages(library(boot))
storm.fm <- nls(Time ~ b*Viscosity/(Wt - c), stormer, start = c(b=29.401, c=2.2183))
st <- cbind(stormer, fit=fitted(storm.fm))
storm.bf <- function(rs, i) {
st$Time <- st$fit + rs[i]
tmp <- nls(Time ~ (b * Viscosity)/(Wt - c), st, start = coef(storm.fm))
tmp$m$getAllPars()
}
rs <- scale(resid(storm.fm), scale = FALSE) # remove the mean
Rprof("boot.out")
storm.boot <- boot(rs, storm.bf, R = 99) # pretty slow
Rprof(NULL)
To profile it I run the following script
LD_PRELOAD="/usr/lib/libprofiler.so.0"
\CPUPROFILE=sample.log \
Rscript profilingSmall.R
Then I tried to parse the log file using
pprof /usr/bin/R sample.log
This returned the following error
Using local file /usr/bin/R.
Using local file sample.log.
substr outside of string at /usr/local/bin/pprof line 3618.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3618.
substr outside of string at /usr/local/bin/pprof line 3620.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3620.
sample.log: header size >= 2**16
sample.log is empty. However, a bunch of sample.log_digit were created that contain information that looks reasonable.
I had the same problem, but realized my problem. I'd done:
export CPUPROFILE=test.prof
export LD_PRELOAD="/usr/local/lib/libprofiler.so"
testprog ...
pprof --web `which testprog` test.prof
If I stopped after running testprog the prof files wasn't empty but after pprof it was. pprof crashed with the substr error.
What I realized later was that by setting and exporting LD_PRELOAD that libprofiler.so was also loaded for pprof, overwriting test.prof.
You just need to ensure LD_PRELOAD is not set when you run pprof.
I'm using gperftools-2.5, and I also encountered the same problem:
[root#localhost ivrserver]# pprof --text ./IvrServer ivr.prof
Using local file ./IvrServer.
Using local file ivr.prof.
substr outside of string at /usr/local/bin/pprof line 3695.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3695.
substr outside of string at /usr/local/bin/pprof line 3697.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3697.
ivr.prof: header size >= 2**16
I found this is because the prof file (ivr.prof in my example) is empty.
everytime the profiler start and end, it will create a new prof file, you should use xxx.prof.0 xxx.prof.1 ... to get the right result

rPython method python.get returns weird encoding

I m trying to use rPython package to pass some arguments into python code and get results back. But for some reason I m getting weird encoding from my python code. Maybe someone has some hints to point me out.
Here is my simple code to test:
require(rPython)
#pass the test word 'audiention' (in ukrainian)
word<-"аудієнція"
python.assign("input", word)
python.exec("input = input.encode('utf-8')")
python.exec("print input") #the output in console is correct at this step: аудієнція
x<-python.get("input")
cat(x) # the output is: 0C4VT=FVO
Does anybody have some suggestions why the output of python.get is encoded weird?
My Sys.getlocale() output is:
Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=uk_UA.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=uk_UA.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=uk_UA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=uk_UA.UTF-8;LC_IDENTIFICATION=C"
Thank you in advance for any hints!
I have recently built a new package based on the original rPython code called SnakeCharmR that addresses this and other problems rPython had.
A quick comparison:
> library(SnakeCharmR)
> py.assign("a", "'")
> py.get("a")
[1] "'"
> py.assign("a", "áéíóú")
> py.get("a")
[1] "áéíóú"
> library(rPython)
> python.assign("a", "'")
File "<string>", line 2
a =' [ "'" ] '
^
SyntaxError: EOL while scanning string literal
> python.assign("a", "áéíóú")
> python.get("a")
[1] "\xe1\xe9\xed\xf3\xfa"
You can install SnakeCharmR like this:
> library(devtools)
> install_github("asieira/SnakeCharmR")
Hope this helps.

Process substitution

I've given a look around about what puzzles me and I only found this:
Do some programs not accept process substitution for input files?
which is partially helping, but I really would like to understand the full story.
I noticed that some of my R scripts give different (ie. wrong) results when I use process substitution.
I tried to pinpoint the problem with a test case:
This script:
#!/usr/bin/Rscript
args <- commandArgs(TRUE)
file <-args[1]
cat(file)
cat("\n")
data <- read.table(file, header=F)
cat(mean(data$V1))
cat("\n")
with an input file generated in this way:
$ for i in `seq 1 10`; do echo $i >> p; done
$ for i in `seq 1 500`; do cat p >> test; done
leads me to this:
$ ./mean.R test
test
5.5
$ ./mean.R <(cat test)
/dev/fd/63
5.501476
Further tests reveal that some lines are lost...but I would like to understand why. Does read.table (scan gives the same results) uses seek?
Ps.
with a smaller test file (100) an error is reported:
$./mean.R <(cat test3)
/dev/fd/63
Error in read.table(file, header = F) : no lines available in input
Execution halted
Add #1: with a modified script that uses scan the results are the same.
I have written this general purpose function for opening a file connection in my own scripts:
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
In your code, replace file with file <- OpenRead(file) and it should handle all of the below:
./mean.R test
./mean.R <(cat test)
cat test | ./mean.R -
cat test | ./foo.R /dev/stdin

Resources