Problem
I have an XML file that I would like to parse in R. I know that this file is not corrupted because the following Python code seems to work:
>>> import xml.etree.ElementTree as ET
>>> xml_tree = ET.parse(PATH_TO_MY_XML_FILE)
>>> do_my_regular_xml_stuff_that_seems_to_work_no_problem(xml_tree)
Now, when I try to run the following code in R, I get an error message:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE)
Error in nchar(text_repr): invalid multibyte string, element 1
Traceback:
Alright, maybe the parser doesn't recognize the encoding. Luckily this should be specified in a decent XML file. So, I go to my shell and check:
$ head -n1 PATH_TO_MY_XML_FILE
??<?xml version="1.0" encoding="utf-16"?>
Now, I can go back to R and explicitly pass on the encoding, only to face the next error message where I got stuck now:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE, encoding='UTF-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Traceback:
1. XML::xmlParse(filePath, encoding = "UTF-16")
2. (function (msg, ...)
. {
. if (length(grep("\\\n$", msg)) == 0)
. paste(msg, "\n", sep = "")
. if (immediate)
. cat(msg)
. if (length(msg) == 0) {
. e = simpleError(paste(1:length(messages), messages, sep = ": ",
. collapse = ""))
. class(e) = c(class, class(e))
. stop(e)
. }
. messages <<- c(messages, msg)
. })(character(0))
A last attempt to check (in R) if the file is in fact "UTF-16" encoded yields:
> f <- file(filePath, 'r', encoding = "UTF-16")
> firstLine <- readLines(f, n=1)
> close(f)
> print(line)
[1] "<?xml version=\"1.0\" encoding=\"utf-16\"?>"
Which looks just about right to me.
Question(s)
Does anyone know what is happening here? Is this a bug from the XML library? Is the file maybe not 'UTF-16' encoded, even though it claims it is? What are the two question marks ?? that I see when I print the file into the shell? These question marks don't appear when reading in the file properly...
Is this a bug from the XML library?
I think there could be a bug here. If I generate a valid UTF-16 XML document, which will have an initial byte-order mark:
$ echo '<a>😊</a>' | iconv -t utf-16 >a-utf16.xml
$ xxd a-utf16.xml
00000000: fffe 3c00 6100 3e00 3dd8 0ade 3c00 2f00 ..<.a.>.=...<./.
00000010: 6100 3e00 0a00 a.>...
then I can parse it with:
> XML::xmlParse('a-utf16.xml')
<?xml version="1.0"?>
<a>😊</a>
but not if I specify the encoding:
> XML::xmlParse('a-utf16.xml', encoding='utf-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Your original problem was when you weren't specifying the encoding. However:
I know that this file is not corrupted because the following Python code seems to work
That's a good hint, but I think you'll find edge cases where that doesn't hold. Try iconv for a second opinion on whether the file is encoded correctly.
For a more specific response, you'll need to post a reproducible XML file,
Related
im using R package httr to get a HTTP-Response for a specific link.
When trying to parse the content of the response im getting the Error:
Fehler in parse(text = script_content) : <text>:1:10: Unerwartete(s) '['
1: {"lines":[
Translated to enlgish it says something like this (sorry for my error messages being in German):
Error in parsing(text = script_content) : <text>1:10: Unexpected '['
1: {"lines":[
It seems as there is a problem with the format/encoding of the text. Here is my code:
script <-
GET(
url = "https://my_url.which_origin_is_not_important/my_script.R",
authenticate(username, pass)
)
script_content <- content(script, as = "text", encoding = "ISO-8859-1")
parsed_condent <- parse(text = script_content )
The value of script_content looks like this:
"{\"lines\":[{\"text\":\"################## FUNCTION ##################\"},{\"text\":\"\"},{\"text\":\"library(log4r)\"}],\"start\":0,\"size\":32,\"isLastPage\":true,\"limit\":500,\"nextPageStart\":null}"
Some more background to this operation: Im trying to source a code, which is currently inside of a private repository. I wrote the code myself i'm trying to source. I made sure, that the issue is not coming from within th code.
I got the solution from: Sourcing R files in a private github folder
Thanks for any advice!!
In the code you'll see that it's an if/then function that will write a file directory depending on what operating system you have. This will run on Mac but when it sees I'm on windows I get this error: Error in system(paste("/RTools/bin/wc -l \"", filename, "\"", sep = ""), : '/RTools/bin/wc' not found
I've tried debugging the function and as I run it, filename is being passed the correct csv file in the correct format. I believe I have an issue with the two back slashes going the wrong way and perhaps an extra quotation mark. Line 3-5 is where I believe my problem is.
function( filename ){
if(.Platform$OS.type=="windows"){
system.time({
cmd<-system(paste("/RTools/bin/wc -l \"",filename, "\"",sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
} else {
system.time({
cmd<-system(paste("wc -l \"",filename,"\" | awk \'{print $1}\'", sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
}
}
I expect it to build the correct file path, it results in the error I listed.
as suggest by Dirk Eddelbuettel in this talk and this answer I tried to profile compiled R code using gperftools. Here is what I did.
I used Dirks profilingSmall.R as script that I want to profile. I repeat it here:
## R Extensions manual, section 3.2 'Profiling R for speed'
## 'N' reduced to 99 here
suppressMessages(library(MASS))
suppressMessages(library(boot))
storm.fm <- nls(Time ~ b*Viscosity/(Wt - c), stormer, start = c(b=29.401, c=2.2183))
st <- cbind(stormer, fit=fitted(storm.fm))
storm.bf <- function(rs, i) {
st$Time <- st$fit + rs[i]
tmp <- nls(Time ~ (b * Viscosity)/(Wt - c), st, start = coef(storm.fm))
tmp$m$getAllPars()
}
rs <- scale(resid(storm.fm), scale = FALSE) # remove the mean
Rprof("boot.out")
storm.boot <- boot(rs, storm.bf, R = 99) # pretty slow
Rprof(NULL)
To profile it I run the following script
LD_PRELOAD="/usr/lib/libprofiler.so.0"
\CPUPROFILE=sample.log \
Rscript profilingSmall.R
Then I tried to parse the log file using
pprof /usr/bin/R sample.log
This returned the following error
Using local file /usr/bin/R.
Using local file sample.log.
substr outside of string at /usr/local/bin/pprof line 3618.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3618.
substr outside of string at /usr/local/bin/pprof line 3620.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3620.
sample.log: header size >= 2**16
sample.log is empty. However, a bunch of sample.log_digit were created that contain information that looks reasonable.
I had the same problem, but realized my problem. I'd done:
export CPUPROFILE=test.prof
export LD_PRELOAD="/usr/local/lib/libprofiler.so"
testprog ...
pprof --web `which testprog` test.prof
If I stopped after running testprog the prof files wasn't empty but after pprof it was. pprof crashed with the substr error.
What I realized later was that by setting and exporting LD_PRELOAD that libprofiler.so was also loaded for pprof, overwriting test.prof.
You just need to ensure LD_PRELOAD is not set when you run pprof.
I'm using gperftools-2.5, and I also encountered the same problem:
[root#localhost ivrserver]# pprof --text ./IvrServer ivr.prof
Using local file ./IvrServer.
Using local file ivr.prof.
substr outside of string at /usr/local/bin/pprof line 3695.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3695.
substr outside of string at /usr/local/bin/pprof line 3697.
Use of uninitialized value in string eq at /usr/local/bin/pprof line 3697.
ivr.prof: header size >= 2**16
I found this is because the prof file (ivr.prof in my example) is empty.
everytime the profiler start and end, it will create a new prof file, you should use xxx.prof.0 xxx.prof.1 ... to get the right result
I m trying to use rPython package to pass some arguments into python code and get results back. But for some reason I m getting weird encoding from my python code. Maybe someone has some hints to point me out.
Here is my simple code to test:
require(rPython)
#pass the test word 'audiention' (in ukrainian)
word<-"аудієнція"
python.assign("input", word)
python.exec("input = input.encode('utf-8')")
python.exec("print input") #the output in console is correct at this step: аудієнція
x<-python.get("input")
cat(x) # the output is: 0C4VT=FVO
Does anybody have some suggestions why the output of python.get is encoded weird?
My Sys.getlocale() output is:
Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=uk_UA.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=uk_UA.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=uk_UA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=uk_UA.UTF-8;LC_IDENTIFICATION=C"
Thank you in advance for any hints!
I have recently built a new package based on the original rPython code called SnakeCharmR that addresses this and other problems rPython had.
A quick comparison:
> library(SnakeCharmR)
> py.assign("a", "'")
> py.get("a")
[1] "'"
> py.assign("a", "áéíóú")
> py.get("a")
[1] "áéíóú"
> library(rPython)
> python.assign("a", "'")
File "<string>", line 2
a =' [ "'" ] '
^
SyntaxError: EOL while scanning string literal
> python.assign("a", "áéíóú")
> python.get("a")
[1] "\xe1\xe9\xed\xf3\xfa"
You can install SnakeCharmR like this:
> library(devtools)
> install_github("asieira/SnakeCharmR")
Hope this helps.
I've given a look around about what puzzles me and I only found this:
Do some programs not accept process substitution for input files?
which is partially helping, but I really would like to understand the full story.
I noticed that some of my R scripts give different (ie. wrong) results when I use process substitution.
I tried to pinpoint the problem with a test case:
This script:
#!/usr/bin/Rscript
args <- commandArgs(TRUE)
file <-args[1]
cat(file)
cat("\n")
data <- read.table(file, header=F)
cat(mean(data$V1))
cat("\n")
with an input file generated in this way:
$ for i in `seq 1 10`; do echo $i >> p; done
$ for i in `seq 1 500`; do cat p >> test; done
leads me to this:
$ ./mean.R test
test
5.5
$ ./mean.R <(cat test)
/dev/fd/63
5.501476
Further tests reveal that some lines are lost...but I would like to understand why. Does read.table (scan gives the same results) uses seek?
Ps.
with a smaller test file (100) an error is reported:
$./mean.R <(cat test3)
/dev/fd/63
Error in read.table(file, header = F) : no lines available in input
Execution halted
Add #1: with a modified script that uses scan the results are the same.
I have written this general purpose function for opening a file connection in my own scripts:
OpenRead <- function(arg) {
if (arg %in% c("-", "/dev/stdin")) {
file("stdin", open = "r")
} else if (grepl("^/dev/fd/", arg)) {
fifo(arg, open = "r")
} else {
file(arg, open = "r")
}
}
In your code, replace file with file <- OpenRead(file) and it should handle all of the below:
./mean.R test
./mean.R <(cat test)
cat test | ./mean.R -
cat test | ./foo.R /dev/stdin