I am parsing xml-files from a webservice, and occasionally I encounter this error:
xml2:::read_xml.raw(rs$content) # where the object rs is the response from the webservice, obtained using the httr package
Error in read_xml.raw(x, encoding = encoding, ...) :
xmlParseCharRef: invalid xmlChar value 2 [9]
I downloaded thousands of xmls and only a few are broken. '
My question is then:
How to locate the characters in the response that causes the error.
And what is the genral strategy to fix an invalid xml caused by invalid xmlChars?
I have sircumvented the problem by paring the response as a html, but I would rather fix the issue and parse as xml
Thanks!
I was able to figure it out by doing the following:
first to get to peek inside the content of the httr response
xml_broken <- readBin(rs$content, what = "character")
Then I was able to systematically delete data from the broken xml, until I finally found this string of text which caused the problem:
"" # from the context i could see that this should be parsed as the danish character 'æ'
from https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references I could see that this should in fact be encoded as
"aelig;"
so finally the httr content can be parsed by doing
rs$content %>%
readBin(what = "character") %>%
gsub(pattern = "", replacement = "aelig;") %>%
XML::xmlParse()
Related
library(readr)
Disease<-read_csv(file = "C:/Users/bryan/Documents/Data Science
Project/CT5163_Project_2021_22_Disease_final.csv")
ForestFire<-read_csv(file = "C:/Users/bryan/Documents/Data Science
Project/CT5163_Project_2021_22_ForestFire_final.csv")
Salary<-read_csv(file = "C:/Users/bryan/Documents/Data Science
Project/CT5163_Project_2021_22_Salary_final.csv")
The first two files read perfectly with no issues. however with the salary_final.csv, when read , an error comes back with the following "Error in nchar(x, "width") : invalid multibyte string, element 1"
Anyone know why and what i can do to fix this?
Try
locale=locale(encoding="latin1")
in the call to read_csv
i had the same problem with a local CSV. I fix it using locale=locale(encoding="latin1") in the call read_csv.
I am trying to run a simple command using gtrendsR package, but it's giving me an error saying Error in make.names(col.names, unique = TRUE) :
invalid multibyte string 1
Here is the code:
res <- gtrends(c("nhl", "nba"), geo = c("CA", "US"))
Sys.setlocale("LC_CTYPE", "English")
If the error is due to different language setting, then it should work for you.
However, if you want to search keyword in other languages instead of English (e.g. Chinese for me), then another problem is the keywords in the retrieved data might be encoded. My trick is simply reset LC_CTYPE to the original setting.
Sys.setlocale("LC_CTYPE", "Chinese (Traditional)")
For someone who has difficulty like me.
Even if some keywords worked well, some keywords did not. I do not know
what makes differences.
Some keywords occured [Error in make.names(col.names, unique = TRUE) : invalid multibyte string] probelm.
I tried many things and nothing worked
Things I have tried and didn't work
read.csv(~, fileEncoding = "UTF-8") and (~~ encoding = "UTF-8")
re-save file in notepad
Encoding()
The solution
At first, I use "Korean Language" and use Windows10, all of my CSV files are encoded as (ASCII)
If I re-encode original CSV files, problems occured at read-file step.
Conclusion
Above, Sys.setlocale() is the only solution in my case with some limitations.
You can find your own Sys.locale bySys.getlocale()`.
In my case,
["LC_COLLATE=Korean_Korea.949;LC_CTYPE=Korean_Korea.949;LC_MONETARY=Korean_Korea.949;LC_NUMERIC=C;LC_TIME=Korean_Korea.949"]
So I changed locale settings to Sys.setlocale("LC_CTYPE", "English")
Limitations
Even "geo" is correct, the result of "related_topics" is doubtable because the related_topics are translated.
Below is my code
google.trends = gtrends(keyword = key_final, geo = "KR", gprop = "web", time = "2018-01-01 2018-11-30")[[1]]
google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits")
rownames(google.trends) = google.trends$date
google.trends$date = NULL
google.trends
plot(google.trends[[1]], type = 'l')
Screenshot 1
but the result is translated
Screenshot 2
I am trying to analyze bill texts from LegisScan, but am running into problems decoding the text from the API pull response. Turns out LegisScan encodes the full text of all legislations in base 64 when pulled through their API and I am having some trouble decoding it.
This downloaded JSON request is an example of the full text portion of the JSON result that I downloaded through the API. However, the usual methods do not seem to be working on it.
What I have tried:
Legiscan does not seem to support R directly, so I used the package LegiscanR. I used LegiscanR's BillText function to get the correct JSON link, then used parseBillText to try to decode the text from the link into UTF-8. However, it throws up a fromJSON error even with the correct API key and document id stated in the link:
Error in fromJSON(content, handler, default.size, depth, allowComments, :
object 'Strict' not found
Using the base64decode (base64enc package) or base64Decode (RCurl package) function to convert the text from base 64 to raw, and then using the rawToChar function to convert it into characters.
My code:
text <- base64decode("https://www.dropbox.com/s/5ozd0a1zsb6y9pi/Legiscan_fulltext.txt?dl=0")
rawToChar(text)
Nul <- text == as.raw(00)
text[Nul] <- as.raw(20)
text2 <- rawToChar(text)
However, trying to use the rawToChar alone gives me an "embedded nul in string" error
Error in rawToChar(test2) :
embedded nul in string: '%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx\x9c\xb5ZYs\xdb8\022~w\x95\xff\003*O\u0516M\021ཛJ\x95\xe3ę̵\x99\xb1\xa7f\xb7\x92y\xa0$\xca\xe2\x86"\025\036\xf6\xe6\xdfow\003\x94\bR0sh\x93*\x99G\xa3\001|\xdd\xfdu7\xa4\xf9U\xd5d\xebdٰ\xe7\xcf\xe7WM\x93,7銽\x9f\u07d5\xbb\xbf\xe6w\x9fw\xe9\xfc]r\x9f\025I\x93\x95\xc5\xfc\xb6]4\xf8\xe8\x874Y\xa5Ջ\027\xec\xe5\xabk\xf6\xf2\xee\xfcl~\xc3Yl\xc7\
Substituting these nulls out to represent spaces allows rawToChar to run, but the output is gibberish, or in another form of encoding that is not the expected English text characters.
[1] "\x86\xdbi\xb3\xff\xf0\xc3\ak\xa2\x96\xe8\xc5\xca&\xfe\xcf\xf9\xa37tk\\xeco\xac\xbd\xa6/\xcbz\b\xacq\xa9\u07faYm{\033m\xc6\xd7e"
Any other ideas on what else to try? Thanks.
I have been dealing with the same problem in Python and in Python the following code worked:
import base64
raw = base64.b64decode(bill_text['doc'])
pdf_result = open(output_file, "wb").write(raw)
I think maybe in your case you are trying to convert the document immediately into text but that may not be so easy and I did in Python by parsing the saved PDF file with functions from the PyPDF2 library.
For an XML document containing the escape characters, I have seen several options to work around. What is the fastest/smallest possible method to either ignore invalid characters or replace them with correct format?
The data is going into a database and the column that this data with the potential for funny characters is going into (location address) is the least important.
I'm getting the entity_name parsing error at the dataset.ReadXml command
Here is my code:
FN = Path.GetFileName(file1).ToString()
xmlFile = XmlReader.Create(Path.Combine(My.Settings.Local_Meter_Path, FN), New XmlReaderSettings())
ds.ReadXml(xmlFile)
This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))