Remove special characters in xml file R - r

I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...
Error: 1: Start tag expected, '<' not found
This is due to the first line in the xml file that is the following...
<?xml version="1.0" encoding="utf-8"?>
If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...
<?xml version="1.0" encoding="utf-8"?>
I'd like to be able to remove the characters automatically. The following is the code that I have written currently.
filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)
files <- lapply(filenames, function(f) {
xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
name <- unname(plantcat$EntityNames)
return(name)
})
I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!
Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you #Prem.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)

The special chars from the beginning might come from a different encoding for the file, especially if your xml contains some special characters.
Try to specify the encoding. To identify what encoding is used, open the file as hexa and read the first bytes.
My hunch is that your special chars comes from BOM:
http://unicode.org/faq/utf_bom.html

In your code use readLines to read file and then gsub can be used to remove junk value from the string.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)

Have you tryed with the gsub function?. It is a very convenient function for characters replacement (and deletion). This works for me:
gsub ('','',string, fixed=TRUE)
On a string = '<?xml version="1.0" encoding="utf-8"?>' variable.
EDIT: I would also suggest you to use the sed function if you're using a computer with GNU/Linux. It's a very powerful tool that would deal perfectly with this task.

Related

Using readtext to extract text from XML

I am not used to working with XML files but need to extract text from various fields in XML files. Specifically, I've downloaded and saved XML files like the following: https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml. I'm interested in the text within the tag "regtext" in this and other similar XML files.
I've downloaded the XML files and stored them on my computer, but when I set the directory and attempt to use the readtext package to read from the XML files, I get the following error:
regtext <- readtext("/regdata/RegDataValidation", text_field = "regtext")
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Start tag expected, '<' not found [4]
I've tried to search the error, but nothing I've come across has helped me figure out what might be going on. This basic command works like a charm on any number of other document types, including .csv or .docx, but for some reason it just doesn't seem to recognize the files I'm trying to work with here. Any pointers would be much appreciated--I'm too much of a novice and all of the documentation on readtext does not give examples of how to work with XML.
Pursuant to comments below, I've also tried to specify a single saved XML file, as follows:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "regtext")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
In addition: There were 50 or more warnings (use warnings() to see the first 50)
I tried to specify an xPath expression on a single file, and this did not return any errors, but didn't actually extract any text (even though there should be plenty of text within the "regtext" node:
> regtext <- readtext("/regdata/RegDataValidation/0579- AC01.xml", text_field = "/regtext/*")
I end up with a dataframe with the correct doc_id, but no text.
From the error messages, the readtext function appears to be converting the xml file into a plain text document and the XML package is not accepting it as a valid document.
It is also likely that the XML parser is differentiating between "regtext" and "REGTEXT".
Here is a solution using the xml2 package. (I find this package provides a simpler interface and is easier to use)
library(xml2)
url <- "https://www.federalregister.gov/documents/full_text/xml/2007/09/18/07-4595.xml"
page <- read_xml(url)
#parse out the nodes within the "REGTEXT" sections
regtext <- xml_find_all(page, ".//REGTEXT")
#convert the regtext nodes into vector of strings
xml_text(regtext)

Change Value in "broken" .xml File in R

I got a "broken" .xml file with missing header and root element
myBrokenXML.xml
<attribute1>false</attribute1>
<subjects>
<subject>
<population>Adult</population>
<name>adult1</name>
</subject>
</subjects>
This .xml file is the input for a program that i have to use and the structure cannot be changed.
I would like to change the attribute "name" to adult5.
I tried using the xml2 package but it requires a proper xml file for read_xml() which returns this error message "Extra content at the end of the document"
I tried reading the file line by line using readLines and then writing a new line with writeLines() but this again resulted in an error message "cannot write to this connection"
Anny suggestions are greatly appreciated. I am new to R and XML and been at this for hours (and cursed the developers a few times in the process)
Thanks in advance!
Code using xml2:
XMLFile <- read_xml("myBrokenXML.xml")
Code using readLines/writeLines; would still require to delete the original line
conn <- file("myBrokenXML.xml", open = "r")
lines <- readLines(conn)
for (i in 1:length(lines)){
print(lines[i])
if (lines[i] == "\t\t<name>adult1</name>"){
writeLines("\t\t<name>adult5</name>", conn)
}
}
GOAL
i need to change the value of "name" from adult1 to adult5 and the file must be in the same structure (no header, no root element) at the end.
The easiest way to do this is to use read_html instead of read_xml, since read_html will attempt to parse even broken documents, whereas read_xml requires strict formatting. It is possible to use this fact to create a repaired xml document by creating a new xml_document and writing the nodes obtained from read_html into it. This function will allow fragments of xml to be repaired into a proper xml document:
fix_xml <- function(xml_path, root_name = "root")
{
my_xml <- xml2::xml_new_root("root")
root <- xml2::xml_find_all(my_xml, "//root")
my_html <- xml2::read_html(xml_path)
fragment <- xml2::xml_find_first(my_html, xpath = "//body")
new_root <- xml2::xml_set_name(fragment, root_name)
new_root <- xml2::xml_replace(root, fragment)
return(my_xml)
}
So we can do:
fix_xml("myBrokenXML.xml")
#> {xml_document}
#> <root>
#> [1] <attribute1>false</attribute1>
#> [2] <subjects>\n <subject>\n <population>Adult</population>\n <name>adult1...
The Answer from Allan Cameron (#1) works fine, as long as your file does not include case sensitive elements.
If someone ever runs into the same problem, here is what worked for me.
fix_xml <- function(xmlPath){
con <- file(xmlPath)
lines <- readLines(con)
firstLine <- c("<root>")
lastLine <- c("</root>")
lines <- append(lines, firstLine, after = 0)
lines <- append(lines, lastLine, after = length(lines))
write(lines, xmlPath)
close(con)
}
This function insert a root element into a "broken" xml file.
The fixed xml file can then be read using read_xml() and edited as desired.
The only difference to Allan's answer is, that using read_html does not care about upper case letters and reads the whole file as all-lower-case.
My solution is not as versatile, but it keeps upper case letters.

Import Json file into R with unicode error

I am having issues importing a JSON file into R
Below is the code
library(RJSON)
json_file <- "file_path"
json_data <-fromJSON(file=json_file)
Which returns - Error in fromJSON(file = json_file) : unexpected character '<ef>'
which i suspect is due to the unicode / hashtag as the JSON file contains hex codes e.g #000000.
If i remove the # and replace it with %23 the file will load correctly.
how do i properly load the JSON file, i do not want to manually replace the "#" as i have thousands of these files.
An idea would be to use REGEX to replace # with %23, however i am not too sure how to do that.
Does anyone have any suggestions?

Loading csv into R with `sep=,` as the first line

The program I am exporting my data from (PowerBI) saves the data as a .csv file, but the first line of the file is sep=, and then the second line of the file has the header (column names).
Sample fake .csv file:
sep=,
Initiative,Actual to Estimate (revised),Hours Logged,Revised Estimate,InitiativeType,Client
FakeInitiative1 ,35 %,320.08,911,Platform,FakeClient1
FakeInitiative2,40 %,161.50,400,Platform,FakeClient2
I'm using this command to read the file:
initData <- read.csv("initData.csv",
row.names=NULL,
header=T,
stringsAsFactors = F)
but I keep getting an error that there are the wrong number of columns (because it thinks the first line tells it the number of columns).
If I do header=F instead then it loads, but then when I do names(initData) <- initData[2,] then the names have spaces and illegal characters and it breaks the rest of my program. Obnoxious.
Does anyone know how to tell R to ignore that first line? I can go into the .csv file in a text editor and just delete the first line manually before I load it each time (if I do that, everything works fine) but I have to export a bunch of files and this is a bit stupid and tedious.
Any help would be much appreciated.
There are many ways to do that. Here's one:
all_content = readLines("initData.csv")
skip_first_line = all_content[-1]
initData <- read.csv(textConnection(skip_first_line),
row.names=NULL,
header=T,
stringsAsFactors = F)
Your file could be in a UTF-16 encoding. See hrbrmstr's answer in how to read a UTF-16 file:

Problems with reading a txt file (EOF within quoted string)

I am trying to use read.table() to import this TXT file into R (contains informations about meteorological stations provided by the WMO):
However, when I try to use
tmp <- read.table(file=...,sep=";",header=FALSE)
I get this error
eof within quoted string
warning and only 3514 of the 6702 lines appear in 'tmp'. From a quick look at the text file, I couldn't find any seemingly problematic characters.
As suggested in other threads, I also tried quote="". The EOF warning disappeared, but still only 3514 lines are imported.
Any advice on how I can get read.table() to work for this particular txt file?
It looks like your data actually has 11548 rows. This works:
read.table(url('http://weather.noaa.gov/data/nsd_bbsss.txt'),
sep=';', quote=NULL, comment='', header=FALSE)
edit: updated according #MrFlick's comment's below.
The problem is LF. R will not recognize "^M", to load the file, you only need to specify the encoding like this:
read.table("nsd_bbsss.txt",sep=";",header=F,encoding="latin1",quote="",comment='',colClasses=rep("character",14)) -> data
But Line 8638 has more than 14 columns, which is different from other lines and may lead an error message.

Resources