I got a "broken" .xml file with missing header and root element
myBrokenXML.xml
<attribute1>false</attribute1>
<subjects>
<subject>
<population>Adult</population>
<name>adult1</name>
</subject>
</subjects>
This .xml file is the input for a program that i have to use and the structure cannot be changed.
I would like to change the attribute "name" to adult5.
I tried using the xml2 package but it requires a proper xml file for read_xml() which returns this error message "Extra content at the end of the document"
I tried reading the file line by line using readLines and then writing a new line with writeLines() but this again resulted in an error message "cannot write to this connection"
Anny suggestions are greatly appreciated. I am new to R and XML and been at this for hours (and cursed the developers a few times in the process)
Thanks in advance!
Code using xml2:
XMLFile <- read_xml("myBrokenXML.xml")
Code using readLines/writeLines; would still require to delete the original line
conn <- file("myBrokenXML.xml", open = "r")
lines <- readLines(conn)
for (i in 1:length(lines)){
print(lines[i])
if (lines[i] == "\t\t<name>adult1</name>"){
writeLines("\t\t<name>adult5</name>", conn)
}
}
GOAL
i need to change the value of "name" from adult1 to adult5 and the file must be in the same structure (no header, no root element) at the end.
The easiest way to do this is to use read_html instead of read_xml, since read_html will attempt to parse even broken documents, whereas read_xml requires strict formatting. It is possible to use this fact to create a repaired xml document by creating a new xml_document and writing the nodes obtained from read_html into it. This function will allow fragments of xml to be repaired into a proper xml document:
fix_xml <- function(xml_path, root_name = "root")
{
my_xml <- xml2::xml_new_root("root")
root <- xml2::xml_find_all(my_xml, "//root")
my_html <- xml2::read_html(xml_path)
fragment <- xml2::xml_find_first(my_html, xpath = "//body")
new_root <- xml2::xml_set_name(fragment, root_name)
new_root <- xml2::xml_replace(root, fragment)
return(my_xml)
}
So we can do:
fix_xml("myBrokenXML.xml")
#> {xml_document}
#> <root>
#> [1] <attribute1>false</attribute1>
#> [2] <subjects>\n <subject>\n <population>Adult</population>\n <name>adult1...
The Answer from Allan Cameron (#1) works fine, as long as your file does not include case sensitive elements.
If someone ever runs into the same problem, here is what worked for me.
fix_xml <- function(xmlPath){
con <- file(xmlPath)
lines <- readLines(con)
firstLine <- c("<root>")
lastLine <- c("</root>")
lines <- append(lines, firstLine, after = 0)
lines <- append(lines, lastLine, after = length(lines))
write(lines, xmlPath)
close(con)
}
This function insert a root element into a "broken" xml file.
The fixed xml file can then be read using read_xml() and edited as desired.
The only difference to Allan's answer is, that using read_html does not care about upper case letters and reads the whole file as all-lower-case.
My solution is not as versatile, but it keeps upper case letters.
Related
I have an issue, where I'm reading in big (+500mb) CSV-files and then want to verify that all data has been read in correctly. To do so, I have been using a comparison between length() of readLines() and nrow() of read.csv2.
The following is my R-code:
df <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = read.csv2,
sep = ";",
quote = "", encoding = "UTF-8", skipNul = TRUE)
df_check <- readFileFromServer(HOST, KEY,
paste0(SERVER_PATH, SERVER_FOLDER),
FILENAME,
FUN = readLines,skipNul = TRUE)`
Then I verify that all data was loaded, by checking:
if(nrow(df) != (length(df_check) - dif)){
stop("some error msg")
}
dif is set to 1, to account for header in the CSV-files.
This check is the part that fails for a given CSV-file.
This has been working as intended up until this point, but now this check is causing issues, but I cannot fully understand why.
The one CSV-file that fails the check has "NULL" in the data, which I believe readLines interprets as a delimiter, thus causing a new line, and then the check fails, but I'm really not sure.
I tried parsing different parameters to my readfunctions, but issue still persists.
I expect readlines and read.csv2 to result in equal the same length()-1 and nrow() respectively, as shown in my code-snippet.
This is not a proper answer, but it was too long for a comment. This would be my debug strategy here.
Pick a file that fails. Slurp it with readLines.
Save the file locally using writeLines.
Your first job is to make sure that the check fails also when the file
is loaded from the disk. My first thought would be that the file transfer the first time you have run readFilesFromServer and the second time were not precisely identical.
Now. If your problem persists for the given file when you read it locally with read.csv (different number of rows than number of lines in the readLine output), your job becomes much easier (and faster, probably) to solve.
First, take a look at the beginning of the CSV file and at its end. Are they as they should be? Do they match the data in the head and tail of your data frame? If yes, then you need to find the missing lines systematically.
Since CSV is just comma separated files, you can compare each line read from the CSV file with readLines with the line as it should be based on the table you have read using read.csv. How this should be done, depends on how your original csv file looks like (whether you need to insert quotes etc.). Basically, you need to figure out a way of restoring the lines of the CSV file from the data in your data frame, and then looking for the first line that is different.
Here is some code to give you an idea what I mean:
## first, prepare data – for this example only!
f <- file("test.csv", "w")
writeLines(c("a,b,c", "1,what ever,42", "12,89,one"), f)
close(f)
## actual test
## first, read the file with readlines
f <- file("test.csv", "r")
rl <- readLines(f)
close(f)
## then, read it with test.csv
csv <- read.csv("test.csv")
## third, prepare the lines as they should look based on the CSV
rl_sim <- do.call(paste, c(csv, sep=","))
## find the first mismatch
for(i in 1:length(rl_sim)) {
if(rl_sim[i] != rl[i + 1]) {
message("Problems start at line ", i, "\n", rl_sim[i], rl[i + 1])
break
}
}
Given a .tar.gz file on my hard disk, would like to create that exact file, but with R code alone (e.g. with the help of serialization). The goal is to not refer to the file itself, but to generate a plain text variable containing the content of the file and after that to write the file to the file system. I thought about the following:
Take the base64 string of the file (base64 serialization).
Write it to the file system as a binary file.
But the following code generates an empty file:
zzfil <- tempfile("testfile")
zz <- file(zzfil, "wb")
file_content <- "H4sIAAAAAAAAA+1YbW/bNhD2Z/6KW/zBNpLIerHjQmvapo6HBWgyw3ZXDE1X0BJtEZFIgaTguEb/+06S7drJumJA5m6DHsAQxOM9PPF4uscyTJuUBnd0zmzbaV8Oxv3R1XBy9ctN7clgI846nfzq9Lr27rVAr2fXHM+zvV6303N6NdvxHDSDXTsAMm2owlDE/K/nfcv+H8WwzL0PZu8gkMkyxcG1lUy4ifH2XUQNmIhtxuFSMg3Nwgp9qlmL/MqU5lL4YFuOZZOLzERS5Z4SFkoaBtyQa8qFwR9DwwTZ1stCsh2H50uZKc3i2SstE7aImGKWYOYFuWQ6UDw1xRrXUjGgU5kZWOShcQNhEVFCl1Pky80mogKkYBAjcYsA4q1mMEN+0LgyTkd6AVyETBgu5hiOonNF0wgt3ERcFI+8s7BF3vCACb3Zkbi8A67zCDIkUi/JQAQyRDof3k5+On2GgadMhNqHETRfnINndSyvRa6SVCqDo/N5GkvjFjbXci2ndQKGT6e4sfmQg9PdFnlDPy0vqaGYLpUxcsNYqPsySXlMyx0RkqxzE/rg2s6zU7t76jngOL7T870uBtP/EcScbPK/n/b2zcX1YDy86A+e8ox9q/6x8Mv6d92eZztY/+6Z163q/xBg9/kBHFJjmBLNo9/fv/dpnEbU//Dh+KhFahX+33hQ/6P2P7BG0eO73a/Xv20/qH/nDGUAdKv6P3z+IxbH0hod8P2P2T5b59/zOo6d6z+706ve/4dAHX7OE34CC6ni8AdSJ3XUZLmS0YDCid3TJEUNMstEkCsMEDRhITSKU9LAuYuIBxGkCpWbhsYeV8Mq2H6TGQRIFTOqRKnJSsm2kX200Ii58srlFozGJgu5BGr8wh8gMib12211mt7NtRXR0AqkJT61C/MY9SFkms2yGO7YciqpCkEjoQkyDGkm1eOFNsSvMx6H+JghjFgsaQhbNQyNvlExHMM44jOD19eNwqMfseBuZ9oOHnoMSo8JFtifOzzymDQIqTfgVdmTSbHH8Px0u/nNFqxQwBab3Tza22ts1Z8fO3/Mq3ufISeI+VRRtWyuNWfrC+eV3gpRLrAw4piFL4+KCTlNaWuGqEDPExNQpU+AMt28P0/SeaucdgxzJpOPeISMRBWdYNBQh5DNaBYbmCItRlr13X/p+z+h4ukVwN/v/67Tc6r+/53yv1YA4cH+/7mP8u91u17V/w+B27yhr4qUfya3NOZUb+9M/l1nte4z74o+g6OZxrOyKhtMM287t+GXTyMrMvyKFMB5azGhd52rN3CFChUqfB/8AQr6tbUAGgAA"
writeBin(RCurl::base64Decode(file_content), zz)
close(zz)
file.rename(from = zzfil, to = paste0(zzfil,".tar.gz"))
How should I serialize the file instead? I.e. how should I fill the functions file_to_string and string_to_file?
file_to_string <- function(input_file){
# Return a serialized string of input_file
}
string_to_file <- function(input_string){
# Return content to write to a file
}
original_file <- "original.tar.gz"
zzfil <- tempfile("copy")
zz <- file(zzfil, "wb")
file_content <- file_to_string(original_file)
writeBin(string_to_file(file_content), zz)
close(zz)
file.rename(from = zzfil, to = paste0(zzfil,".tar.gz"))
For me, using R 3.4.4 on platform x86_64-pc-linux-gnu, RCurl version 1.95-4.10, the example code produces a non-empty file that can be read back in using readBin, so i can't reproduce your empty file issue.
But that's not the main issue here.
UsingwriteBin does not achieve what you want to do: it's use case is to store an R-Object (a vector) in a binary format on the filesystem and read it back in with readBin; not to read in a binary file, then manipulate it and save the new version or generate a binary file that is meant to be understood by anything else besides readBin.
In my humble opinion: R is probably not the right tool to do binary patches.
I am trying to write R code where I input an URL and output (save on hard drive) a .txt file. I created a large list of url using the "edgarWebR" package. An example would be "https://www.sec.gov/Archives/edgar/data/1131013/000119312518074650/d442610dncsr.htm". Basically
open the link
Copy everything (CTRL+A, CTRL+C)
open empy text file and paste content (CTRL+V)
save .txt file under specified name
(in a looped fashion of course). I am inclined to "hard code it" (as in open website in browner using browseURL(...) and "send keys" commands). But I am afraid that it will not run very smoothly. However other commands (such as readLines()) seem to copy the HTML structure (therefore returning not only the text).
In the end I am interested in a short paragraph of each of those shareholder letters (containing only text; Therefore Tables/graphs are no concern in my particular setup.)
Anyone aware of an R function that would help`?
thanks in advance!
Let me know incase below code works for you. xpathSApply can be applied for different html components as well. Since in your case only paragraphs are required.
library(RCurl)
library(XML)
# Create character vector of urls
urls <- c("url1", "url2", "url3")
for ( url in urls) {
# download html
html <- getURL(url, followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
# writing lines to html
# depends whether you need separate files for each url or same
fileConn<-file(paste(url, "txt", sep="."))
writeLines(paste(plain.text, collapse = "\n"), fileConn)
close(fileConn)
}
Thanks everyone for your input. Turns out that any html conversion took too much time given the ammount of websites I need to parse. The (working) solution probably violates some best-practice guidelines, but it does do the job.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=path + '/codes_ml/geckodriver/geckodriver.exe') # initialize driver
# it is fine to open the driver just once
# loop over urls will the text
driver.get(report_url)
element = driver.find_element_by_css_selector("body")
element.send_keys(Keys.CONTROL+'a')
element.send_keys(Keys.CONTROL+'c')
text = clipboard.paste()
I have a few thousand xml files that I would like to read into R. The problem is that some of these files have three special characters "" in the beginning of the file that stops xmlTreeParse from reading the xml file. The error that I get is the following...
Error: 1: Start tag expected, '<' not found
This is due to the first line in the xml file that is the following...
<?xml version="1.0" encoding="utf-8"?>
If I manually remove the characters using notepad, I have this in the beginning of the xml file and I am able to read the xml file...
<?xml version="1.0" encoding="utf-8"?>
I'd like to be able to remove the characters automatically. The following is the code that I have written currently.
filenames <- list.files("...filepath...", pattern="*.xml", full.names=TRUE)
files <- lapply(filenames, function(f) {
xmlfile <-tryCatch(xmlTreeParse(file = f), error=function(e) print(f))
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
name <- unname(plantcat$EntityNames)
return(name)
})
I'm wondering how I can read the xml file in by removing the special characters in R. I have tried tryCatch as you can see above but I'm not sure how can edit the xml file without actually reading it in first. Any help would be appreciated!
Edit: Using the following parsing code fixed the problem. I think when I opened the xml file in notepad, it was showing "" but in reality it was this following string "". It's possible that this was due to the encoding of the file but I'm not sure of the specifics. Thank you #Prem.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
The special chars from the beginning might come from a different encoding for the file, especially if your xml contains some special characters.
Try to specify the encoding. To identify what encoding is used, open the file as hexa and read the first bytes.
My hunch is that your special chars comes from BOM:
http://unicode.org/faq/utf_bom.html
In your code use readLines to read file and then gsub can be used to remove junk value from the string.
xmlfile <- xmlTreeParse(gsub("","",readLines(f)), asText=TRUE)
Have you tryed with the gsub function?. It is a very convenient function for characters replacement (and deletion). This works for me:
gsub ('','',string, fixed=TRUE)
On a string = '<?xml version="1.0" encoding="utf-8"?>' variable.
EDIT: I would also suggest you to use the sed function if you're using a computer with GNU/Linux. It's a very powerful tool that would deal perfectly with this task.
I have a file that I open using wdGet(filename="exOut.doc",visible=FALSE). This file already has images in it that I've inserted using html and cat(img, file=outputDoc, sep="\n", append=TRUE).
I need to insert a table at the end of the document, but wdTable(format(head(testTable))) places the table at the very top of the word document. How can I fix this?
Also, second problem: I have a lot of tables I need to insert into my document and hence make use of a loop. Below is sample code that demonstrates my problem. Here's the really weird part for me: when I step through the code and run each line after another, it produces no error and I have an output document. If I run everything at once I get a 'cannot open the connection error'. I don't understand how this can be. How is it possible that running each line one at a time produces a different result than running all of that exact same code all at once?
rm(list=ls())
library(R2wd)
library(png)
outputForNow<-"C:\\Users\\dirkh_000\\Downloads\\"
outputDoc<-paste(outputForNow,"exOut.doc",sep="")
setwd(outputForNow)
# Some example plots
for(i in 1:3)
{
dir.create(file.path(paste("folder",i,sep="")))
setwd(paste("folder",i,sep="")) # Note that images are all in different folders
png(paste0("ex", i, ".png"))
plot(1:5)
title(paste("plot", i))
dev.off()
setwd(outputForNow)
}
setwd(outputForNow)
# Start empty word doc
cat("<body>", file="exOut.doc", sep="\n")
# Retrieve a list of all folders
folders<-dir()[file.info(dir())$isdir]
folders<-folders[!is.na(folders)]
# Cycle through all folders in working directory
for(folder in folders){
setwd(paste(outputForNow,folder,sep=""))
# select all png files in working directory
for(i in list.files(pattern="*.png"))
{
temp<-paste0('<img src=','\"',gsub("[\\]","/",folder),"/", i, '\">')
cat(temp, file=outputDoc, sep="\n", append=TRUE)
setwd(paste(outputForNow,folder,sep=""))
}
setwd(outputForNow)
cat("</body>", file="exOut.doc", sep="\n", append=TRUE)
testTable<-as.data.frame(cbind(1,2,3))
wdGet(filename="exOut.doc",visible=FALSE)
wdTable(format(head(testTable))) ## This produces a table at the top and not the bottom of the document
wdSave(outputDoc)
wdQuit() # NOTE that this means that the document is closed and opened over and over again in the loop otherwise cat() will throw an error
}
The above code produces:
Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
Can anyone tell me why this occurs and how to fix it? Please and thank you. Please do recommend a completely different approach if you know I'm going about this the wrong way, but please also explain what it is that I'm doing wrong.
To start the DescTools package and a Word document, use something like this (obviously, modified for your path structure):
library(DescTools)
library(RDCOMClient)
report <- GetNewWrd(template = "C:/Users/Rees/Documents/R/win-library/3.0/R2DOCX/templates/TEMPLATE_03.docx")
ADDED BASED ON COMMENT
Create a template for your report in Word. Perhaps you call it TEMPLATE.docx. Save it in your Document director (or whatever directory you keep Word documents in. Then
report <- GetNewWrd(template = " "C:/Users/dirkh_000/Documents/TEMPLATE.docx")
Thereafter, each time you create a plot, add this line:
WrdPlot(wrd = report)
The plot is inserted in the TEMPLATE.docx Word document in the specified directory.
The same for WrdTable(wrd = report)