Most efficient way to extract data from large XML files in R - r

I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):
<recordGroup>
<records>
<record>
<recordId>123442</recordId>
<reportingCountry>PT</reportingCountry>
<date>2020-02-20</date>
<field>
<fieldName>Gender</fieldName>
<fieldValue>F</fieldValue>
</field>
<field>
<fieldName>Age</fieldName>
<fieldValue>57</fieldValue>
</field>
<field>
<fieldName>ClinicalSymptoms</fieldName>
<fieldValue>COUGH</fieldValue>
<fieldValue>FEVER</fieldValue>
<fieldValue>O</fieldValue>
<fieldValue>RUNOS</fieldValue>
<fieldValue>SBREATH</fieldValue>
</field>
</record>
</records>
</recordGroup>
I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse using the code below, but this seems to be even more inefficient.
value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0
xmlEventParse(xmlDoc_clean,
list(
startElement = function (name, attrs) {
tagName <<- name
},
text = function (x) {
if (nchar(x) > 0) {
if (tagName == "recordId") {
rec <<- x
} else
if (tagName == "fieldName") {
var_f <<- x
} else {
if (tagName == 'fieldValue') {
v <- x
nvar <<- nvar + 1
value_df[nvar, 1:3] <<- c(rec, var_f, v)
}
}
}
},
endElement = function (name) {
if (name == 'record') {
print(nvar)
}
}
))
I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.
Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.

When memory is a challenge, consider hard disk. Specifically, consider building a large CSV version of extracted parsed XML data with iterative append calls via write.csv in an xmlEventParse run:
# INITIALIZE EMPTY CSV WITH EMPTY ROW
csv <- file.path("C:", "Path", "To", "Large.csv")
fileConn <- file(csv); writeLines(paste0("id,tag,text"), fileConn); close(fileConn)
i <- 0
doc <- file.path("C:", "Path", "To", "Large.xml")
output <- xmlEventParse(doc,
list(startElement=function(name, attrs){
if(name == "recordId") {i <<- i + 1}
tagName <<- name
}, text=function(x) {
if(nchar(trimws(x)) > 0) {
write.table(data.frame(id=i, tag=tagName, text=x),
file=csv, append=TRUE, sep=",",
row.names=FALSE, col.names=FALSE)
}
}),
useTagName=FALSE, addContext=FALSE)
Output
Obviously, further data wrangling will be needed for proper row/column migration. But you can now read large CSV with the many tools out there or via chunks.
id,tag,text
1,"recordId","123442"
1,"reportingCountry","PT"
1,"date","2020-02-20"
1,"fieldName","Gender"
1,"fieldValue","F"
1,"fieldName","Age"
1,"fieldValue","57"
1,"fieldName","ClinicalSymptoms"
1,"fieldValue","COUGH"
1,"fieldValue","FEVER"
1,"fieldValue","O"
1,"fieldValue","RUNOS"
1,"fieldValue","SBREATH"

In the end the fastest approach I found was the following:
Split the XML files in smaller chunks using XML2. I have >100GB RAM on the server I am working on so could parallelize this process using foreach with 6 workers, but mileage varies depending on how much RAM is available.
The function splitting the files returns a data.frame with the location of the split files.
Process the smaller XML files in a foreach loop - this time it is possible to use all cores so I have gone with 12 workers. The processing uses XML2 as I found that to be the fastest way. Initially the extracted data is in a long format but I then convert to a wide format within the loop.
The loop binds and outputs the different dataframes into one large dataframe. The final step is using fwrite to save the csv file. This seems to be the most efficient way.
With this approach I can process a 2.6GB XML file in 6.5 minutes.
I will add code eventually but it is quite specific so need to generalize a bit.

Related

speeding up nested for loop in R

I have a nested for loop that I am using to parse a complex and large json file. It takes forever! I am wondering if there is a way to speed this up that I am missing? I can link to a smaller json file if that would be useful? Not sure if something from the apply family would be of use?
for(row_n1 in 1:length(json_data$in_network)){
in_network1=json_data$in_network[[row_n1]]
in_network1[["negotiated_rates"]]=NULL
in_network_df=as.data.frame(in_network1)
for(row_n2 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates)){
reporting=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups)){
npi_df=as.data.frame(toString(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$npi))%>% set_names(nm = "npi")
tin_df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$tin)))
df=merge(npi_df,tin_df)
df=merge(in_network1,df)
reporting=rbind(reporting,df)
}
negotiated_prices_df=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices)){
df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices[[row_n3]])))
negotiated_prices_df=bind_rows(negotiated_prices_df,df)
}
r=merge(reporting,negotiated_prices_df)
result=bind_rows(result,r)
}
if(row_n1%% 100 ==0)
print(paste (row_n1,Sys.time() ,sep="====="))
}
l=json_data
l$in_network=NULL
result=merge(as.data.frame(l),result)
This isn't a big deal on small files (e.g., < 1,000 rows in the result df, but when I am dealing with 1m+ it just takes forever!)

While loop for creating multiple resources with capacity

I need to create 52 resources with capacity 2 in the Simmer simulation package. I am trying to do this by using a while loop that creates these resources for me, instead of creating each resource myself.
The idea is that I have a while loop as given below. In each loop, a resource should be created called Transport_vehicle1, Transport_vehicle2, ..., Transport_vehicle52, with capacity 2.
Now I do not know how to insert the number i in the name of the resource that I am trying to create
i<-1
while (i<=52)
{ env %>%
add_resource("Transport_vehicle"[i],capacity = 2)
i <- i+1
}
Could someone please help me out? Thanks!
You can use the paste method to concatenate the string and the number:
i<-1
while (i<=52)
{ env %>%
add_resource(paste("Transport_vehicle", i),capacity = 2)
i <- i+1
}
If you do not want a space between the string and the number add the sep="" argument
paste("Transport_vehicle", i, sep="")
or use
paste0("Transport_vehicle", i)

Unable to update data in dataframe

i tried updating data in dataframe but its unable to get updating
//Initialize data and dataframe here
user_data=read.csv("train_5.csv")
baskets.df=data.frame(Sequence=character(),
Challenge=character(),
countno=integer(),
stringsAsFactors=FALSE)
/Updating data in dataframe here
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i]==user_data$challenge_sequence[j]&&user_data$challenge[i]==user_data$challenge[j])
{
writedata(user_data$challenge_sequence[i],user_data$challenge[i])
}
}
}
writedata=function( seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence=seqnn,Challenge=challng,countno=1)
baskets.df=rbind(baskets.df,newRow)
}
//view data here
View(baskets.df)
I've modified your code to what I believe will work. You haven't provided sample data, so I can't verify that it works the way you want. I'm basing my attempt here on a couple of common novice mistakes that I'll do my best to explain.
Your writedata function was written to be a little loose with it's scope. When you create a new function, what happens in the function technically happens in its own environment. That is, it tries to look for things defined within the function, and then any new objects it creates are created only within that environment. R also has this neat (and sometimes tricky) feature where, if it can't find an object in an environment, it will try to look up to the parent environment.
The impact this has on your writedata function is that when R looks for baskets.df in the function and can't find it, R then turns to the Global Environment, finds baskets.df there, and then uses it in rbind. However, the result of rbind gets saved to a baskets.df in the function environment, and does not update the object of the same name in the global environment.
To address this, I added an argument to writedata that is simply named data. We can then use this argument to pass a data frame to the function's environment and do everything locally. By not making any assignment at the end, we implicitly tell the function to return it's result.
Then, in your loop, instead of simply calling writedata, we assign it's result back to baskets.df to replace the previous result.
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i] == user_data$challenge_sequence[j] &&
user_data$challenge[i] == user_data$challenge[j])
{
baskets.df <- writedata(baskets.df,
user_data$challenge_sequence[i],
user_data$challenge[i])
}
}
}
writedata=function(data, seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence = seqnn,
Challenge = challng,
countno = 1)
rbind(data, newRow)
}
I'm not sure what you're programming background is, but your loops will be very slow in R because it's an interpreted language. To get around this, many functions are vectorized (which simply means that you give them more than one data point, and they do the looping inside compiled code where the loops are fast).
With that in mind, here's what I believe will be a much faster implementation of your code
user_data=read.csv("train_5.csv")
# challenge_indices will be a matrix with TRUE at every place "challenge" and "challenge_sequence" is the same
challenge_indices <- outer(user_data$challenge_sequence, user_data$challenge_sequence, "==") &
outer(user_data$challenge, user_data$challenge, "==")
# since you don't want duplicates, get rid of them
challenge_indices[upper.tri(challenge_indices, diag = TRUE)] <- FALSE
# now let's get the indices of interest
index_list <- which(challenge_indices,arr.ind = TRUE)
# now we make the resulting data set all at once
# this is much faster, because it does not require copying the data frame many times - which would be required if you created a new row every time.
baskets.df <- with(user_data, data.frame(
Sequence = challenge_sequence[index_list[,"row"]],
challenge = challenge[index_list[,"row"]]
)

dump() in R not source()able- output contains "..."

I'm trying to use dump() to save the settings of my analysis so I can examine them in a text editor or reload them at a later date.
In my code I'm using the command
dump(ls(), settingsOutput, append=TRUE)
The file defined by `settingsOutput' gets created, but the larger objects and locally defined functions are truncated. Here's an excerpt from such a file. Note these files are generally on the order of a few kb.
createFilePrefix <-
function (runDesc, runID, restartNumber)
{
...
createRunDesc <-
function (genomeName, nGenes, nMix, mixDef, phiFlag)
{
...
datasetID <-
"02"
descriptionPartsList <-
c("genomeNameTest", "nGenesTest", "numMixTest", "mixDefTest",
"phiFlagTest", "runDescTest", "runIDTest", "restartNumberTest"
...
diffTime <-
structure(0.531, units = "hours", class = "difftime")
dissectObjectFileName <-
function (objectFileName)
{
...
divergence <-
0
Just for reference, here's one of the functions defined above
createFilePrefix <- function(runDesc, runID, restartNumber){
paste(runDesc, "_run-", runID, "_restartNumber-", restartNumber, sep="")
}
Right now I'm going back and removing the problematic lines and then loading the files, but I'd prefer to actually have code that works as intended.
Can anyone explain to me why I'm getting this behavior and what to do to fix it?

Use of variable in Unix command line

I'm trying to make life a little bit easier for myself but it is not working yet. What I'm trying to do is the following:
NOTE: I'm running R in the unix server, since the rest of my script is in R. That's why there is system(" ")
system("TRAIT=some_trait")
system("grep var.resid.anim rep_model_$TRAIT.out > res_var_anim_$TRAIT'.xout'",wait=T)
When I run the exact same thing in putty (without system(" ") of course), then the right file is read and right output is created. The script also works when I just remove the variable that I created. However, I need to do this many times, so a variable is very convenient for me, but I can't get it to work.
This code prints nothing on the console.
system("xxx=foo")
system("echo $xxx")
But the following does.
system("xxx=foo; echo $xxx")
The system forgets your variable definition as soon as you finish one call for "system".
In your case, how about trying:
system("TRAIT=some_trait; grep var.resid.anim rep_model_$TRAIT.out > res_var_anim_$TRAIT'.xout'",wait=T)
You can keep this all in R:
grep_trait <- function(search_for, in_trait, out_trait=in_trait) {
l <- readLines(sprintf("rep_model_%s.out", in_trait))
l <- grep(search_for, l, value=TRUE) %>%
writeLines(l, sprintf("res_var_anim_%s.xout", out_trait))
}
grep_trait("var.resid.anim", "haptoglobin")
If there's a concern that the files are read into memory first (i.e. if they are huge files), then:
grep_trait <- function(search_for, in_trait, out_trait=in_trait) {
fin <- file(sprintf("rep_model_%s.out", in_trait), "r")
fout <- file(sprintf("res_var_anim_%s.xout", out_trait), "w")
repeat {
l <- readLines(fin, 1)
if (length(l) == 0) break;
if (grepl(search_for, l)[1]) writeLines(l, fout)
}
close(fin)
close(fout)
}

Resources