Extract all values automatically from multiple xml files into data frame - r

I'm new to R and trying to parse over 100k xml files into 1 csv file. I used a formula from a previous question asked and it works perfectly if I state the specific column name. My xml files are rather long to specifically write them out so I want to add all the columns into data frame without explicitly writing the column headings. I'm using this exact same formula except I have more rows listing column names rather than just zip code and amount.
require(XML)
require(plyr)
setwd("LOCATION_OF_XML_FILES")
xmlfiles <- list.files(pattern = "*.xml")
dat <- ldply(seq(xmlfiles), function(i){
doc <- xmlTreeParse(xmlfiles[i], useInternal = TRUE)
zipcode <- xmlValue(doc[["//ZipCode"]])
amount <- xmlValue(doc[["//AwardAmount"]])
return(data.frame(zip = zipcode, amount = amount))
})
write.csv(dat, "zipamount.csv", row.names=FALSE)

Hopefully the xmlToDataFrame() function will do what you want.
It assumes the XML document is a a root node whose child nodes are a sequence of records and that each record has simple elements. Then it extracts them into a data.frame
Consider a sample XML document
<doc>
<record><a>1</a><b>2</b></record>
<record><a>10</a><b>20</b><c>bob</c></record>
<record><a>20</a><b>30</b></record>
</doc>
xmlToDataFrame() returns
a b c
1 1 2 <NA>
2 10 20 bob
3 20 30 <NA>

Related

Parsing large XML to dataframe in R

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.
Thanks in advance.
While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.
library(XML)
doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))
You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.
dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"
xml_process <- function(dt) {
tryCatch({
# DOWNLOAD ZIP TO URL
url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
file <- paste0(xml_folder, "/dding2018", dt, ".xml")
download.file(url, temp_zip)
unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
unlink(temp_zip) # DESTROY TEMP ZIP
# PARSE XML TO DATA FRAME
doc <- xmlParse(file)
df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
unlink(file) # DESTROY TEMP XML
# RETURN XML DF
return(df)
}, error = function(e) NA)
}
# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)
# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)
# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)
Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.
This version can also handle cases where the number of variables per record is different.
library(xml2)
doc<-read_xml('<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value1</variable1>
<variable191>value2</variable191>
</ROW_DUASDIA>
<ROW_DUASDIA NUM="150236">
<variable1>value3</variable1>
<variable2>value_new</variable2>
<variable191>value4</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>')
#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")
#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)
#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))
#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength),
variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)
#dataframe is in a long format.
#Use the function cast, or spread from the tidyr to convert wide format
# NUM variable values
# 1 1 variable1 value1
# 2 1 variable191 value2
# 3 150236 variable1 value3
# 4 150236 variable2 value_new
# 5 150236 variable191 value4
#Convert to wide format
library(tidyr)
spread(df, variable, values)

How to combine files and match them with their identifier from a separate file?

I have 500 txt files all under the same folder. Each text file represents a patient and has a list of genes (miRNA genes in this example) and their corresponding expression values. I am only interested in the reads_per_million_miRNA_mapped for each corresponding miRNA_ID. Below is an example of three:
File name: 0a4af8c8.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 39039 5576.681 N
2 hsa-let-7a-2 38985 5568.967 Y
3 hsa-let-7a-3 38773 5538.684 N
File name: 0a867fd6.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 36634 11413.6842 N
2 hsa-let-7a-2 36608 11405.5837 N
3 hsa-let-7a-3 36006 11218.0246 N
File name: 0ac65c4b.mirnas.quantification.txt
miRNA_ID read_count reads_per_million_miRNA_mapped cross.mapped
1 hsa-let-7a-1 68376 14254.3693 N
2 hsa-let-7a-2 67965 14168.6880 Y
3 hsa-let-7a-3 67881 14151.1765 N
While each file has a unique name, the name does not tell me the patient's ID, and there is nothing in the file which directly tells me the patient's ID. To determine the patient's ID, I use a separate master CSV file which includes a row of all patient ID's and there corresponding file name for the txt files. This csv file has way to many columns for me to post an example row so I only have the two columns of interest listed below.
file_name patient_id
0a4af8c8.mirnas.quantification.txt TCGA-G9-6373-01A
0a867fd6.mirnas.quantification.txt TCGA-XJ-A9DX-01A
0ac65c4b.mirnas.quantification.txt TCGA-V1-A9OF-01A
My goal is to create a data frame of all combined txt files which has the gene expression data for all patients for all genes
miRNA_ID TCGA-G9-6373-01A TCGA-XJ-A9DX-01A TCGA-V1-A9OF-01A
hsa-let-7a-1 5576.681 11413.6842 14254.3693
hsa-let-7a-2 5568.967 11405.5837 14168.6880
hsa-let-7a-3 5538.684 11218.0246 14151.1765
I have figured out a way to do this by subsetting the file name and patient ID into a new data frame and then using a for loop to combine all the txt files and add on an additional column with the file name to get to each file. I then use the left_join function from the tidyverse package to combine the data frames.
While this works, it is not resource efficient as I am creating extra data frames and columns which I do not need. I was wondering if anyone knows of a better approach which can do the same thing in one goal. For example by using a which function within the for loop that can be used to rename the Expression_value column as the patient ID by associating the file going through the loop with the patient ID from the same row in the separate master CSV file. Thanks in advance.
Here is the link to the previous method I used.
How to create a data frame in R where I have to associate different txt files with a sample ID from a separate file?
Without your actual data it is very challenging to attempt to answer this, so hopefully this will be a useful design pattern. You will need a two things:
1) An identifying pattern that you can construct based on the file name and merge with the master
2) All of the files in the working directory
Here is what I would recommend:
library(data.table)
library(magrittr)
library(stringr)
setwd("path/to/directory")
# Probably implement some kind of regex on the file name
# to extract the patient name
read_file <- function(file_name){
fread(file_name) %>%
.[,patient_name := str_replace_all(file_name,"regex_string","")]
}
all_files <- list.files(pattern = "file_pattern")
master <- fread("path/to/master")
combined_files <- lapply(all_files, read_file) %>%
rbindlist %>%
merge(master, by = "patient_name")
Essentially this sets the working directory to where your files are, implements a parser which grabs the patient name to match to the master, applies that parser to all the files, combines them to a single data frame with the identifying observation, and then merges them with the master. Hopefully it helps!
This should work. You'll need to customize the input_folder (or set your working directory there and delete the references to it in my code). I'm calling the data frame with the patient IDs and file names filekey.
library(data.table)
input_folder = "path/to/folder/"
cols_to_keep = c("miRNA_ID", "reads_per_million_miRNA_mapped")
files = lapply(paste0(input_folder, "filekey$file_name"), fread, select = cols_to_keep)
names(files) = filekey$patient_id
long = rbindlist(files, id = T)
result = dcast(long, miRNA_ID ~ .id, value.var = "reads_per_million_miRNA_mapped")
result
# miRNA_ID TCGA-G9-6373-01A TCGA-V1-A9OF-01A TCGA-XJ-A9DX-01A
# 1: hsa-let-7a-1 5576.681 14254.37 11413.68
# 2: hsa-let-7a-2 5568.967 14168.69 11405.58
# 3: hsa-let-7a-3 5538.684 14151.18 11218.02

Get function or Passing character variable names in R

I have data frame name MC_df, I need to get the list of some variables in MC_df to access through my function, so if I do manually by vari_names_MCdf <- list(MC_df$Age, MC_df$Year, MC_df$Education, MC_df$City, MC_df$Country), it works, but due to the large of data set, I write all of variables' names I want to include into csv file. Now I read csv file to get access all of variables, I got stuck. Here is data MC_df
MC_df <- data.frame(ID=c(1:5),Age=c(18,25,30,22,19),
Year=c(2003,2010,2008,2010,2015),
Education = c("<12","13-15",">15","<12",">15"),
City = c("SD","MS","LA","CV","SD"),
Country=c("US","CA","CA","CA","US"),
Group=c(rep("group",5)))
I do manually, just call list all of variables I want, it works
vari_names_MCdf <- list(MC_df$Age, MC_df$Year, MC_df$Education, MC_df$City, MC_df$Country)
vari_group_MCdf <- list(MC_df$Group,MC_df$Group,MC_df$Group,MC_df$Group,MC_df$Group)
Now I write variables into csv file, and read.csv file, it helps me for the futures, I need to edit in csv file, I don't need to access from MC_df
df <- data.frame(names=c("Age","Year","Education","City","Country"),
text=c("Age Continuos","Year Category","Eduaction Group","City Category","Country Group"))
write.csv(df,"vari.csv")
var <- read.csv("vari.csv", stringsAsFactors = FALSE)
var
X names text
1 1 Age Age Continuos
2 2 Year Year Category
3 3 Education Eduaction Group
4 4 City City Category
5 5 Country Country Group
I try
vari_names <- as.list(var$names)
vari_text <- as.list(var$text)
get("MC_df")$vari_names[[1]]
vari_names_new <- lapply(vari_names, function(i) paste0("MC_df$",i, sep=""))
get(vari_names_new[[1]])
Everything doesn't work, so how I can active variable names in CSV file and dataset names

Import multiple .txt files into R and skip to actual data rows

I have 537 .txt files which I need to import into either a list or separate data frames in R. I do not want to append any data as it is crucial to keep everything separate.
I've renamed each file, so the file names are all uniform. In each file, there is a header section with a lot of miscellaneous information. This header section is 12-16 rows depending on the file. For the data, I have between 5 and 7 columns. The data is all tab delimited. The number of columns varies between 5 and 9 columns, and the columns are not always in the same order, so it is important that I can import the column names with the data (column names are uniform across files). The format of the file is as follows:
Header
Header
Header
Header...up to 16 rows
((number of spaces between header and column names varies))
Date(\t)Time(\t)dataCol1(\t)dataCol2(\t)dataCol3(\t)dataCol4
((no empty row between column names and units))
mm/dd/yyyy(\t)hh:mm:ss(\t)units(\t)units(\t)units(\t)units
((1 empty row between units and data))
01/31/2016(\t)14:32:02(\t)14.9(\t)25.3(\t)15.8(\t)25.6
((data repeats for up to 4000 rows))
To recap what I need:
Import all of the files into individual data frames or a lists of data frames.
Skip past the header information to the row with "Date" (and possibly delete the two rows following with units and the empty row) leaving me with a row of column names and the data following.
Here's a crude copy of what I have been working on for code. The idea is, after importing all of the files into R, determine the max value for 1-2 columns in each file. Then, export a single file which will have 1 row for each file with 2 columns containing the 2 max values from each file.
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path){
##read files
files[[i]] <- read.delim(path[[i]],header = FALSE, sep = "\t", skip = 18
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I know right now, my code just skips past everything into the data ( max doesn't come until much later in file, so skipping 1 or 2 lines won't hurt me), and it assumes the columns are in the same order in each file. This is horrible practice and gives me some bad results on about 5% of my data points, but I want to do this correctly. My main concern is to get the data into R in a usable format. Then, I can add the other calculations and conversions. I am new to R, and after 2 days of searching, I have not found the help I need already posted to any forum.
Assuming that the structure of the header follows a Line \n Line \n Data we can use a grep to find the line number where "mm/dd/yyyy"
As such:
system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T)
# ran.txt is an arbitrary text file I created, we will substitute
# 'ran.txt' with path[[i]] later on.
#[1] "6:mm/dd/yyyy\thh:mm:ss\tunits\tunits\tunits\tunits"
From this we can then strsplit the output into the number before the : and use that argument as the necessary value for skip.
as.numeric(strsplit(system("grep -nr 'mm/dd/yyyy' ran.txt", intern=T),":")[[1]][1])
# [[1]][1] will specify the first element of the output of strsplit as
# in the output the hh:mm:ss also is split.
# [1] 6
As there is an empty row between our called row and the actual data we can add 1 to this and then begin reading the data.
Thusly:
##list files and create list for data.frames
path <- list.files("Path",pattern = NULL, all.files=FALSE,full.names=TRUE)
files <- list()
##Null list for final data to be extracted to
results <- NULL
##add names to results list (using file name - extension
results$name <- substr(basename(path),1,nchar(basename(Path))-4)
##loop to read in data files and calculate max
for(i in 1:length(path)){
##read files
# Calculate the number of rows to skip.
# Using Dave2e's suggestion:
header <-readLines("path[[i]]", n=20)
skip <- grep("^mm/dd/yy", header)
#Add one due to missing line
skip <- skip + 1
files[[i]] <- read.delim(path[[i]],
header = FALSE,
sep = "\t",
skip = skip)
##will have to add code:
##"if columnx exists do this; if columny exists do this"
##convert 2 columns for calculation to numeric
x.x <- as.numeric(as.character(files$columnx))
x.y <- as.numeric(as.character(files$columny))
##will have to add code:
##"if column x exists, do this....if not, "NA"
##get max value for 2 specific columns
results$max.x <- max(files$columnx)
results$max.y <- max(files$columny)
}
##add results to data frame
max <- data.frame(results)
##export to .csv
write.csv(max,file="PATH")
I think that about covers everything.
Thought I would add this here in case it helps someone else with a similar issue. #TJGorrie's solution helped solve my slightly different challenge. I have several .rad files that I need to read in, tag, and merge. The .rad files have headers that start at random rows so I needed a way to find the row with the header. I didn't need to do any additional calculations except create a tag column. Hope this helps someone in the future but thanks #TJGorrie for the awesome answer!
##list files and create list for data.frames
path <- list.files(pattern="*.rad")
files <- list()
##loop to read in data files
for(i in 1:length(path)){
# Using Dave2e's suggestion:
header <-readLines(path[[i]], n=20)
skip <- grep("Sample", header)
#Subtract one row to keep the row with "Sample" in it as the header
skip <- skip - 1
files[[i]] <- read.table(path[[i]],
header = TRUE,
fill = TRUE,
skip = skip,
stringsAsFactors = FALSE)
# Name the newly created file objects the same name as the original file.
names(files)[i] = gsub(".rad", "", (path[i]))
files[[i]] = na.omit(as.data.frame(files[[i]]))
# Create new column that includes the file name to act as a tag
# when the dfs get merged through rbind
files[[i]]$Tag = names(files)[i]
# bind all the dfs listed in the file into a single df
df = do.call("rbind",
c(files, make.row.names = FALSE))
}
##export to .csv
write.csv(df,file="PATH.csv", row.names = FALSE)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

Resources