Parsing Large XML files efficiently using R

Parsing Large XML files efficiently using R - r

New here. I use R a lot, but I'm pretty unfamiliar with XML. I'm look for advice for efficiently looping through and aggregating large XML files (~200MB) from here. I have XML files with elements that look like this:
<OpportunitySynopsisDetail_1_0>
<OpportunityID>134613</OpportunityID>
<OpportunityTitle>Research Dissemination and Implementation Grants (R18)</OpportunityTitle>
<OpportunityNumber>PAR-12-063</OpportunityNumber>
...
</OpportunitySynopsisDetail_1_0>
None of the sub-elements have children or attributes. The only complicating factor is that some elements can have multiple instances of a single child type.
I've already downloaded and parsed one file using the xml2 package, and I have my xml_nodeset. I've also played around successfully with extracting data from subsets of the nodeset (i.e. the first 100 nodes). Here's an example of what I did to extract elements without an "ArchivedDate" sub-element:
for(i in c(1:100)){
if (is.na(
xml_child(nodeset[[i]],"d1:ArchiveDate",xml_ns(xmlfile)))){
print(paste0("Entry ",i," is not archived."))
Here's the problem: if I replace 100 with length(nodeset) which is 56k+, this thing is gonna take forever to iterate through. Is there a better way to filter out and analyze xml elements without iterating through each and every one? Or is this just a limitation of the file format? The long term goal would be to get a very small subset of this file into a data frame.
Thanks!

Related

Is there a method in R of extracting nested relational data tables from a JSON file?

I am currently researching publicly available payer transparency files across multiple insurers and I am trying to parse and extract JSON files using R and output them into .CSV files to later use with SQL. The file I am currently working with contains nested tables within the highest table.
I have attached the specific file I am working with right now in a link below, along with the code to mount it into R's dataviewer. I have used R extensively in healthcare analytics classes for statistical analysis and machine learning; though, I have never used R for building out data tables.
My goal is to assign a primary key to the highest level of the table, apply foreign and primary keys to lower tables and extract the lower tables and join them onto eachother later to build out a large CSV or TXT file to load onto SQL.
So far, I have used the jsonlite and rjson packages to extract the JSON itself into R, but trying to delist an unnest the tables within the tables are an enigma to me even after extensive research. I also find myself running into problems with "subscript out of bounds", "unimplemented list errors" and other issues.
It could also very well be the case that the JSON is too large for R's packages or that the JSON is structurally flawed (I wouldn't know if it is, I am not accustomed to JSONs). It seems that this could be a problem better solved with Python, though I don't know how to use Python too well and I am optimistic in R given how powerful it is.
Any feedback or answers would be greatly appreciated.
JSON file link: https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json
Code to load JSON:
json2 <- fromJSON('https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json')
JSONs load correctly, but there are tables embedded within tables. I would hope that these tables could be easily exported and have keys for joining, but I can not figure out how to denest these tables from within the data.
Some nested tables are out of subscript bounds for the data array. I have never encountered this problem and am bewildered as to how to go about and resolve the issue.
I can not figure out how to 'extract' the lower level tables, let alone open them, due to the subscript boundary error.
I can assign row ID to the main/highest table in the file, but I can not figure out how to add sub row ID's to the lower table for future joins.

Maybe the jsonStrings package can help. It allows to manipulate JSON, without converting to a R object. That's the first time I try it on such a big JSON string and it works fine.
Here is how to get the table in the first element of the JSON array:
options(timeout = 300)
download.file(
"https://individual.carefirst.com/carefirst-resources/machine-readable/Provider_Med_5.json",
"jsonFile.json"
)
library(jsonStrings)
# load the JSON file
jstring <- jsonString$new("jsonFile.json")
# extract table "plans" of first element (indexed by 0)
jsonTable <- jstring$at(0, "plans")
# get a dataframe
library(jsonlite)
dat <- fromJSON(jsonTable$asString())
But the dataframe dat has a list column. I don't know how you want to make a CSV with this dataframe.

Import only specific cells from Excel to R by hard coding

I have around 100 equal .xls files containing 10 sheets each, with very messy data, here is a thought example of one sheet:
I want to add everything together in one R dataframe/tibble.
I don't know the right approach here, but I believe that I can hard code this within readxl::read.xls. It should look like this
I would like if somebody could show a short code of how to pick a cell to be the column name by its position and the data belonging to that column, also by its position/range.
Afterwards, I will find a way to loop this to all sheets within all files, or better: If I can specify the needed code for a certain sheet name within the read.xls function. Then i only have to loop on all the files.
Thanks and let me know if you need some more information on this.

Is it possible to import a subset of big .rds or .feather files into R?

I've found good tips about fast ways to import files into R, but I'm wondering if it is possible to import only a subset of a given file into a variable.
In my case, I have a file with 16 million rows saved as .rds (and also as .feather, as I was playing with the speed of both formats) and I'd like to import a subset of it (say, a few rows or a few columns) for initial analysis.
Is it possible? The readRDS() does not seem to accept any subsetting, while read_feather() does not seem to allow row selection (although you can specify the columns). Should I consider another data format?

The short answer is 'no'. A nice alternative is the fst file format, which does allow the retrieval of a selection of columns and rows from a large dataset. More info here.

Using readr::read_csv you could use n_max parameter and read as many rows as you like.
With readRDS, I suppose you could read the file dplyr::sample_n and then just erase it from memory with rm(object).
If you can not read the whole file into memory, you could use either sqlite, or another database, which is the prefered way, or you could try something along the line of readr::read_delim_chunked, which alows you to read a file in chunks, do something with the read chunk (like sample_n), delete the read chukc from memory and keep just the callback's result and go on like that until the file is over.

Which functions should I use to work with an XDF file on HDFS?

I have an .xdf file on an HDFS cluster which is around 10 GB having nearly 70 columns. I want to read it into a R object so that I could perform some transformation and manipulation. I tried to Google about it and come around with two functions:
rxReadXdf
rxXdfToDataFrame
Could any one tell me the preferred function for this as I want to read data & perform the transformation in parallel on each node of the cluster?
Also if I read and perform transformation in chunks, do I have to merge the output of each chunks?
Thanks for your help in advance.
Cheers,
Amit

Note that rxReadXdf and rxXdfToDataFrame have different arguments and do slightly different things:
rxReadXdf has a numRows argument, so use this if you want to read the top 1000 (say) rows of the dataset
rxXdfToDataFrame supports rxTransforms, so use this if you want to manipulate your data in addition to reading it
rxXdfToDataFrame also has the maxRowsByCols argument, which is another way of capping the size of the input
So in your case, you want to use rxXdfToDataFrame since you're transforming the data in addition to reading it. rxReadXdf is a bit faster in the local compute context if you just want to read the data (no transforms). This is probably also true for HDFS, but I haven’t checked this.
However, are you sure that you want to read the data into a data frame? You can use rxDataStep to run (almost) arbitrary R code on an xdf file, while still leaving your data in that format. See the linked documentation page for how to use the transforms arguments.

Multiple files in R

I am trying to manage multiple files in R but am having a difficult time of it. I want to take the data in each of these files and manipulate them through a series of steps (all files receiving the same treatment). I think that I am going about it in a very silly manner though. Is there a way to manage many files (each the same a before) without using 900 apply statements? For example, when is it recommended you merge all the data frames rather that treat each separately? Is there a way to merge more than two, or an uncertain number, as with the way the files are input here? Or is there a better way to handle so many files?
I take files in a standard way:
chosen<-(tk_choose.files(default="", caption="Files:", multi=TRUE, filters=NULL, index=1))
But after that I would like to do several things with the data. As of now I am just apply different things but it is getting confusing. See:
ytrim<-lapply(chosen, function(x) strtrim(y, width=11))
chRead<-lapply(chosen,read.table,header=TRUE)
tmp<-lapply(inputFiles, function(x) stack(fnctn))
etc, etc. This surely can't be the recommended way to go about it. Is there a better way to handle a multitude of files?

You can write one function with all operations, and apply it to all your files like this:
doSomethingWithFile <- function(filename) {
ytrim <- strtrim(filename, width=11))
chRead<- read.table(filename,header=TRUE)
# Return some result
chRead
}
result<-lapply(chosen, doSomethingWithFile)
You will only need to think about how to return the results, as lapply needs to return a list with the same length as the input (chosen, in this case). You could also look at one of the apply functions of the plyr packages for more flexibility.
(BTW: this code is not without errors, but neither is your example... I'll update mine if you give a proper example)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex