speeding up nested for loop in R - r

I have a nested for loop that I am using to parse a complex and large json file. It takes forever! I am wondering if there is a way to speed this up that I am missing? I can link to a smaller json file if that would be useful? Not sure if something from the apply family would be of use?
for(row_n1 in 1:length(json_data$in_network)){
in_network1=json_data$in_network[[row_n1]]
in_network1[["negotiated_rates"]]=NULL
in_network_df=as.data.frame(in_network1)
for(row_n2 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates)){
reporting=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups)){
npi_df=as.data.frame(toString(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$npi))%>% set_names(nm = "npi")
tin_df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$tin)))
df=merge(npi_df,tin_df)
df=merge(in_network1,df)
reporting=rbind(reporting,df)
}
negotiated_prices_df=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices)){
df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices[[row_n3]])))
negotiated_prices_df=bind_rows(negotiated_prices_df,df)
}
r=merge(reporting,negotiated_prices_df)
result=bind_rows(result,r)
}
if(row_n1%% 100 ==0)
print(paste (row_n1,Sys.time() ,sep="====="))
}
l=json_data
l$in_network=NULL
result=merge(as.data.frame(l),result)
This isn't a big deal on small files (e.g., < 1,000 rows in the result df, but when I am dealing with 1m+ it just takes forever!)

Related

Most efficient way to extract data from large XML files in R

I have a few large (~10 GB and growing every week) which I need to convert from XML to a dataframe in R for analysis. The structure of the XML is as follows (with multiple records and a few more field elements per record):
<recordGroup>
<records>
<record>
<recordId>123442</recordId>
<reportingCountry>PT</reportingCountry>
<date>2020-02-20</date>
<field>
<fieldName>Gender</fieldName>
<fieldValue>F</fieldValue>
</field>
<field>
<fieldName>Age</fieldName>
<fieldValue>57</fieldValue>
</field>
<field>
<fieldName>ClinicalSymptoms</fieldName>
<fieldValue>COUGH</fieldValue>
<fieldValue>FEVER</fieldValue>
<fieldValue>O</fieldValue>
<fieldValue>RUNOS</fieldValue>
<fieldValue>SBREATH</fieldValue>
</field>
</record>
</records>
</recordGroup>
I have been trying to find the most efficient way of extracting the data and converting them to a data.frame, however one major challenge is that the files are quite large and both XML and XML2 run into problems apart that it takes hours to process. My current strategy is using xmlEventParse using the code below, but this seems to be even more inefficient.
value_df <- data.frame(recordId = as.character(), vardf = as.character(), value = as.character())
nvar <- 0
xmlEventParse(xmlDoc_clean,
list(
startElement = function (name, attrs) {
tagName <<- name
},
text = function (x) {
if (nchar(x) > 0) {
if (tagName == "recordId") {
rec <<- x
} else
if (tagName == "fieldName") {
var_f <<- x
} else {
if (tagName == 'fieldValue') {
v <- x
nvar <<- nvar + 1
value_df[nvar, 1:3] <<- c(rec, var_f, v)
}
}
}
},
endElement = function (name) {
if (name == 'record') {
print(nvar)
}
}
))
I have tried XML2 (memory issues), XML (memory issues as well with the standard DOM parsing) and also was going to try to use XMLSchema but didn't manage to get it to work. Both XML and XML2 work if files are split up.
Would appreciate any guidance on improving efficiency as the files I am working with are becoming larger every week. I am using R on a linux machine.
When memory is a challenge, consider hard disk. Specifically, consider building a large CSV version of extracted parsed XML data with iterative append calls via write.csv in an xmlEventParse run:
# INITIALIZE EMPTY CSV WITH EMPTY ROW
csv <- file.path("C:", "Path", "To", "Large.csv")
fileConn <- file(csv); writeLines(paste0("id,tag,text"), fileConn); close(fileConn)
i <- 0
doc <- file.path("C:", "Path", "To", "Large.xml")
output <- xmlEventParse(doc,
list(startElement=function(name, attrs){
if(name == "recordId") {i <<- i + 1}
tagName <<- name
}, text=function(x) {
if(nchar(trimws(x)) > 0) {
write.table(data.frame(id=i, tag=tagName, text=x),
file=csv, append=TRUE, sep=",",
row.names=FALSE, col.names=FALSE)
}
}),
useTagName=FALSE, addContext=FALSE)
Output
Obviously, further data wrangling will be needed for proper row/column migration. But you can now read large CSV with the many tools out there or via chunks.
id,tag,text
1,"recordId","123442"
1,"reportingCountry","PT"
1,"date","2020-02-20"
1,"fieldName","Gender"
1,"fieldValue","F"
1,"fieldName","Age"
1,"fieldValue","57"
1,"fieldName","ClinicalSymptoms"
1,"fieldValue","COUGH"
1,"fieldValue","FEVER"
1,"fieldValue","O"
1,"fieldValue","RUNOS"
1,"fieldValue","SBREATH"
In the end the fastest approach I found was the following:
Split the XML files in smaller chunks using XML2. I have >100GB RAM on the server I am working on so could parallelize this process using foreach with 6 workers, but mileage varies depending on how much RAM is available.
The function splitting the files returns a data.frame with the location of the split files.
Process the smaller XML files in a foreach loop - this time it is possible to use all cores so I have gone with 12 workers. The processing uses XML2 as I found that to be the fastest way. Initially the extracted data is in a long format but I then convert to a wide format within the loop.
The loop binds and outputs the different dataframes into one large dataframe. The final step is using fwrite to save the csv file. This seems to be the most efficient way.
With this approach I can process a 2.6GB XML file in 6.5 minutes.
I will add code eventually but it is quite specific so need to generalize a bit.

Unable to update data in dataframe

i tried updating data in dataframe but its unable to get updating
//Initialize data and dataframe here
user_data=read.csv("train_5.csv")
baskets.df=data.frame(Sequence=character(),
Challenge=character(),
countno=integer(),
stringsAsFactors=FALSE)
/Updating data in dataframe here
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i]==user_data$challenge_sequence[j]&&user_data$challenge[i]==user_data$challenge[j])
{
writedata(user_data$challenge_sequence[i],user_data$challenge[i])
}
}
}
writedata=function( seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence=seqnn,Challenge=challng,countno=1)
baskets.df=rbind(baskets.df,newRow)
}
//view data here
View(baskets.df)
I've modified your code to what I believe will work. You haven't provided sample data, so I can't verify that it works the way you want. I'm basing my attempt here on a couple of common novice mistakes that I'll do my best to explain.
Your writedata function was written to be a little loose with it's scope. When you create a new function, what happens in the function technically happens in its own environment. That is, it tries to look for things defined within the function, and then any new objects it creates are created only within that environment. R also has this neat (and sometimes tricky) feature where, if it can't find an object in an environment, it will try to look up to the parent environment.
The impact this has on your writedata function is that when R looks for baskets.df in the function and can't find it, R then turns to the Global Environment, finds baskets.df there, and then uses it in rbind. However, the result of rbind gets saved to a baskets.df in the function environment, and does not update the object of the same name in the global environment.
To address this, I added an argument to writedata that is simply named data. We can then use this argument to pass a data frame to the function's environment and do everything locally. By not making any assignment at the end, we implicitly tell the function to return it's result.
Then, in your loop, instead of simply calling writedata, we assign it's result back to baskets.df to replace the previous result.
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i] == user_data$challenge_sequence[j] &&
user_data$challenge[i] == user_data$challenge[j])
{
baskets.df <- writedata(baskets.df,
user_data$challenge_sequence[i],
user_data$challenge[i])
}
}
}
writedata=function(data, seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence = seqnn,
Challenge = challng,
countno = 1)
rbind(data, newRow)
}
I'm not sure what you're programming background is, but your loops will be very slow in R because it's an interpreted language. To get around this, many functions are vectorized (which simply means that you give them more than one data point, and they do the looping inside compiled code where the loops are fast).
With that in mind, here's what I believe will be a much faster implementation of your code
user_data=read.csv("train_5.csv")
# challenge_indices will be a matrix with TRUE at every place "challenge" and "challenge_sequence" is the same
challenge_indices <- outer(user_data$challenge_sequence, user_data$challenge_sequence, "==") &
outer(user_data$challenge, user_data$challenge, "==")
# since you don't want duplicates, get rid of them
challenge_indices[upper.tri(challenge_indices, diag = TRUE)] <- FALSE
# now let's get the indices of interest
index_list <- which(challenge_indices,arr.ind = TRUE)
# now we make the resulting data set all at once
# this is much faster, because it does not require copying the data frame many times - which would be required if you created a new row every time.
baskets.df <- with(user_data, data.frame(
Sequence = challenge_sequence[index_list[,"row"]],
challenge = challenge[index_list[,"row"]]
)

Writing the results of the for loop

We were trying to write the results from a for loop. We tried to use write.table, as.data.frame and other solutions, but with no success. We expect to have a data frame.
Currently we have only the loop, that shows year and values from a matrix which are bigger than 50. Looks like that:
for (i in 1:nrow(dobowe1)) {
if(dobowe1[i,4]>50) {
cat(dobowe1[i,"rok"],dobowe1[i,4], "\n")
}
}
Note: We don't do programming a lot, so it's hard to use other solutions from the questions that already beed asked.
Try to save each element to the vector, like here:
tabela <- numeric(nrow(dobowe1))
for (i in 1:nrow(dobowe1)) {
if(dobowe1[i,4]>50) {
tabela[i] <- paste(dobowe1[i,"rok"],dobowe1[i,4])
}
}
as.data.frame(tabela)
If you just want to visually inspect a subset of your matrix, you can just print out a filtered subset:
# create the filter:
> f <- dobowe1[,4] > 50
# use the filter to subset (index) your data.frame:
> dobowe1[f,c("rok", whatever-4th-var-is-called)]
This will automatically print it out. Or you can write it to a file with ?write.table

Merging a large number of csv datasets

Here are 2 sample datasets.
PRISM-APPT_1895.csv
https://copy.com/SOO2KbCHBX4MRQbn
PRISM-APPT_1896.csv
https://copy.com/JDytBqLgDvk6JzUe
I have 100 of these types of data sets that I'm trying to merge into one data frame, export that to csv, and then merge that into another very large dataset.
I need to merge everything by "gridNumber" and "Year", creating a time series dataset.
Originally, I imported all of the annual datasets and then tried to merge them with this :
df <- join_all(list(Year_1895, Year_1896, Year_1897, Year_1898, Year_1899, Year_1900, Year_1901, Year_1902,
Year_1903, Year_1904, Year_1905, Year_1906, Year_1907, Year_1908, Year_1909, Year_1910,
Year_1911, Year_1912, Year_1913, Year_1914, Year_1915, Year_1916, Year_1917, Year_1918,
Year_1919, Year_1920, Year_1921, Year_1922, Year_1923, Year_1924, Year_1925, Year_1926,
Year_1927, Year_1928, Year_1929, Year_1930, Year_1931, Year_1932, Year_1933, Year_1934,
Year_1935, Year_1936, Year_1937, Year_1938, Year_1939, Year_1940, Year_1941, Year_1942,
Year_1943, Year_1944, Year_1945, Year_1946, Year_1947, Year_1948, Year_1949, Year_1950,
Year_1951, Year_1952, Year_1953, Year_1954, Year_1955, Year_1956, Year_1957, Year_1958,
Year_1959, Year_1960, Year_1961, Year_1962, Year_1963, Year_1964, Year_1965, Year_1966,
Year_1967, Year_1968, Year_1969, Year_1970, Year_1971, Year_1972, Year_1973, Year_1974,
Year_1975, Year_1976, Year_1977, Year_1978, Year_1979, Year_1980, Year_1981, Year_1982,
Year_1983, Year_1984, Year_1985, Year_1986, Year_1987, Year_1988, Year_1989, Year_1990,
Year_1991, Year_1992, Year_1993, Year_1994, Year_1995, Year_1996, Year_1997, Year_1998,
Year_1999, Year_2000),
by = c("gridNumber","Year"),type="full")
But R keeps crashing because I think the merge is a bit to large for it to handle, so I'm looking for something that would work better. Maybe data.table? Or another option.
Thanks for any help you can provide.
Almost nine months later and your question has no answer. I could not find your datasets, however, I will show one way to do the job. It is trivial in awk.
Here is a minimal awk script:
BEGIN {
for(i=0;i<10;i++) {
filename = "out" i ".csv";
while(getline < filename) print $0;
close(filename);
}
}
The script is run as
awk -f s.awk
where s.awk is the above script in a text file.
This script creates ten filenames: out0.csv, out1.csv ... out9.csv. These are the already-existing files with the data. The first file is opened and all records sent to the standard output. The file is then closed and the next filename created and opened. The above script has little to offer over a command line read/redirect. You would typically use awk to process a long list of filenames read from another file; with statements to selectively ignore lines or columns depending on various criteria.

Warning meassage: number of items to replace is not a multiple of replacement length

I got warnings when running this code.
For example, when I put
tm1<- summary(tmfit)[c(4,8,9)],
I can get the result, but I need to run this code for each $i$.
Why do I get this error?
Is there any way to do this instead of via a for loop?
Specifically, I have many regressants ($y$) with the same two regressors ($x$'s).
How I can get these results of regression analysis(to make some comparisons)?
dreg=read.csv("dayreg.csv")
fundr=read.csv("fundreturnday.csv")
num=ncol(fundr)
exr=dreg[,2]
tm=dreg[,4]
for(i in 2:num)
{
tmfit=lm(fundr[,i]~exr+tm)
tm1[i]<- summary(tmfit)[c(4,8,9)]
}
Any help is highly appreciated
Try storing your result into a list instead of a vector.
dreg=read.csv("dayreg.csv")
fundr=read.csv("fundreturnday.csv")
num=ncol(fundr)
exr=dreg[,2]
tm = list()
for(i in 2:num)
{
tmfit=lm(fundr[,i]~exr+tm)
tm1[[i]]<- summary(tmfit)[c(4,8,9)]
}
You can look at an element in the list like so
tm1[[2]]

Resources