R merge large number of data frames - r

I have the output from a data submission which is in the form of multiple vector list objects in rda files.
Each list object is in a separate rda file and i have nearly 2000 files.
I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times).
All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.
Memory isn't a huge problem as am running it on a server with >700GB RAM,
My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess).
My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:
load("data_1.rda")
load("data_2.rda")
load("data_3.rda") ...
load("data_2000.rda")
my.list <- list()
my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000)
save(my.list, file="my_list.rda")
And just to add to things i'm getting an error when doing this:
Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT
It's not a very helpful error message
All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related.
Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing?
Something like this:
my.list <- do.call(rbind, mget(ls(pattern="^data_")))
Thanks

Related

R needs several hours to save very small objects. Why?

I am running several calculations and ML algorithms in R and store their results in four distinctive tables.
For each calculation, I obtain four tables, which I store in a single list.
According to R, all of my lists are labelled as "Large List (4 elements, 971.2 kB)" in the upper right quadrant in RStudio where all my objects, functions, etc. are displayed.
I have five of these lists and save them for later use with the save() function.
I use the function:
save(list1, list2, list3, list4, list5, file="mypath/mylists.RData")
For some reason, which I do not understand, R takes more than 24 hours to save these four lists with only 971.2 kB each.
Maybe, I should add that apparently more than 10GB of my RAM are used by R at the time. However, the lists are as small as I indicated above.
Does anyone have an idea why it takes so long to save the lists to my harddrive and what I could do about it?
Thank you
This is just a guess, because we don't have your data.
Some objects in r contain references to environments. The most common examples are functions and formulas. If you save one of those, r may need to save the whole environment. This can drastically increase the size of what is being saved. If you are short of memory that could take a very long time due to swapping.
Example:
F <- function () {
X <- rnorm(1000000)
Y ~ z
}
This function returns a small formula which references the environment holding X, so saving it will take a lot of space.
Thanks for your answers.
I solved my problem by writing a function which extracts the tables from the objects and saves them as .csv files in a folder. I cleaned the environment and shut down the computer. Afterwards, I restarted the computer, started R and loaded all the .csv files again. I then saved the thus created objects with the familiar save() command.
It is probably not the most elegant way, but it worked and was quite quick.

R: Best Way to Create a Varying Number of Assignment Statements

I have a script I have to run periodically where I use a varying number of assignment statements in R, like this:
r5$NWord_1<-ifelse(r5$Match==1,NA,r5$NWord_1)
r5$NWord_2<-ifelse(r5$Match==2,NA,r5$NWord_2)
r5$NWord_3<-ifelse(r5$Match==3,NA,r5$NWord_3)
r5$NWord_4<-ifelse(r5$Match==4,NA,r5$NWord_4)
r5$NWord_5<-ifelse(r5$Match==5,NA,r5$NWord_5)
r5$NWord_6<-ifelse(r5$Match==6,NA,r5$NWord_6)
r5$NWord_7<-ifelse(r5$Match==7,NA,r5$NWord_7)
The problem is that the number of "NWord" variables changes from run to run usually between 5 and 7). I have the number of "NWord" variables stored separately as Size.
Size<-5
I have tried the following, but get() only works on objects, not columns of dataframes.
for(i in 1:Size){
get(paste("r5$NWord_",i,sep=""))<-ifelse(r5$Match==i,NA,get(paste("r5$NWord_",i,sep="")))
}
I am curious: What is the best way to automate this process so I do not have to manually run a subset of these statements every time?
For those interested: 1) The import data are in wide format (that is what the system allows me to download). 2) The export data have to be in wide format in order to upload to the system. Since these are just a few of some ~200 variables in the data set, going back and forth between wide and long and then long back to wide (possibly multiple times) seems cumbersome and prone to error. Therefore, I came up with this:
idx1<-which(colnames(r5)=="NWord_1")
idx2<-which(colnames(r5)==paste("NWord_",Size,sep=""))
for(i in idx1:idx2){
r5[,i]<-ifelse(r5$Match==1,NA,r5[,i])
}
Seems to work just fine, however, I'm not sure that it is the most efficient way to code this.

call columns from inside a for loop in R

I basically want to be capable to call columns from inside a for loop (in reality two nested for loops), using past() and i (j..) value of the loop to access
my data frames columns wise in a flexible manner.
#for the showcase I use the standard cars example
r1 <- cars
r2 <- cars
# in case there are more data to consider I would want to add, ore remove further with out changing the rest
# here I am entering the "dimension" of what I want to compare for the showcase its only one
num_r <- 2 #total number of reactors in the experiment
for( i in 1:num_r)
{
# shoud create proxie variable to be processed further
assign(paste("proxi_r",i,sep="", colapse="") , do.call("matrix",
list(get(paste("r",i,"$speed",sep="", colapse="" )))))
# further operations of gluing and arranging data follow so they fit tests formatting requirements
}
which gives me:
Error in get(paste("r", i, "$speed", sep = "", colapse = "")) :
object 'r1$speed' not found
but when typ r1$speed it obviously exists??
Sofare I searched "R object dont exist inside loop", "using paste() to acces variables inside loop", "foor loops and objects","do.call inside loops" ....and similar...
Is there anything to circumvent get() so I don’t have to look into the topic of environments, so I can keep the flexibility of my loops so I don’t have re-edit my script every time I have a changed the experimental configuration, which is really time consuming and allows a lot of errors to sneak inside.
The size of the data have crashed excel with extensive use of excel macros, which everyone in the lab here is using, several times :) , so there is no going back to the convort zone.
I am now trying to dig into R programming with a R statics book, and a lot of googling and reading tutorials, so please forgive my naive approach, and my lousy English.
I would be very thankful for any tips, as I feel sort of stuck right now.
This is a common confusion. You've created an object name "r1$speed" , i.e. a complete character string. This is not the same as the object r1 subsetted by $speed .
Try using get(paste('r',i,collapse='',sep=''))$speed

Multiple files in R

I am trying to manage multiple files in R but am having a difficult time of it. I want to take the data in each of these files and manipulate them through a series of steps (all files receiving the same treatment). I think that I am going about it in a very silly manner though. Is there a way to manage many files (each the same a before) without using 900 apply statements? For example, when is it recommended you merge all the data frames rather that treat each separately? Is there a way to merge more than two, or an uncertain number, as with the way the files are input here? Or is there a better way to handle so many files?
I take files in a standard way:
chosen<-(tk_choose.files(default="", caption="Files:", multi=TRUE, filters=NULL, index=1))
But after that I would like to do several things with the data. As of now I am just apply different things but it is getting confusing. See:
ytrim<-lapply(chosen, function(x) strtrim(y, width=11))
chRead<-lapply(chosen,read.table,header=TRUE)
tmp<-lapply(inputFiles, function(x) stack(fnctn))
etc, etc. This surely can't be the recommended way to go about it. Is there a better way to handle a multitude of files?
You can write one function with all operations, and apply it to all your files like this:
doSomethingWithFile <- function(filename) {
ytrim <- strtrim(filename, width=11))
chRead<- read.table(filename,header=TRUE)
# Return some result
chRead
}
result<-lapply(chosen, doSomethingWithFile)
You will only need to think about how to return the results, as lapply needs to return a list with the same length as the input (chosen, in this case). You could also look at one of the apply functions of the plyr packages for more flexibility.
(BTW: this code is not without errors, but neither is your example... I'll update mine if you give a proper example)

Undo command in R

I can't find something to the effect of an undo command in R (neither on An Introduction to R nor in R in a Nutshell). I am particularly interested in undoing/deleting when dealing with interactive graphs.
What approaches do you suggest?
You should consider a different approach which leads to reproducible work:
Pick an editor you like and which has R support
Write your code in 'snippets', ie short files for functions, and then use the facilities of the editor / R integration to send the code to the R interpreter
If you make a mistake, re-edit your snippet and run it again
You will always have a log of what you did
All this works tremendously well in ESS which is why many experienced R users like this environment. But editors are a subjective and personal choice; other people like Eclipse with StatET better. There are other solutions for Mac OS X and Windows too, and all this has been discussed countless times before here on SO and on other places like the R lists.
In general I do adopt Dirk's strategy. You should aim for your code to be a completely reproducible record of how you have transformed your raw data into output.
However, if you have complex code it can take a long time to re-run it all. I've had code that takes over 30 minutes to process the data (i.e., import, transform, merge, etc.).
In these cases, a single data-destroying line of code would require me to wait 30 minutes to restore my workspace.
By data destroying code I mean things like:
x <- merge(x, y)
df$x <- df$x^2
e.g., merges, replacing an existing variable with a transformation, removing rows or columns, and so on. In these cases, it's easy, especially when first learning R to make a mistake.
To avoid having to wait this 30 minutes, I adopt several strategies:
If I'm about to do something where there's a risk of destroying my active objects, I'll first copy the result into a temporary object. I'll then check that it worked with the temporary object and then rerun replacing it with the proper object.
E.g., first run temp <- merge(x, y); check that it worked str(temp); head(temp); tail(temp) and if everything looks good x <- merge(x, y)
As is common in psychological research, I often have large data frames with hundreds of variables and different subsets of cases. For a given analysis (e.g., a table, a figure, some results text), I'll often extract just the subset of cases and variables that I need into a separate object for the analysis and work with that object when preparing and finalising my analysis code. That way, I'm less likely to accidentally damage my main data frame. This assumes that the results of the analysis does not need to be fed back into the main data frame.
If I have finished performing a large number of complex data transformations, I may save a copy of the core workspace objects. E.g., save(x, y, z , file = 'backup.Rdata') That way, If I make a mistake, I only have to reload these objects.
df$x <- NULL is a handy way of removing a variable in a data frame that you did not want to create
However, in the end I still run all the code from scratch to check that the result is reproducible.

Resources