In R, when is smart to formaly declare a data.frame? - r

In other R code, it is common to see data.frame declared before a loop is started.
Suppose I have data frame data1 with 2000 rows.
And in a loop, I am via web service looping over data1 to create a new data.frame data2. (Please don't recommend not using a loop).
And in data2$result and data2$pubcount I need to store different values for each of the 2000 data1 items.
Do I HAVE to declare before the loop
data2=data.frame()
and do I have to tell R how many rows and what columns I will later use? I know that columns can be added without declaring. What about rows. Is there advantage in doing:
data2<-data.frame(id=data1$id)
I would like to do only what I absolutely HAVE to declare and do.
Why the empty declaration gives error once in the loop?
later edit: Speed and memory is not of issue. 10s vs. 30s makes no difference and I have a under 100MB data and big PC (8GB). Matrix is not an option since the data is numbers and text (mixed), so I have to use non-matrix.

Something like this:
df <- data.frame(a=numeric(n),b=character(n))
for (i in 1:n) {
#<do stuff>
df[i,1] <- ...
df[i,2] <- ...
}
You should avoid manipulation of data.frames in a loop, since subsetting of data.frames is a slow operation:
a <- numeric(n)
b <- character(n)
for (i in 1:n) {
#<do stuff>
a[i] <- ...
b[i] <- ...
}
df <- data.frame(a,b)
Of course, there are often better ways than a for loop. But it is strongly recommended to avoid growing objects (and I wont teach you how to do that). Pre-allocate as shown here.
Why should you pre-allocate? Because growing objects in a loop is sloooowwwww and that's one of the main reasons why people think loops in R are slow.

Related

Object selection in loop

I am currently experiencing perpetual issues with object selection within loops in R. I am fairly convinced that this is a common problem but I cannot seem to find the answer so here I am...
Here's a practical example of a problem I have:
I have a dataframe as source with a series of variables named sequentially (X1,X2,X3,X4, and so on). I am looking to create a function which takes the data as source matches it to another dataset to create a new, combined dataset.
The number of variables will vary. I want to pass my function a parameter which tells it how many variables I have, and the function needs to adjust the number of times it will run the code accordingly. This seems like a task for a for loop, but again there doesn't appear to be an easy way for that selection and recreation of variables within a loop.
Here's the code I need to repeat:
new1$X1 <- data$X1[match(new1$matf1, data$rowID)]
new1$X2 <- data$X2[match(new1$matf1, data$rowID)]
new1$X3 <- data$X3[match(new1$matf1, data$rowID)]
new1$X4 <- data$X4[match(new1$matf1, data$rowID)]
new1$X5 <- data$X5[match(new1$matf1, data$rowID)]
(...)
return(new1)
I've attempted something like this:
for(i in 1:5) {
new1$Xi <- assign(paste0("X", i)), as.vector(paste0("data$X",i)[match(new1$matf1, data$rowID)])
}
without success.
Thank you for your help!
You can try this simple way, however a join would be more efficient:
vals <- paste0('X',1:5)
for(i in vals){
new1[[i]] <- data[[i]][match(new1$matf1, data$rowID)]
}

Expand dataframe in R with rbind (union)

I need to scale up a set of files for a proof of concept in my company. Essentially have several 1000row files with around 200 columns each, and I want to rbind them until I reach the desired scale. This might be 1Million rows or more.
The output will be essentially a repetition of data (sounds a bit silly) and I'm aware of that, but i just need to prove something.
I used a while loop in R similar to this:
while(nrow(df) < 1000000) {df <- rbind(df,df);}
This seems to work but it looks a bit computationally heavy. It might might take like 10-15minutes.
I though of creating a function (below) and use an "apply" family function on the df, but couldn't succeed:
scaleup_function <- function(x)
{
while(nrow(df) < 1000)
{
x <- rbind(df, df)
}
}
Is there a quicker and more efficient way of doing it (it doesn't need to be with rbind) ?
Many thanks,
Joao
This should do the trick:
df <- matrix(0,nrow=1000,ncol=200)
reps_needed <- ceiling(1000000 / nrow(df))
df_scaled <- df[rep(1:nrow(df),reps_needed),]

How Can I Avoid This For Loop? (R)

I currently have a for loop as below and it does not run as fast as I would like it to.
library(dplyr)
DF<-data.frame(Name=c('Bob','Joe','Sally')) #etc
PrimaryResult <- Function1(DF)
ResultsDF<-Function2(PrimaryResult)
for(i in 1:9)
{
Filtered<-filter(DF,Name!=PrimaryResult[i,2])
NextResult <- Function1(Filtered)
ResultsDF<-rbind(ResultsDF,Function2(NextResult))
}
The code takes an initial result of Function1 (which is a list of names) and tries it again with each name in the initial result being excluded individually to provide alternative results. These are returned as a one row data frame via Function2 and appended to the Results data frame.
How can I make this faster?
It seems like your main problem is the appending results from function 2 each iteration with rbind. This is classically slow because you are telling R to rewrite a bunch of information at each time step and R does not really know how large of a vector you are going to end up with.
Try making your results into a list vector. I don't really know what your functions do so I can't really assist with that part.
results_list <- vector("list", 10)
results_list[[1]] <- Function2(PrimaryResult)
for(i in 1:9){
Filtered<-filter(DF,Name!=PrimaryResult[i,2])
NextResult <- Function1(Filtered)
results_list[[i+1]]<-rbind(results_list[[i]],Function2(NextResult))
}
This is not perfect, but it should speed things up a bit.

How to efficiently iterate through a complicated function that outputs a dataframe?

I essentially need to iterate through a set of values for parameters A,B,C to generate a table of results that will help me analyze the importance of such parameters. This is for a program in R.
Let's say that:
A goes from rangeA = 1:10
B goes from rangeB = 11:20
C goes from rangeC = 21:30
The simplest (not most efficient) solution that I currently use goes something like this:
### here I create this empty dataframe because I add on each tmp calc later
res <- data.frame()
### here i just create a random dataframe for replicative purposes
dataset <- data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
ParameterAdjustment() <- function{
for(a in rangeA){
for(b in rangeB){
for(c in rangeC){
### this is a complicated calculation that is much more
### difficult than the replicable example below
tmp <- CalculateSomething(dataset,a,b,c)
### an example calculation
### EDIT NEW EXAMPLE CALCULATION
tmp <- colMeans(dataset+a*b*c)
tmp <- data.frame(data.frame(t(tmp),sd(tmp))
res <- rbind(res,tmp)
}
}
}
return(res)
}
My problem is that this works fine with my original dataset that runs calculations on a 7000x500 dataframe. However, my new datasets are much larger and performance has become a significant issue. Can anyone suggest or help with a more efficient solution? Thank you.
Not sure what language the above is, so not sure how relevant this is but here goes: Are you outputting/sending the data as you go or collecting all the display-results in memory then outputting them all in one go at the end? When I've encountered similar problems with large datasets and this approach has helped me out a few times. For example, sending 10,000s of data-points back to the client for a graph, rather than generating an array of all those points and sending that, I output to screen after each point and then free up the memory. It still takes a while but that's unavoidable. The important bit is that it doesn't crash.

Memory efficient alternative to rbind - in-place rbind?

I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
data.table is your friend!
C.f. http://www.mail-archive.com/r-help#r-project.org/msg175877.html
Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :
Same as do.call("rbind",l), but much faster.
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000
n2 <- 1000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000
n2 <- 3000000
ncols <- 20
dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols))
dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))
gc()
Sys.sleep(10)
dtfcomb <- rbind(dtf1,dtf2)
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtfcomb)
gc()
Sys.sleep(10)
dtf <- list()
for(i in names(dtf1)){
dtf[[i]] <- c(dtf1[[i]],dtf2[[i]])
}
attr(dtf,"row.names") <- 1:(n1+n2)
attr(dtf,"class") <- "data.frame"
Sys.sleep(10)
gc()
Sys.sleep(10)
rm(dtf)
gc()
Right now I worked out the following solution:
nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)
Now I don't run out of memory. I think its because I store
object.size(df) + 2 * object.size(df.extension)
while with rbind R would need
object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension).
After that I use
rm(df.extension)
gc(reset=TRUE)
to free the memory I don't need anymore.
This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.
This is a perfect candidate for bigmemory. See the site for more information. Here are three usage aspects to consider:
It's OK to use the HD: Memory mapping to the HD is much faster than practically any other access, so you may not see any slowdowns. At times I rely upon > 1TB of memory-mapped matrices, though most are between 6 and 50GB. Moreover, as the object is a matrix, this requires no real overhead of rewriting code in order to use the object.
Whether you use a file-backed matrix or not, you can use separated = TRUE to make the columns separate. I haven't used this much, because of my 3rd tip:
You can over-allocate the HD space to allow for a larger potential matrix size, but only load the submatrix of interest. This way there is no need to do rbind.
Note: Although the original question addressed data frames and bigmemory is suitable for matrices, one can easily create different matrices for different types of data and then combine the objects in RAM to create a dataframe, if it's really necessary.

Resources