Slow data.frame row assignation - r

I am working with RMongoDB and I need to fill an empty data.frame with the values of a query. The results are quite long, about 2 milion documents (rows).
While I was doing performance tests, I found out that the time for writing the values to a row increases by the dimension of the data frame. Maybe it is a well known issue and I am the last one to notice it.
Some code example:
set.seed(20140430)
nreg <- 2e3
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
nreg <- 2e6
dfres <- as.data.frame(matrix(rep(NA,nreg*7),nrow=nreg,ncol=7))
system.time(dfres[1e3,] <- c(1:5,"a","b"))
summary(replicate(10,system.time(dfres[sample(1:nreg,1),] <- c(1:5,"a","b"))[3]))
On my machine, the assignment at the 2 milion rows data.frame takes about 0.4 seconds. This is a lot of time if I want to fill the whole dataset. Here goes a second simulation in order to draw the issue.
nreg <- seq(2e1,2e7,length.out=10)
te <- NULL
for(i in nreg){
dfres <- as.data.frame(matrix(rep(NA,i*7),nrow=i,ncol=7))
te <- c(te,mean(replicate(10,{r <- sample(1:i,1); system.time(dfres[r,] <- c(1:5,"a","b"))[3]}) ) )
}
plot(nreg,te,xlab="Number of rows",ylab="Avg. time for 10 random assignments [sec]",type="o")
#rm(nreg,dfres,te)
Question: Why this happens? Is there a quicker way to fill the data.frame in memory?

Let's start with "columns" first and see what goes on and then return to rows.
R versions < 3.1.0 (unnecessarily) copies the entire data.frame when you operate on them. For example:
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# x 0x7ff9343fb4d0 0x7ff9326dfba8
# y 0x7ff9343fb488 0x7ff9326dfbf0
# z <added> 0x7ff9326dfc38
# Changed attributes:
# old new
# names 0x7ff934170c28 0x7ff934308808
# row.names 0x7ff934551b18 0x7ff934308970
# class 0x7ff9346c5278 0x7ff935d1d1f8
You can see that addition of "new" column has resulted in a copy of the "old" columns (the addresses are different). Also the attributes are copied. What bites most is that these copies are deep copies, as opposed to shallow copies.
Shallow copies only copy the vector of column pointers, not the entire data, where as deep copies copy everything (which is unnecessary here).
However, in R v3.1.0, there has been nice welcoming changes in that the "old" columns are not deep copied. All credits to the R core dev team.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
dplyr:::changes(df, transform(df, z=11:15)) ## requires dplyr to be available
# Changed variables:
# old new
# z <added> 0x7f85d328dda8
# Changed attributes:
# old new
# names 0x7f85d1459548 0x7f85d297bec8
# row.names 0x7f85d2c66cd8 0x7f85d2bfa928
# class 0x7f85d345cab8 0x7f85d2d6afb8
You can see that the columns x and y aren't changed at all (and therefore not present in the output of changes function call). This is a huge (and welcoming) improvement!
So far, we looked at the issue in adding columns in R <3.1.0 and v3.1.0
Now, coming to your question: so, what about the "rows"? Let's consider older version of R first and then come back to R v3.1.0.
## R v3.0.3
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# x 0x7f968b423e50 0x7f968ac6ba40
# y 0x7f968b423e98 0x7f968ac6bad0
#
# Changed attributes:
# old new
# names 0x7f968ab88a28 0x7f968abca8e0
# row.names 0x7f968abb6438 0x7f968ab22bb0
# class 0x7f968ad73e08 0x7f968b580828
Once again we see that changing column y has resulted in copying column x as well in older versions of R.
## R v3.1.0
df <- data.frame(x=1:5, y=6:10)
df.old <- df
df$y[1L] <- -6L
dplyr:::changes(df.old, df)
# Changed variables:
# old new
# y 0x7f85d3544090 0x7f85d2c9bbb8
#
# Changed attributes:
# old new
# row.names 0x7f85d35a69a8 0x7f85d35a6690
We see the nice improvements in R v3.1.0 which has resulted in the copy of just column y. Once again, great improvements in R v3.1.0! R's copy-on-modify has gotten wiser.
But still, using data.table's assignment by reference semantics, we can do one step better - not copy even the y column as is the case in R v3.1.0.
The idea being: as long as the type of the object you assign to a column at certain indices don't change (here, column y is integer - so as long as you assign an integer back to y), we really can do it without having to copy by modifying in-place (by reference).
Why? Because we don't have to allocate/re-allocate anything here. As an example, if you had assigned a double/numeric type, which requires 8 bytes of storage as opposed to 4-bytes of storage for integer column y, then we've to create a new column y and copy values back.
That is, we can sub-assign by reference using data.table. We can use either := or set() to do this. I'll demonstrate using set() here.
Now, here's a comparison with base R and data.table on your data with 2,000 to 20,000,000 rows in multiples of 10, against R v3.0.3 and v3.1.0 separately. You can find the code here.
Plot for comparison against R v3.0.3:
Plot for comparison against R v3.1.0:
The min, median and max for R v3.0.3, R v3.1.0 and data.table on 20 million rows with 10 replications are:
type min median max
base_3.0.3 10.05 10.70 18.51
base_3.1.0 1.67 1.97 5.20
data.table 0.04 0.04 0.05
Note: You can see the complete timings in this gist.
This clearly shows the improvement in R v3.1.0, but also shows that the column which is being changed is still being copied and that still consumes sometime, which is overcome through sub-assignment by reference in data.table.
HTH

Related

"This R function is purely indicative and should never be called": Why am I getting this error?

I am returning to some old code. I am sure it worked in the past. Since I last used it, I've upgraded dplyr to version 1.0.0. Unfortunately, devtools::install_version("dplyr", version='0.8.5") gives me an error whilst compliling, so I can't perform a regression test.
I am trying to create a tidy version of the mcmc class from the runjags package. The mcmc class is essentially a (large) two-dimensional matrix of arbitrary size. It's likely to have several (tens of) thousands of rows and the column names are relevant and (as in my toy data below) potentially awkward. There is also useful information in the attributes of the mcmc object. Hence the somewhat convoluted approach I've taken. It needs to be completely generic.
* Toy data *
# load("./data/oCRMPosteriorShort.rda")
# x <- head(oCRMPosteriorShort$mcmc[[1]])
# dput(x)
x <- structure(c(7.27091686833247, 5.72764789439587, 5.72103479848012,
7.43825337823404, 8.59970106873194, 8.03081445451, 9.16248677241767,
3.09793571064081, 4.66492638321819, 3.19480526258532, 5.1159808007229,
6.08361682213139, 5.05973067601884, 4.14556598358942, 0.95900563867179,
0.88584483221691, 0.950304627720881, 1.13467524314569, 1.44963882689823,
1.19907577185321, 1.15968445234753), .Dim = c(7L, 3L), .Dimnames = list(
c("5001", "5003", "5005", "5007", "5009", "5011", "5013"),
c("alpha[1]", "alpha[2]", "beta")), mcpar = c(5001, 5013,
2), class = "mcmc")
* Stage 1: Code that works: *
a <- attributes(x)
colNames <- a$dimnames[[2]]
#Get the sample IDs (from the attributes of x) and add the chain index
base <- tibble::enframe(a$dimnames[[1]], value="Sample") %>%
tibble::add_column(Chain=1, .before=1) %>%
dplyr::select(-.data$name)
# Create a list of tibbles, defining each as the contents of base plus the contents of the ith column of x,
# plus the name of the ith column in Temp.
t <- lapply(1:length(colNames), function(i) d <- cbind(base) %>% tibble(Temp=colNames[i], Value=x[,colNames[i]]))
At this point, I have a list of tibbles. Each tibble contains columns named Chain (with the value of 1 for each observation in each tibble in this case), Sample (with values taken from the first dimension of the dimnames attribute of x, Temp (with values of beta, alpha[1] and alpha[2] in elements 3, 1 and 2 of the list) and Value (the value of the mcmc object in cell [Sample, Temp].
* Stage 2: Here's the problem... *
It should be a simple matter to row_bind the list into a single tidy tibble containing (some of) the information I need. But:
# row_bind the list of tibbles into a single object
rv <- dplyr::bind_rows(t)
# Error: `vec_ptype2.double.double()` is implemented at C level.
# This R function is purely indicative and should never be called.
# Run `rlang::last_error()` to see where the error occurred.
* Questions *
I can't see what I'm doing wrong here. (And even if I were doing something wrong, I'd expect a more user-friendly, higher level sort of error message.) I can't find any references to this error anywhere on the web.
Does anyone have any idea what's going on?
Would someone run the code using dplyr v0.8.x and report what they see?
I'd appreciate your thoughts.
* Update *
It looked as if the problem has been resolved by a reboot, but has now returned. Even when these tibbles cause the error, a related example from the online doc works:
one <- starwars[1:4, ]
two <- starwars[9:12, ]
bind_rows(list(one, two))
runs without problems.
Context:
> # Context
> R.Version()$version.string
[1] "R version 3.6.3 (2020-02-29)"
> packageVersion("dplyr")
[1] ‘1.0.0’
> Sys.info()["version"]
version
"Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64"

How to rename an entire dataframe in r? [duplicate]

I have a huge data frame loaded in global environment in R named df. How can I rename the data frame without copying the data frame by assigning it to another symbol and remove the original one?
R is smart enough not to make a copy if the variable is the same, so just go ahead, reassign and rm() the original.
Example:
x <- 1:10
tracemem(x)
# [1] "<0000000017181EA8>"
y <- x
tracemem(y)
# [1] "<0000000017181EA8>"
As we can see both objects point to the same address. R makes a new copy in the memory if one of them is modified, i.e.: 2 objects are not identical anymore.
# Now change one of the vectors
y[2] <- 3
# tracemem[0x0000000017181ea8 -> 0x0000000017178c68]:
# tracemem[0x0000000017178c68 -> 0x0000000012ebe3b0]:
tracemem(x)
# [1] "<0000000017181EA8>"
tracemem(y)
# [1] "<0000000012EBE3B0>"
Related post: How do I rename an R object?
There is a function called mv in the gdata package.
library(gdata)
x <- data.frame(A = 1:100, B = 101:200, C = 201:300)
tracemem(x)
"<0000000024EA66F8>"
mv(from = "x", to = "y")
tracemem(y)
"<0000000024EA66F8>"
You will notice that the output from tracemem is identical for x and y. Looking at the code of mv, you will see that it assigns the object to the environment in scope and then removes the old object. This is quite similar to the approach C8H10N4O2 used (although mv is for a single object), but at least the function is convenient to use.
To apply the accepted answer to many objects, you could use a loop of assign(new_name, get(old_name)) followed by rm(list= old_names). For example, if you wanted to replace old_df,old_x,old_y, ... with new_df, new_x...
for (obj_old_name in ls(pattern='old_')){
assign(sub('old_','new_',obj_old_name), get(obj_old_name))
}
rm(list=ls(pattern='old_'))

Can I create new xts columns from a list of names?

My objective: read data files from yahoo then perform calculations on each xts using lists to create the names of xts and the names of columns to assign results to.
Why? I want to perform the same calculations for a large number of xts datasets without having to retype separate lines to perform the same calculations on each dataset.
First, get the datasets for 2 ETFs:
library(quantmod)
# get ETF data sets for example
startDate = as.Date("2013-12-15") #Specify period of time we are interested in
endDate = as.Date("2013-12-31")
etfList <- c("IEF","SPY")
getSymbols(etfList, src = "yahoo", from = startDate, to = endDate)
To simplify coding, replace the ETF. prefix from yahoo data
colnames(IEF) <- gsub("SPY.","", colnames(SPY))
colnames(IEF) <- gsub("IEF.","", colnames(IEF))
head(IEF,2)
Open High Low Close Volume Adjusted
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67
Creating new columns using the functions in quantmod is straightforward, e.g.,
SPY$logRtn <- periodReturn(Ad(SPY),period='daily',subset=NULL,type='log')
IEF$logRtn <- periodReturn(Ad(IEF),period='daily',subset=NULL,type='log')
head(IEF,2)
# Open High Low Close Volume Adjusted logRtn
#2013-12-16 100.86 100.87 100.52 100.61 572400 98.36 0.0000000
#2013-12-17 100.60 100.93 100.60 100.93 694800 98.67 0.0031467
but rather that creating a new statement to perform the calculation for each ETF, I want to use a list instead. Here's the general idea:
etfList
#[1] "IEF" "SPY"
etfColName = "logRtn"
for (etfName in etfList) {
newCol <- paste(etfName, etfColName, sep = "$"
newcol <- periodReturn(Ad(etfName),period='daily',subset=NULL,type='log')
}
Of course, using strings (obviously) doesn't work, because
typeof(newCol) # is [1] "character"
typeof(logRtn) # is [1] "double"
I've tried everything I can think of (at least twice) to coerce the character string etfName$etfColName into an object that I can assign calculations to.
I've looked at many variations that work with data.frames, e.g., mutate() from dplyr, but don't work on xts data files. I could convert datasets back/forth from xts to data.frames, but that's pretty kludgy (to say the least).
So, can anyone suggest an elegant and straightforward solution to this problem (i.e., in somewhat less than 25 lines of code)?
I shall be so grateful that, when I make enough to buy my own NFL team, you will always have a place of honor in the owner's box.
This type of task is a lot easier if you store your data in a new environment. Then you can use eapply to loop over all the objects in the environment and apply a function to them.
library(quantmod)
etfList <- c("IEF","SPY")
# new environment to store data
etfEnv <- new.env()
# use env arg to make getSymbols load the data to the new environment
getSymbols(etfList, from="2013-12-15", to="2013-12-31", env=etfEnv)
# function containing stuff you want to do to each instrument
etfTransform <- function(x, ...) {
# remove instrument name prefix from colnames
colnames(x) <- gsub(".*\\.", "", colnames(x))
# add return column
x$logRtn <- periodReturn(Ad(x), ...)
x
}
# use eapply to apply your function to each instrument
etfData <- eapply(etfEnv, etfTransform, period='daily', type='log')
(I didn't realize that you had posted a reproducible example.)
See if this is helpful:
etfColName = "logRtn"
for ( etfName in etfList ) {
newCol <- get(etfName)[ , etfColName]
assign(etfName, cbind( get(etfName),
periodReturn( Ad(get(etfName)),
period='daily',
subset=NULL,type='log')))
}
> names(SPY)
[1] "SPY.Open" "SPY.High" "SPY.Low" "SPY.Close"
[5] "SPY.Volume" "SPY.Adjusted" "logRtn" "daily.returns"
I'm not an quantmod user and it's only from the behavior I see that I believe the Ad function returns a named vector. (So I did not need to do any naming.)
R is not a macro language, which means you cannot just string together character values and expect them to get executed as though you had typed them at the command line. Theget and assign functions allow you to 'pull' and 'push' items from the data object environment on the basis of character values, but you should not use the $-function in conjunction with them.
I still do not see a connection between the creation of newCol and the actual new column that your code was attempting to create. They have different spellings so would have been different columns ... if I could have figured out what you were attempting.

How to rename a variable in R without copying the object?

I have a huge data frame loaded in global environment in R named df. How can I rename the data frame without copying the data frame by assigning it to another symbol and remove the original one?
R is smart enough not to make a copy if the variable is the same, so just go ahead, reassign and rm() the original.
Example:
x <- 1:10
tracemem(x)
# [1] "<0000000017181EA8>"
y <- x
tracemem(y)
# [1] "<0000000017181EA8>"
As we can see both objects point to the same address. R makes a new copy in the memory if one of them is modified, i.e.: 2 objects are not identical anymore.
# Now change one of the vectors
y[2] <- 3
# tracemem[0x0000000017181ea8 -> 0x0000000017178c68]:
# tracemem[0x0000000017178c68 -> 0x0000000012ebe3b0]:
tracemem(x)
# [1] "<0000000017181EA8>"
tracemem(y)
# [1] "<0000000012EBE3B0>"
Related post: How do I rename an R object?
There is a function called mv in the gdata package.
library(gdata)
x <- data.frame(A = 1:100, B = 101:200, C = 201:300)
tracemem(x)
"<0000000024EA66F8>"
mv(from = "x", to = "y")
tracemem(y)
"<0000000024EA66F8>"
You will notice that the output from tracemem is identical for x and y. Looking at the code of mv, you will see that it assigns the object to the environment in scope and then removes the old object. This is quite similar to the approach C8H10N4O2 used (although mv is for a single object), but at least the function is convenient to use.
To apply the accepted answer to many objects, you could use a loop of assign(new_name, get(old_name)) followed by rm(list= old_names). For example, if you wanted to replace old_df,old_x,old_y, ... with new_df, new_x...
for (obj_old_name in ls(pattern='old_')){
assign(sub('old_','new_',obj_old_name), get(obj_old_name))
}
rm(list=ls(pattern='old_'))

R data.table efficient replication by group

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.
Here is some sample data:
ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")
What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.
The following code works and gives me the answer I want:
good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))
good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]
The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.
If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.
Does anyone have a better solution?
I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.
One way is:
require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]
Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.
That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.
So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.
And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).
Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.
# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)
DF <- data.frame(yr = rep(yr, sz),
token = token,
multiple = multiple, stringsAsFactors=FALSE)
# Arun's solution
ARUN.DT <- function(dt) {
setkey(dt, "multiple")
idx <- unique(dt$multiple)
dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}
# Ricardo's solution
RICARDO.DT <- function(dt) {
setkey(dt, key="yr")
newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
newDT
}
# create data.table
require(data.table)
DT <- data.table(DF)
# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 1 res1 <- ARUN.DT(DT) 10 9.542 1.000 7.218 1.394
# 2 res2 <- RICARDO.DT(DT) 10 17.484 1.832 14.270 2.888
But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.
you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:
# make sure `sample_data$multiple` is an integer
sample_data$multiple <- as.integer(sample_data$multiple)
# create data.table
S <- data.table(sample_data, key='yr')
# optionally, drop original data.frame if not needed
rm(sample_data)
## Allocate the memory first
newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
newDT[, tokenReps := as.character(NA)]
# Add the rep'd tokens into newDT, using recycling
newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]
Two notes:
(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.
(2) I used the following to determine the number of rows needed per year
S[, list(rows=length(token) * unique(multiple)), by=yr]

Resources