Efficiently compute proportions of one data frame from another - r

I have this data.frame:
set.seed(1)
df <- cbind(matrix(rnorm(26,100),26,100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Each row is 100 measurements from a certain subject (designated by id), which is associated with a parent ID (designated by parent.id). The relationship between parent.id and id is one-to-many.
I'm looking for a fast way to get the fraction of each df$id (for each of its 100 measurements) out the measurements of its parent.id. Meaning that for each id in df$id I want to divide each of its 100 measurements by the sum of its measurements across all df$id's which correspond to its df$parent.id.
What I'm trying is:
sum.df <- dplyr::select(df,-id) %>% dplyr::group_by(parent.id) %>% dplyr::summarise_all(sum)
fraction.df <- do.call(rbind,lapply(df$id,function(i){
pid <- dplyr::filter(df,id == i)$parent.id
(dplyr::filter(df,id == i) %>% dplyr::select(-id,-parent.id))/
(dplyr::filter(sum.df,parent.id == pid) %>% dplyr::select(-parent.id))
}))
But for the real dimensions of my data: length(df$id) = 10,000 with 1,024 measurements, this is not fast enough.
Any idea how to improve this, ideally using dplyr functions?

Lets compare these options with microbenchmark, all using the new definition for the dataset in #Sathish's answer:
OP method:
Units: seconds
min lq mean median uq max neval
1.423583 1.48449 1.602001 1.581978 1.670041 2.275105 100
#Sathish method speeds it up by a factor of about 5. This is valuable, to be sure
Units: milliseconds
min lq mean median uq max neval
299.3581 334.787 388.5283 363.0363 398.6714 951.4654 100
One possible base R implementation below, using principles of efficient R code, improves things by a factor of about 65 (24 milliseconds, vs 1,582 milliseconds):
Units: milliseconds
min lq mean median uq max neval
21.49046 22.59205 24.97197 23.81264 26.36277 34.72929 100
Here's the base R implementation. As is the case for the OP's implementation, the parent.id and id columns are not included in the resulting structure (here fractions). fractions is a matrix with rows ordered according to sort(interaction(df$id, df$parent.id, drop = TRUE)).
values <- df[1:100]
parents <- split(values, df$parent.id)
sums <- vapply(parents, colSums, numeric(100), USE.NAMES = FALSE)
fractions <- matrix(0, 26, 100)
f_count <- 0
for (p_count in seq_along(parents)){
parent <- as.matrix(parents[[p_count]])
dimnames(parent) <- NULL
n <- nrow(parent)
for (p_row in seq_len(nrow(parent))){
fractions[(f_count + p_row),] <- parent[p_row,] / sums[,p_count]
}
f_count <- f_count + p_row
}
Note: there's still room for improvement. split() is not particularly efficient.
Note 2: What "principles of efficient R code" were used?
Get rid of names whenever you can
It's faster to find things in a matrix than a data frame
Don't be afraid of for loops for efficiency, provided you're not growing an object
Prefer vapply to the other apply family functions.

The problem with your data is all rows are duplicate of each other, so I changed it slightly to reflect different values in the dataset.
Data:
set.seed(1L)
df <- cbind(matrix(rnorm(2600), nrow = 26, ncol = 100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Code:
library('data.table')
setDT(df) # assign data.table class by reference
# compute sum for each `parent.id` for each column (100 columns)
sum_df <- df[, .SD, .SDcols = which(colnames(df) != 'id' )][, lapply(.SD, sum ), by = .(parent.id ) ]
# get column names for sum_df and df which are sorted for consistency
no_pid_id_df <- gtools::mixedsort( colnames(df)[ ! ( colnames(df) %in% c( 'id', 'parent.id' ) ) ] )
no_pid_sum_df <- gtools::mixedsort( colnames(sum_df)[ colnames(sum_df) != 'parent.id' ] )
# match the `parent.id` for each `id` and then divide its value by the value of `sum_df`.
df[, .( props = {
pid <- parent.id
unlist( .SD[, .SD, .SDcols = no_pid_id_df ] ) /
unlist( sum_df[ parent.id == pid, ][, .SD, .SDcols = no_pid_sum_df ] )
}, parent.id ), by = .(id)]
Output:
# id props parent.id
# 1: A -0.95157186 e
# 2: A 0.06105359 e
# 3: A -0.42267771 e
# 4: A -0.03376174 e
# 5: A -0.16639600 e
# ---
# 2596: Z 2.34696158 e
# 2597: Z 0.23762369 e
# 2598: Z 0.60068440 e
# 2599: Z 0.14192337 e
# 2600: Z 0.01292592 e
Benchmark:
library('microbenchmark')
microbenchmark( sathish(), frank(), dan())
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sathish() 404.450219 413.456675 433.656279 420.46044 429.876085 593.44202 100 c
# frank() 2.035302 2.304547 2.707019 2.47257 2.622025 18.31409 100 a
# dan() 17.396981 18.230982 19.316653 18.59737 19.700394 27.13146 100 b

Related

Compare Matrices in R efficiently

I have an array a with some matrices in it. Now i need to efficiently check how many different matrices I have and what indices (in ascending order) they have in the array. My approach is the following: Paste the columns of the matrixes as character vectors and have a look at the frequency table like this:
n <- 10 #observations
a <- array(round(rnorm(2*2*n),1),
c(2,2,n))
paste_a <- apply(a, c(3), paste, collapse=" ") #paste by column
names(paste_a) <- 1:n
freq <- as.numeric( table(paste_a) ) # frequencies of different matrices (in ascending order)
indizes <- as.numeric(names(sort(paste_a[!duplicated(paste_a)])))
nr <- length(freq) #number of different matrices
However, as you increase n to large numbers, this gets very inefficient (it's mainly paste() that's getting slower and slower). Does anyone have a better solution?
Here is a "real" dataset with 100 observations where some matrices are actual duplicates (as opposed to my example above): https://pastebin.com/aLKaSQyF
Thank you very much.
Since your actual data is made up of the integers 0,1,2,3, why not take advantage of base 4? Integers are much faster to compare than entire matrix objects. (All occurrences of a below are of the data found in the real data set from the link.)
Base4Approach <- function() {
toBase4 <- sapply(1:dim(a)[3], function(x) {
v <- as.vector(a[,,x])
pows <- which(v > 0)
coefs <- v[pows]
sum(coefs*(4^pows))
})
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
And since the question is about efficiency, let's benchmark:
MartinApproach <- function() {
### commented this out for comparison reasons
# dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3]
a
}
identical(MartinApproach(), Base4Approach())
[1] TRUE
microbenchmark(Base4Approach(), MartinApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 291.658 303.525 339.2712 325.4475 352.981 636.361 100
MartinApproach() 983.855 1000.958 1160.4955 1071.9545 1187.321 3545.495 100
The approach by #d.b. doesn't really do the same thing as the previous two approaches (it simply identifies and doesn't remove duplicates).
DBApproach <- function() {
a[, , 9] = a[, , 1]
#Convert to list
mylist = lapply(1:dim(a)[3], function(i) a[1:dim(a)[1], 1:dim(a)[2], i])
temp = sapply(mylist, function(x) sapply(mylist, function(y) identical(x, y)))
temp2 = unique(apply(temp, 1, function(x) sort(which(x))))
#The indices in 'a' where the matrices are same
temp2[lengths(temp2) > 1]
}
However, Base4Approach still dominates:
microbenchmark(Base4Approach(), MartinApproach(), DBApproach())
Unit: microseconds
expr min lq mean median uq max neval
Base4Approach() 298.764 324.0555 348.8534 338.899 356.0985 476.475 100
MartinApproach() 1012.601 1087.9450 1204.1150 1110.662 1162.9985 3224.299 100
DBApproach() 9312.902 10339.4075 11616.1644 11438.967 12413.8915 17065.494 100
Update courtesy of #alexis_laz
As mentioned in the comments by #alexis_laz, we can do much better.
AlexisBase4Approach <- function() {
toBase4 <- colSums(a * (4 ^ (0:(prod(dim(a)[1:2]) - 1))), dims = 2)
myDupes <- which(duplicated(toBase4))
a[,,-(myDupes)]
}
microbenchmark(Base4Approach(), MartinApproach(), DBApproach(), AlexisBase4Approach(), unit = "relative")
Unit: relative
expr min lq mean median uq max neval
Base4Approach() 11.67992 10.55563 8.177654 8.537209 7.128652 5.288112 100
MartinApproach() 39.60408 34.60546 27.930725 27.870019 23.836163 22.488989 100
DBApproach() 378.91510 342.85570 262.396843 279.190793 231.647905 108.841199 100
AlexisBase4Approach() 1.00000 1.00000 1.000000 1.000000 1.000000 1.000000 100
## Still gives accurate results
identical(MartinApproach(), AlexisBase4Approach())
[1] TRUE
My first attempt was actually really slow. So here is slightly changed version of yours:
dimnames(a) <- list(1:dim(a)[1], 1:dim(a)[2], 1:dim(a)[3])
a <- a[,,!duplicated(a, MARGIN = 3)]
nr <- dim(a)[3] #number of different matrices
idx <- dimnames(a)[[3]] # indices of left over matrices
I don't know if this is exactly what you want but here is a way you can extract indices where the matrices are same. More processing may be necessary to get what you want
#DATA
n <- 10
a <- array(round(rnorm(2*2*n),1), c(2,2,n))
a[, , 9] = a[, , 1]
temp = unique(apply(X = sapply(1:dim(a)[3], function(i)
sapply(1:dim(a)[3], function(j) identical(a[, , i], a[, , j]))),
MARGIN = 1,
FUN = function(x) sort(which(x))))
temp[lengths(temp) > 1]
#[[1]]
#[1] 1 9

Efficient sparse linear interpolation of row by row data

What is the most efficient way to do linear interpolation when the desired interpolation points are sparse compared to the available data? I have a very long data frame containing many columns, one of which represents a timestamp and the rest are variables, for which I am interested in interpolating at a very small number of timestamps. For example, consider the two variable case:
microbenchmark::microbenchmark(approx(1:2, 1:2, 1.5)$y)
# Unit: microseconds
# expr min lq mean median uq max neval
# ... 39.629 41.3395 46.80514 42.195 52.8865 138.558 100
microbenchmark::microbenchmark(approx(seq_len(1e6), seq_len(1e6), 1.5)$y)
# Unit: milliseconds
# expr min lq mean median uq max neval
# ... 129.5733 231.0047 229.3459 236.3845 247.3096 369.4621 100
we see that although only one interpolated value (at t = 1.5) is desired, increasing the number of pairs (x, y) can cause a few orders of magnitude difference in running time.
Another example, this time with a data table.
library(data.table)
tmp_dt <- data.table(time = seq_len(1e7), a = seq_len(1e7), b = seq_len(1e7), c = seq_len(1e7))
Running tmp_dt[, lapply(.SD, function(col) {approx(time, col, 1.5)$y}), .SDcols = c("a", "b", "c")] produces a one row data table but it takes a while.
I am thinking there must be some efficiency to be gained by removing all rows in the data table that are not necessary for interpolation.
If your linear interpolation is weighted.mean(c(x0, x1), c(t1-t, t-t0)), where (t0, x0) is the nearest point below and (t1, x1) the nearest above...
# fix bad format
tmp_dt[, names(tmp_dt) := lapply(.SD, as.numeric)]
# enumerate target times
tDT = data.table(t = seq(1.5, 100.5, by=.5))
# handle perfect matches
tDT[, a := tmp_dt[.SD, on=.(time = t), x.a]]
# handle interpolation
tDT[is.na(a), a := {
w = findInterval(t, tmp_dt$time)
cbind(tmp_dt[w, .(t0 = time, a0 = a)], tmp_dt[w+1L, .(t1 = time, a1 = a)])[,
(a0*(t1-t) + a1*(t-t0))/(t1-t0)]
}]
The extension to more columns is a little messy, but can be shoehorned in here.
Some sort of rolling, like w = tmp_dt[t, on=.(time), roll=TRUE, which=TRUE], might be faster than findInterval, but I haven't looked into it.

Is there a way to speed up subsetting of smaller data.frames

I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table
The following minimum reproducible example shows, that data.frame seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a data.frame or data.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.
Technically, it is not a sequence of data.frames but just one data.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.
You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).
Here we go. First I modified the data to have it with size 200x30:
library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors
And the comparison, ranked from quickest to slowest:
summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100
If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.

How to identify indexes of elements appearing only once in a vector in R?

I would like to have logical vector which identifies positions of elements only appearing once in a column of a data frame.
As far as I understood unique() and duplicated() base R functions cannot be of help, as they either show duplicate-removed list of values or positions of duplicates.
The use of a table() function may identify values occuring once but not their position to be used for further data manipulation. Any suggestions? Thanks a lot
Let x be your vector, for example :
set.seed(1)
x <- sample(1:10, 10 ,replace = T)
You can do it in two steps :
var.names <- names(table(x))[table(x) == 1]
match(var.names, x)
To get var.namesyou can also do :
names(which(table(x) == 1))
There are many answers here. I thought I'd compare their computation time
x <- rbinom(500, 1000, 0.5)
microbenchmark::microbenchmark(
x[which(!(duplicated(x)|duplicated(x, fromLast=TRUE)))],
x[ave(x, x, FUN = length) == 1],
setdiff(unique(x),x[duplicated(x)]),
names(which(table(x) == 1))
)
The output is
Unit: microseconds
expr min lq mean median
x[which(!(duplicated(x) | duplicated(x, fromLast = TRUE)))] 22.517 26.2880 28.75954 29.460
x[ave(x, x, FUN = length) == 1] 247.923 256.4725 265.80232 262.290
setdiff(unique(x), x[duplicated(x)]) 38.706 41.1915 45.58309 46.278
names(which(table(x) == 1)) 194.656 204.4935 213.87719 213.388
uq max neval cld
31.036 41.033 100 a
266.321 461.379 100 d
48.546 71.819 100 b
219.536 290.785 100 c
So the winner is x[which(!(duplicated(x)|duplicated(x, fromLast=TRUE)))]

What is the right way to multiply data frame by vector?

I'm trying to multiply a data frame df by a vector v, so that the product is a data frame, where the i-th row is given by df[i,]*v. I can do this, for example, by
df <- data.frame(A=1:5, B=2:6); v <- c(0,2)
as.data.frame(t(t(df) * v))
A B
1 0 4
2 0 6
3 0 8
4 0 10
5 0 12
I am sure there has to be a more R-style approach (and a very simple one!), but nothing comes on my mind. I even tried something like
apply(df, MARGIN=1, function(x) x*v)
but still, non-readable constructions like as.data.frame(t(.)) are required.
How can I find an efficient and elegant workaround here?
This works too:
data.frame(mapply(`*`,df,v))
In that solution, you are taking advantage of the fact that data.frame is a type of list, so you can iterate over both the elements of df and v at the same time with mapply.
Unfortunately, you are limited in what you can output from mapply: as simple list, or a matrix. If your data are huge, this would likely be more efficient:
data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE))
Because it would convert it to a list, which is more efficient to convert to a data.frame.
If you're looking for speed and memory efficiency - data.table to the rescue:
library(data.table)
dt = data.table(df)
for (i in seq_along(dt))
dt[, (i) := dt[[i]] * v[i]]
eddi = function(dt) { for (i in seq_along(dt)) dt[, (i) := dt[[i]] * v[i]] }
arun = function(df) { df * matrix(v, ncol=ncol(df), nrow=nrow(df), byrow=TRUE) }
nograpes = function(df) { data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)) }
N = 1e6
dt = data.table(A = rnorm(N), B = rnorm(N))
v = c(0,2)
microbenchmark(eddi(copy(dt)), arun(copy(dt)), nograpes(copy(dt)), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# eddi(copy(dt)) 23.01106 24.31192 26.47132 24.50675 28.87794 34.28403 10
# arun(copy(dt)) 337.79885 363.72081 450.93933 433.21176 516.56839 644.70103 10
# nograpes(copy(dt)) 19.44873 24.30791 36.53445 26.00760 38.09078 95.41124 10
As Arun points out in the comments, one can also use the set function from the data.table package to do this in-place modification on data.frame's as well:
for (i in seq_along(df))
set(df, j = i, value = df[[i]] * v[i])
This of course also works for data.table's and could be significantly faster if the number of columns is large.
A language that lets you combine vectors with matrices has to make a decision at some point whether the matrices are row-major or column-major ordered. The reason:
> df * v
A B
1 0 4
2 4 0
3 0 8
4 8 0
5 0 12
is because R operates down the columns first. Doing the double-transpose trick subverts this. Sorry if this is just explaining what you know, but I don't know another way of doing it, except explicitly expanding v into a matrix of the same size.
Or write a nice function that wraps the not very R-style code into something that is R-stylish.
Whats wrong with
t(apply(df, 1, function(x)x*v))
?
library(purrr)
map2_dfc(df, v, `*`)
Benchmark
N = 1e6
dt = data.table(A = rnorm(N), B = rnorm(N))
v = c(0,2)
eddi = function(dt) { for (i in seq_along(dt)) dt[, (i) := dt[[i]] * v[i]]; dt }
arun = function(df) { df * matrix(v, ncol=ncol(df), nrow=nrow(df), byrow=TRUE) }
nograpes = function(df) { data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)) }
ryan = function(df) {map2_dfc(df, v, `*`) }
library(microbenchmark)
microbenchmark(
eddi(copy(dt))
, arun(copy(dt))
, nograpes(copy(dt))
, ryan(copy(dt))
, times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval
# eddi(copy(dt)) 8.367513 11.06719 24.26205 12.29132 19.35958 171.6212 100
# arun(copy(dt)) 94.031272 123.79999 186.42155 148.87042 251.56241 364.2193 100
# nograpes(copy(dt)) 7.910739 10.92815 27.68485 13.06058 21.39931 172.0798 100
# ryan(copy(dt)) 8.154395 11.02683 29.40024 13.73845 21.77236 181.0375 100
I think the fastest way (without testing data.table) is data.frame(t(t(df)*v)).
My tests:
testit <- function(nrow, ncol)
{
df <- as.data.frame(matrix(rnorm(nrow*ncol),nrow=nrow,ncol=ncol))
v <- runif(ncol)
r1 <- data.frame(t(t(df)*v))
r2 <- data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE))
r3 <- df * rep(v, each=nrow(df))
stopifnot(identical(r1, r2) && identical(r1, r3))
microbenchmark(data.frame(t(t(df)*v)), data.frame(mapply(`*`,df,v,SIMPLIFY=FALSE)), df * rep(v, each=nrow(df)))
}
Result
> set.seed(1)
>
> testit(100,100)
Unit: milliseconds
expr min lq median uq max neval
data.frame(t(t(df) * v)) 2.297075 2.359541 2.455778 3.804836 33.05806 100
data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE)) 9.977436 10.401576 10.658964 11.762009 15.09721 100
df * rep(v, each = nrow(df)) 14.309822 14.956705 16.092469 16.516609 45.13450 100
> testit(1000,10)
Unit: microseconds
expr min lq median uq max neval
data.frame(t(t(df) * v)) 754.844 805.062 844.431 1850.363 27955.79 100
data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE)) 1457.895 1497.088 1567.604 2550.090 4732.03 100
df * rep(v, each = nrow(df)) 5383.288 5527.817 5875.143 6628.586 32392.81 100
> testit(10,1000)
Unit: milliseconds
expr min lq median uq max neval
data.frame(t(t(df) * v)) 17.07548 18.29418 19.91498 20.67944 57.62913 100
data.frame(mapply(`*`, df, v, SIMPLIFY = FALSE)) 99.90103 104.36028 108.28147 114.82012 150.05907 100
df * rep(v, each = nrow(df)) 112.21719 118.74359 122.51308 128.82863 164.57431 100

Resources