Faster index finding of flanking non-NA values - r

This is a speed optimization question.
Here is my sample data. The real data has over 100k rows and >300 columns.
library(data.table)
dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA
dt
#> ref tgt1 tgt2
#> 1: 1 11 21
#> 2: 2 12 22
#> 3: 3 NA 23
#> 4: 4 14 NA
#> 5: 5 15 NA
#> 6: 6 16 26
#> 7: 7 17 27
#> 8: 8 NA 28
#> 9: 9 NA 29
#> 10: 10 20 30
#> 11: 11 21 31
#> 12: 12 22 32
#> 13: 13 23 33
#> 14: 14 24 34
#> 15: 15 NA NA
#> 16: 16 NA 36
#> 17: 17 NA NA
#> 18: 18 28 38
#> 19: 19 29 39
#> 20: 20 30 40
Some columns have NA at some positions, and my goal is to get the positions of the nearest non-NA flanking values. For instance, for the second column tgt1, I am using the following code
tgt = dt[, tgt1]
tgt.na = which(is.na(tgt))
tgt.non.na = which(!is.na(tgt))
start = sapply(tgt.na, function(x) max(tgt.non.na[tgt.non.na < x]))
stop = sapply(tgt.na, function(x) min(tgt.non.na[tgt.non.na > x]))
data.frame(start, stop)
#> start stop
#> 1 2 4
#> 2 7 10
#> 3 7 10
#> 4 14 18
#> 5 14 18
#> 6 14 18
Here for the tgt1 column, I get what I want. For example, for the NA at 3rd row, the closest flanking non-NA values are at 2 and 4, and so on for others. My issues is that the sapply are very slow. Imagine running this for >300 columns and 100k rows. In current form it takes over few hours to finish. Ultimately, when these positions are found, then they are used to index values from ref column to compute the missing values in tgt1 and so on columns. But that is the topic for another time.
Is there any way I can make it faster? Any data.table way solution for it.
Edit: All great solutions, here is my benchmark, and you can see all proposed methods worked lightning fast compared to my original sapply. I select lapply, not only because it is the fastest but also because it aligns well with my current code syntax.
Unit: milliseconds
expr min lq mean median uq max neval
sapply 3755.118949 3787.288609 3850.322669 3819.458269 3897.924530 3976.390790 3
dt.thelatemail 9.145551 9.920238 10.242885 10.694925 10.791552 10.888180 3
lapply.andrew 2.626525 3.038480 3.446682 3.450434 3.856760 4.263086 3
zoo.chinsoon 6.457849 6.578099 6.629839 6.698349 6.715834 6.733318 3

Here is a base R alternative using rle. I used lapply because I was not sure how you wanted to save all the output dataframes. Hope this helps!
dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA
lapply(dt[,-1], function(x) {
na_loc <- which(is.na(x))
rle_x <- rle(is.na(x))
reps <- rle_x$lengths[rle_x$values == T]
start <- na_loc - 1
start <- start[!start %in% na_loc]
end <- na_loc + 1
end <- end[!end %in% na_loc]
data.frame(start = rep(start, reps),
end = rep(end, reps))
})
$tgt1
start end
1: 2 4
2: 7 10
3: 7 10
4: 14 18
5: 14 18
6: 14 18
$tgt2
start end
1: 3 6
2: 3 6
3: 14 16
4: 16 18
It also scales fairly well on my laptop for a sample dataframe w/ 300 columns:
df1 <- data.frame(ref = 1:1e5)
df1[paste0("tgt", 1:300)] <- replicate(300, sample(c(1:50, rep(NA, 5)), 1e5, replace = T))
microbenchmark::microbenchmark(
base = {
lapply(df1[,-1], function(x) {
na_loc <- which(is.na(x))
rle <- rle(is.na(x))
reps <- rle$lengths[rle$values == T]
start <- na_loc - 1
start <- start[!start %in% na_loc]
end <- na_loc + 1
end <- end[!end %in% na_loc]
data.frame(start = rep(start, reps),
end = rep(end, reps))
}
)},
times = 5
)
Unit: seconds
expr min lq mean median uq max neval
base 1.863319 1.888617 1.897651 1.892166 1.898196 1.945954 5

You should be able to take advantage of rleid to calculate the prior value to a run of NAs and then match it up. E.g.:
dt[, a := rleid(is.na(tgt1))]
dt[, rev(ref)[match((a - 1)[is.na(tgt1)], rev(a))] ]
#[1] 2 7 7 14 14 14
dt[, ref[match((a + 1)[is.na(tgt1)], a)] ]
#[1] 4 10 10 18 18 18
Seems pretty quick to process 100k rows:
dt <- dt[rep(1:20,5e3),]
dt[, ref := 1:1e5]
system.time({
dt[, a := rleid(is.na(tgt1))]
dt[, rev(ref)[match((a-1)[is.na(tgt1)],rev(a))]]
dt[, ref[match((a+1)[is.na(tgt1)],a)]]
})
# user system elapsed
# 0.02 0.00 0.02

Another possibility using zoo package:
library(zoo)
for (j in paste0("tgt", 1L:2L)) {
print(dt[, {
k <- is.na(get(j))
x <- replace(ref, k, NA_integer_)
.(start=na.locf0(x)[k],
end=na.locf0(x, fromLast=TRUE)[k])
}])
}
output:
start end
1: 2 4
2: 7 10
3: 7 10
4: 14 18
5: 14 18
6: 14 18
start end
1: 3 6
2: 3 6
3: 14 16
4: 16 18
timing code:
library(data.table)
library(zoo)
sz <- 100e3
nc <- 400
dt <- data.table(ref=1L:sz,
as.data.table(matrix(sample(c(NA_integer_, 1L), sz*nc, replace=TRUE), ncol=nc)))
library(microbenchmark)
microbenchmark(
mtd0=for (j in paste0("V", 1L:nc)) {
k <- dt[,is.na(get(j))]
dt[, a := rleid(k)][,
.(start=rev(ref)[match((a-1)[k],rev(a))], end=ref[match((a+1)[k],a)])]
},
mtd1=for (j in paste0("V", 1L:nc)) {
dt[, {
k <- is.na(get(j))
x <- replace(ref, k, NA_integer_)
.(start=na.locf0(x)[k], end=na.locf0(x, fromLast=TRUE)[k])
}]
},
times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
mtd0 6.638253 6.698023 6.730352 6.757794 6.776402 6.795010 3 b
mtd1 4.832264 4.835764 4.854799 4.839264 4.866066 4.892867 3 a
Not much diff in timings given the number of rows.

Related

Replacing Missing Value in R

I have to replace the missing value to maximum (Value) by ID. How to do in R
ID Value
1 NA
5 15
8 16
6 8
7 65
8 NA
5 25
1 62
6 14
7 NA
9 11
8 12
9 36
1 26
4 13
I would first precompute the max values using a call to aggregate(), and also precompute which rows of the data.frame have an NA value. Then you can match the IDs into the aggregation table to extract the corresponding max value.
maxes <- aggregate(Value~ID,df,max,na.rm=T);
nas <- which(is.na(df$Value));
df$Value[nas] <- maxes$Value[match(df$ID[nas],maxes$ID)];
df;
## ID Value
## 1 1 62
## 2 5 15
## 3 8 16
## 4 6 8
## 5 7 65
## 6 8 16
## 7 5 25
## 8 1 62
## 9 6 14
## 10 7 65
## 11 9 11
## 12 8 12
## 13 9 36
## 14 1 26
## 15 4 13
Alternative, using ave():
df$Value <- ave(df$Value,df$ID,FUN=function(x) { x[is.na(x)] <- max(x,na.rm=T); x; });
df;
## ID Value
## 1 1 62
## 2 5 15
## 3 8 16
## 4 6 8
## 5 7 65
## 6 8 16
## 7 5 25
## 8 1 62
## 9 6 14
## 10 7 65
## 11 9 11
## 12 8 12
## 13 9 36
## 14 1 26
## 15 4 13
Data
df <- data.frame(ID=c(1L,5L,8L,6L,7L,8L,5L,1L,6L,7L,9L,8L,9L,1L,4L),Value=c(NA,15L,16L,8L,
65L,NA,25L,62L,14L,NA,11L,12L,36L,26L,13L));
Benchmarking
Notes:
I had to modify bgoldst2(), rafa(), and akrun() to guard against the case of zero non-NAs in a group; otherwise, max(...,na.rm=T) returns -Inf which can mess up subsequent operations. I used the same algorithm for all three guards. thierry() and bgoldst1() did not have to be modified.
The large scale loop was fairly tricky to write, and I'm not going to attempt to explain all the details; feel free to ask follow-up questions if interested. Basically I ran 12 different benchmarks which varied by the number of groups and the frequency of NAs. The resulting table res shows the two parameters, the mean run-times for all solutions, and the unit chosen by the microbenchmark summarization algorithm.
library(microbenchmark);
library(dplyr);
library(data.table);
library(zoo);
thierry <- function(df) df %>% group_by(ID) %>% mutate(Value = ifelse(is.na(Value), max(Value, na.rm = TRUE), Value));
bgoldst1 <- function(df) { maxes <- aggregate(Value~ID,df,max,na.rm=T); nas <- which(is.na(df$Value)); df$Value[nas] <- maxes$Value[match(df$ID[nas],maxes$ID)]; df; };
bgoldst2 <- function(df) { df$Value <- ave(df$Value,df$ID,FUN=function(x) { nas <- is.na(x); if (any(!nas) && any(nas)) x[nas] <- max(x,na.rm=T); x; }); df; };
rafa <- function(dt) dt[ , Value := { nas <- is.na(Value); if (any(!nas) && any(nas)) ifelse( nas, max(Value, na.rm=T), Value) else Value; }, by = ID];
akrun <- function(dt) dt[, Value := { nas <- is.na(Value); if (any(!nas) && any(nas)) na.aggregate(Value, FUN = max) else Value; }, ID];
## small scale (OP's sample input)
df <- data.frame(ID=c(1L,5L,8L,6L,7L,8L,5L,1L,6L,7L,9L,8L,9L,1L,4L),Value=c(NA,15L,16L,8L,65L,NA,25L,62L,14L,NA,11L,12L,36L,26L,13L));
dt <- as.data.table(df);
ex <- as.data.frame(thierry(copy(df)));
identical(ex,bgoldst1(copy(df)));
identical(ex,bgoldst2(copy(df)));
identical(ex,as.data.frame(rafa(copy(dt))));
identical(ex,as.data.frame(akrun(copy(dt))));
microbenchmark(thierry(copy(df)),bgoldst1(copy(df)),bgoldst2(copy(df)),rafa(copy(dt)),akrun(copy(dt)));
## Unit: microseconds
## expr min lq mean median uq max neval
## thierry(copy(df)) 955.804 989.1610 1043.2847 1004.984 1044.542 2852.016 100
## bgoldst1(copy(df)) 953.238 1005.1985 1069.6281 1039.410 1075.760 2968.337 100
## bgoldst2(copy(df)) 160.798 181.9665 196.0281 192.872 207.412 246.329 100
## rafa(copy(dt)) 947.679 1006.6945 1056.9396 1033.637 1055.874 2943.105 100
## akrun(copy(dt)) 1327.862 1384.5255 1496.1259 1415.530 1445.894 3969.899 100
## large scale, 3 group sizes crossed with 4 NA densities
NV <- 1e5L;
NIs <- c(10L,1e3L,3e4L);
probNAs <- c(1e-3,0.05,0.4,0.95);
res <- expand.grid(NI=NIs,probNA=probNAs);
system.time({
for (ri in seq_len(nrow(res))) {
NI <- res$NI[ri];
probNA <- res$probNA[ri];
df <- data.frame(ID=sample(seq_len(NI),NV,T),Value=sample(c(NA,1:99),NV,T,c(probNA,rep((1-probNA)/99,99L))));
dt <- as.data.table(df);
ex <- as.data.frame(thierry(copy(df)));
if (!all(c(
identical(ex,bgoldst1(copy(df))),
identical(ex,bgoldst2(copy(df))),
identical(ex,as.data.frame(rafa(copy(dt)))),
identical(ex,as.data.frame(akrun(copy(dt))))
))) stop('non-identical failure.');
bm <- summary(microbenchmark(thierry(copy(df)),bgoldst1(copy(df)),bgoldst2(copy(df)),rafa(copy(dt)),akrun(copy(dt)),times=5L));
nms <- sub('\\(.*','',as.character(bm$expr));
for (nm in nms) if (!nm%in%names(res)) res[[nm]] <- NA_real_;
if (!'unit'%in%names(res)) res$unit <- NA_character_;
res[ri,nms] <- bm$mean;
res$unit[ri] <- attr(bm,'unit');
}; ## end for
});
## user system elapsed
## 73.18 0.00 73.37
res;
## NI probNA thierry bgoldst1 bgoldst2 rafa akrun unit
## 1 10 0.001 7.850589 138.77128 14.867427 7.071150 8.023874 milliseconds
## 2 1000 0.001 40.318311 177.26223 9.868853 6.389129 18.054122 milliseconds
## 3 30000 0.001 813.204627 619.16166 125.274735 57.301590 74.732023 milliseconds
## 4 10 0.050 9.387743 139.41686 15.032158 8.479837 6.933616 milliseconds
## 5 1000 0.050 43.223697 156.79871 23.377797 20.550586 145.632279 milliseconds
## 6 30000 0.050 822.338773 677.81813 129.268155 114.585475 656.468438 milliseconds
## 7 10 0.400 15.955374 110.20717 9.785802 11.832889 10.511871 milliseconds
## 8 1000 0.400 55.858348 115.93900 14.441228 22.525058 142.740834 milliseconds
## 9 30000 0.400 853.571520 521.19690 147.925864 208.278328 2518.672465 milliseconds
## 10 10 0.950 9.768268 43.98346 5.921021 9.895623 8.571868 milliseconds
## 11 1000 0.950 49.228024 63.72596 13.702929 22.152230 143.606916 milliseconds
## 12 30000 0.950 822.033257 103.91700 113.398739 86.240922 630.982913 milliseconds
library(dplyr)
dataset %>%
group_by(ID) %>%
mutate(
Value = ifelse(
is.na(Value),
max(Value, na.rm = TRUE),
Value
)
)
A simple and fast solution using data.table. Thanks #bgoldst for the tip of including na.rm=T.
library(data.table)
setDT(df)[ , Value := ifelse( is.na(Value), max(Value, na.rm=T), Value), by = ID]
We can use na.aggregate with data.table
library(data.table)
library(zoo)
setDT(df)[, Value := na.aggregate(Value, FUN = max) , by = ID]
df
# ID Value
# 1: 1 62
# 2: 5 15
# 3: 8 16
# 4: 6 8
# 5: 7 65
# 6: 8 16
# 7: 5 25
# 8: 1 62
# 9: 6 14
#10: 7 65
#11: 9 11
#12: 8 12
#13: 9 36
#14: 1 26
#15: 4 13

How can I add a row manually to a dataframe? [duplicate]

The following code combines a vector with a dataframe:
newrow = c(1:4)
existingDF = rbind(existingDF,newrow)
However this code always inserts the new row at the end of the dataframe.
How can I insert the row at a specified point within the dataframe? For example, lets say the dataframe has 20 rows, how can I insert the new row between rows 10 and 11?
Here's a solution that avoids the (often slow) rbind call:
existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
r <- 3
newrow <- seq(4)
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
> insertRow(existingDF, newrow, r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
If speed is less important than clarity, then #Simon's solution works well:
existingDF <- rbind(existingDF[1:r,],newrow,existingDF[-(1:r),])
> existingDF
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 1 2 3 4
41 4 9 14 19
5 5 10 15 20
(Note we index r differently).
And finally, benchmarks:
library(microbenchmark)
microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r)
)
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 660.131 678.3675 695.5515 725.2775 928.299
2 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 801.161 831.7730 854.6320 881.6560 10641.417
Benchmarks
As #MatthewDowle always points out to me, benchmarks need to be examined for the scaling as the size of the problem increases. Here we go then:
benchmarkInsertionSolutions <- function(nrow=5,ncol=4) {
existingDF <- as.data.frame(matrix(seq(nrow*ncol),nrow=nrow,ncol=ncol))
r <- 3 # Row to insert into
newrow <- seq(ncol)
m <- microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r),
insertRow2(existingDF,newrow,r)
)
# Now return the median times
mediansBy <- by(m$time,m$expr, FUN=median)
res <- as.numeric(mediansBy)
names(res) <- names(mediansBy)
res
}
nrows <- 5*10^(0:5)
benchmarks <- sapply(nrows,benchmarkInsertionSolutions)
colnames(benchmarks) <- as.character(nrows)
ggplot( melt(benchmarks), aes(x=Var2,y=value,colour=Var1) ) + geom_line() + scale_x_log10() + scale_y_log10()
#Roland's solution scales quite well, even with the call to rbind:
5 50 500 5000 50000 5e+05
insertRow2(existingDF, newrow, r) 549861.5 579579.0 789452 2512926 46994560 414790214
insertRow(existingDF, newrow, r) 895401.0 905318.5 1168201 2603926 39765358 392904851
rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 787218.0 814979.0 1263886 5591880 63351247 829650894
Plotted on a linear scale:
And a log-log scale:
insertRow2 <- function(existingDF, newrow, r) {
existingDF <- rbind(existingDF,newrow)
existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
row.names(existingDF) <- 1:nrow(existingDF)
return(existingDF)
}
insertRow2(existingDF,newrow,r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
microbenchmark(
+ rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
+ insertRow(existingDF,newrow,r),
+ insertRow2(existingDF,newrow,r)
+ )
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 513.157 525.6730 531.8715 544.4575 1409.553
2 insertRow2(existingDF, newrow, r) 430.664 443.9010 450.0570 461.3415 499.988
3 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 606.822 625.2485 633.3710 653.1500 1489.216
The .before argument in dplyr::add_row can be used to specify the row.
dplyr::add_row(
cars,
speed = 0,
dist = 0,
.before = 3
)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 0 0
#> 4 7 4
#> 5 7 22
#> 6 8 16
#> ...
You should try dplyr package
library(dplyr)
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- bind_rows(a, b)
}
})
Output
user system elapsed
0.25 0.00 0.25
In contrast with using rbind function
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- rbind(a, b)
}
})
Output
user system elapsed
0.49 0.00 0.49
There is some performance gain.
Insert blank row after five row in data frame and use this library package.
library(berryFunctions)
df <- insertRows(df, 5 , new = "")

Select one row from each group in a large data.table based on a condition [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 5 years ago.
I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
X Y
1: 1 -1.37925206
2: 1 -0.53837461
3: 2 0.26516340
4: 2 -0.04643483
5: 3 0.40331424
6: 3 0.28667275
7: 4 -0.30342327
8: 4 -2.13143267
9: 5 2.11178673
10: 5 -0.98047230
11: 6 -0.27230783
12: 6 -0.79540934
13: 7 1.54264549
14: 7 0.40079650
15: 8 -0.98474297
16: 8 0.73179201
17: 9 -0.34590491
18: 9 -0.55897393
19: 10 0.97523187
20: 10 1.16924293
> DT[, .SD[Y == max(Y)], by = X]
X Y
1: 1 -0.5383746
2: 2 0.2651634
3: 3 0.4033142
4: 4 -0.3034233
5: 5 2.1117867
6: 6 -0.2723078
7: 7 1.5426455
8: 8 0.7317920
9: 9 -0.3459049
10: 10 1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
user system elapsed
9.69 0.00 9.69
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
X Y
1: 1 1.26925708
2: 4 -0.66625732
3: 5 0.41498548
4: 8 0.03531185
5: 9 0.30608380
6: 1 0.50308578
7: 4 0.19848227
8: 6 0.86458423
9: 8 0.69825500
10: 10 -0.38160503
This would be faster compared to .SD
system.time({setkey(DT, X)
DT[DT[,Y==max(Y), by=X]$V1,]})
# user system elapsed
#0.016 0.000 0.016
Or
system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
# user system elapsed
# 0.023 0.000 0.023
If there are only two columns,
system.time(DT[,list(Y=max(Y)), by=X])
# user system elapsed
# 0.006 0.000 0.007
Compared to,
system.time(DT[, .SD[Y == max(Y)], by = X] )
# user system elapsed
# 2.946 0.006 2.962
Based on comments from #Khashaa, #AnandaMahto, the CRAN version (1.9.4) gives a different result for the .SD method compared to devel version (1.9.5) (which I used). You could get the same result for "CRAN" version (from #Arun's comments) by setting the options
options(datatable.auto.index=FALSE)
NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by #docendo discimus). My solutions are based on the "code" posted by the OP.
If there are "ties", then you could use unique with by option (in case the number of columns are > 2)
setkey(DT,X)
unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))
microbenchmarks
library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# f1() 2.794435 2.733706 3.024097 2.756398 2.832654 6.697893 20
# f2() 4.302534 4.291715 4.535051 4.271834 4.342437 8.114811 20
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691 20
# cld
# a
# a
# a
# b
data
N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))

Resampling from subject id's in R

Assume we have the following data
set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2
The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id.
I need the code that results in (and works also for a large dataset with more variables):
var1 id
13 2
19 2
15 2
19 4
13 2
19 2
15 2
The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is
var1 id
13 2
19 2
15 2
19 4
where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?
EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):
test replications elapsed relative user.self
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67
2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83
4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33
This would be probably the fastest approach for a big data set using data.table binary search
library(data.table)
setkey(setDT(dat), id)[J(sampledIDs)]
# var1 id
# 1: 13 2
# 2: 19 2
# 3: 15 2
# 4: 19 4
# 5: 13 2
# 6: 19 2
# 7: 15 2
Edit:
Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner
library(data.table)
library(microbenchmark)
set.seed(123)
n <- 1e5
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
dat2 <- copy(dat)
Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
Res <- microbenchmark(Sven1(dat),
Sven2(dat),
flodel(dat),
David(dat2))
Res
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100
boxplot(Res)
If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
[1] 7 6 10 9 5 9 5 3 7 3
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100
You can do:
do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
One approach:
dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
# var1 id
# 3 13 2
# 4 19 2
# 5 15 2
# 7 19 4
# 3.1 13 2
# 4.1 19 2
# 5.1 15 2
An alternative approach:
dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

How do you multiply two unequal length vectors by a factor?

I have two data frames of differing lengths. There is a unique factor that links the two data frames together. I want to multiply the values in the larger data frame by the matching factor in the smaller data frame. Here is code to demonstrate:
d1 <- data.frame(u = factor(x = LETTERS[1:5]), n1 = 1:5)
d2 <- data.frame(u = factor(x = rep(x = LETTERS[1:5], each = 2)), n2 = 1:10)
I want d2[1:2, 2] both multiplied by d1[1, 2] because the factor "A" matches and so forth for the rest of the matching factors.
For this problem you can also use match, which should be somewhat more efficient than the merge/transform approach (particularly if you don't need the data.frame that the latter creates):
d2$n2 * d1[match(d2$u, d1$u), 'n1']
# [1] 1 2 6 8 15 18 28 32 45 50
Use merge to join the two data frames, then transform to add a column to it.
> transform(merge(d1, d2), n.total = n1*n2)
u n1 n2 n.total
1 A 1 1 1
2 A 1 2 2
3 B 2 3 6
4 B 2 4 8
5 C 3 5 15
6 C 3 6 18
7 D 4 7 28
8 D 4 8 32
9 E 5 9 45
10 E 5 10 50
If you don't need the data frame created by transform you can use with instead.
> with(merge(d1, d2), n1*n2)
[1] 1 2 6 8 15 18 28 32 45 50
If you have a lot of data and the above solutions are too slow or inefficient I suggest you go for #jbaums solution, but otherwise I find that the increased readability of merge is preferable.
> require(microbenchmark)
> microbenchmark(transform(merge(d1, d2), n.total = n1*n2),
+ with(merge(d1, d2), n1*n2),
+ d2$n2 * d1[match(d2$u, d1$u), 'n1'])
Unit: microseconds
expr min lq mean
transform(merge(d1, d2), n.total = n1 * n2) 826.897 904.2275 1126.41204
with(merge(d1, d2), n1 * n2) 658.295 722.6715 907.34581
d2$n2 * d1[match(d2$u, d1$u), "n1"] 49.372 59.5830 78.42575
median uq max neval cld
940.3890 1087.0350 2695.521 100 c
764.2965 934.5555 2463.300 100 b
66.2475 86.1505 260.820 100 a
If we into speed comparisons, you might just as well try data.table package (although for such a small data set, jbaums approach probably be more efficient)
library(data.table)
setkey(setDT(d1), u); setDT(d2)
d1[d2][, n.total := n1*n2][]
# u n1 n2 n.total
# 1: A 1 1 1
# 2: A 1 2 2
# 3: B 2 3 6
# 4: B 2 4 8
# 5: C 3 5 15
# 6: C 3 6 18
# 7: D 4 7 28
# 8: D 4 8 32
# 9: E 5 9 45
# 10: E 5 10 50
Or as (suggested by #Arun)
d2[d1, n2 := n2*n1] # Update (by reference) `n2`
OR
d2[d1, new := n2*n1] # Add new column
Note: Although these would be faster, you won't see column n1 in the final result

Resources