Assume we have the following data
set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2
The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id.
I need the code that results in (and works also for a large dataset with more variables):
var1 id
13 2
19 2
15 2
19 4
13 2
19 2
15 2
The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is
var1 id
13 2
19 2
15 2
19 4
where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?
EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):
test replications elapsed relative user.self
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67
2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83
4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33
This would be probably the fastest approach for a big data set using data.table binary search
library(data.table)
setkey(setDT(dat), id)[J(sampledIDs)]
# var1 id
# 1: 13 2
# 2: 19 2
# 3: 15 2
# 4: 19 4
# 5: 13 2
# 6: 19 2
# 7: 15 2
Edit:
Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner
library(data.table)
library(microbenchmark)
set.seed(123)
n <- 1e5
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
dat2 <- copy(dat)
Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
Res <- microbenchmark(Sven1(dat),
Sven2(dat),
flodel(dat),
David(dat2))
Res
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100
boxplot(Res)
If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
[1] 7 6 10 9 5 9 5 3 7 3
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100
You can do:
do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
One approach:
dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
# var1 id
# 3 13 2
# 4 19 2
# 5 15 2
# 7 19 4
# 3.1 13 2
# 4.1 19 2
# 5.1 15 2
An alternative approach:
dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
Related
This is a speed optimization question.
Here is my sample data. The real data has over 100k rows and >300 columns.
library(data.table)
dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA
dt
#> ref tgt1 tgt2
#> 1: 1 11 21
#> 2: 2 12 22
#> 3: 3 NA 23
#> 4: 4 14 NA
#> 5: 5 15 NA
#> 6: 6 16 26
#> 7: 7 17 27
#> 8: 8 NA 28
#> 9: 9 NA 29
#> 10: 10 20 30
#> 11: 11 21 31
#> 12: 12 22 32
#> 13: 13 23 33
#> 14: 14 24 34
#> 15: 15 NA NA
#> 16: 16 NA 36
#> 17: 17 NA NA
#> 18: 18 28 38
#> 19: 19 29 39
#> 20: 20 30 40
Some columns have NA at some positions, and my goal is to get the positions of the nearest non-NA flanking values. For instance, for the second column tgt1, I am using the following code
tgt = dt[, tgt1]
tgt.na = which(is.na(tgt))
tgt.non.na = which(!is.na(tgt))
start = sapply(tgt.na, function(x) max(tgt.non.na[tgt.non.na < x]))
stop = sapply(tgt.na, function(x) min(tgt.non.na[tgt.non.na > x]))
data.frame(start, stop)
#> start stop
#> 1 2 4
#> 2 7 10
#> 3 7 10
#> 4 14 18
#> 5 14 18
#> 6 14 18
Here for the tgt1 column, I get what I want. For example, for the NA at 3rd row, the closest flanking non-NA values are at 2 and 4, and so on for others. My issues is that the sapply are very slow. Imagine running this for >300 columns and 100k rows. In current form it takes over few hours to finish. Ultimately, when these positions are found, then they are used to index values from ref column to compute the missing values in tgt1 and so on columns. But that is the topic for another time.
Is there any way I can make it faster? Any data.table way solution for it.
Edit: All great solutions, here is my benchmark, and you can see all proposed methods worked lightning fast compared to my original sapply. I select lapply, not only because it is the fastest but also because it aligns well with my current code syntax.
Unit: milliseconds
expr min lq mean median uq max neval
sapply 3755.118949 3787.288609 3850.322669 3819.458269 3897.924530 3976.390790 3
dt.thelatemail 9.145551 9.920238 10.242885 10.694925 10.791552 10.888180 3
lapply.andrew 2.626525 3.038480 3.446682 3.450434 3.856760 4.263086 3
zoo.chinsoon 6.457849 6.578099 6.629839 6.698349 6.715834 6.733318 3
Here is a base R alternative using rle. I used lapply because I was not sure how you wanted to save all the output dataframes. Hope this helps!
dt <- data.table(ref=1:20, tgt1=11:30, tgt2=21:40)
dt[c(3,8,9,15,16,17), "tgt1"] = NA
dt[c(4,5,15,17), "tgt2"] = NA
lapply(dt[,-1], function(x) {
na_loc <- which(is.na(x))
rle_x <- rle(is.na(x))
reps <- rle_x$lengths[rle_x$values == T]
start <- na_loc - 1
start <- start[!start %in% na_loc]
end <- na_loc + 1
end <- end[!end %in% na_loc]
data.frame(start = rep(start, reps),
end = rep(end, reps))
})
$tgt1
start end
1: 2 4
2: 7 10
3: 7 10
4: 14 18
5: 14 18
6: 14 18
$tgt2
start end
1: 3 6
2: 3 6
3: 14 16
4: 16 18
It also scales fairly well on my laptop for a sample dataframe w/ 300 columns:
df1 <- data.frame(ref = 1:1e5)
df1[paste0("tgt", 1:300)] <- replicate(300, sample(c(1:50, rep(NA, 5)), 1e5, replace = T))
microbenchmark::microbenchmark(
base = {
lapply(df1[,-1], function(x) {
na_loc <- which(is.na(x))
rle <- rle(is.na(x))
reps <- rle$lengths[rle$values == T]
start <- na_loc - 1
start <- start[!start %in% na_loc]
end <- na_loc + 1
end <- end[!end %in% na_loc]
data.frame(start = rep(start, reps),
end = rep(end, reps))
}
)},
times = 5
)
Unit: seconds
expr min lq mean median uq max neval
base 1.863319 1.888617 1.897651 1.892166 1.898196 1.945954 5
You should be able to take advantage of rleid to calculate the prior value to a run of NAs and then match it up. E.g.:
dt[, a := rleid(is.na(tgt1))]
dt[, rev(ref)[match((a - 1)[is.na(tgt1)], rev(a))] ]
#[1] 2 7 7 14 14 14
dt[, ref[match((a + 1)[is.na(tgt1)], a)] ]
#[1] 4 10 10 18 18 18
Seems pretty quick to process 100k rows:
dt <- dt[rep(1:20,5e3),]
dt[, ref := 1:1e5]
system.time({
dt[, a := rleid(is.na(tgt1))]
dt[, rev(ref)[match((a-1)[is.na(tgt1)],rev(a))]]
dt[, ref[match((a+1)[is.na(tgt1)],a)]]
})
# user system elapsed
# 0.02 0.00 0.02
Another possibility using zoo package:
library(zoo)
for (j in paste0("tgt", 1L:2L)) {
print(dt[, {
k <- is.na(get(j))
x <- replace(ref, k, NA_integer_)
.(start=na.locf0(x)[k],
end=na.locf0(x, fromLast=TRUE)[k])
}])
}
output:
start end
1: 2 4
2: 7 10
3: 7 10
4: 14 18
5: 14 18
6: 14 18
start end
1: 3 6
2: 3 6
3: 14 16
4: 16 18
timing code:
library(data.table)
library(zoo)
sz <- 100e3
nc <- 400
dt <- data.table(ref=1L:sz,
as.data.table(matrix(sample(c(NA_integer_, 1L), sz*nc, replace=TRUE), ncol=nc)))
library(microbenchmark)
microbenchmark(
mtd0=for (j in paste0("V", 1L:nc)) {
k <- dt[,is.na(get(j))]
dt[, a := rleid(k)][,
.(start=rev(ref)[match((a-1)[k],rev(a))], end=ref[match((a+1)[k],a)])]
},
mtd1=for (j in paste0("V", 1L:nc)) {
dt[, {
k <- is.na(get(j))
x <- replace(ref, k, NA_integer_)
.(start=na.locf0(x)[k], end=na.locf0(x, fromLast=TRUE)[k])
}]
},
times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
mtd0 6.638253 6.698023 6.730352 6.757794 6.776402 6.795010 3 b
mtd1 4.832264 4.835764 4.854799 4.839264 4.866066 4.892867 3 a
Not much diff in timings given the number of rows.
I have to replace the missing value to maximum (Value) by ID. How to do in R
ID Value
1 NA
5 15
8 16
6 8
7 65
8 NA
5 25
1 62
6 14
7 NA
9 11
8 12
9 36
1 26
4 13
I would first precompute the max values using a call to aggregate(), and also precompute which rows of the data.frame have an NA value. Then you can match the IDs into the aggregation table to extract the corresponding max value.
maxes <- aggregate(Value~ID,df,max,na.rm=T);
nas <- which(is.na(df$Value));
df$Value[nas] <- maxes$Value[match(df$ID[nas],maxes$ID)];
df;
## ID Value
## 1 1 62
## 2 5 15
## 3 8 16
## 4 6 8
## 5 7 65
## 6 8 16
## 7 5 25
## 8 1 62
## 9 6 14
## 10 7 65
## 11 9 11
## 12 8 12
## 13 9 36
## 14 1 26
## 15 4 13
Alternative, using ave():
df$Value <- ave(df$Value,df$ID,FUN=function(x) { x[is.na(x)] <- max(x,na.rm=T); x; });
df;
## ID Value
## 1 1 62
## 2 5 15
## 3 8 16
## 4 6 8
## 5 7 65
## 6 8 16
## 7 5 25
## 8 1 62
## 9 6 14
## 10 7 65
## 11 9 11
## 12 8 12
## 13 9 36
## 14 1 26
## 15 4 13
Data
df <- data.frame(ID=c(1L,5L,8L,6L,7L,8L,5L,1L,6L,7L,9L,8L,9L,1L,4L),Value=c(NA,15L,16L,8L,
65L,NA,25L,62L,14L,NA,11L,12L,36L,26L,13L));
Benchmarking
Notes:
I had to modify bgoldst2(), rafa(), and akrun() to guard against the case of zero non-NAs in a group; otherwise, max(...,na.rm=T) returns -Inf which can mess up subsequent operations. I used the same algorithm for all three guards. thierry() and bgoldst1() did not have to be modified.
The large scale loop was fairly tricky to write, and I'm not going to attempt to explain all the details; feel free to ask follow-up questions if interested. Basically I ran 12 different benchmarks which varied by the number of groups and the frequency of NAs. The resulting table res shows the two parameters, the mean run-times for all solutions, and the unit chosen by the microbenchmark summarization algorithm.
library(microbenchmark);
library(dplyr);
library(data.table);
library(zoo);
thierry <- function(df) df %>% group_by(ID) %>% mutate(Value = ifelse(is.na(Value), max(Value, na.rm = TRUE), Value));
bgoldst1 <- function(df) { maxes <- aggregate(Value~ID,df,max,na.rm=T); nas <- which(is.na(df$Value)); df$Value[nas] <- maxes$Value[match(df$ID[nas],maxes$ID)]; df; };
bgoldst2 <- function(df) { df$Value <- ave(df$Value,df$ID,FUN=function(x) { nas <- is.na(x); if (any(!nas) && any(nas)) x[nas] <- max(x,na.rm=T); x; }); df; };
rafa <- function(dt) dt[ , Value := { nas <- is.na(Value); if (any(!nas) && any(nas)) ifelse( nas, max(Value, na.rm=T), Value) else Value; }, by = ID];
akrun <- function(dt) dt[, Value := { nas <- is.na(Value); if (any(!nas) && any(nas)) na.aggregate(Value, FUN = max) else Value; }, ID];
## small scale (OP's sample input)
df <- data.frame(ID=c(1L,5L,8L,6L,7L,8L,5L,1L,6L,7L,9L,8L,9L,1L,4L),Value=c(NA,15L,16L,8L,65L,NA,25L,62L,14L,NA,11L,12L,36L,26L,13L));
dt <- as.data.table(df);
ex <- as.data.frame(thierry(copy(df)));
identical(ex,bgoldst1(copy(df)));
identical(ex,bgoldst2(copy(df)));
identical(ex,as.data.frame(rafa(copy(dt))));
identical(ex,as.data.frame(akrun(copy(dt))));
microbenchmark(thierry(copy(df)),bgoldst1(copy(df)),bgoldst2(copy(df)),rafa(copy(dt)),akrun(copy(dt)));
## Unit: microseconds
## expr min lq mean median uq max neval
## thierry(copy(df)) 955.804 989.1610 1043.2847 1004.984 1044.542 2852.016 100
## bgoldst1(copy(df)) 953.238 1005.1985 1069.6281 1039.410 1075.760 2968.337 100
## bgoldst2(copy(df)) 160.798 181.9665 196.0281 192.872 207.412 246.329 100
## rafa(copy(dt)) 947.679 1006.6945 1056.9396 1033.637 1055.874 2943.105 100
## akrun(copy(dt)) 1327.862 1384.5255 1496.1259 1415.530 1445.894 3969.899 100
## large scale, 3 group sizes crossed with 4 NA densities
NV <- 1e5L;
NIs <- c(10L,1e3L,3e4L);
probNAs <- c(1e-3,0.05,0.4,0.95);
res <- expand.grid(NI=NIs,probNA=probNAs);
system.time({
for (ri in seq_len(nrow(res))) {
NI <- res$NI[ri];
probNA <- res$probNA[ri];
df <- data.frame(ID=sample(seq_len(NI),NV,T),Value=sample(c(NA,1:99),NV,T,c(probNA,rep((1-probNA)/99,99L))));
dt <- as.data.table(df);
ex <- as.data.frame(thierry(copy(df)));
if (!all(c(
identical(ex,bgoldst1(copy(df))),
identical(ex,bgoldst2(copy(df))),
identical(ex,as.data.frame(rafa(copy(dt)))),
identical(ex,as.data.frame(akrun(copy(dt))))
))) stop('non-identical failure.');
bm <- summary(microbenchmark(thierry(copy(df)),bgoldst1(copy(df)),bgoldst2(copy(df)),rafa(copy(dt)),akrun(copy(dt)),times=5L));
nms <- sub('\\(.*','',as.character(bm$expr));
for (nm in nms) if (!nm%in%names(res)) res[[nm]] <- NA_real_;
if (!'unit'%in%names(res)) res$unit <- NA_character_;
res[ri,nms] <- bm$mean;
res$unit[ri] <- attr(bm,'unit');
}; ## end for
});
## user system elapsed
## 73.18 0.00 73.37
res;
## NI probNA thierry bgoldst1 bgoldst2 rafa akrun unit
## 1 10 0.001 7.850589 138.77128 14.867427 7.071150 8.023874 milliseconds
## 2 1000 0.001 40.318311 177.26223 9.868853 6.389129 18.054122 milliseconds
## 3 30000 0.001 813.204627 619.16166 125.274735 57.301590 74.732023 milliseconds
## 4 10 0.050 9.387743 139.41686 15.032158 8.479837 6.933616 milliseconds
## 5 1000 0.050 43.223697 156.79871 23.377797 20.550586 145.632279 milliseconds
## 6 30000 0.050 822.338773 677.81813 129.268155 114.585475 656.468438 milliseconds
## 7 10 0.400 15.955374 110.20717 9.785802 11.832889 10.511871 milliseconds
## 8 1000 0.400 55.858348 115.93900 14.441228 22.525058 142.740834 milliseconds
## 9 30000 0.400 853.571520 521.19690 147.925864 208.278328 2518.672465 milliseconds
## 10 10 0.950 9.768268 43.98346 5.921021 9.895623 8.571868 milliseconds
## 11 1000 0.950 49.228024 63.72596 13.702929 22.152230 143.606916 milliseconds
## 12 30000 0.950 822.033257 103.91700 113.398739 86.240922 630.982913 milliseconds
library(dplyr)
dataset %>%
group_by(ID) %>%
mutate(
Value = ifelse(
is.na(Value),
max(Value, na.rm = TRUE),
Value
)
)
A simple and fast solution using data.table. Thanks #bgoldst for the tip of including na.rm=T.
library(data.table)
setDT(df)[ , Value := ifelse( is.na(Value), max(Value, na.rm=T), Value), by = ID]
We can use na.aggregate with data.table
library(data.table)
library(zoo)
setDT(df)[, Value := na.aggregate(Value, FUN = max) , by = ID]
df
# ID Value
# 1: 1 62
# 2: 5 15
# 3: 8 16
# 4: 6 8
# 5: 7 65
# 6: 8 16
# 7: 5 25
# 8: 1 62
# 9: 6 14
#10: 7 65
#11: 9 11
#12: 8 12
#13: 9 36
#14: 1 26
#15: 4 13
The following code combines a vector with a dataframe:
newrow = c(1:4)
existingDF = rbind(existingDF,newrow)
However this code always inserts the new row at the end of the dataframe.
How can I insert the row at a specified point within the dataframe? For example, lets say the dataframe has 20 rows, how can I insert the new row between rows 10 and 11?
Here's a solution that avoids the (often slow) rbind call:
existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
r <- 3
newrow <- seq(4)
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
> insertRow(existingDF, newrow, r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
If speed is less important than clarity, then #Simon's solution works well:
existingDF <- rbind(existingDF[1:r,],newrow,existingDF[-(1:r),])
> existingDF
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 1 2 3 4
41 4 9 14 19
5 5 10 15 20
(Note we index r differently).
And finally, benchmarks:
library(microbenchmark)
microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r)
)
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 660.131 678.3675 695.5515 725.2775 928.299
2 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 801.161 831.7730 854.6320 881.6560 10641.417
Benchmarks
As #MatthewDowle always points out to me, benchmarks need to be examined for the scaling as the size of the problem increases. Here we go then:
benchmarkInsertionSolutions <- function(nrow=5,ncol=4) {
existingDF <- as.data.frame(matrix(seq(nrow*ncol),nrow=nrow,ncol=ncol))
r <- 3 # Row to insert into
newrow <- seq(ncol)
m <- microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r),
insertRow2(existingDF,newrow,r)
)
# Now return the median times
mediansBy <- by(m$time,m$expr, FUN=median)
res <- as.numeric(mediansBy)
names(res) <- names(mediansBy)
res
}
nrows <- 5*10^(0:5)
benchmarks <- sapply(nrows,benchmarkInsertionSolutions)
colnames(benchmarks) <- as.character(nrows)
ggplot( melt(benchmarks), aes(x=Var2,y=value,colour=Var1) ) + geom_line() + scale_x_log10() + scale_y_log10()
#Roland's solution scales quite well, even with the call to rbind:
5 50 500 5000 50000 5e+05
insertRow2(existingDF, newrow, r) 549861.5 579579.0 789452 2512926 46994560 414790214
insertRow(existingDF, newrow, r) 895401.0 905318.5 1168201 2603926 39765358 392904851
rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 787218.0 814979.0 1263886 5591880 63351247 829650894
Plotted on a linear scale:
And a log-log scale:
insertRow2 <- function(existingDF, newrow, r) {
existingDF <- rbind(existingDF,newrow)
existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
row.names(existingDF) <- 1:nrow(existingDF)
return(existingDF)
}
insertRow2(existingDF,newrow,r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
microbenchmark(
+ rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
+ insertRow(existingDF,newrow,r),
+ insertRow2(existingDF,newrow,r)
+ )
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 513.157 525.6730 531.8715 544.4575 1409.553
2 insertRow2(existingDF, newrow, r) 430.664 443.9010 450.0570 461.3415 499.988
3 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 606.822 625.2485 633.3710 653.1500 1489.216
The .before argument in dplyr::add_row can be used to specify the row.
dplyr::add_row(
cars,
speed = 0,
dist = 0,
.before = 3
)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 0 0
#> 4 7 4
#> 5 7 22
#> 6 8 16
#> ...
You should try dplyr package
library(dplyr)
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- bind_rows(a, b)
}
})
Output
user system elapsed
0.25 0.00 0.25
In contrast with using rbind function
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- rbind(a, b)
}
})
Output
user system elapsed
0.49 0.00 0.49
There is some performance gain.
Insert blank row after five row in data frame and use this library package.
library(berryFunctions)
df <- insertRows(df, 5 , new = "")
Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.
The following code combines a vector with a dataframe:
newrow = c(1:4)
existingDF = rbind(existingDF,newrow)
However this code always inserts the new row at the end of the dataframe.
How can I insert the row at a specified point within the dataframe? For example, lets say the dataframe has 20 rows, how can I insert the new row between rows 10 and 11?
Here's a solution that avoids the (often slow) rbind call:
existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
r <- 3
newrow <- seq(4)
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
> insertRow(existingDF, newrow, r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
If speed is less important than clarity, then #Simon's solution works well:
existingDF <- rbind(existingDF[1:r,],newrow,existingDF[-(1:r),])
> existingDF
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 1 2 3 4
41 4 9 14 19
5 5 10 15 20
(Note we index r differently).
And finally, benchmarks:
library(microbenchmark)
microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r)
)
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 660.131 678.3675 695.5515 725.2775 928.299
2 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 801.161 831.7730 854.6320 881.6560 10641.417
Benchmarks
As #MatthewDowle always points out to me, benchmarks need to be examined for the scaling as the size of the problem increases. Here we go then:
benchmarkInsertionSolutions <- function(nrow=5,ncol=4) {
existingDF <- as.data.frame(matrix(seq(nrow*ncol),nrow=nrow,ncol=ncol))
r <- 3 # Row to insert into
newrow <- seq(ncol)
m <- microbenchmark(
rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
insertRow(existingDF,newrow,r),
insertRow2(existingDF,newrow,r)
)
# Now return the median times
mediansBy <- by(m$time,m$expr, FUN=median)
res <- as.numeric(mediansBy)
names(res) <- names(mediansBy)
res
}
nrows <- 5*10^(0:5)
benchmarks <- sapply(nrows,benchmarkInsertionSolutions)
colnames(benchmarks) <- as.character(nrows)
ggplot( melt(benchmarks), aes(x=Var2,y=value,colour=Var1) ) + geom_line() + scale_x_log10() + scale_y_log10()
#Roland's solution scales quite well, even with the call to rbind:
5 50 500 5000 50000 5e+05
insertRow2(existingDF, newrow, r) 549861.5 579579.0 789452 2512926 46994560 414790214
insertRow(existingDF, newrow, r) 895401.0 905318.5 1168201 2603926 39765358 392904851
rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 787218.0 814979.0 1263886 5591880 63351247 829650894
Plotted on a linear scale:
And a log-log scale:
insertRow2 <- function(existingDF, newrow, r) {
existingDF <- rbind(existingDF,newrow)
existingDF <- existingDF[order(c(1:(nrow(existingDF)-1),r-0.5)),]
row.names(existingDF) <- 1:nrow(existingDF)
return(existingDF)
}
insertRow2(existingDF,newrow,r)
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 1 2 3 4
4 3 8 13 18
5 4 9 14 19
6 5 10 15 20
microbenchmark(
+ rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
+ insertRow(existingDF,newrow,r),
+ insertRow2(existingDF,newrow,r)
+ )
Unit: microseconds
expr min lq median uq max
1 insertRow(existingDF, newrow, r) 513.157 525.6730 531.8715 544.4575 1409.553
2 insertRow2(existingDF, newrow, r) 430.664 443.9010 450.0570 461.3415 499.988
3 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 606.822 625.2485 633.3710 653.1500 1489.216
The .before argument in dplyr::add_row can be used to specify the row.
dplyr::add_row(
cars,
speed = 0,
dist = 0,
.before = 3
)
#> speed dist
#> 1 4 2
#> 2 4 10
#> 3 0 0
#> 4 7 4
#> 5 7 22
#> 6 8 16
#> ...
You should try dplyr package
library(dplyr)
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- bind_rows(a, b)
}
})
Output
user system elapsed
0.25 0.00 0.25
In contrast with using rbind function
a <- data.frame(A = c(1, 2, 3, 4),
B = c(11, 12, 13, 14))
system.time({
for (i in 50:1000) {
b <- data.frame(A = i, B = i * i)
a <- rbind(a, b)
}
})
Output
user system elapsed
0.49 0.00 0.49
There is some performance gain.
Insert blank row after five row in data frame and use this library package.
library(berryFunctions)
df <- insertRows(df, 5 , new = "")