I a using a data.table to store data. I am trying to figure out whether certain columns in each row are unique. I want to add a column to the data.table that will hold the value "Duplicated Values" if there are duplicated values and be NA if there are no duplicated values. The names of the columns that I want to check for duplication are stored in a character vector. For example, I create my data.table:
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
> tmpdt
a b c d
1: 1 2 4 3
2: 2 2 2 3
3: 3 3 2 1
4: 4 4 4 4
5: 5 5 4 5
I have another variable that indicates which columns I need to check for duplicates. It is important that I be able to store the column names in a character vector and not need to "know" them (because they will be passed as an argument to a function).
dupcheckcols<-c("a", "c", "d")
I want the output to be:
> tmpdt
a b c d Dups
1: 1 2 4 3 <NA>
2: 2 2 2 3 Has Dups
3: 3 3 2 1 <NA>
4: 4 4 4 4 Has Dups
5: 5 5 4 5 Has Dups
If I were using a data.frame, this is easy. I could simply use:
tmpdt<-data.frame(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
tmpdt$Dups[apply(tmpdt[,dupcheckcols], 1, function(x) {return(sum(duplicated(x))>0)})]<-"Has Dups"
> tmpdt
a b c d Dups
1 1 2 4 3 <NA>
2 2 2 2 3 Has Dups
3 3 3 2 1 <NA>
4 4 4 4 4 Has Dups
5 5 5 4 5 Has Dups
But I can't figure out how to accomplish the same task with a data.table. Any help is greatly appreciated.
I'm sure there are other ways
tmpdt[, dups := tmpdt[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ]
# a b c d dups
#1: 1 2 4 3 FALSE
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 FALSE
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
A more convoluted, but slightly quicker (in computational terms) method would be to construct the filter condition in i, then update in j by reference
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
# [1] "a==c|a==d|c==d"
expr <- parse(text=expr)
tmpdt[ eval(expr), dups := TRUE ]
# a b c d dups
#1: 1 2 4 3 NA
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 NA
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
I was interested in speed benefits, so I've benchmarked these two plus Ananda's solution:
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
t1 <- tmpdt
t2 <- tmpdt
t3 <- tmpdt
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
expr <- parse(text=expr)
#Ananda's solution
t1[, dups := any(duplicated(unlist(.SD))), by = 1:nrow(tmpdt), .SDcols = dupcheckcols],
t2[, dups := t2[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ],
t3[ eval(expr), dups := TRUE ]
# min lq mean median uq max neval cld
# 531.416 552.5760 577.0345 565.182 573.2015 1761.863 100 b
#1277.569 1333.2615 1389.5857 1358.021 1387.9860 2694.951 100 c
# 265.872 283.3525 293.9362 292.487 301.1640 520.436 100 a
You should be able to do something like this:
tmpdt[, dups := any(duplicated(unlist(.SD, use.names = FALSE))),
by = 1:nrow(tmpdt), .SDcols = dupcheckcols]
# a b c d dups
# 1: 1 2 4 3 FALSE
# 2: 2 2 2 3 TRUE
# 3: 3 3 2 1 FALSE
# 4: 4 4 4 4 TRUE
# 5: 5 5 4 5 TRUE
Adjust accordingly if you really want the words "Has Dups", but note that it would probably be easier to use logical values, as in my answer here.
I found a way to do this with Rcpp, following an example by hadley (under "Sets"):
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector anyDupCols(IntegerMatrix x) {
int nr = x.nrow();
int nc = x.ncol();
LogicalVector out(nr, false);
std::unordered_set<int> seen;
for (int i = 0; i < nr; i++) {
for (int j = 0; j < nc; j++){
int xij = x(i,j);
if (seen.count(xij)){ out[i] = true; break; }
else seen.insert(xij);
return out;
To use it, put it in a cpp file and run
It does pretty well in benchmarks:
nc = 30
nv = nc^2
n = 1e4
DT = setDT( replicate(nc, sample(nv, n, replace = TRUE), simplify=FALSE) )
ananda = DT[, any(duplicated(unlist(.SD, use.names = FALSE))), by = 1:nrow(DT)]$V1,
tospig = {
expr = parse(text=paste(apply(t(combn(names(DT),2)),1,FUN =
function(x){ paste0(x, collapse="==") }), collapse = "|"))
DT[, eval(expr)]
cpp = anyDupCols(as.matrix(DT)),
alex = ff(DT),
tscharf = apply(DT,1,function(row) any(duplicated(row))),
unit = "relative", times = 10
Unit: relative
expr min lq mean median uq max neval cld
ananda 2.462739 2.596990 2.774660 2.659898 2.869048 3.352547 10 c
tospig 3.118158 3.253102 3.606263 3.424598 3.885561 4.583268 10 d
cpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
alex 1.295415 1.927802 1.914883 1.982580 2.029868 2.538143 10 b
tscharf 2.112286 2.204654 2.385318 2.234963 2.322206 2.978047 10 bc
If I go to nc = 50, #tospig's expr becomes too long for R to handle and I get node stack overflow, which is fun.
a one-liner with some elegance
define the columns
loop down the rows
see if there are any dupes
tmpdt[,dups:=apply(.SD,1,function(row) any(duplicated(row))),.SDcols = dupcheckcols]
> tmpdt
a b c d dups
1: 1 2 4 3 FALSE
2: 2 2 2 3 TRUE
3: 3 3 2 1 FALSE
4: 4 4 4 4 TRUE
5: 5 5 4 5 TRUE
Another way is to tabulate "tmpdt" along its rows and find which rows have more than one of an element:
tmpdt2 = tmpdt[, dupcheckcols, with = FALSE] # subset tmpdt
colSums(table(unlist(tmpdt2), row(tmpdt2)) > 1L) > 0L
# 1 2 3 4 5
Peeking at table we could speed it up significantly with something like:
ff = function(x)
lvs = Reduce(union, lapply(x, function(X) if(is.factor(X)) levels(X) else unique(X)))
x = lapply(x, function(X) match(X, lvs))
nr = length(lvs); nc = length(x[[1L]])
tabs = "dim<-"(tabulate(unlist(x, use.names = FALSE) + (0:(nc - 1L)) * nr, nr * nc),
c(nr, nc))
colSums(tabs > 1L) > 0L
I am trying to exclude rows have missing values (NA) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-NA" rows.
I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the while loop.
## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)]
while (dt[.N, row_empty] == TRUE) {
dt <- dt[1:(.N-1)]
dt %>% return()
d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA))
#EDIT2: adding more test cases
d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA))
d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA))
#Edit3:adding no NA rows test case
d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7))
d4 %>% remove_empty_row_last()
This seems to work with all test cases.
The idea is to use a reverse cumsum to filter out the NA rows at the end.
remove_empty_row_last_new <- function(d) {
d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
#> a b
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: 1 NA
d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
#> A B
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
#> 7: NA 7
You'll have to check performance on your real dataset, but it seems a bit faster :
> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
expr min lq mean median uq max neval cld
remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401 100 b
remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401 100 a
Maybe this will be fast enough?
d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]]
a b
1: 1 1
2: NA NA
3: 3 3
4: NA 4
5: 5 5
Here's another approach that relies on rcpp.
IntegerVector which_end_cont(LogicalVector x) {
const int n = x.size();
int consecutive = 0;
for (int i = n - 1; i >= 0; i--) {
if (x[i]) consecutive++; else break;
IntegerVector out(consecutive);
if (consecutive == 0)
return(seq(1, n - consecutive));
remove_empty_row_last3 <- function(dt) {
lgl = rowSums(is.na(dt)) == length(dt)
ind = which_end_cont(lgl)
if (length(ind)) return(dt[ind]) else return(dt)
Basically, it
uses R to find out which rows are completely NA.
it uses rcpp to loop through the logical vector to determine how many consecutive empty rows there are at the end. Using rcpp allows us to minimize the memory allocated.
If there are no rows empty at the end, we prevent allocating memory by just returning the input rcpp. Otherwise, we allocate the sequence in rcpp and return it to subset the data.table.
Using microbenchmark, this is about 3 times faster for cases in which there are empty rows at the end and about 15 times faster in which there are no empty rows.
If you have taken the time to add rcpp, the nice thing is that data.table has exported some of their internal functions so that they can be called directly from C. That can further simplify things and make it very, very quick, mainly because we can skip the NSE performed during [data.table which is why all conditions are now ~15 times faster than the OP original function.
SEXP mysub2(SEXP dt, LogicalVector x) {
const int n = x.size();
int consecutive = 0;
for (int i = n - 1; i >= 0; i--) {
if (x[i]) consecutive++; else break;
if (consecutive == 0)
return(DT_subsetDT(dt, wrap(seq(1, n - consecutive)), wrap(seq_len(LENGTH(dt)))));
include="#include <datatableAPI.h>",
remove_empty_row_last4 <- function(dt) {
lgl = rowSums(is.na(dt)) == length(dt)
return(mysub2(dt, lgl))
dt = copy(d)
dt2 = copy(d2)
dt3 = copy(d3)
dt4 = copy(d4)
microbenchmark::microbenchmark(original = remove_empty_row_last(d3),
rcpp_subset = remove_empty_row_last4(dt3),
rcpp_ind_only = remove_empty_row_last3(dt3),
waldi = remove_empty_row_last_new(dt3),
ian = dt3[!dt3[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]])
## Unit: microseconds
## expr min lq mean median uq max neval
## original 498.0 519.00 539.602 537.65 551.85 621.6 100
## rcpp_subset 34.0 39.95 43.422 43.30 46.70 59.0 100
## rcpp_ind_only 116.9 129.75 139.943 140.15 146.35 177.7 100
## waldi 370.9 387.70 408.910 400.55 417.90 683.4 100
## ian 432.0 445.30 461.310 456.25 473.35 554.1 100
## andrew 120.0 131.40 143.153 141.60 151.65 197.5 100
I am late to the party but here is another option that should be relatively memory efficient and only uses base R.
d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
#> a b
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: 1 NA
d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
#> A B
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
#> 7: NA 7
Created on 2021-02-01 by the reprex package (v0.3.0)
remove_empty_row_last_andrew = function(x) {
idx = do.call(pmin.int, lapply(x, is.na))
length_idx = length(idx)
if(idx[length_idx] == 0) {
start_idx = length_idx - which.min(idx[length_idx:1L]) + 2
x = x[-(start_idx:length_idx), ]
I want to determine the amount of duplicate records per row for certain columns in a big data table. Simple example with desired output:
test <- data.table(a=c(1,2,3),b=c(1,4,6),c=c(5,6,9),duplicatercds=c(1,0,0))
Is there a command for this?
You can melt into a long format before calculating the dupes
DT <- data.table(a=c(1,2,3),b=c(1,4,6),c=c(5,6,9))
stat <- melt(DT[, rn:=.I], id.vars="rn")[,
.(duplicatercds=.N - uniqueN(value)), by=.(rn)]
DT[stat, duplicatercds := duplicatercds, on=.(rn)]
Be careful when doing uniqueN on double values because of machine precision
There is a github fr for this: https://github.com/Rdatatable/data.table/issues/1063
Another cleaner method
DT[, duplicatercds := apply(.SD, 1, function(x) length(x) - uniqueN(x))]
EDIT: Added some timings:
ncols <- 10L
nrows <- 1e4L
uniqVal <- seq_len(1000L)
test <- as.data.table(matrix(sample(uniqVal, nrows*ncols, replace=TRUE), nrow=nrows))
test[, duplicatercds := NA_real_]
f1 <- function() test[, apply(.SD, 1, function(x) { y <- table(x); sum(y) - length(y) }) ]
f2 <- function() test[, sum(table(unlist(.SD)) > 1), by=.(1:nrows)]$V1
f3 <- function() test[, apply(test, 1, function(x) sum(diff(sort(x))==0))]
f4 <- function() test[, apply(.SD, 1, function(x) length(x) - uniqueN(x))]
f5 <- function() test[, ncols - vapply(transpose(.SD), uniqueN, 1L) + 1L]
identical(f2(), f1())
# [1] FALSE
identical(f3(), f1())
# [1] TRUE
identical(f4(), f1())
# [1] TRUE
identical(unname(f5()), f1())
# [1] TRUE
microbenchmark(f1(), f2(), f3(), f4(), f5(), times=5L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1() 1883.7817 1903.7626 1940.5378 1922.6539 1981.1139 2011.3771 5
# f2() 1821.0437 1901.1188 1933.8926 1908.4297 1999.6216 2039.2491 5
# f3() 657.4502 666.6721 679.5539 672.6617 686.4095 714.5760 5
# f4() 167.8048 168.5211 174.3660 169.9920 180.1304 185.3816 5
# f5() 146.0255 154.6341 159.4630 160.1968 164.3369 172.1219 5
Let's assume you don't have that last column then you can get your desired result (modulo your clarifying comment) with :
test[ ,duplicatercds := apply(.SD, 1, function(x) {sum(table(x))-length(table(x))}),
by=1:nrow(test) ]
> test
a b c duplicatercds
1: 1 1 5 1
2: 2 4 6 0
3: 3 6 9 0
And a 'test' with a more complex example:
> test <- data.table(a=c(1,2,3),b=c(1,4,6),c=c(5,6,9), d=c(1,2,3), c=c(5,6,9))
> test
a b c d c
1: 1 1 5 1 5
2: 2 4 6 2 6
3: 3 6 9 3 9
> test[ , duplicatercds := apply(.SD, 1, function(x) {sum(table(x))-length(table(x))}), by=1:nrow(test) ]
> test
a b c d c duplicatercds
1: 1 1 5 1 5 3
2: 2 4 6 2 6 2
3: 3 6 9 3 9 2
Or maybe:
test[ , duplicatercds := apply(.SD, 1,
function(x) {sum(table(x))-length(table(x))}) ]
You can do a table, and count how many have a frequency of more than 1
test=test[,1:3]#Remove your duplicatercds
test[, duplicatercds:=sum(table(unlist(.SD))>1),by=.(1:nrow(test))][]
a b c duplicatercds
1: 1 1 5 1
2: 2 4 6 0
3: 3 6 9 0
One solution is to use diff as part of apply function.
test <- data.table(a=c(1,2,3),b=c(1,4,6),c=c(5,6,9))
test$dup <- apply(test,1,function(x)sum(diff(sort(x))==0))
# a b c dup
# 1: 1 1 5 1
# 2: 2 4 6 0
# 3: 3 6 9 0
I am trying to generate a vector containing decreasing sequences of increasing length, such as 1, 2,1, 3,2,1, 4,3,2,1, 5,4,3,2,1, i.e.
c(1, 2:1, 3:1, 4:1, 5:1)
I tried to use a loop for this, but I don't know how to stack or concatenate the results.
for (i in 1:11)
x = rev(seq(i:1))
[1] 1
[1] 2 1
[1] 3 2 1
[1] 4 3 2 1
[1] 5 4 3 2 1
[1] 6 5 4 3 2 1
[1] 7 6 5 4 3 2 1
[1] 8 7 6 5 4 3 2 1
[1] 9 8 7 6 5 4 3 2 1
[1] 10 9 8 7 6 5 4 3 2 1
[1] 11 10 9 8 7 6 5 4 3 2 1
I have also been experimenting with the rep, rev and seq, which are my favourite option but did not get far.
With sequence:
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
From R 4.0.0 sequence takes arguments from and by:
sequence(1:5, from = 1:5, by = -1)
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
Far from the golf minimalism of rev... However, if you wake up one morning and want to create such a sequence with n = 1000 (like in the answer below), the latter is in fact faster (but I can hear Brian Ripley in fortunes::fortune(98))
n = 1000
f_rev = rev(sequence(n:1)),
f_seq4.0.0 = sequence(1:n, from = 1:n, by = -1))
# Unit: microseconds
# expr min lq mean median uq max neval
# f_rev 993.7 1040.3 1128.391 1076.95 1133.3 1904.7 100
# f_seq4.0.0 136.4 141.5 153.778 148.25 150.1 304.7 100
We can do this with lapply
unlist(lapply(1:11, function(x) rev(seq(x))))
Or as #zx8754 mentioned in the comments, in place of rev(seq, : can be used
unlist(lapply(1:11, function(x) x:1))
Or as #BrodieG suggested, we can make this more compact by removing the anonymous function call
unlist(lapply(1:11, ":", 1))
And for fun, using matrices (and ignoring the warning ;) )
m <- matrix(c(1:5,0), ncol = 5, nrow = 5, byrow = T)
m[ upper.tri(m, diag = T) ]
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
And we can simplify the upper.tri into its component parts
m[ row(m) <= col(m)]
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
And if you can handle even more fun, then how about some benchmarking:
maxValue <- 1000
vec2 <- maxValue:1
m2 <- matrix(c(1:maxValue,0), ncol = maxValue, nrow = maxValue, byrow = T)
henrik = {
henrik_4.0.0 = {
sequence(1:maxValue, from = 1:maxValue, by = -1)
akrun = {
unlist(lapply(1:maxValue, function(x) x:1))
symbolix1 = {
m <- matrix(c(1:maxValue,0), ncol = maxValue, nrow = maxValue, byrow = T)
m[ row(m) <= col(m) ]
symbolix2 = {
m2[ row(m2) <= col(m2) ]
lmo1 = {
unlist(lapply(1:maxValue, tail, x=maxValue:1))
lmo2 = {
vec <- maxValue:1
unlist(lapply(rev(vec), tail, x=vec))
lmo3 = {
unlist(lapply(rev(vec2), tail, x=vec2))
# Unit: microseconds
# expr min lq mean median uq max neval
# henrik 1018.7 1068.20 1176.430 1103.65 1223.20 2348.4 100
# henrik_4.0.0 139.9 147.90 166.092 151.40 162.70 379.0 100
# akrun 3420.1 3637.75 3825.336 3729.10 3897.00 4960.6 100
# symbolix1 6999.5 7483.20 7807.747 7618.30 7810.70 12138.7 100
# symbolix2 4791.2 5043.00 5677.742 5190.50 5393.65 29318.7 100
# lmo1 7530.1 7967.05 10918.201 8161.10 8566.45 132324.1 100
# lmo2 7385.7 8017.95 12271.158 8213.90 8500.70 143798.2 100
# lmo3 7539.5 7959.05 14355.810 8177.85 8500.85 131154.2 100
In this example, henrik_4.0.0 is the winner! (for bm with pre-R 4.0.0 sequence only, see previous edits)
But I know what you're thinking, 'why end all the fun there!'
Well, lets write our own C++ function and see how that performs
cppFunction('NumericVector reverseSequence(int maxValue, int vectorLength){
NumericVector out(vectorLength);
int counter = 0;
for(int i = 1; i <= maxValue; i++){
for(int j = i; j > 0; j--){
out[counter] = j;
return out;
maxValue <- 5
reverseSequence(maxValue, sum(1:maxValue))
[1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
maxValue <- 1000
akrun = {
unlist(sapply(1:maxValue, function(x) x:1))
symbolix3 = {
reverseSequence(maxValue, sum(1:maxValue))
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun 1522.250 1631.6030 3148.922 1829.9370 3357.493 45576.148 100
# symbolix3 338.626 495.3825 1293.720 950.6635 2169.656 3816.091 100
Another alternative is to use tail within lapply, to successively select the number of elements to keep from the initial vector:
unlist(lapply(1:5, tail, x=5:1))
[1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
Or, it may be faster to construct the base vector first and then call on it:
vec <- 5:1
unlist(lapply(rev(vec), tail, x=vec))
[1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
I have a data.table like this:
dt<-data.table(v1=rep(c('a','b','c'),4), v2=rep(c(1,2),6))
v1 v2
1: a 1
2: b 2
3: c 1
4: a 2
5: b 1
6: c 2
7: a 1
8: b 2
9: c 1
10: a 2
11: b 1
12: c 2
I need to remove all rows that have the same v1 but different v2 (except the first row in each combination of v1 and v2). In this example, rows 4-6 and 10-12 should be removed. How can I do this?
This works I think:
dt[, v2[v2 == v2[1]], by = v1]
# v1 V1
#1: a 1
#2: a 1
#3: b 2
#4: b 2
#5: c 1
#6: c 1
How about this?
tmp = dt[dt[, list(I=.I[1]), by=list(v1)]$I]
v1 v2
1: a 1
2: a 1
3: b 2
4: b 2
5: c 1
6: c 1
Bigger data and benchmarking:
# create some data
ff <- function() paste0(sample(letters, sample(5:8, 1), TRUE), collapse="")
ll <- unique(replicate(1e4, ff()))
DT <- data.table(v1=sample(ll, 1e6, TRUE), v2=sample(1:1e4, 1e6, TRUE))
# add functions
eddi <- function(dt=copy(DT)) {
dt[, list(v2=v2[v2 == v2[1]]), by = v1]
andrey <- function(dt=copy(DT)) {
dt[, .SD[v2 == v2[1],], by = v1]
arun <- function(dt=copy(DT)) {
tmp = dt[dt[, list(I=.I[1]), by=list(v1)]$I]
# benchmark
microbenchmark(a1 <- eddi(), a2 <- andrey(), a3 <- arun(), times=2)
Unit: milliseconds
expr min lq median uq max neval
a1 <- eddi() 342.4429 342.4429 348.1604 353.8780 353.8780 2
a2 <- andrey() 5810.8947 5810.8947 5829.0742 5847.2537 5847.2537 2
a3 <- arun() 494.6861 494.6861 509.3022 523.9182 523.9182 2
setkey(a3, NULL)
> identical(a1, a2) # [1] TRUE
> identical(a1, a3) # [1] TRUE
You could try using the mult argument. I'm not sure if the setkeyv will affect the rows that you are selecting though, please check that before you use it -
firstocc <- dt[unique(dt),,mult="first"][,v2.1 := NULL]
For a given dataframe, I'd like to split it based on some boolean value, and then apply a label to that row and the previous rows up until that point.
Assuming the following dataframe:
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
I'd ultimately like to create a new column that would contain a label for that specific portion of the dataframe. Ideally, something like the following:
x y z
1 F 1
2 F 1
3 F 1
4 T 1
5 F 2
6 F 2
7 T 2
8 F 3
9 F 3
10 F 3
My current thought is that I need to loop through the dataframe with a function similar to the following (but not exactly):
label.portion <- function(test) {
for (i in 1:nrow(test)) {
z <- 1
if(test$y[i]) { z <- z + 1 }
What is the best/easiest way of doing this? Any help is much appreciated.
Your z column can be built as
z <- with(test, sum(y)-rev(cumsum(rev(y)))+1)
in order to make every new z value start at a FALSE y after a TRUE y, as per your example.
Then you can do cbind(test, z) to get what you want.
One liner solution using transform
transform(test,z= cumsum(c(0,diff(y)) == -1)+1)
x y z
1 1 FALSE 1
2 2 FALSE 1
3 3 FALSE 1
4 4 TRUE 1
5 5 FALSE 2
6 6 FALSE 2
7 7 TRUE 2
8 8 FALSE 3
9 9 FALSE 3
10 10 FALSE 3
Another one liner solution which will be slightly faster than other solutions (except data.table)
test <- data.frame(x = 1:10, y = c(F, F, F, T, F, F, T, F, F, F))
test$z <- c(1, head(cumsum(test$y), -1) + 1)
## x y z
## 1 1 FALSE 1
## 2 2 FALSE 1
## 3 3 FALSE 1
## 4 4 TRUE 1
## 5 5 FALSE 2
## 6 6 FALSE 2
## 7 7 TRUE 2
## 8 8 FALSE 3
## 9 9 FALSE 3
## 10 10 FALSE 3
Benchmarks with other solutions provided (excluding data.table)
test <- data.frame(x = 1:1e+05, y = sample(c(T, F), size = 1e+05, replace = TRUE))
microbenchmark(c(1, head(cumsum(test$y), -1) + 1), cumsum(c(0, diff(test$y)) == -1) + 1, with(test, sum(y) - rev(cumsum(rev(y))) +
1), times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## c(1, head(cumsum(test$y), -1) + 1) 1.685473 1.758474 1.865409 4.647218 5.091512 100
## cumsum(c(0, diff(test$y)) == -1) + 1 4.064867 4.379714 6.936561 7.338810 7.657961 100
## with(test, sum(y) - rev(cumsum(rev(y))) + 1) 2.568766 2.720395 5.396096 5.701176 30.642436 100
Here is an approach using na.locf from xts and data.table for coding elegance (and efficiency)
library(xts) # for na.locf
test <- data.table(test)
test[(y), grp := seq_along(y)][, grp := na.locf(grp, fromLast = TRUE)]
test[is.na(grp), grp := max(test[, grp], na.rm =TRUE) + 1L]
And a far clearer and faster approach
test[, grp := {xx <- diff(c(0,.I[y], length(.I))); rep.int(seq_along(xx),xx)}]
Note that diff uses a for loop implemented in R, so an Rcpp sugar implementation) would be faster (I'm sure that a cpp function would blow most of these out of the water)