This question already has answers here:
Fastest way to replace NAs in a large data.table
(10 answers)
Closed 5 years ago.
A lot comes together in this question. First off all I would like to segment the data by column c. The subsets are given by the factor c: the levels are 1 to 4. So 4 distinct segments.
Next I have two columns. Column a and b.
I would like to replace the NA's with the maximum value of each segment specific column. So for example, NA at row 3 and column 'a', this would be 30. (b,3) would be 80, (b,8) would be 50 and (a, 5) would be 80.
I have created the code below that does the job, but now I need to make it automatic (like a for loop) for all segments and columns. How could I do this?
a <- c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40)
b <- c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90)
c <- c(1,1,1,2,2,2,2,2,3,3,3,4,4,4)
a b c
1: 10 80 1
2: NA 70 1
3: 30 NA 1
4: 40 50 2
5: NA 40 2
6: 60 30 2
7: 70 20 2
8: 80 NA 2
9: 90 0 3
10: 90 0 3
11: 80 10 3
12: 90 69 4
13: 10 40 4
14: 40 90 4
mytable <- data.table(a,b,c)
mytable[which([c == 1][,1, with = FALSE]) == TRUE),1] <- max(mytable[c==1,1], na.rm = TRUE)
Unfortunately, this try results in an error:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which([c == i][,j, with = FALSE]) == TRUE),j, with = FALSE] <- max(mytable[c==i][,j, with = FALSE], na.rm = TRUE)
Error in [<*tmp*, which([c == i][, j, with = FALSE]) == :
unused argument (with = FALSE)
Surprisingly, this results in an error as well:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which([c == i][,j]) == TRUE),j] <- max(mytable[c==i,j], na.rm = TRUE)
Error in [.data.table(mytable, c == i, j) :
j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[,..j] or DT[,j,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.
mytable <- data.table(
b=c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90),
foo <- function(x) { x[] <- max(x, na.rm=TRUE); x }
mytable[, .(A=foo(a), B=foo(b)), by=c]
> mytable[, .(A=foo(a), B=foo(b)), by=c]
# c A B
# 1: 1 10 80
# 2: 1 30 70
# 3: 1 30 80
# 4: 2 40 50
# 5: 2 80 40
# 6: 2 60 30
# 7: 2 70 20
# 8: 2 80 50
# 9: 3 90 0
#10: 3 90 0
#11: 3 80 10
#12: 4 90 69
#13: 4 10 40
#14: 4 40 90
or for direct substitution of a and b:
mytable[, `:=`(a=foo(a), b=foo(b)), by=c] # or
mytable[, c("a", "b") := (lapply(.SD, foo)), by = c] # from #Sotos
or the safer variant (tnx to #Frank for the remark):
cols <- c("a", "b")
mytable[, (cols) := lapply(.SD, foo), by=c, .SDcols=cols]
Using data.table
mytable[, a := ifelse(, max(a, na.rm = TRUE), a), by = c]
mytable[, b := ifelse(, max(b, na.rm = TRUE), b), by = c]
Or in a single command
mytable[, c("a", "b") := lapply(.SD, function(x) ifelse(, max(x, na.rm = TRUE), x)), .SDcols = c("a", "b"), by = c]
Use ddply() from package plyr:
df2<-ddply(df, .(c), transform, a=ifelse(, max(a, na.rm=T),a),
b=ifelse(, max(b, na.rm=T),b))
I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
I got the output as I desired:
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
But the output was
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
table[-1, C := A * B]
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
C := A * B
which updates dt as
> dt
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
table$C <- table$B * table$A
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
data = data.table("cat" = c(0,5,NA,0,0,0),
"horse" = c(0,4,2,1,1,3),
"fox" = c(2,2,NA,NA,7,0))
I wish to replace values of 'cat' and 'fox' that are equal to '0' or '2' with '-99'
I can do it one at a time but how to do them both?
dat[fox == 0 | fox == 2, fox := -99]
Another approach with data.table is using a for(...) set(...)-approach, which is in this case both fast and memory efficient:
cols <- c('fox', 'cat')
# option 1
for (j in cols) d[get(j) %in% c(0, 2), (j) := -99]
# option 2 (thx to #Cole for highlighting)
for (j in cols) set(d, which(d[[j]] %in% c(0, 2)), j, value = -99)
# option 3 (thx to #Frank for highlighting)
for (j in cols) d[.(c(0,2)), on = j, (j) := -99]
which gives:
> d
cat horse fox
1: -99 0 -99
2: 5 4 -99
3: NA 2 NA
4: -99 1 NA
5: -99 1 7
6: -99 3 -99
d <- data.table("cat" = c(0,5,NA,0,0,0),
"horse" = c(0,4,2,1,1,3),
"fox" = c(2,2,NA,NA,7,0))
Here's a not-so-elegant way of doing this:
> data
cat horse fox
1: 0 0 2
2: 5 4 2
3: NA 2 NA
4: 0 1 NA
5: 0 1 7
6: 0 3 0
> data[, c('fox', 'cat') := list(ifelse(cat %in% c(0,2) | fox %in% c(0,2), 99, cat ), ifelse(cat %in% c(0,2) | fox %in% c(0,2), 99, cat ))]
> data
cat horse fox
1: 99 0 99
2: 99 4 99
3: NA 2 NA
4: 99 1 99
5: 99 1 99
6: 99 3 99
I'm calling (c('cat', 'fox')) explicitly, but you could save them as mycols and assign using := operator: data[, mycols := ...]
Similarly, I'm passing a list explicitly based on the conditions - this could be better done using a function instead.
If I understand, this would work as well:
cols = c("cat", "fox")
data[, (cols) := lapply(.SD, function (x) fifelse(x %in% c(0, 2), -99, x)), .SDcols = cols]
There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <-
I'd like to combine/pair multiple columns in a data frame as pairs of column cells in the same row. As an example, df1 should be transformed to df2.
col1 col2 col3
1 2 3
0 0 1
c1 c2
1 2
1 3
2 3
0 0
0 1
0 1
The solution should be scalable for df1s with (way) more than three columns.
I thought about melt/reshape/dcast but found no solution yet. There are no NAs in the data frame. Thank you!
EDIT: Reshape just produced errors, so I thought about
combn(df1[1,], 2)
comb2 <- t(comb1)
and looping and appending through all rows. This inefficient, considering 2 million rows..
Here's the approach I would take.
Create a function that uses rbindlist from "data.table" and combn from base R. The function looks like this:
lengthener <- function(indf) {
temp <- rbindlist(
combn(names(indf), 2, FUN = function(x) indf[x], simplify = FALSE),
use.names = FALSE, idcol = TRUE)
setorder(temp[, .id := sequence(.N), by = .id], .id)[, .id := NULL][]
Here's the sample data from the other answer, and the application of the function on it:
df1 =,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))
# V1 V2
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 2 3
# 5: 2 4
# 6: 3 4
# 7: 0 0
# 8: 0 1
# 9: 0 1
# 10: 0 1
# 11: 0 1
# 12: 1 1
Test it out on some larger data too:
M <-, 100*100, TRUE), 100))
system.time(out <- lengthener(M))
# user system elapsed
# 0.19 0.00 0.19
# V1 V2
# 1: 27 66
# 2: 27 27
# 3: 27 68
# 4: 27 66
# 5: 27 56
# ---
# 494996: 33 13
# 494997: 33 66
# 494998: 80 13
# 494999: 80 66
# 495000: 13 66
System time for the other approach:
funAMK <- function(indf) {
nrow_combn = nrow(t(combn(indf[1,], m = 2)))
nrow_df = nrow(indf) * nrow_combn
df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
for(i in 1:nrow(indf)){
df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(indf[i,], m = 2)))
> system.time(funAMK(M))
user system elapsed
16.03 0.16 16.37
Your edit is very similar to my answer below, you just need to rbind the result each iteration over the rows of df1. Using data.table is a good way to speed up rbind, see this answer for more.
EDIT: Unfortunately, when I switched to the data.table approach, it turned out that the rbindlist() led the answer to be wrong (as pointed out in the comment below). Therefore, although it may be slightly slower, I think that preallocating a data frame and using rbind may be the best option.
EDIT2: switched the preallocated df to a more general number of rows.
df1 =,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))
nrow_combn = nrow(t(combn(df1[1,], m = 2)))
nrow_df = nrow(df1) * nrow_combn
df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
for(i in 1:nrow(df1)){
df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(df1[i,], m = 2)))
I a using a data.table to store data. I am trying to figure out whether certain columns in each row are unique. I want to add a column to the data.table that will hold the value "Duplicated Values" if there are duplicated values and be NA if there are no duplicated values. The names of the columns that I want to check for duplication are stored in a character vector. For example, I create my data.table:
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
> tmpdt
a b c d
1: 1 2 4 3
2: 2 2 2 3
3: 3 3 2 1
4: 4 4 4 4
5: 5 5 4 5
I have another variable that indicates which columns I need to check for duplicates. It is important that I be able to store the column names in a character vector and not need to "know" them (because they will be passed as an argument to a function).
dupcheckcols<-c("a", "c", "d")
I want the output to be:
> tmpdt
a b c d Dups
1: 1 2 4 3 <NA>
2: 2 2 2 3 Has Dups
3: 3 3 2 1 <NA>
4: 4 4 4 4 Has Dups
5: 5 5 4 5 Has Dups
If I were using a data.frame, this is easy. I could simply use:
tmpdt<-data.frame(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
tmpdt$Dups[apply(tmpdt[,dupcheckcols], 1, function(x) {return(sum(duplicated(x))>0)})]<-"Has Dups"
> tmpdt
a b c d Dups
1 1 2 4 3 <NA>
2 2 2 2 3 Has Dups
3 3 3 2 1 <NA>
4 4 4 4 4 Has Dups
5 5 5 4 5 Has Dups
But I can't figure out how to accomplish the same task with a data.table. Any help is greatly appreciated.
I'm sure there are other ways
tmpdt[, dups := tmpdt[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ]
# a b c d dups
#1: 1 2 4 3 FALSE
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 FALSE
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
A more convoluted, but slightly quicker (in computational terms) method would be to construct the filter condition in i, then update in j by reference
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
# [1] "a==c|a==d|c==d"
expr <- parse(text=expr)
tmpdt[ eval(expr), dups := TRUE ]
# a b c d dups
#1: 1 2 4 3 NA
#2: 2 2 2 3 TRUE
#3: 3 3 2 1 NA
#4: 4 4 4 4 TRUE
#5: 5 5 4 5 TRUE
I was interested in speed benefits, so I've benchmarked these two plus Ananda's solution:
tmpdt<-data.table(a=c(1,2,3,4,5), b=c(2,2,3,4,5), c=c(4,2,2,4,4), d=c(3,3,1,4,5))
t1 <- tmpdt
t2 <- tmpdt
t3 <- tmpdt
expr <- paste(apply(t(combn(dupcheckcols,2)), 1, FUN=function(x){ paste0(x, collapse="==") }), collapse = "|")
expr <- parse(text=expr)
#Ananda's solution
t1[, dups := any(duplicated(unlist(.SD))), by = 1:nrow(tmpdt), .SDcols = dupcheckcols],
t2[, dups := t2[, dupcheckcols, with=FALSE][, apply(.SD, 1, function(x){sum(duplicated(x))>0})] ],
t3[ eval(expr), dups := TRUE ]
# min lq mean median uq max neval cld
# 531.416 552.5760 577.0345 565.182 573.2015 1761.863 100 b
#1277.569 1333.2615 1389.5857 1358.021 1387.9860 2694.951 100 c
# 265.872 283.3525 293.9362 292.487 301.1640 520.436 100 a
You should be able to do something like this:
tmpdt[, dups := any(duplicated(unlist(.SD, use.names = FALSE))),
by = 1:nrow(tmpdt), .SDcols = dupcheckcols]
# a b c d dups
# 1: 1 2 4 3 FALSE
# 2: 2 2 2 3 TRUE
# 3: 3 3 2 1 FALSE
# 4: 4 4 4 4 TRUE
# 5: 5 5 4 5 TRUE
Adjust accordingly if you really want the words "Has Dups", but note that it would probably be easier to use logical values, as in my answer here.
I found a way to do this with Rcpp, following an example by hadley (under "Sets"):
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalVector anyDupCols(IntegerMatrix x) {
int nr = x.nrow();
int nc = x.ncol();
LogicalVector out(nr, false);
std::unordered_set<int> seen;
for (int i = 0; i < nr; i++) {
for (int j = 0; j < nc; j++){
int xij = x(i,j);
if (seen.count(xij)){ out[i] = true; break; }
else seen.insert(xij);
return out;
To use it, put it in a cpp file and run
It does pretty well in benchmarks:
nc = 30
nv = nc^2
n = 1e4
DT = setDT( replicate(nc, sample(nv, n, replace = TRUE), simplify=FALSE) )
ananda = DT[, any(duplicated(unlist(.SD, use.names = FALSE))), by = 1:nrow(DT)]$V1,
tospig = {
expr = parse(text=paste(apply(t(combn(names(DT),2)),1,FUN =
function(x){ paste0(x, collapse="==") }), collapse = "|"))
DT[, eval(expr)]
cpp = anyDupCols(as.matrix(DT)),
alex = ff(DT),
tscharf = apply(DT,1,function(row) any(duplicated(row))),
unit = "relative", times = 10
Unit: relative
expr min lq mean median uq max neval cld
ananda 2.462739 2.596990 2.774660 2.659898 2.869048 3.352547 10 c
tospig 3.118158 3.253102 3.606263 3.424598 3.885561 4.583268 10 d
cpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
alex 1.295415 1.927802 1.914883 1.982580 2.029868 2.538143 10 b
tscharf 2.112286 2.204654 2.385318 2.234963 2.322206 2.978047 10 bc
If I go to nc = 50, #tospig's expr becomes too long for R to handle and I get node stack overflow, which is fun.
a one-liner with some elegance
define the columns
loop down the rows
see if there are any dupes
tmpdt[,dups:=apply(.SD,1,function(row) any(duplicated(row))),.SDcols = dupcheckcols]
> tmpdt
a b c d dups
1: 1 2 4 3 FALSE
2: 2 2 2 3 TRUE
3: 3 3 2 1 FALSE
4: 4 4 4 4 TRUE
5: 5 5 4 5 TRUE
Another way is to tabulate "tmpdt" along its rows and find which rows have more than one of an element:
tmpdt2 = tmpdt[, dupcheckcols, with = FALSE] # subset tmpdt
colSums(table(unlist(tmpdt2), row(tmpdt2)) > 1L) > 0L
# 1 2 3 4 5
Peeking at table we could speed it up significantly with something like:
ff = function(x)
lvs = Reduce(union, lapply(x, function(X) if(is.factor(X)) levels(X) else unique(X)))
x = lapply(x, function(X) match(X, lvs))
nr = length(lvs); nc = length(x[[1L]])
tabs = "dim<-"(tabulate(unlist(x, use.names = FALSE) + (0:(nc - 1L)) * nr, nr * nc),
c(nr, nc))
colSums(tabs > 1L) > 0L