As an example, I have the data.table shown below. I want to do a simple aggregation where b=sum(b). For c, however I want the value of the record in c where b is maximum. The desired output is shown below (data.aggr). This leads to a few questions:
1) Is there a way to do this data.table?
2) Is there a simpler way to do this in plyr?
3) In plyr the output object got change from a data.table to a data.frame. Can I avoid this behavior?
library(plyr)
library(data.table)
dt <- data.table(a=c('a', 'a', 'a', 'b', 'b'), b=c(1, 2, 3, 4, 5),
c=c('m', 'n', 'p', 'q', 'r'))
dt
# a b c
# 1: a 1 m
# 2: a 2 n
# 3: a 3 p
# 4: b 4 q
# 5: b 5 r
dt.split <- split(dt, dt$a)
dt.aggr <- ldply(lapply(dt.split,
FUN=function(dt){ dt[, .(b=sum(b), c=dt[b==max(b), c]),
by=.(a)] }), .id='a')
dt.aggr
# a b c
# 1 a 6 p
# 2 b 9 r
class(dt.aggr)
# [1] "data.frame"
This is a simple operation within the data.table scope
dt[, .(b = sum(b), c = c[which.max(b)]), by = a]
# a b c
# 1: a 6 p
# 2: b 9 r
A similar option would be
dt[order(b), .(b = sum(b), c = c[.N]), by = a]
Related
Sorry, I could not find right answer in questions with longer object length is not a multiple of shorter object length
I have a dataframe like this
dt = data.frame(id = c(1,2,3,4,5), A=c('a', 'a', 'c', 'b','b'), B= c('d', 'd','h', 'd', 'd'))
And I want to get
id A B final
1 1 a d <NA>
2 2 a d d
3 3 c h c
4 4 b d b
5 5 b d d
I do
dt$A = ifelse(dt$A[dt$id] == dt$A[dt$id-1], as.character(dt$B[dt$id-1]), as.character(dt$A))
Warning message:
In dt$A[dt$id] == dt$A[dt$id - 1] :
longer object length is not a multiple of shorter object length
I can do
shift <- function(x, n){
c(x[-(seq(n))], rep(NA, n))
}
dt$sht <- shift(as.character(dt$A), 1)
dt$new = ifelse(dt$sht == dt$A, as.character(dt$B), as.character(dt$A[dt$id+1]))
temp = dt$new
temp=append(NA, temp)
temp = temp[-6]
dt$final = temp
dt[, c(1,2,3,6)]
id A B final
1 1 a d <NA>
2 2 a d d
3 3 c h c
4 4 b d b
5 5 b d d
But it is a long way, I think you can correct the error in the formula
dt$A = ifelse(dt$A[dt$id] == dt$A[dt$id-1], as.character(dt$B[dt$id-1]), as.character(dt$A))
Or I will be grateful for any more convenient and shorter way.
The indexing in R starts from 1. When we take dt$id -1, for the 'id =1, it becomes 0 and indexing with that returns
dt$A[0]
#character(0)
resulting in a difference in the length of different arguments of ifelse.
ifelse(test, yes, no)
If yes or no are too short, their elements are recycled. yes will be evaluated if and only if any element of test is true, and analogously for no.
Instead, we can make use of lag
library(dplyr)
dt %>%
mutate(final = case_when(A == lag(A, default = A[1]) ~ lag(B), TRUE ~ A))
# id A B final
#1 1 a d <NA>
#2 2 a d d
#3 3 c h c
#4 4 b d b
#5 5 b d d
Here, it can be replaced with ifelse too, and according to ?case_when
This function allows you to vectorise multiple if_else() statements.
data
dt = data.frame(id = c(1,2,3,4,5), A=c('a', 'a', 'c', 'b','b'),
B= c('d', 'd','h', 'd', 'd'), stringsAsFactors = FALSE)
NOTE: stringsAsFactors = TRUE, by default. By changing it to FALSE, can avoid doing the multiple as.character conversions after the dataset is created
Let's say I have the following data.table:
DT <- setDT(data.frame(id = 1:10, LETTERS = LETTERS[1:10],
letters = letters[1:10]))
##+ > DT
## id LETTERS letters
## 1: 1 A a
## 2: 2 B b
## 3: 3 C c
## 4: 4 D d
## 5: 5 E e
## 6: 6 F f
## 7: 7 G g
## 8: 8 H h
## 9: 9 I i
## 10: 10 J j
and I want to find the row and column numbers of the letter 'h' (which are 8 and 3). How would I do that?
DT[, which(.SD == "h", arr.ind = TRUE)]
# row col
# [1,] 8 3
EDIT:
Trying to take into account Michael's points:
str_idx = which(sapply(DT, function(x) is.character(x) || is.factor(x)))
idx <- DT[, which(as.matrix(.SD) == "h", arr.ind = TRUE), .SDcols = str_idx]
idx[, "col"] <- chmatch(names(str_idx)[idx[, "col"]], names(DT))
idx
# row col
# [1,] 8 3
Depends on the exact format of your desired output.
# applying to non-string columns is inefficient
str_idx = which(sapply(DT, is.character))
# returns a list as long as str_idx with two elements appropriately named
lapply(str_idx, function(jj) list(row = which(DT[[jj]] == 'h'), col = jj))
It should also be possible to melt the string columns your table to avoid looping.
Suppose I have a data.table like this:
Table:
V1 V2
A B
C D
C A
B A
D C
I want each row to be regarded as a set, which means that B A and A B are the same. So after the process, I want to get:
V1 V2
A B
C D
C A
In order to do that, I have to first sort the table row-by-row and then use unique to remove the duplicates. The sorting process is quite slow if I have millions of rows. So is there an easy way to remove the duplicates without sorting?
For just two columns you can use the following trick:
dt = data.table(a = letters[1:5], b = letters[5:1])
# a b
#1: a e
#2: b d
#3: c c
#4: d b
#5: e a
dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1]
# a b
#1: a e
#2: b d
#3: c c
Borrowing (probably unrealistic) data from a dupe:
library(data.table)
size <- 118000000
key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
val <- runif(size, 0.0, 5.0)
dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
Here's a fast way if your data looks like this:
# eddi's answer
system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1])
# user system elapsed
# 101.79 3.01 107.98
# optimized for this data
system.time({
dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)]
res2 <- unique(dt2, by=c("key1", "key2"))
})
# user system elapsed
# 8.50 1.16 4.93
fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2)
# [1] TRUE
Data like this seems unlikely if it pertains to covariances, since you should have at most one duplicate (ie, A-B with B-A).
You can try:
df[!duplicated(t(apply(df, 1, sort))), ]
where df is your dataframe
Here is the simple way of removing duplicate rows.
delRows = NULL # the rows to be removed
for(i in 1:nrow(tab)){
j = which(tab$V1 == tab$V2[i] & tab$V2 == tab$V1[i])
j = j [j > i]
if (length(j) > 0){
delRows = c(delRows, j)
}
}
tab = tab[-delRows,]
The result is,
Before,
> tab
V1 V2
1 A B
2 C D
3 C A
4 B A
5 D C
After,
> tab
V1 V2
1 A B
2 C D
3 C A
Suppose I have a data.table like this:
Table:
V1 V2
A B
C D
C A
B A
D C
I want each row to be regarded as a set, which means that B A and A B are the same. So after the process, I want to get:
V1 V2
A B
C D
C A
In order to do that, I have to first sort the table row-by-row and then use unique to remove the duplicates. The sorting process is quite slow if I have millions of rows. So is there an easy way to remove the duplicates without sorting?
For just two columns you can use the following trick:
dt = data.table(a = letters[1:5], b = letters[5:1])
# a b
#1: a e
#2: b d
#3: c c
#4: d b
#5: e a
dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1]
# a b
#1: a e
#2: b d
#3: c c
Borrowing (probably unrealistic) data from a dupe:
library(data.table)
size <- 118000000
key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
val <- runif(size, 0.0, 5.0)
dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
Here's a fast way if your data looks like this:
# eddi's answer
system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1])
# user system elapsed
# 101.79 3.01 107.98
# optimized for this data
system.time({
dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)]
res2 <- unique(dt2, by=c("key1", "key2"))
})
# user system elapsed
# 8.50 1.16 4.93
fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2)
# [1] TRUE
Data like this seems unlikely if it pertains to covariances, since you should have at most one duplicate (ie, A-B with B-A).
You can try:
df[!duplicated(t(apply(df, 1, sort))), ]
where df is your dataframe
Here is the simple way of removing duplicate rows.
delRows = NULL # the rows to be removed
for(i in 1:nrow(tab)){
j = which(tab$V1 == tab$V2[i] & tab$V2 == tab$V1[i])
j = j [j > i]
if (length(j) > 0){
delRows = c(delRows, j)
}
}
tab = tab[-delRows,]
The result is,
Before,
> tab
V1 V2
1 A B
2 C D
3 C A
4 B A
5 D C
After,
> tab
V1 V2
1 A B
2 C D
3 C A
I would like to create a data.table of the form
newdat
# A B
# 1: 1 1,2
# 2: 2 1,2,3
from a data.table of the form
dat <- data.table(A = c(1, 1, 2, 2, 2), B = c(1, 2, 1, 2, 3))
dat
# A B
# 1: 1 1
# 2: 1 2
# 3: 2 1
# 4: 2 2
# 5: 2 3
I can create newdat directly via
newdat <- data.table(A = 1:2, B = list(1:2, 1:3))
and I guess I could fill in the necessary arguments via something like
newdat <- data.table(A = unique(dat$A), B = split(dat$B, dat$A))
but I have a feeling there is a better way to do this using the data.table functionality that I can't find right now - any suggestions?
Here you go dat[,list(B=list(B)),by=A]
dat[,list(B=list(B)),by=A]
A B
1: 1 1,2
2: 2 1,2,3