I'm trying to replace values of duplicate rows in a data.table. Let's say u have
A <- c(1,2,3,4,4,6,4)
B <- c("a","b","c","d","e","f","g")
C <- c(10,11,23,8,8,1,3)
dt <- data.table(A,B,C)
I would like to do: dt[duplicated(dt,dt[,c(1,3)]),][,2] <- 0 to obtain
>dt
A B C
1: 1 a 10
2: 2 b 11
3: 3 c 23
4: 4 d 8
5: 4 0 8
6: 6 f 1
7: 4 g 3
You could do
> A <- c(1,2,3,4,4,6,4)
> B <- c("a","b","c","d","e","f","g")
> dt <- data.table(A,B,C, stringsAsFactors = FALSE)
> C <- c(10,11,23,8,8,1,3)
> dt[dt[, j = duplicated(.SD), .SDcols = c("A", "C")], B := "0"]
> dt
A B C
1: 1 a 10
2: 2 b 11
3: 3 c 23
4: 4 d 8
5: 4 0 8
6: 6 f 1
7: 4 g 3
... but now seeing David's solution is way more concise...
Related
I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))
This question already has answers here:
How to create group indices for nested groups in r
(3 answers)
Closed 3 years ago.
This is related to multiple duplicates (1, 2, 3), but a slightly different problem that I'm stuck with. So far, I've seen pandas solution only.
In this data table:
dt = data.table(gr = rep(letters[1:2], each = 6),
cl = rep(letters[1:4], each = 3))
gr cl
1: a a
2: a a
3: a a
4: a b
5: a b
6: a b
7: b c
8: b c
9: b c
10: b d
11: b d
12: b d
I'd like to enumerate unique classes per group to obtain this:
gr cl id
1: a a 1
2: a a 1
3: a a 1
4: a b 2
5: a b 2
6: a b 2
7: b c 1
8: b c 1
9: b c 1
10: b d 2
11: b d 2
12: b d 2
Try
library(data.table)
dt[, id := rleid(cl), by=gr]
dt
# gr cl id
# 1: a a 1
# 2: a a 1
# 3: a a 1
# 4: a b 2
# 5: a b 2
# 6: a b 2
# 7: b c 1
# 8: b c 1
# 9: b c 1
#10: b d 2
#11: b d 2
#12: b d 2
You can do (maybe it will require to sort the data first):
dt[, id := cumsum(!duplicated(cl)), by = gr]
gr cl id
1: a a 1
2: a a 1
3: a a 1
4: a b 2
5: a b 2
6: a b 2
7: b c 1
8: b c 1
9: b c 1
10: b d 2
11: b d 2
12: b d 2
The same with dplyr:
dt %>%
group_by(gr) %>%
mutate(id = cumsum(!duplicated(cl)))
Or a rleid()-like possibility:
dt %>%
group_by(gr) %>%
mutate(id = with(rle(cl), rep(seq_along(lengths), lengths)))
An alternative solution using factor which will not require ordering first
dt %>%
group_by(gr) %>%
mutate(id = as.numeric(factor(cl))) %>%
ungroup()
# # A tibble: 12 x 3
# gr cl id
# <chr> <chr> <dbl>
# 1 a a 1
# 2 a a 1
# 3 a a 1
# 4 a b 2
# 5 a b 2
# 6 a b 2
# 7 b c 1
# 8 b c 1
# 9 b c 1
#10 b d 2
#11 b d 2
#12 b d 2
Note that this will automatically assign a number / id based on the alphabetical order of the cl values, within each gr group.
I would like to aggregate a data.table by a list of column and keep all the columns at the end.
A <- c(1,2,3,4,4,6,4)
B <- c("a","b","c","d","e","f","g")
C <- c(10,11,23,8,8,1,3)
D <- c(2,3,5,9,7,8,4)
dt <- data.table(A,B,C,D)
Now I want to aggregate the column B paste(B,sep=";") by A and C and keep the column D too at the end. Do you know a way to do it please?
EDIT
this is what i obtained using dt[, newCol := toString(B), .(A, C)]
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
5: 4 e 8 7 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g
But i would like to obtain
A B C D newCol
1: 1 a 10 2 a
2: 2 b 11 3 b
3: 3 c 23 5 c
4: 4 d 8 9 d, e
6: 6 f 1 8 f
7: 4 g 3 4 g
I have the following data table:
require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
ind cat counts
1: 1 A 10
2: 2 A 9
3: 3 A 8
4: 4 B 7
5: 5 B 6
6: 6 C 5
7: 7 C 4
8: 8 D 3
What I would like to achieve is to add a row for each cat which in the counts has the difference between the sum(counts) of the cat and the sum(counts) of cat A. For these rows the ind should be 0.
Essentially I would like to rbind the following information:
added_info <- cbind(ind =0, dt1[, .(counts = dt1[cat == "A", sum(counts)] - sum(counts)), by = cat])
> added_info
ind cat counts
1: 0 A 0
2: 0 B 14
3: 0 C 18
4: 0 D 24
And the end result would be:
dt1 <- rbind(dt1, added_info)[order(cat)]
> dt1
ind cat counts
1: 1 A 10
2: 2 A 9
3: 3 A 8
4: 0 A 0
5: 4 B 7
6: 5 B 6
7: 0 B 14
8: 6 C 5
9: 7 C 4
10: 0 C 18
11: 8 D 3
12: 0 D 24
My question is if there is a better (shorter) way of achieving this using datatable (perhaps by using .I or .N ??)
You could do
require(data.table)
dt1 <- data.table(ind = 1:8, cat = c("A", "A", "A", "B", "B", "C", "C", "D"), counts = (10:3))
dt1[,c:=sum(counts[cat=="A"])][,.(ind=c(ind,0), counts=c(counts,c[.N]-sum(counts))),cat][]
# cat ind counts
# 1: A 1 10
# 2: A 2 9
# 3: A 3 8
# 4: A 0 0
# 5: B 4 7
# 6: B 5 6
# 7: B 0 14
# 8: C 6 5
# 9: C 7 4
# 10: C 0 18
# 11: D 8 3
# 12: D 0 24
This may be a solution within one data.table call:
dt1[, rbind(.SD,
data.table(ind = 0,
counts = dt1[cat == 'A', sum(counts)] - sum(.SD$counts))),
by = cat]
Out:
cat ind counts
1: A 1 10
2: A 2 9
3: A 3 8
4: A 0 0
5: B 4 7
6: B 5 6
7: B 0 14
8: C 6 5
9: C 7 4
10: C 0 18
11: D 8 3
12: D 0 24
You said efficient, so... This has two by's; the unique is likely vectorized and the data.table by for sum should compile to a c for loop.
> dt1[, .SD
][, ca := sum(.SD[cat == 'A', counts])
][, cc := sum(counts), cat
][, cd := ca - cc
][, rbind(.SD, unique(.SD, by=c('cat'))[, `:=`(ind=0)])
][ind == 0, counts := cd
][, .(cat, ind, counts)
][order(cat, ind)
]
cat ind counts
1: A 0 0
2: A 1 10
3: A 2 9
4: A 3 8
5: B 0 14
6: B 4 7
7: B 5 6
8: C 0 18
9: C 6 5
10: C 7 4
11: D 0 24
12: D 8 3
>
I have a data.frame df and I want that every row in this df is duplicated lengthTime times and that a new column is added that counts from 1 to lengthTime for each row in df.
I know, it sounds pretty complicated, but what I basically want is to apply expand.grid to df. Here is an ugly workaround and I have the feeling that there most be an easier solution (maybe even a base-R function?):
df <- data.frame(ID = rep(letters[1:3], each=3),
CatA = rep(1:3, times = 3),
CatB = letters[1:9])
lengthTime <- 3
nrRow <- nrow(df)
intDF <- df
for (i in 1:(lengthTime - 1)) {
df <- rbind(df, intDF)
}
df$Time <- rep(1:lengthTime, each=nrRow)
I thought that I could just use expand.grid(df, 1:lengthTime), but that does not work. outer did not bring any luck either. So does anyone know a good solution?
It's been a while since this question was posted, but I recently came across it looking for just the thing in the title, namely, an expand.grid that works for data frames. The posted answers address the OP's more specific question, so in case anyone is looking for a more general solution for data frames, here's a slightly more general approach:
expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
# For the example in the OP
expand.grid.df(df, data.frame(1:lengthTime))
# More generally
df1 <- data.frame(A=1:3, B=11:13)
df2 <- data.frame(C=51:52, D=c("Y", "N"))
df3 <- data.frame(E=c("+", "-"))
expand.grid.df(df1, df2, df3)
You can also just do a simple merge by NULL (which will cause merge to do simple combinatorial data replication):
merge(data.frame(time=1:lengthTime), iris, by=NULL)
Why not just something like df[rep(1:nrow(df),times = 3),] to extend the data frame, and then add the extra column just as you have above, with df$Time <- rep(1:lengthTime, each=nrRow)?
Quick update
There is now also the crossing() function in package tidyr which can be used instead of merge, is somewhat faster, and returns a tbl_df / tibble.
data.frame(time=1:10) %>% merge(iris, by=NULL)
data.frame(time=1:10) %>% tidyr::crossing(iris)
This works:
REP <- rep(1:nrow(df), 3)
df2 <- data.frame(df[REP, ], Time = rep(1:3, each = 9))
rownames(df2) <- NULL
df2
A data.table solution:
> library(data.table)
> ( df <- data.frame(ID = rep(letters[1:3], each=3),
+ CatA = rep(1:3, times = 3),
+ CatB = letters[1:9]) )
ID CatA CatB
1 a 1 a
2 a 2 b
3 a 3 c
4 b 1 d
5 b 2 e
6 b 3 f
7 c 1 g
8 c 2 h
9 c 3 i
> ( DT <- data.table(df)[, lapply(.SD, function(x) rep(x,3))][, Time:=rep(1:3, each=nrow(df0))] )
ID CatA CatB Time
1: a 1 a 1
2: a 2 b 1
3: a 3 c 1
4: b 1 d 1
5: b 2 e 1
6: b 3 f 1
7: c 1 g 1
8: c 2 h 1
9: c 3 i 1
10: a 1 a 2
11: a 2 b 2
12: a 3 c 2
13: b 1 d 2
14: b 2 e 2
15: b 3 f 2
16: c 1 g 2
17: c 2 h 2
18: c 3 i 2
19: a 1 a 3
20: a 2 b 3
21: a 3 c 3
22: b 1 d 3
23: b 2 e 3
24: b 3 f 3
25: c 1 g 3
26: c 2 h 3
27: c 3 i 3
Another one :
> library(data.table)
> ( df <- data.frame(ID = rep(letters[1:3], each=3),
+ CatA = rep(1:3, times = 3),
+ CatB = letters[1:9]) )
> DT <- data.table(df)
> rbindlist(lapply(1:3, function(i) cbind(DT, Time=i)))
ID CatA CatB Time
1: a 1 a 1
2: a 2 b 1
3: a 3 c 1
4: b 1 d 1
5: b 2 e 1
6: b 3 f 1
7: c 1 g 1
8: c 2 h 1
9: c 3 i 1
10: a 1 a 2
11: a 2 b 2
12: a 3 c 2
13: b 1 d 2
14: b 2 e 2
15: b 3 f 2
16: c 1 g 2
17: c 2 h 2
18: c 3 i 2
19: a 1 a 3
20: a 2 b 3
21: a 3 c 3
22: b 1 d 3
23: b 2 e 3
24: b 3 f 3
25: c 1 g 3
26: c 2 h 3
27: c 3 i 3