Let's say the below df
df <- data.table(id = c(1, 2, 2, 3)
, datee = as.Date(c('2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'))
); df
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 2 2022-01-02
4: 3 2022-01-03
and I wanted to keep only the non-duplicated rows
df[!duplicated(id, datee)]
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 3 2022-01-03
which worked.
However, with the below df_1
df_1 <- data.table(a = c(1,1,2)
, b = c(1,1,3)
); df_1
a b
1: 1 1
2: 1 1
3: 2 3
using the same method does not rid the duplicated rows
df_1[!duplicated(a, b)]
a b
1: 1 1
2: 1 1
3: 2 3
What am I doing wrong?
Let's dive in to why your df_1[!duplicated(a, b)] doesn't work.
duplicated uses S3 method dispatch.
library(data.table)
.S3methods("duplicated")
# [1] duplicated.array duplicated.data.frame
# [3] duplicated.data.table* duplicated.default
# [5] duplicated.matrix duplicated.numeric_version
# [7] duplicated.POSIXlt duplicated.warnings
# see '?methods' for accessing help and source code
Looking at those, we aren't using duplicated.data.table since we're calling it with individual vectors (it has no idea it is being called from within a data.table context), so it makes sense to look into duplicated.default.
> debugonce(duplicated.default)
> df_1[!duplicated(a, b)]
debugging in: duplicated.default(a, b)
debug: .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x),
nlevels(x) + 1L) else nmax))
Browse[2]> match.call() # ~ "how this function was called"
duplicated.default(x = a, incomparables = b)
Confirming with ?duplicated:
x: a vector or a data frame or an array or 'NULL'.
incomparables: a vector of values that cannot be compared. 'FALSE' is
a special value, meaning that all values can be compared, and
may be the only value accepted for methods other than the
default. It will be coerced internally to the same type as
'x'.
From this we can see that a is being used for deduplication, and b is used as "incomparable". Because b contains the value 1 that is in a and duplicated, then rows where a==1 are not tested for duplication.
To confirm, if we change b such that it does not share (duplicated) values with a, we see that the deduplication of a works as intended (though it is silently ignoring b's dupes due to the argument problem):
df_1 <- data.table(a = c(1,1,2) , b = c(2,2,4))
df_1[!duplicated(a, b)] # accidentally correct, `b` is not used
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_1, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
df_2 <- data.table(a = c(1,1,2) , b = c(2,3,4))
df_2[!duplicated(a, b)] # wrong, `b` is not considered
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_2, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 1 3
# 3: 2 4
(Note that unique above is actually data.table:::unique.data.table, another S3 method dispatch provided by the data.table package.)
debug and debugonce are your friends :-)
Related
I want to divide the columns in one character vector by the columns in another character vector in a data.table. Easiest to explain with an example:
dt <- data.table(g1 = c('a', 'b', 'a', 'b'),
x1 = rep(1:2, 2),
x2 = 10:13)
dt2 <- dt[dt[x1==1], on = c('g1')]
dt2
g1 x1 x2 i.x1 i.x2
1: a 1 10 1 10
2: a 1 12 1 10
3: a 1 10 1 12
4: a 1 12 1 12
Now I want to create two new columns: x1_n = x1 / i.x1 and x2_n = x2 / i.x2. My actual data has many more columns so it's too cumbersome to write out and may change. I'm trying:
cc <- c('x1', 'x2')
icc <- paste0('i.', cc)
new_c <- paste0(cc, '_n')
dt2[, (new_c) := mget(cc) / mget(icc)]
Error in mget(cc)/mget(icc) : non-numeric argument to binary operator
I'm not sure how that final step should be written. I feel like I'm missing something simple. Thanks friends.
You're close, I think just you need Map:
nms1 <- grep("^x[0-9]", names(dt2), value = TRUE)
nms2 <- paste0("i.", nms1)
nmsn <- paste0(nms1, "_n")
nms1
# [1] "x1" "x2"
nms2
# [1] "i.x1" "i.x2"
nmsn
# [1] "x1_n" "x2_n"
dt2[, c(nmsn) := Map(`/`, mget(nms1), mget(nms2))]
dt2
# g1 x1 x2 i.x1 i.x2 x1_n x2_n
# <char> <int> <int> <int> <int> <num> <num>
# 1: a 1 10 1 10 1 1.0000000
# 2: a 1 12 1 10 1 1.2000000
# 3: a 1 10 1 12 1 0.8333333
# 4: a 1 12 1 12 1 1.0000000
The use of `/` (backtick-escaped) is the way in R to call in infix operator as a function. An equivalent way to write that:
dt2[, c(nmsn) := Map(function(a, b) a / b, mget(nms1), mget(nms2))]
Consider these two data.tables, foo and bar.
foo <- data.table(id = c(1,2,3,4), f1 = c("a", "b", "c", "d"), f2 = c("a", "b", "c", "d"))
bar <- data.table(id = c(1,2,3,4), f1 = c("a", "a", "c", "d"), f2 = c("a", "b", "c", "e"))
foo
id f1 f2
1: 1 a a
2: 2 b b
3: 3 c c
4: 4 d d
bar
id f1 f2
1: 1 a a
2: 2 a b
3: 3 c c
4: 4 d e
I know that foo and bar have a 1-1 relationship.
I would like to select rows from bar such that the corresponding row in foo has different values. For example,
id 1: the values of f1 and f2 are the same in foo and bar, so exclude this one
id 2: the value of f1 has changed! include this in the result
id 3: the values of f1 and f2 are the same in foo and bar, so exclude this one
id 4: the value of f2 has changed! include this in the result
Expected Result
bar[c(2,4)]
id f1 f2
1: 2 a b
2: 4 d e
What I tried
I thought a non-equi join would work great here.. Unfortunately, it seems the "not equals" operator isn't supported.?
foo[!bar, on = c("id=id", "f1!=f1", "f2!=f2")]
# Invalid operators !=,!=. Only allowed operators are ==<=<>=>.
foo[!bar, on = c("id=id", "f1<>f1", "f2<>f2")]
# Found more than one operator in one 'on' statement: f1<>f1. Please specify a single operator.
With data.table:
bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"]
id f1 f2
<num> <char> <char>
1: 2 a b
2: 4 d e
I think this is best (cleanest, but perhaps not fastest?):
bar[!foo, on=.(id,f1,f2)]
id f1 f2
<num> <char> <char>
1: 2 a b
2: 4 d e
Benchmarking on a larger dataset with a few different data.table options. The mapply option works only if all.equal(foo$id, bar$id) (depends on exactly what is meant by "1-1 relationship").
library(data.table)
set.seed(123)
foo <- data.table(id = 1:1e5, f1 = 1:1e5, f2 = 1:1e5)
mix <- sample(1e5, 5e4)
bar <- copy(foo)[mix[1:25e3], f1 := 0L][mix[25001:5e4], f2 := 0L]
head(fsetdiff(bar, foo))
#> id f1 f2
#> 1: 5 0 5
#> 2: 6 6 0
#> 3: 7 0 7
#> 4: 8 0 8
#> 5: 10 0 10
#> 6: 12 12 0
microbenchmark::microbenchmark(join = bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"],
antijoin = bar[!foo, on=.(id,f1,f2)],
fsetdiff = fsetdiff(bar, foo),
duplicated = bar[(!duplicated(rbindlist(list(bar, foo)), fromLast = TRUE))[1:nrow(bar)]],
mapply = bar[rowSums(mapply(function(i) foo[[i]] != bar[[i]], 2:length(bar))) > 0,],
check = "equal")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> join 12.2306 14.07125 15.133795 14.76330 15.96645 22.4534 100
#> antijoin 16.2002 17.60420 19.234747 18.30230 19.44395 59.1581 100
#> fsetdiff 20.1408 21.76150 23.080961 23.03760 23.73860 30.9594 100
#> duplicated 17.8954 20.12690 21.673165 21.66795 22.79185 27.3250 100
#> mapply 3.2440 3.56480 4.346703 3.87415 4.63610 10.2100 100
As other than data.table solution are also welcomed. Here is a tidyverse solution:
library(dplyr)
library(tidyr)
left_join(foo, bar, by="id") %>%
group_by(id) %>%
mutate(identical = n_distinct(unlist(cur_data())) == 1) %>%
filter(identical == FALSE) %>%
select(id, f1=f1.y, f2=f2.y)
Groups: id [2]
id f1 f2
<dbl> <chr> <chr>
1 2 a b
2 4 d e
I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.
I'm attempting to create a summarised data.table from an existing one, however I am wanting to do this in a function that allows me to pass in a column prefix so I can prefix my columns as required.
I've seen the question/response here but am trying to work out how to do it when not using the := operator.
Reprex:
library(data.table)
tbl1 <- data.table(urn = c("a", "a", "a", "b", "b", "b"),
amount = c(1, 2, 1, 3, 3, 4))
# urn amount
# 1: a 1
# 2: a 2
# 3: a 1
# 4: b 3
# 5: b 3
# 6: b 4
tbl2 <- tbl1[, .(mean_amt = mean(amount),
rows = .N),
by = urn]
# urn mean_amt rows
# 1: a 1.333333 3
# 2: b 3.333333 3
This is using fixed names for the column names being created, however as mentioned I'd like to be able to include a prefix.
I've tried the following:
prefix <- "mypfx_"
tbl2 <- tbl1[, .(paste0(prefix, mean_amt) = mean(amount),
paste0(prefix, rows) = .N),
by = urn]
# Desired output
# urn mypfx_mean_amt mypfx_rows
# 1: a 1.333333 3
# 2: b 3.333333 3
Unfortunately that codes gets an error saying: Error: unexpected '=' in " tbl2 <- tbl1[, .(paste0(prefix, mean_amt) ="
Any thoughts on how to make the above work would be appreciated.
You can use setNames to rename the columns dynamically:
prefix <- "mypfx_"
tbl2 <- tbl1[, setNames(list(mean(amount), .N), paste0(prefix, c("mean_amt", "rows"))),
by = urn]
tbl2
# urn mypfx_mean_amt mypfx_rows
#1: a 1.333333 3
#2: b 3.333333 3
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....