get rows of unique values by group - r

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6

The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6

data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4

The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6

Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Related

`data.table` how to get `keyby` to include all combinations of factors? [duplicate]

This question already has answers here:
Keeping zero count combinations when aggregating with data.table
(2 answers)
Use a factor column in "by" and do not drop empty factors
(2 answers)
Closed 4 years ago.
I have a data.table and I would like to count the occurrence of each combination of a and b:
dt1 <- data.table(
a = c(1,1,1,1,2,2,2,2,3,3,3,3),
b = c(1,1,2,2,1,1,1,1,1,2,2,2) %>% letters[.]
)
# a b
# 1: 1 a
# 2: 1 a
# 3: 1 b
# 4: 1 b
# 5: 2 a
# 6: 2 a
# 7: 2 a
# 8: 2 a
# 9: 3 a
# 10: 3 b
# 11: 3 b
# 12: 3 b
dt1[, .N, keyby = .(a, b)]
# a b N
# 1: 1 a 2
# 2: 1 b 2
# 3: 2 a 4
# 4: 3 a 1
# 5: 3 b 3
It misses out the case of a==2 & b=="b", which has a zero count in dt1, but I want it to be included so the result would look like:
# a b c
# 1: 1 a 2
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 b 0
# 5: 3 a 1
# 6: 3 b 3
The most intuitive way to use the loop or the apply family but it is just inefficient for my large datasets. Any idea?
That's a tidyr/dplyr approach:
dt1 %>%
group_by(a,b) %>%
summarise(c = length(.)) %>%
ungroup %>%
complete(a,b, fill = list(c = 0))

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

R - unpivot list in data.table rows

I have a dataset that contains several columns, including 1 with list entries:
DT = data.table(
x = c(1:5),
y = seq(2, 10, 2),
z = list(list("a","b","a"), list("a","c"), list("b","c"), list("a","b","c"), list("b","c","b"))
)
Basically, I'm trying to unlist a, b, c from column z, and aggregate the data based on the x & y values.
Desired output:
z x sum(y)
1: a 1 4
2: b 1 2
3: a 2 4
4: c 2 4
5: b 3 6
6: c 3 6
7: a 4 8
8: b 4 8
9: c 4 8
10: b 5 20
11: c 5 10
My current method is rather round-about; I created 2 other columns with x and y values in lists of the same length as the list entry in z column, then unlisted all 3 columns simultaneously before aggregating - i.e. sum y values, grouped by z & x.
Code (before unlisting & aggregation):
DT[, listlen := sapply(z, function(x) length(x))]
for (a in c(1:nrow(DT))){
DT[a, x1:= list(list(rep(DT[a, x], DT[a, listlen])))]
DT[a, y1:= list(list(rep(DT[a, y], DT[a, listlen])))]}
DT_out = data.table(x = unlist(DT[,x1]), y = unlist(DT[,y1]), z = unlist(DT[,z]))
x y z listlen x1 y1
1: 1 2 <list> 3 1,1,1 2,2,2
2: 2 4 <list> 2 2,2 4,4
3: 3 6 <list> 2 3,3 6,6
4: 4 8 <list> 3 4,4,4 8,8,8
5: 5 10 <list> 3 5,5,5 10,10,10
Is there a method through data.table or reshape packages that can help me melt the dataset / do this much simpler? As I'm working with a lot more rows than this and this step seems to be very inefficient.
Any other help regarding the aggregation step would be much appreciated too!
unlist your z column first and then just aggregate as per normal via by=:
DT[, .(z=unlist(z)), by=.(x,y)][, .(sumy=sum(y)), by=.(x,z)]
# x z sumy
# 1: 1 a 4
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 c 4
# 5: 3 b 6
# 6: 3 c 6
# 7: 4 a 8
# 8: 4 b 8
# 9: 4 c 8
#10: 5 b 20
#11: 5 c 10

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources