data preparation part II - r

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1

It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Related

Count occurrences of value in multiple columns with duplicates

My problem is very similar to:
R: Count occurrences of value in multiple columns
However, the solution proposed there doesn't work for me because in the same row the value may appear twice but I want to count only the rows where this appears. I have worked out a solution but it seems too long:
> toy_data = data.table(from=c("A","A","A","C","E","E"), to=c("B","C","A","D","F","E"))
> toy_data
from to
1: A B
2: A C
3: A A
4: C D
5: E F
6: E E
> #get a table with intra-link count
> A = data.table(table(unlist(toy_data[from==to,from ])))
> A
V1 N
1: A 1
2: E 1
A #get a table with total count
> B = data.table(table(unlist(toy_data[,c(from,to)])))
> B
V1 N
1: A 4
2: B 1
3: C 2
4: D 1
5: E 3
6: F 1
>
> # concatenate changing sign
> table = rbind(B,A[,.(V1,-N)],use.names=FALSE)
> # groupby and subtract
> table[,sum(N),by=V1]
V1 V1
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Is there some function that would do the job in less lines? I thought in python I'd concatenate from and to then match(), cannot find the right sintax though
EDIT: I know this would work A=length(toy_data[from=="A"|to=="A",from]) but I would like avoiding loops among the various "A","B"... (and I don't know how to format output in this way)
You can try the code below
> toy_data[, to := replace(to, from == to, NA)][, data.frame(table(unlist(.SD)))]
Var1 Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
or
toy_data %>%
mutate(to = replace(to, from == to, NA)) %>%
unlist() %>%
table() %>%
as.data.frame()
which gives
. Freq
1 A 3
2 B 1
3 C 2
4 D 1
5 E 2
6 F 1
Using data.table
library(data.table)
toy_data[from == to, to := NA][, .(to = na.omit(c(from, to)))][, .N, to]
You could just subset the to vector:
data.table(table(unlist(toy_data[,c(from,to[to!=from])])))
V1 N
1: A 3
2: B 1
3: C 2
4: D 1
5: E 2
6: F 1
Using to:=NA as suggested by akrun, one can wrap the result in table(unlist()) and convert to data.table
data.table(table(unlist(toy_data[from==to, to:=NA, from])))

I found a strange thing (bug?) about 'combn' function and 'data.table' package [all possible combinations by group]

I tried to find all possible combinations by group. I tried to use combn function and data.table package as a below post teaches [(here is the link)](Generate All ID Pairs, by group with data.table in R
This gives me the expected result.
dat1 <- data.table(ids=1:4, groups=c("B","A","B","A"))
dat1
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 A
dat1[, as.data.table(t(combn(ids, 2))), .(groups)]
groups V1 V2
1: B 1 3
2: A 2 4
But this gives me a strange result. It's very weird. I tried to understand this result for about 3 hours but I can't. Isn't it a bug?
dat2 <- data.table(ids=1:4, groups=c("B","A","B","C"))
dat2
ids groups
1: 1 B
2: 2 A
3: 3 B
4: 4 C
dat2[, as.data.table(t(combn(ids, 2))), .( groups)]
groups V1 V2
1: B 1 3
2: A 1 2
3: C 1 2
4: C 1 3
5: C 1 4
6: C 2 3
7: C 2 4
8: C 3 4
I really appreciate it for your teaching.

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources