Duplicates after aggregating wiith data.table - r

I'm new to data.table, and trying to find how many rows in my table have the same value in two columns. The resulting table had multiple rows containing the same key combination. Can someone help me with what I'm doing wrong?
labs_raw_df <- data.table(labs_raw)
setkey(labs_raw_df, NAT, LAB_TST_AN_LAB_TST_CD)
lab_pt_count <- labs_raw_df[,
list(n=.N)
,by=list(NAT, LAB_TST_AN_LAB_TST_CD)]
Both columns are character.

Writing an answer since this is too long for a comment.
I assume that you use data.table 1.8.6.
Let's create some dummy data:
set.seed(42)
labs_raw_df <- data.frame(NAT=sample(c("A","B","C"),20,TRUE),
LAB_TST_AN_LAB_TST_CD=sample(c("A","B","C"),20,TRUE),
value=sample(0:1,20,TRUE))
Now your code (with some minor corrections of naming):
library(data.table)
labs_raw_dt <- data.table(labs_raw_df)
setkey(labs_raw_dt, NAT, LAB_TST_AN_LAB_TST_CD)
lab_pt_count <- labs_raw_dt[,
list(n=.N),
by=list(NAT, LAB_TST_AN_LAB_TST_CD)]
print(lab_pt_count)
NAT LAB_TST_AN_LAB_TST_CD n
1: A A 1
2: A C 3
3: B A 2
4: B B 3
5: B C 2
6: C A 2
7: C B 2
8: C C 5
This is the expected result. Can you elaborate on how that doesn't meet your expectation?
Of course we can simplify a bit:
lab_pt_count <- labs_raw_dt[,
.N,
by=key(labs_raw_dt)]
print(lab_pt_count)
NAT LAB_TST_AN_LAB_TST_CD N
1: A A 1
2: A C 3
3: B A 2
4: B B 3
5: B C 2
6: C A 2
7: C B 2
8: C C 5
But the result is the same.

Related

Creating data.table from a list of unequal vector lengths

I am looking to create a data.table from a list of unequal vectors, but instead of repeating the values for the "shorter" vector, I want it to be filled with NAs. I have one possible solution, but it repeats values and does not retain the NA as needed.
Example:
library(data.table)
my_list <- list(A = 1:4, B = letters[1:5])
as.data.table(do.call(cbind, my_list))
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 1 e
But I want it to look like:
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
Thank you!
We need to make the lengths same by appending NA at the end of the list elements having lesser length than the max length
mx <- max(lengths(my_list))
as.data.table(do.call(cbind, lapply(my_list, `length<-`, mx)))
-output
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: <NA> e
Instead of cbind/as.data.table, setDT is more compact
setDT(lapply(my_list, `length<-`, mx))[]
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
You may use stringi::stri_list2matrix to make all the list length equal.
my_list |>
stringi::stri_list2matrix() |>
data.table::as.data.table() |>
type.convert(as.is = TRUE) |>
setNames(names(my_list))
# A B
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: NA e

add rows to data frame for each value from a character column R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 2 years ago.
I better explain my problem with an example:
Role Skill ID
1: A a 1, 1/a, 1/b
2: A b 1/a, 2/a
3: B c 1/a, 2/c
4: B d 3
5: C e 4
Having the data, dt, above, I would like to create additional rows, for each value of the ID variable.
The end result should be:
Role Skill ID
1: A a 1
2: A a 1/a
3: A a 1/b
4: A b 1/a
5: A b 2/a
6: B c 1/a
7: B c 2/c
8: B d 3
9: C e 4
Below, the code to replicate the data:
dt <- data.table(Role = c("A","A","B","B","C"),
Skill = c("a","b",'c',"d","e"),
ID = c(c("1, 1/a, 1/b"),c("1/a, 2/a"),c("1/a, 2/c"),
c("3"),c("4")))
We can use separate_rows
library(dplyr)
library(tidyr)
dt %>%
separate_rows(ID, sep=",\\s*")
# Role Skill ID
#1: A a 1
#2: A a 1/a
#3: A a 1/b
#4: A b 1/a
#5: A b 2/a
#6: B c 1/a
#7: B c 2/c
#8: B d 3
#9: C e 4
Or with strsplit
dt[, .(ID = unlist(strsplit(ID, ",\\s*"))), .(Role, Skill)]

Subsetting data.table by values with repetition

I would like to subset data.table in a way which would provide some rows repeatedly. This works well with indices but I am not sure how to do it in a simple way with values especially when the values does not appear in one row only.
E.g.:
library(data.table)
dt<-data.table(x1=c('a','a','b','b','c','c'),x2=c(1,2,3,4,5,6))
xsel<-c('a','b','a','a','c','b')
dt[x1%in%xsel,]
will provide this output:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: c 5
6: c 6
I would like to get it in the original order and with repetition exactly as it is in the xsel vector. Is it possible to do it in a reasonably simple way without looping? Thanks.
Using:
setkey(dt, x1) # set the key
dt[J(xsel)] # or: dt[.(xsel)]
gives:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: a 1
6: a 2
7: a 1
8: a 2
9: c 5
10: c 6
11: b 3
12: b 4
Without setting the key, you could use:
dt[.(xsel), on = .(x1)]

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources