I am searching a way to reorder each row of a data.table in alphatical order in an efficient way. So I assume that each column does give the same information and is comparable. When you see the example it will make more sense:
test <- data.table(A = c("A", "b", "c"),
B = c(1,"a","d"),
C = c("F", 0, 1))
Expected result:
result <- data.table(t(apply(test,1, sort)))
names(result) <- colnames(test)
In this solution I have to loop through all the rows, can this be prevented?
For 2 columns I found a efficient way to solve this problem:
result <- data.table(A = pmin(test$A, test$B), B = pmax(test$A, test$B) )
But this solution does not work well for more than 2 columns
EDIT:
Lets add a benchmark of the different solutions on two columns:
test <- data.table(A = sample(c("A","B", "C", "D"), 1000000, replace = T),
B = sample(c("A","B", "C", "D"), 1000000, replace = T))
OptionOne <- function(test){
result <- data.table(A = pmin(test$A, test$B), B = pmax(test$A, test$B) )
}
OptionTwo <- function(test){
test[, names(test) := as.list(sort(unlist(.SD))), 1:nrow(test)][]
}
OptionThree <- function(test){
test[, id := .I]
test <- melt(test, id.vars = "id")
setorder(test, id, value)
test[, variable1 := seq_len(.N), by = id]
dcast(test, id ~ variable1, value.var = "value")
}
system.time(OptionOne(test))
#user system elapsed
#0.13 0.00 0.12
system.time(OptionTwo(test))
# user system elapsed
# 17.58 0.00 18.27
system.time(OptionThree(test))
#user system elapsed
# 0.23 0.00 0.24
It seems like for two columns the pmin and pmax is the most efficient way but for more columns the reshape does a good job.
Your data.table is conceptionally in the wrong shape. Sorting over rows (i.e., over variables) does not make sense. Thus, to do this efficiently you need to reshape:
library(data.table)
test <- data.table(A = c("A", "b", "c"),
B = c(1,"a","d"),
C = c("F", 0, 1))
test[, id := .I]
test <- melt(test, id.vars = "id")
setorder(test, id, value)
# id variable value
#1: 1 B 1
#2: 1 A A
#3: 1 C F
#4: 2 C 0
#5: 2 B a
#6: 2 A b
#7: 3 C 1
#8: 3 A c
#9: 3 B d
If you must, you can then reshape again, though I would not recommend that.
test[, variable1 := seq_len(.N), by = id]
dcast(test, id ~ variable1, value.var = "value")
# id 1 2 3
#1: 1 1 A F
#2: 2 0 a b
#3: 3 1 c d
We can try
test[, names(test) := as.list(sort(unlist(.SD))), 1:nrow(test)][]
Related
I have a look-up table of "firsts" in column d. For example, the first time the patient was admitted because of a specific disease. I would like to join this back into the main data frame via data.table on multiple other conditions.
My problem is that, unfortunately, the main data.table could have multiple records with identical joining criteria that results in multiple "firsts" per patient after the join. Real world data is messy, people!
Is it possible to do a {data.table} join on only the first matching record?
This is similar to this question, but the multiple-matches are on the main data table. I think that mult only works on when there are several entries on the table being joined in.
repex:
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d2 <- data.table(a = 1, b = 1, d = TRUE)
d2[d1, on = c("a", "b")]
a b d c
1: 1 1 TRUE 4
2: 1 1 TRUE 8
3: 1 2 NA 2
desired output
a b d c
1: 1 1 TRUE 4
2: 1 1 NA 8
3: 1 2 NA 2
library(data.table)
set.seed(1724)
d1 = data.table(a = c(1, 1, 1), b = c(1, 1, 2), c = sample(1:10, 3))
d2 = data.table(a = 1, b = 1, d = TRUE)
d1[, i1:=seq_len(.N), by=c("a","b")]
d2[, i2:=seq_len(.N), by=c("a","b")]
d2[d1, on = c("a","b","i2==i1")][, "i2":=NULL][]
# a b d c
# <num> <num> <lgcl> <int>
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
One way would be to turn the values to NA after join.
library(data.table)
d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3
# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
The easy solution would be to index every row and join on this also (the d2 is a filtered version of d1):
library(data.table)
set.seed(1724)
d1 <- data.table(a = c(1, 1, 1),
b = c(1, 1, 2),
c = sample(1:10, 3))
d1[, rid := seq(to = .N)]
d2 <- d1[, .SD[1], by = c("a"), .SDcols = c("b", "rid")][, d := TRUE] # UPDATE
d2[d1, on = c("a", "b", "rid")]
Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?
We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]
Suppose I have a large data.table that looks like dt below.
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 10)
)
# dt
# player_1 player_1_age player_2 player_2_age
# 1: a 10 b 20
# 2: b 20 a 10
# 3: b 20 c 30
# 4: c 30 a 10
From the dt above, I would like to create a data.table with unique players and their age like the following, player_dt:
# player_dt
# player age
# a 10
# b 20
# c 30
To do so, I've tried the code below, but it takes too long on my larger dataset, probably because I am creating a data.table for each iteration of sapply.
How would you get the player_dt above, while checking for each player that there is only one unique age value?
# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))
# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
unique_values <- unique(c(
dt[player_1 == x][["player_1_age"]],
dt[player_2 == x][["player_2_age"]]))
if(length(unique_values) > 1) stop() else return(unique_values)
})
# combine to create the player_dt
player_dt <- data.table(player, age)
I use the data from #DavidT as input.
dt
# player_1 player_1_age player_2 player_2_age
#1: a 10 b 20
#2: b 20 a 10
#3: b 20 c 30
#4: c 30 a 11 # <--
TL;DR
You can do
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
out <-
unique(melt(
dt,
measure.vars = list(colsAge, colsOther),
value.name = c("age", "player")
)[, .(age, player)])[, if (.N == 1) # credit: https://stackoverflow.com/a/34427944/8583393
.SD, by = player]
out
# player age
#1: b 20
#2: c 30
Step-by-step
What you can to do is to melt multiple columns simultaneously - those that end with "age" and those that don't.
nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))
The result is
dt1
# variable age player
#1: 1 10 a
#2: 1 20 b
#3: 1 20 b
#4: 1 30 c
#5: 2 20 b
#6: 2 10 a
#7: 2 30 c
#8: 2 11 a
Now we call unique ...
out <- unique(dt1[, .(age, player)])
out
# age player
#1: 10 a
#2: 20 b
#3: 30 c
#4: 11 a
... and filter for groups of "player" with length equal to 1
out <- out[, if(.N == 1) .SD, by=player]
out
# player age
#1: b 20
#2: c 30
Given OP's input data, that last step is not needed.
data
library(data.table)
dt <- data.table(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
Reference: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html
I've altered your data so that there's at least one error to catch:
library(tidyverse)
dt <- tibble(
player_1 = c("a", "b", "b", "c"),
player_1_age = c(10, 20, 20, 30),
player_2 = c("b", "a", "c", "a"),
player_2_age = c(20, 10, 30, 11)
)
# Get the Names columns and the Age columns
colName <- names(dt)
ageCol <- colName[str_detect(colName, "age$")]
playrCol <- colName[! str_detect(colName, "age$")]
# Gather the Ages
ages <- dt %>%
select(ageCol) %>%
gather(player_age, age)
# Gather the names
names <- dt %>%
select(playrCol ) %>%
gather(player_name, name)
# Bind the two together, and throw out the duplicates
# If there are no contradictions, this is what you want.
allNameAge <- cbind( names, ages) %>%
select(name, age) %>%
distinct() %>%
arrange(name)
# But check for inconsistencies. This should leave you with
# an empty tibble, but instead it shows the error.
inconsistencies <- allNameAge %>%
group_by(name) %>%
mutate(AGE.COUNT = n_distinct(age)) %>%
filter(AGE.COUNT > 1) %>%
ungroup()
This should extends to more name/age column pairs.
I am having some difficulty creating a function which would group by according to column P and count all the "YES" for the columns that have been saved in a list such as "list_col". I am pretty sure I would be using a function from the apply family but not sure how to group it by a certain column (Col P in this case)
P <- as.character(c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
a <- as.character(c(NA,"Yes","Yes",NA,NA,NA,"Yes","Yes","Yes",NA))
b <- as.character(c(NA,"Yes",NA,NA,"Yes",NA,"Yes","Yes",NA,NA))
c <- as.character(c(NA,NA,NA,NA,"Yes",NA,"Yes",NA,NA,NA))
df_sample <- as.data.frame(cbind(P, a, b, c))
df_sample
list_col <- c("a","b","c")
Ideally I would be looking for the following answer with the above sample dataframe and the answer below with the columns changing according to the "list_col"
P a b c
A 2 2 1
B 3 2 1
Any help would be appreciated
Here is an approach via dplyr,
library(dplyr)
df_sample %>%
group_by(P) %>%
select(list_col) %>%
summarise_all(funs(sum(. == 'Yes', na.rm = TRUE)))
#Adding missing grouping variables: `P`
# A tibble: 2 x 4
# P a b c
# <fctr> <int> <int> <int>
#1 A 2 2 1
#2 B 3 2 1
In data.table:
library(data.table)
list_col <- c("a","b","c")
setDT(df_sample)[, (lapply(.SD, function(x) sum(x=="Yes", na.rm = TRUE))), by = P, .SDcols = list_col]
# P a b c
#1: A 2 2 1
#2: B 3 2 1
Alternatively, a base R solution still using lapply:
res <-lapply(split(df_sample[,list_col], df_sample$P), function(x) colSums(x=="Yes", na.rm = TRUE))
do.call(rbind, res)
# a b c
#A 2 2 1
#B 3 2 1
For what it's worth, a microbenchmark on my machine:
microbenchmark::microbenchmark(splitlapply = do.call(rbind, lapply(split(df_sample[,list_col], df_sample$P), function(x) colSums(x=="Yes", na.rm = TRUE))),
+ dt = sampleDT[, (lapply(.SD, function(x) sum(x=="Yes", na.rm = TRUE))), by = P, .SDcols = list_col])
Unit: microseconds
expr min lq mean median uq max neval
splitlapply 455.841 505.0715 546.6699 529.3225 561.2315 889.436 100
dt 861.722 1052.9920 1114.2752 1111.7040 1166.7695 1707.761 100
Using melt from reshape
library(reshape)
df=na.omit(melt(df_sample,id.vars='P'))
table(df$P,df$variable)
a b c
A 2 2 1
B 3 2 1
df_sample <- as.data.frame(cbind(P, a, b, c), stringsAsFactors = FALSE)
for (i in list_col){
df_r <- df_sample[, i] == 'Yes' & !is.na(df_sample[, i])
df_sample[df_r, i] <- df_sample$P[df_r]
}
sapply(df_sample[, list_col], table)
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....