I have a data.table like this
dt1=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8))
where each individual has 2 scores on the variable "score".
I would like to assign each individual to a category in the variable outcome based on their score.
For their lower score, they get an "A", for their higher, they get a "B". So the final table looks like this
dt2=data.table(id=c(001,001,002,002,003,003),
score=c(4,6,3,7,2,8),
category=c('A','B', 'A','B', 'A','B'))
Since the values in column "score" are random, the category should be assigned based on the magnitude of the numbers assigned to each person. Any help is much appreciated.
We can order the 'score' in i, grouped by 'id' and assign the 'category' as 'A', 'B'
library(data.table)
dt1[order(score), category := c('A', 'B') , by = id]
dt1
# id score category
#1: 001 4 A
#2: 001 6 B
#3: 002 3 A
#4: 002 7 B
#5: 003 2 A
#6: 003 8 B
Or another option is to convert a logical vector to a numeric index and replace the values based on that
dt1[, category := c('A', 'B')[(score != min(score)) + 1] ,by = id]
data
dt1 <- data.table(id=c('001','001','002','002','003','003'),
score=c(4,6,3,7,2,8))
We can use ifelse:
library(data.table)
dt1[, category := ifelse(score == min(score), 'A', 'B'), by = id]
Result:
id score category
1: 1 4 A
2: 1 6 B
3: 2 3 A
4: 2 7 B
5: 3 2 A
6: 3 8 B
Related
I have a data frame like below
sample <- data.frame(ID = 1:9,
Group = c('AA','AA','AA','BB','BB','CC','CC','BB','CC'),
Value = c(1,1,1,2,2,2,3,2,3))
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
6 CC 2
7 CC 3
8 BB 2
9 CC 3
I want to select groups according to the number of distinct (unique) values within each group. For example, select groups where all values within the group are the same (one distinct value per group). If you look at the group CC, it has more than one distinct value (2 and 3) and should thus be removed. The other groups, with only one distinct value, should be kept. Desired output:
ID Group Value
1 AA 1
2 AA 1
3 AA 1
4 BB 2
5 BB 2
8 BB 2
Would you tell me simple and fast code in R that solves the problem?
Here's a solution using dplyr:
library(dplyr)
sample <- data.frame(
ID = 1:9,
Group= c('AA', 'AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'BB', 'CC'),
Value = c(1, 1, 1, 2, 2, 2, 3, 2, 3)
)
sample %>%
group_by(Group) %>%
filter(n_distinct(Value) == 1)
We group the data by Group, and then only select groups where the number of distinct values of Value is 1.
data.table version:
library(data.table)
sample <- as.data.table(sample)
sample[ , if(uniqueN(Value) == 1) .SD, by = Group]
# Group ID Value
#1: AA 1 1
#2: AA 2 1
#3: AA 3 1
#4: BB 4 2
#5: BB 5 2
#6: BB 8 2
An alternative using ave if the data is numeric, is to check if the variance is 0:
sample[with(sample, ave(Value, Group, FUN=var ))==0,]
An alternative solution that could be faster on large data is:
setkey(sample, Group, Value)
ans <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
The point is that calculating unique values for each group could be time consuming when there are more groups. Instead, we can set the key on the data.table, then take unique values by key (which is extremely fast) and then count the total values for each group. We then require only those where it is 1. We can then perform a join (which is once again very fast). Here's a benchmark on large data:
require(data.table)
set.seed(1L)
sample <- data.table(ID=1:1e7,
Group = sample(rep(paste0("id", 1:1e5), each=100)),
Value = sample(2, 1e7, replace=TRUE, prob=c(0.9, 0.1)))
system.time (
ans1 <- sample[,if(length(unique(Value))==1) .SD ,by=Group]
)
# minimum of three runs
# user system elapsed
# 14.328 0.066 14.382
system.time ({
setkey(sample, Group, Value)
ans2 <- sample[unique(sample)[, .N, by=Group][N==1, Group]]
})
# minimum of three runs
# user system elapsed
# 5.661 0.219 5.877
setkey(ans1, Group, ID)
setkey(ans2, Group, ID)
identical(ans1, ans2) # [1] TRUE
You can make a selector for sample using ave many different ways.
sample[ ave( sample$Value, sample$Group, FUN = function(x) length(unique(x)) ) == 1,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) sum(x - x[1]) ) == 0,]
or
sample[ ave( sample$Value, sample$Group, FUN = function(x) diff(range(x)) ) == 0,]
Here's an approach
> ind <- aggregate(Value~Group, FUN=function(x) length(unique(x))==1, data=sample)[,2]
> sample[sample[,"Group"] %in% levels(sample[,"Group"])[ind], ]
ID Group Value
1 1 AA 1
2 2 AA 1
3 3 AA 1
4 4 BB 2
5 5 BB 2
8 8 BB 2
I have a dataframe that looks like this:
data<-data.frame("ID" = c(rep("A", times = 13),
rep("B", times = 7)),
"Value" = c(112,130,67,120,117,45,56,90,140,210,30,45,65,220,145,34,45,89,120,180))
I want to add a column that counts each episode. An episode is from the first occurrence of a value <70 to the first occurrence of a value >=70. Sometimes, there is never a value >=70 after the initial value <70, but it is still considered an episode.
I want a resulting dataframe that looks like this:
data<-data.frame("ID" = c(rep("A", times = 13),
rep("B", times = 7)),
"Value" = c(112,130,67,120,117,45,56,90,140,210,30,45,65,220,145,34,45,89,120,180),
"Episode" = c(NA,NA,1,1,NA,2,2,2,NA,NA,3,3,3,NA,NA,1,1,1,NA,NA))
That way, I can summarize the number of episodes per ID:
final<-data.frame("ID" = c("A", "B"),
"Episodes" = c(3, 1))
Thank you in advance!
If your goal is to produce final, I think this works:
final <- data %>%
group_by(ID) %>%
mutate(is_new_episode = if_else(lag(Value) < 70, 'same', 'new'),
is_episode = if_else(Value < 70, 'episode', 'no_episode'),
episode_start = is_episode == 'episode' & is_new_episode == 'new') %>%
summarize(Episodes = sum(episode_start))
Basically, you count which rows are the beginning of an episode.
An option is to convert the 'data.frame' to 'data.table' (setDT(data)), create a logical column based on the logical expression Value < 70 and its shifted output grouped by 'ID', using rleid (run-length-id - create a grouping variable on the similarity of adjacent elements of 'i1', grouped by 'ID', then specify the i as the 'i1', grouped by 'ID', match the 'grp' with unique elements of 'grp' and assign it to 'Episode'. By default the elements that are not matched will be assigned to NA
library(data.table)
setDT(data)[, i1 := Reduce(`|`, list(Value < 70,
shift(Value < 70, fill = FALSE))), ID]
data[, grp := rleid(i1), ID]
data[as.logical(i1), Episode := match(grp, unique(grp)), ID][,
c('grp', 'i1') := NULL][]
# ID Value Episode
# 1: A 112 NA
# 2: A 130 NA
# 3: A 67 1
# 4: A 120 1
# 5: A 117 NA
# 6: A 45 2
# 7: A 56 2
# 8: A 90 2
# 9: A 140 NA
#10: A 210 NA
#11: A 30 3
#12: A 45 3
#13: A 65 3
#14: B 220 NA
#15: B 145 NA
#16: B 34 1
#17: B 45 1
#18: B 89 1
#19: B 120 NA
#20: B 180 NA
From here, we can create the summarised output
data[, .(Episodes = uniqueN(Episode[!is.na(Episode)])), ID]
# ID Episodes
#1: A 3
#2: B 1
This question already has answers here:
Cumulatively paste (concatenate) values grouped by another variable
(6 answers)
Closed 4 years ago.
I have a data.table like the following:
x <- data.table(group = c('A', 'A', 'A', 'B', 'B'),
row_id = c(1, 2, 3, 1, 2),
value = c('a', 'b', 'c', 'd', 'e'))
I want to add a new column that cumulatively concatenate column 'value' ordered by 'row_id', within each group indicated by 'group'. So the output would look like:
group row_id value
1: A 1 a
2: A 2 a_b
3: A 3 a_b_c
4: B 1 d
5: B 2 d_e
Thank you for your help!
One option would be to group by 'group', loop through the sequence of rows, get the sequence of it, use that as index to get the corresponding 'value' and paste together with delimiter _, assign (:=) it to update the 'value'
x[, value := sapply(seq_len(.N), function(i)
paste(value[seq(i)], collapse = "_")), by = group]
x
# group row_id value
#1: A 1 a
#2: A 2 a_b
#3: A 3 a_b_c
#4: B 1 d
#5: B 2 d_e
What is efficient and elegant data.table syntax for finding the most common category for each id? I keep a boolean vector indicating NA positions (for other purposes)
dt = data.table(id=rep(1:2,7), category=c("x","y",NA))
print(dt)
In this toy example, ignoring NA, x is common category for id==1 and y for id==2.
If you want to ignore NA's, you have to exclude them first with !is.na(category), group by id and category (by = .(id, category)) and create a frequency variable with .N:
dt[!is.na(category), .N, by = .(id, category)]
which gives:
id category N
1: 1 x 3
2: 2 y 3
3: 2 x 2
4: 1 y 2
Ordering this by id will give you a clearer picture:
dt[!is.na(category), .N, by = .(id, category)][order(id)]
which results in:
id category N
1: 1 x 3
2: 1 y 2
3: 2 y 3
4: 2 x 2
If you just want the rows which indicate the top results:
dt[!is.na(category), .N, by = .(id, category)][order(id, -N), head(.SD,1), by = id]
or:
dt[!is.na(category), .N, by = .(id, category)][, .SD[which.max(N)], by = id]
which both give:
id category N
1: 1 x 3
2: 2 y 3
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....