Clustering rows by group based on column value with conditions - r

A few days ago I opened this thread:
Clustering rows by group based on column value
In which we obtained this result:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5))
With:
df <- df %>%
group_by(ID) %>%
mutate_at(vars(Obs1),
funs(ClusterObs1= with(rle(.), rep(cumsum(values == 1), lengths))))
Now I have to make some modifications:
If value of 'Control' is higher than 12 and actual 'Obs1' value is equal to 1 and to previous 'Obs1' value, 'DesiredResultClusterObs1' value should add +1
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5),
DesiredResultClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7))
I have considered add if_else condition with lag in funs but unsuccessfully, any ideas?
EDIT: How it would be for many columns?

This seems to work:
df %>%
mutate(DesiredResultClusterOrbs1 = with(rle(Control > 12 & Obs1 == 1 & lag(Obs1) == 1),
rep(cumsum(values == 1), lengths)) + ClusterObs1)
ID Obs1 Control ClusterObs1 DesiredResultClusterOrbs1
1 1 1 0 1 1
2 1 1 3 1 1
3 1 0 3 1 1
4 1 1 1 2 2
5 1 0 12 2 2
6 1 1 1 3 3
7 1 1 1 3 3
8 1 0 1 3 3
9 1 1 36 4 4
10 1 0 13 4 4
11 1 0 1 4 4
12 1 0 1 4 4
13 1 1 2 5 5
14 1 1 24 5 6
15 1 1 2 5 6
16 1 1 2 5 6
17 1 1 48 5 7
Basically, we use the rle+rep mechanic from your previous thread to create a cumulative vector from the TRUE/FALSE result of your conditions and add it to the existing ClusterObs1.
If you want to create multiple DesiredResultClusterOrbs, you can use mapply. Maybe there's a dplyr solution for this, but this is base R.
Data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1),
Obs2 = rbinom(17, 1, .5),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48),
ClusterObs1 = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5))
df <- df %>%
mutate_at(vars(Obs2),
funs(ClusterObs2= with(rle(.), rep(cumsum(values == 1), lengths))))
The loop:
newcols <- mapply(function(x, y){
with(rle(df$Control > 12 & x == 1 & lag(x) == 1),
rep(cumsum(values == 1), lengths)) + y
}, df[2:3], df[5:6])
This produces a matrix with the new columns, which you can then rename and cbind to your data:
colnames(newcols) <- paste0("DesiredResultClusterOrbs", 1:2)
cbind.data.frame(df, newcols)
ID Obs1 Obs2 Control ClusterObs1 ClusterObs2 DesiredResultClusterOrbs1 DesiredResultClusterOrbs2
1 1 1 1 0 1 1 1 1
2 1 1 1 3 1 1 1 1
3 1 0 0 3 1 1 1 1
4 1 1 0 1 2 1 2 1
5 1 0 0 12 2 1 2 1
6 1 1 0 1 3 1 3 1
7 1 1 1 1 3 2 3 2
8 1 0 0 1 3 2 3 2
9 1 1 1 36 4 3 4 3
10 1 0 1 13 4 3 4 4
11 1 0 0 1 4 3 4 4
12 1 0 1 1 4 4 4 5
13 1 1 1 2 5 4 5 5
14 1 1 0 24 5 4 6 5
15 1 1 1 2 5 5 6 6
16 1 1 1 2 5 5 6 6
17 1 1 1 48 5 5 7 7

Related

Counter sequential of specific values in R

I have a column like that :
a = c(3, 1, 2, 3, 3, 3, 1, 3, 2, 3, 3, 1, 3, 2, 1, 3, 1)
I want to have a column that counts 1 and 2 sequentially to make a column like this:
a b
1 3 0
2 1 1
3 2 2
4 3 2
5 3 2
6 3 2
7 1 3
8 3 3
9 2 4
10 3 4
11 3 4
12 1 5
13 3 5
14 2 6
15 1 7
16 3 7
We can use cumsum on a logical vector
df1$b <- cumsum(df1$a %in% c(1, 2))
data
df1 <- data.frame(a)

R: How to start a new sub_id each time a new sequence begins

Suppose I have data as follows:
tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
)
i.e.,
# A tibble: 16 x 2
A B
<dbl> <dbl>
1 1 1
2 2 1
3 2 2
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 1
10 3 1
11 4 1
12 4 2
13 4 3
14 4 4
15 4 1
16 5 1
How do I create a sub_id each time a new sequence begins within the group defined by variable A, i.e.,
tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
sub_id = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1)
)
# A tibble: 16 x 3
A B sub_id
<dbl> <dbl> <dbl>
1 1 1 1
2 2 1 1
3 2 2 1
4 2 1 2
5 2 2 2
6 2 3 2
7 3 1 1
8 3 2 1
9 3 1 2
10 3 1 3
11 4 1 1
12 4 2 1
13 4 3 1
14 4 4 1
15 4 1 2
16 5 1 1
Hopefully that’s well defined. I suppose I’m after a kind of inverse to row_number
Thanks in advance,
James.
Using base R
df$sub_id <- with(df, ave(B ==1, A, FUN = cumsum))
You got the "ingredients" already laid out.
(i) for each group of column A
(ii) check if a new sequence starts
The following is based on {dplyr}. For demo purposes, I create an additional column/variable to show the "start condition". You can combine this into one call.
I use the fact that summing over TRUE/FALSE codes TRUE as 1. If this is not evident for you, you can use as.numeric(B == 1)
library(dplyr)
library(tibble)
# load example data
df <- tibble(
A = c(1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5),
B = c(1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 3, 4, 1, 1),
sub_id = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1)
)
# perform group-wise operations
df %>%
group_by(A) %>%
mutate(
# --------------- highlight start of new sequence --------------
start = B == 1
# --------------- create cumsum over TRUEs----------------------
, sub_id2 = cumsum(start)
)
This yields what you looked for:
# A tibble: 16 x 5
# Groups: A [5]
A B sub_id start sub_id2
<dbl> <dbl> <dbl> <lgl> <int>
1 1 1 1 TRUE 1
2 2 1 1 TRUE 1
3 2 2 1 FALSE 1
4 2 1 2 TRUE 2
5 2 2 2 FALSE 2
6 2 3 2 FALSE 2
7 3 1 1 TRUE 1
8 3 2 1 FALSE 1
9 3 1 2 TRUE 2
10 3 1 3 TRUE 3
11 4 1 1 TRUE 1
12 4 2 1 FALSE 1
13 4 3 1 FALSE 1
14 4 4 1 FALSE 1
15 4 1 2 TRUE 2
16 5 1 1 TRUE 1
We could use group_by and cumsum:
library(dplyr)
df %>%
group_by(A) %>%
mutate(sub_id = cumsum(B==1)
Output:
# Groups: A [5]
A B sub_id
<dbl> <dbl> <int>
1 1 1 1
2 2 1 1
3 2 2 1
4 2 1 2
5 2 2 2
6 2 3 2
7 3 1 1
8 3 2 1
9 3 1 2
10 3 1 3
11 4 1 1
12 4 2 1
13 4 3 1
14 4 4 1
15 4 1 2
16 5 1 1
A data.table option
> setDT(df)[, sub_id := cumsum(B == 1), A][]
A B sub_id
1: 1 1 1
2: 2 1 1
3: 2 2 1
4: 2 1 2
5: 2 2 2
6: 2 3 2
7: 3 1 1
8: 3 2 1
9: 3 1 2
10: 3 1 3
11: 4 1 1
12: 4 2 1
13: 4 3 1
14: 4 4 1
15: 4 1 2
16: 5 1 1

Filter rows according to their frequency

Let's say we have the following data:
library(tidyverse)
data <- tibble(
V1 = c(1, 1, 1, 1, 2, 2, 1, 3),
V2 = c(1, 1, 1, 2, 2, 2, 1, 3),
V3 = c(1, 1, 1, 2, 2, 2, 3, 3),
V4 = c(1, 1, 1, 2, 2, 2, 3, 3)
)
> data
# A tibble: 8 x 4
V1 V2 V3 V4
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 ## 1st occurrence
2 1 1 1 1 ## 2nd occurrence
3 1 1 1 1 ## 3rd occurrence
4 1 2 2 2 ## This row does not count while it occurs only once in the data
5 2 2 2 2 ## 1st occurrence
6 2 2 2 2 ## 2nd occurrence
7 1 1 3 3 ## This row does not count while it occurs only once in the data
8 3 3 3 3 ## This row does not count while it occurs only once in the data
We want to filter out rows which occur more often than a threshold; let's say threshold is set to 2 in our example. Additionally, values of the rows which don't reach the threshold are set to 0. Therefore, the result table should be:
> data_filtered
# A tibble: 8 x 4
V1 V2 V3 V4
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 0 0 0 0
5 2 2 2 2
6 2 2 2 2
7 0 0 0 0
8 0 0 0 0
Any suggestion is greatly appreciated.
An idea using dplyr,
library(dplyr)
a %>%
group_by_all() %>%
mutate(new = n()) %>%
rowwise() %>%
mutate_at(vars(-new), funs(replace(., new < 2 , 0))) %>%
select(-new) %>%
ungroup()
which gives,
# A tibble: 8 x 4
V1 V2 V3 V4
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 0 0 0 0
5 2 2 2 2
6 2 2 2 2
7 0 0 0 0
8 0 0 0 0
I would go with data.table:
library(data.table)
data <- data.table(
V1 = c(1, 1, 1, 1, 2, 2, 1, 3),
V2 = c(1, 1, 1, 2, 2, 2, 1, 3),
V3 = c(1, 1, 1, 2, 2, 2, 3, 3),
V4 = c(1, 1, 1, 2, 2, 2, 3, 3)
)
data[,key:=apply(data,1,function(x) paste0(x,collapse = ""))]#create a unique key per row
setkey(data,key) #set the "key" (to be used later on)
data<-merge(data,data[,.N,by=key])#create the frequency N and propagate the values to the initial table via merge
So for the moment:
>data
key V1 V2 V3 V4 N
1: 1111 1 1 1 1 3
2: 1111 1 1 1 1 3
3: 1111 1 1 1 1 3
4: 1133 1 1 3 3 1
5: 1222 1 2 2 2 1
6: 2222 2 2 2 2 2
7: 2222 2 2 2 2 2
8: 3333 3 3 3 3 1
data[,key:=NULL]#drop the key
You can now filter entire rows based on N, via:
data[N<=2,c("V1","V2","V3","V4"):=0]#set all columns to 0 if N is less or equal to 2
resulting in:
V1 V2 V3 V4 N
1: 1 1 1 1 3
2: 1 1 1 1 3
3: 1 1 1 1 3
4: 0 0 0 0 1
5: 0 0 0 0 1
6: 2 2 2 2 2
7: 2 2 2 2 2
8: 0 0 0 0 1
Of course you can drop now N via data[,N:=NULL]

Create matrix of counts using two variables

I have two columns - a unique id column id and the day of travel day. My objective is to create a matrix of counts per id per day (and to include all days even if the count is zero)
> test
id day
1 3 3
2 4 4
3 1 4
4 2 3
5 2 5
6 2 4
7 1 1
8 5 4
9 1 1
10 3 2
11 2 2
12 4 2
13 2 4
14 2 5
15 4 5
16 3 4
17 5 3
18 3 2
19 5 5
20 3 4
21 1 3
22 2 3
23 2 5
24 5 2
25 3 2
The output should be the following, where rows represent id and columns represent day:
> output
1 2 3 4 5
1 2 0 1 1 0
2 0 1 2 2 3
3 0 3 1 2 0
4 0 1 0 1 1
5 0 1 1 1 1
I have tried the following with the reshape package
output <- reshape2::dcast(test, day ~ id, sum)
but it throws the following error:
Error in unique.default(x) : unique() applies only to vectors
Why does this happen and what would the right solution be in dplyr or using base R? Any tips would be appreciated.
Here is the data:
> dput(test)
structure(list(id = c(3, 4, 1, 2, 2, 2, 1, 5, 1, 3, 2, 4, 2,
2, 4, 3, 5, 3, 5, 3, 1, 2, 2, 5, 3), day = c(3, 4, 4, 3, 5, 4,
1, 4, 1, 2, 2, 2, 4, 5, 5, 4, 3, 2, 5, 4, 3, 3, 5, 2, 2)), .Names = c("id",
"day"), row.names = c(NA, -25L), class = "data.frame")
Easier to see whats going on with character variables
id <- c('a', 'a', 'b', 'f', 'b', 'a')
day <- c('x', 'x', 'x', 'y', 'z', 'x')
test <- data.frame(id, day)
output <- as.data.frame.matrix(table(test))
This is the simplest way to do it...use the table() function then convert to data.frame
ans <- tapply(test$id, test$day,
function(x) {
y <- table(x)
z <- rep(0, 5)
z[as.numeric(names(y))] <- y
z
} )
do.call("cbind", ans)
1 2 3 4 5
[1,] 2 0 1 1 0
[2,] 0 1 2 2 3
[3,] 0 3 1 2 0
[4,] 0 1 0 1 1
[5,] 0 1 1 1 1

how to remove incomplete data for longitudinal data in long format?

Here is a reproducible test dataset
mydata <- structure(list(subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3), time = c(0, 1, 2, 0, 1, 2, 0, 1, 2), measure = c(10, 12, 8, 7, 0, 0, 5, 3, NA)), .Names = c("subject", "time", "measure"), row.names = 1:9, class = "data.frame")
mydata
subject time measure
1 0 10
1 1 12
1 2 8
2 0 7
2 1 0
2 2 0
3 0 5
3 1 3
3 2 NA
I would like to remove all the rows where measure is NA and all the corresponding rows for the same subject. So in the example above that would yield:
subject time measure
1 0 10
1 1 12
1 2 8
2 0 7
2 1 0
2 2 0
Is there an easy way to do this without reshaping to wide format first ?
I don't think this needs reshaping or even ave. It is just a subsetting issue, if I understand your question right.
mydata[!with(mydata, subject %in% subject[is.na(measure)]), ]
# subject time measure
# 1 1 0 10
# 2 1 1 12
# 3 1 2 8
# 4 2 0 7
# 5 2 1 0
# 6 2 2 0
You could use:
mydata[with(mydata, as.logical(ave(measure, subject, FUN=function(x) ifelse(any(is.na(x)), 0, 1)))),]
# subject time measure
# 1 1 0 10
# 2 1 1 12
# 3 1 2 8
# 4 2 0 7
# 5 2 1 0
# 6 2 2 0

Resources