How to label clustered groups in a vector? - r

Data:
set.seed (112098)
op <- data.frame(id=1:100,cluster=rbinom(100,1,0.5))
id cluster
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
Intended:
id cluster groups
1 1 1
2 1 1
3 1 1
4 0 0
5 1 2
6 1 2
7 0 0
8 0 0
9 1 3
Essentially, every consecutive 1 series forms a group. How could I add the group column in R?

Here is one option using rleid from data.table
library(data.table)
setDT(op)[, groups := rleid(cluster)*(cluster)
][groups!=0, groups := as.integer(factor(groups))]
head(op, 9)
# id cluster groups
#1: 1 1 1
#2: 2 1 1
#3: 3 1 1
#4: 4 0 0
#5: 5 0 0
#6: 6 1 2
#7: 7 1 2
#8: 8 0 0
#9: 9 1 3

Related

Pasting a string of variables into a function is not working

I was looking at this question: Find how many times duplicated rows repeat in R data frame, which provides the following code:
library(plyr)
ddply(df,.(a,b),nrow)
However, I have a dataset with many variables, so I can't type them out like a,b in this case. I've tried using names(data) with the paste function, but it doesn't seem to work. I tried this:
var_names=paste(names(data),collapse=",")
ddply(data,.(paste(a)),nrow)
It instead gives this output:
However, if I manually type them out, I get the proper output:
What do I need to do differently here?
Instead of paste and evaluating, make use of count from dplyr, which can take multiple columns with across and select-helpers - everything()
library(dplyr)
df %>%
count(across(everything()))
A reproducible example with mtcars dataset
data(mtcars)
df <- mtcars %>%
select(vs:carb)
count(df, across(everything()))
vs am gear carb n
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Also, in ddply, we can just pass a vector of column names i.e. no need to create a single string
library(plyr)
ddply(df, names(df), nrow)
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Or if we are creating a single string from names, also paste the whole expression and then evaluate (which is not recommended as there are standard ways of dealing this)
eval(parse(text = paste('ddply(df, .(', toString(names(df)), '), nrow)')))
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
You can use aggregate by grouping all the columns and counting it's length.
aggregate(1:nrow(df)~., df, length)

What is the R function for detecting successive differences in a data frame?

I use the following code in R and it works very well. More precisely, I compare each time cluster_id with the last cluster_ref to see when they differ 2 periods in a row (data is organized by fund_numbers). However, I would like to adapt it to 5 periods. But it is impossible to make it work. Do you have any idea how I can modify this code to solve my problem?
get_output <- function(mon, ref){
exp <- !is.na(Cluster_id) & !map2_lgl(Cluster_id, last(Cluster_ref), identical)
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
arrange(Fund_number, rolling_window) %>%
group_by(Fund_number) %>%
mutate(Deviation = get_output(Cluster_id, Cluster_ref)) %>%
ungroup()
rolling_window Fund_number Cluster_id Cluster_ref Expected_output
1 1 10 10 0
2 1 10 10 0
3 1 8 9 0
4 1 8 8 0
5 1 7 7 0
6 1 8 8 0
7 1 8 NA 1
8 1 7 NA 1
9 1 7 10 1
10 1 10 10 0
1 2 NA NA 0
2 2 NA 3 0
3 2 3 3 0
4 2 2 5 0
5 2 2 NA 0
6 2 2 4 0
7 2 2 4 1
8 2 5 5 0
9 2 4 5 0
10 2 3 5 0
This is what I want.
So as you can see, the data is organized by fund_number. Then I look at the last cluster_ref for each fund (so every 8 rows) and compare it to each cluster_id for each fund. As soon as it is different at least 5 periods in a row I have 1 if not 0. So for each fund, I compare the 8th cluster_ref and the cluster_id of rows 1 to 8.
The code above makes this but with 2 time periods.
Thank you very much,
Vanie
In data.table we can use rleid over Cluster_id values.
library(data.table)
setDT(df)[, temp := rleid(last(Cluster_ref) != Cluster_id), Fund_number]
df[, output := +(seq_along(Cluster_ref) >= 5), .(Fund_number, temp)]
df[, temp := NULL]
df
# rolling_window Fund_number Cluster_id Cluster_ref Expected_output output
# 1: 1 1 10 10 0 0
# 2: 2 1 10 10 0 0
# 3: 3 1 8 9 0 0
# 4: 4 1 8 8 0 0
# 5: 5 1 7 7 0 0
# 6: 6 1 8 8 0 0
# 7: 7 1 8 NA 1 1
# 8: 8 1 7 NA 1 1
# 9: 9 1 7 10 1 1
#10: 10 1 10 10 0 0
#11: 1 2 NA NA 0 0
#12: 2 2 NA 3 0 0
#13: 3 2 3 3 0 0
#14: 4 2 2 5 0 0
#15: 5 2 2 NA 0 0
#16: 6 2 2 4 0 0
#17: 7 2 2 4 1 1
#18: 8 2 5 5 0 0
#19: 9 2 4 5 0 0
#20: 10 2 3 5 0 0

Ifelse statment across multiple rows

Looking to add a column based on the values of two columns, but over more than one row.
Example Dataset Code:
A = c(1,1,1,2,2,2,3,3,3,4,4)
B = c(1,2,3,1,2,3,1,2,3,1,2)
C = c(0,0,0,1,0,0,1,1,1,0,1)
data <- data.frame(A,B,C)
Dataset:
A B C
1 1 1 0
2 1 2 0
3 1 3 0
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 1
Ifelse statements:
What I am trying to achieve is "Create column D.If column C == 1 in any row where column A == x, column D = 1. Else column D == 0"
Desired Output:
A B C D
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 2 1 1 1
5 2 2 0 1
6 2 3 0 1
7 3 1 1 1
8 3 2 1 1
9 3 3 1 1
10 4 1 0 1
11 4 2 1 1
What I've done:
I've thought about it today but can't come up with a logical answer, I've tried looking at the data in long and wide formats but nothings jumped out.
Note:
In actual application the number of times x appears in column C is not equal (some contain one repeat in the dataset, others contain 20).
# just check using any() if any group has a single row with C==1
library(dplyr)
data %>% group_by(A) %>% mutate(D = as.numeric(any(C==1)))
library(data.table)
data[, D:=as.numeric(any(C==1)), by = .(A)]
# A B C D
#1 1 1 0 0
#2 1 2 0 0
#3 1 3 0 0
#4 2 1 1 1
#5 2 2 0 1
#6 2 3 0 1
#7 3 1 1 1
#8 3 2 1 1
#9 3 3 1 1
#10 4 1 0 1
#11 4 2 1 1
Easy with data.table
library(data.table)
data <- data.table(data)
x=2
data[,D:=ifelse(!A==x,ifelse(C==1,1,0),0)]
data
We can use ave from base R
data$D <- with(data, as.integer(ave(C==1, A, FUN=any)))
data
# A B C D
#1 1 1 0 0
#2 1 2 0 0
#3 1 3 0 0
#4 2 1 1 1
#5 2 2 0 1
#6 2 3 0 1
#7 3 1 1 1
#8 3 2 1 1
#9 3 3 1 1
#10 4 1 0 1
#11 4 2 1 1

Conditional Series Fill in R

Looking for a way to fill in a vector with new values conditional on values within that vector and another variable in the data frame. Pasted an example of what the data looks like below.
PrsVar= c(rep(1,10),rep(2,7),rep(3,11))
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0)
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3)
exampdata <- cbind(PrsVar,IndVar,OutVar)
exampdata <- as.data.frame(exampdata)
> exampdata
PrsVar IndVar OutVar
1 1 0 1
2 1 0 1
3 1 0 1
4 1 1 1
5 1 0 2
6 1 0 2
7 1 1 2
8 1 0 3
9 1 0 3
10 1 0 3
11 2 0 1
12 2 0 1
13 2 0 1
14 2 1 1
15 2 0 2
16 2 0 2
17 2 1 2
18 3 0 1
19 3 0 1
20 3 0 1
21 3 1 1
22 3 0 2
23 3 0 2
24 3 0 2
25 3 1 2
26 3 0 3
27 3 0 3
28 3 0 3
This is time-series data and each row represents a person-day. PrsVar is an ID for an individual in the study and IndVar is an indicator that an episode has ended on that person-day. The person-day after that represents a new episode.
I'd like to create a variable that looks like OutVar using just the values from PrsVar and IndVar. This new variable OutVar labels the episode each person-day is in, incrementing by 1, and starting over at 1 for each new individual.
I could run this through a loop, but I need more efficient code to work with 3,000,000+ rows of data. Was trying to use something in dplyr or maybe mapply, but I'm stumped. Thinking a solution to this would be helpful to others and would certainly be helpful to me again in the near future.
The data.table package offers a fast, efficient, and tidy way to do this. It's all done by reference (not by value, so no copying is done) so millions of rows won't be an issue at all (under a minute, maybe).
library(data.table)
patient.data <- data.table(PrsVar = c(rep(1,10), rep(2,7), rep(3,11)),
IndVar = c(0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0),
OutVar = c(1,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,1,1,1,1,2,2,2,2,3,3,3))
Increment an episode counter EpVar based on the cumulative sum of IndVar (plus 1). This increases the counter at the record where IndVar increases (which is too early) so shift it down a record with shift, replacing the missing value with a reset counter (1). This can be done groupwise with the by keyword.
patient.data[ , EpVar:=shift(1+cumsum(IndVar), fill=1), by=PrsVar]
patient.data
PrsVar IndVar OutVar EpVar
1: 1 0 1 1
2: 1 0 1 1
3: 1 0 1 1
4: 1 1 1 1
5: 1 0 2 2
6: 1 0 2 2
7: 1 1 2 2
8: 1 0 3 3
9: 1 0 3 3
10: 1 0 3 3
11: 2 0 1 1
12: 2 0 1 1
13: 2 0 1 1
14: 2 1 1 1
15: 2 0 2 2
16: 2 0 2 2
17: 2 1 2 2
18: 3 0 1 1
19: 3 0 1 1
20: 3 0 1 1
21: 3 1 1 1
22: 3 0 2 2
23: 3 0 2 2
24: 3 0 2 2
25: 3 1 2 2
26: 3 0 3 3
27: 3 0 3 3
28: 3 0 3 3
A bit ugly, but this logic should be easily adaptable to other methods:
with(exampdata,
ave(IndVar, PrsVar, FUN=function(x) {
out <- rev(cumsum(rev(x)))
max(out) - out + 1
})
)
# [1] 1 1 1 1 2 2 2 3 3 3 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 3 3 3

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

Resources