How to group dataframe by Conditioning in R - r

I want to group a data frame by using a condition which when applied gives the ouptut "True False False False False True False False True False ..." and so on. I want group in a way which gives me the rows from "True" to the last "False" i.e. for the mentioned output I´d get 3 groups (the rows which give) "True False False False False", "True False False " and "True False"
I hope this makes sense lol

We could do this by counting the cumulative number of TRUE's and using that as our group.
a <- data.frame(log = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE))
a %>%
mutate(group = cumsum(log))
or in base R:
a$group = cumsum(a$log)
Result
log group
1 TRUE 1
2 FALSE 1
3 FALSE 1
4 FALSE 1
5 FALSE 1
6 TRUE 2
7 FALSE 2
8 FALSE 2
9 TRUE 3
10 FALSE 3

Related

How create new column an add column names by selected row in r

a<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE)
b<-c(TRUE,FALSE,TRUE,FALSE,FALSE,FALSE)
c<-c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
costumer<-c("one","two","three","four","five","six")
df<-data.frame(costumer,a,b,c)
That's an example code. It looks like this printed:
costumer a b c
1 one TRUE TRUE TRUE
2 two FALSE FALSE TRUE
3 three TRUE TRUE TRUE
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I want to create a new column df$items that contains only the column names that are TRUE for each row in the data. Something like this:
costumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE
6 six FALSE FALSE FALSE
I thought of using apply function or use which for selecting indexes, but couldn't figure it out. Can anyone help me?
df$items <- apply(df, 1, function(x) paste0(names(df)[x == TRUE], collapse = ","))
df
custumer a b c items
1 one TRUE TRUE TRUE a,b,c
2 two FALSE FALSE TRUE c
3 three TRUE TRUE TRUE a,b,c
4 four FALSE FALSE FALSE
5 five TRUE FALSE TRUE a,c
6 six FALSE FALSE FALSE
df$items = apply(df[2:4], 1, function(x) toString(names(df[2:4])[x]))
df
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
You could use
df$items <- apply(df, 1, function(x) toString(names(df)[which(x == TRUE)]))
Output
# custumer a b c items
# 1 one TRUE TRUE TRUE a, b, c
# 2 two FALSE FALSE TRUE c
# 3 three TRUE TRUE TRUE a, b, c
# 4 four FALSE FALSE FALSE
# 5 five TRUE FALSE TRUE a, c
# 6 six FALSE FALSE FALSE
We can use pivot_longer to reshape to 'long' format and then do a group by paste
library(dplyr)
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = a:c) %>%
group_by(costumer) %>%
summarise(items = toString(name[value])) %>%
left_join(df)

finding all possible subsets of a dataframe

I am looking for a function that takes a column of a data.frame as the reference and finds all subsets with respect to the other variable levels. For example, let z be data frame with 4 columns a,b,c,d, each column has 2 levels for instance. let a be the reference. Then z would be like
z$a : TRUE FALSE
z$b : TRUE FALSE
z$c : TRUE FALSE
z$d : TRUE FALSE
Then what I need is a LIST that the elements are combination names such as
aTRUEbTRUEcTRUEdTR UE :subset of the dataframe
aTRUEbFALSEcTRUEdTRUE : subset
...
Here is an example,
set.seed(123)
z=matrix(sample(c(TRUE,FALSE),size = 100,replace = TRUE),ncol=4)
colnames(z) = letters[1:4]
z=as.data.frame(z)
output= list(
'bTUEcTRUEdFALSE' = subset(z,b==TRUE & c==TRUE & d==FALSE),
'bTRUEcTRUEdTRUE' = subset(z,b==TRUE & c==TRUE & d==TRUE),
'bTRUEcFALSEdFALSE' = subset(z,b==TRUE & c==FALSE & d==FALSE),
'bTRUEcFALSEdTRUE' = subset(z,b==TRUE & c==FALSE & d==TRUE)
# and so on ...
)
output
$bTUEcTRUEdFALSE
a b c d
13 FALSE TRUE TRUE FALSE
14 FALSE TRUE TRUE FALSE
$bTRUEcTRUEdTRUE
a b c d
4 FALSE TRUE TRUE TRUE
10 TRUE TRUE TRUE TRUE
16 FALSE TRUE TRUE TRUE
20 FALSE TRUE TRUE TRUE
24 FALSE TRUE TRUE TRUE
$bTRUEcFALSEdFALSE
a b c d
17 TRUE TRUE FALSE FALSE
19 TRUE TRUE FALSE FALSE
22 FALSE TRUE FALSE FALSE
$bTRUEcFALSEdTRUE
a b c d
5 FALSE TRUE FALSE TRUE
11 FALSE TRUE FALSE TRUE
15 TRUE TRUE FALSE TRUE
18 TRUE TRUE FALSE TRUE
21 FALSE TRUE FALSE TRUE
23 FALSE TRUE FALSE TRUE
However, there is an issue with the example. firstly, I do not know the number of variables (in this case 4 (a to d). Secondly, the name of the variables must be caught from the data (simple speaking, I cannot use subset since I do not know the variable name in the condition (a== can be anything==)
What is the most efficient way of doing this in R?
You can use split and paste like so:
split(z, paste(z$b, z$c, z$d))
But the tricky part of your question is how to programmatically combine the variables in columns 2:end without knowing beforehand the number of columns, their names or values. We can use a function like below to paste the values by row in columns 2:end
apply(df, 1, function(i) paste(i[-1], collapse=""))
Now combine with split
split(z, apply(z, 1, function(i) paste(i[-1], collapse="")))

More efficient ways to use R than 'for' loops

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!
We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)
For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE
Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4

Counting Falses before Trues in R

I'm trying to use R to find the average number of attempts before a success in a dataframe with 300,000+ rows. Data is structured as below.
EventID SubjectID ActionID Success DateUpdated
a b c TRUE 2014-06-21 20:20:08.575032+00
b a c FALSE 2014-06-20 02:58:40.70699+00
I'm still learning my way through R. It looks like I can use ddply to separate the frame out based on Subject and Action (I want to see how many times a given subject tries an action before achieving a success), but I can't figure out how to write the formula I need to apply.
library(data.table)
# example data
dt = data.table(group = c(1,1,1,1,1,2,2), success = c(F,F,T,F,T,F,T))
# group success
#1: 1 FALSE
#2: 1 FALSE
#3: 1 TRUE
#4: 1 FALSE
#5: 1 TRUE
#6: 2 FALSE
#7: 2 TRUE
dt[, which(success)[1] - 1, by = group]
# group V1
#1: 1 2
#2: 2 1
Replace group with list(subject, action) or whatever is appropriate for your data (after converting it to data.table from data.frame).
To follow up on Tarehman's suggestion, since I like rle,
foo <- rle(data$Success)
mean(foo$lengths[foo$values==FALSE])
This might be an answer to a totally different question, but does this get close to what you want?
tfs <- sample(c(FALSE,TRUE),size = 50, replace = TRUE, prob = c(0.8,0.2))
tfs_sums <- cumsum(!tfs)
repsums <- tfs_sums[duplicated(tfs_sums)]
mean(repsums - c(0,repsums[-length(repsums)]))
tfs
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[39] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
repsums
1 6 8 9 20 20 20 20 24 26 31 36
repsums - c(0,repsums[-length(repsums)])
1 5 2 1 11 0 0 0 4 2 5 5
The last vector shown is the length of each continuous "run" of FALSE values in the vector tfs
you could use data.table work around to get what you need as follows:
library (data.table)
df=data.frame(EventID=c("a","b","c","d"),SubjectID=c("b","a","a","a"),ActionID=c("c","c","c","c"),Success=c(TRUE,FALSE,FALSE,TRUE))
dt=data.table(df)
dt[ , Index := 1:.N , by = c("SubjectID" , "ActionID","Success") ]
Now this Index column will hold the number that you need for each subject/action consecutive experiments. You need to aggregate to get that number (max number)
result=stats:::aggregate.formula(Index~(SubjectID+ActionID),data=dt,FUN= function(x) max(x))
so this will give you the max index and it is the number of the falses before you hit a true. Note that you might need to do further processing to filter out subjects that has never had a true

Meeting every logical condition in a dataframe

I have a data.frame with 6,000 obs
SubjectID : int 1,2,3,4...
Arthritis : logi FALSE FALSE TRUE FALSE FALSE
Stroke : logi TRUE FALSE FALSE FALSE FALSE
Diabetes : logi TRUE FALSE FALSE FALSE FALSE
Cancer : logi FALSE FALSE FALSE FALSE TRUE
I am trying to find rows where every disease is absent. I can do it for a single disease with this:
subset(RHV.FINAL, Arthritis=="FALSE")
I have tried this for all diseases, which works, but is cumbersome:
subset(RHV.FINAL, Arthritis=="FALSE" & Stroke=="FALSE" & Diabetes=="FALSE" & Cancer=="FALSE")
Is there a more eloquent solution?
Can you not use rowSums? It's abit hard to tell with the str of your data as you have posted it. A small example to recreate in an R session would be good (dput).
df [rowSums( df ) == 0 , ]
For example...
set.seed(123)
df <- data.frame( id = 1:5,
A = sample( c(T,F) , 5 , repl = T ),
B = sample( c(T,F) , 5 , repl = T ),
C = sample( c(T,F) , 5 , repl = T ))
id A B C
1 1 TRUE TRUE FALSE
2 2 FALSE FALSE TRUE
3 3 TRUE FALSE FALSE
4 4 FALSE FALSE FALSE
5 5 FALSE TRUE TRUE
# df[,-1] to exclude id variable in first column (thanks #DidzisElferts)
df[ rowSums(df[,-1]) == 0 , ]
id A B C
4 4 FALSE FALSE FALSE

Resources