How to delete certain condition from data frame - r

Lets say this is my df
:
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
And I would like to remove all people that never completed any activity.
I have tried this code, but somehow it does not work. I have no idea what could be wrong here.
nevercompleted<- df %>%
filter(completion != 0) %>%
group_by(people) %>%
summarise("frequency activity" = n())
df<- -c (df$nevercompleted)
So, in this scenario person 4 should be removed from the df. Note that I am only intrested in removing those that never completed anything like person 4, not person 1 who at one point completes an activity.

1. Base R
In base R, the following can easily be rewritten as a one-liner.
i <- ave(as.logical(df$completion), df$people, FUN = function(x) any(x != 0, na.rm = TRUE))
df <- df[which(i), ]
df
# people activity completion
#1 1 1 0
#2 1 1 0
#3 1 1 1
#4 2 2 0
#5 2 2 1
#6 3 3 1
#7 3 4 1
#10 5 6 0
#11 5 6 1
2. Package dplyr
And here is a dplyr way.
First filter only people that have completed an activity, then join with the original data set in order to get all columns.
df <- df %>%
group_by(people) %>%
summarise(completion = any(as.logical(completion))) %>%
filter(completion) %>%
select(-completion) %>%
left_join(df, by = 'people')
df
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 9 x 3
# people activity completion
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 1 1 0
#3 1 1 1
#4 2 2 0
#5 2 2 1
#6 3 3 1
#7 3 4 1
#8 5 6 0
#9 5 6 1
Data
In the question there is no data.frame instruction, only the creation of the column vectors.
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
df <- data.frame(people, activity, completion)

in Base we could do this
byGroup <- split(df,df$people)
do.call(rbind,byGroup[sapply(byGroup, function(x) !all(x$completion == 0))])
people activity completion
1.1 1 1 0
1.2 1 1 0
1.3 1 1 1
2.4 2 2 0
2.5 2 2 1
3.6 3 3 1
3.7 3 4 1
5.10 5 6 0
5.11 5 6 1

can be done this way
library(tidyverse)
df <- tibble(people, activity, completion)
df %>%
group_by(people) %>%
filter(any(completion != 0))
# A tibble: 9 x 3
# Groups: people [4]
people activity completion
<dbl> <dbl> <dbl>
1 1 1 0
2 1 1 0
3 1 1 1
4 2 2 0
5 2 2 1
6 3 3 1
7 3 4 1
8 5 6 0
9 5 6 1

Here's the code that should work:
library(dplyr)
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
df <- data.frame(people, activity, completion)
df <- filter(df, completion != 0)
Result:
people activity completion
1 1 1 1
2 2 2 1
3 3 3 1
4 3 4 1
5 5 6 1
This will filter your dataframe to rows whose completion variable is not 0.
I'm not sure where you were going with the group_by and summarize. If you want to do more than remove the rows whose completion variable is 0, please clarify that in your question.

Related

In R: Subset observations that have values, 0, 1, and 2 by group

I have the following data:
companyID status
1 1
1 1
1 0
1 2
2 1
2 1
2 1
3 1
3 0
3 2
3 2
3 2
And would like to subset those observations (by companyID) where status has 0, 1, and 2 across the group (companyID). My preferred outcome would look like the following:
companyID status
1 1
1 1
1 0
1 2
3 1
3 0
3 2
3 2
3 2
Thank you in advance for any help!!
You can select groups where all the values from 0-2 are present in the group.
library(dplyr)
df %>% group_by(companyID) %>%filter(all(0:2 %in% status))
# companyID status
# <int> <int>
#1 1 1
#2 1 1
#3 1 0
#4 1 2
#5 3 1
#6 3 0
#7 3 2
#8 3 2
#9 3 2
In base R and data.table :
#Base R :
subset(df, as.logical(ave(status, companyID, FUN = function(x) all(0:2 %in% x))))
#data.table
library(data.table)
setDT(df)[, .SD[all(0:2 %in% status)], companyID]
We can use
library(dplyr)
df %>%
group_by(companyID) %>%
filter(sum(0:2 %in% status) == 3)

If 1 appears, all subsequent elements of the variable must be 1, grouped by subject

I want make from:
test <- data.frame(subject=c(rep(1,10),rep(2,10)),x=1:10,y=0:1)
Something like that:
As I wrote in the title, when the first 1 appears all subsequent values of "y" for a given "subject" must change to 1, then the same for the next "subject"
I tried something like that:
test <- test%>%
group_nest(subject) %>%
mutate(XD = map(data,function(x){
ifelse(x$y[which(grepl(1, x$y))[1]:nrow(x)]==TRUE , 1,0)})) %>% unnest(cols = c(data,XD))
It didn't work :(
Try this:
library(dplyr)
#Code
new <- test %>%
group_by(subject) %>%
mutate(y=ifelse(row_number()<min(which(y==1)),y,1))
Output:
# A tibble: 20 x 3
# Groups: subject [2]
subject x y
<dbl> <int> <dbl>
1 1 1 0
2 1 2 1
3 1 3 1
4 1 4 1
5 1 5 1
6 1 6 1
7 1 7 1
8 1 8 1
9 1 9 1
10 1 10 1
11 2 1 0
12 2 2 1
13 2 3 1
14 2 4 1
15 2 5 1
16 2 6 1
17 2 7 1
18 2 8 1
19 2 9 1
20 2 10 1
Since you appear to just have 0's and 1's, a straightforward approach would be to take a cumulative maximum via the cummax function:
library(dplyr)
test %>%
group_by(subject) %>%
mutate(y = cummax(y))
#Duck's answer is considerably more robust if you have a range of values that may appear before or after the first 1.

Subsetting data based on a value within ids in r

I'm trying to subset a dataset based on two criteria. Here is a snapshot of my data:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
ids seq type
1 1 1 1
2 1 2 1
3 1 3 5
4 1 4 1
5 1 5 1
6 1 6 1
7 2 1 1
8 2 2 1
9 2 3 1
10 2 4 8
11 2 5 1
12 2 6 1
13 3 1 1
14 3 2 1
15 3 3 1
16 3 4 1
17 3 5 1
18 3 6 1
ids is the student id, seq is the sequence of the questions (items) students take. type refers to the type of the question. 1 is simple, 5 or 8 is the complicated items. What I would like to do is to generate 1st variable(complex) as to whether or not student has a complicated item(type=5|8). Then I would like to get:
> data
ids seq type complex
1 1 1 1 1
2 1 2 1 1
3 1 3 5 1
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
10 2 4 8 1
11 2 5 1 1
12 2 6 1 1
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
The second step is to split data within students.
(a) For the student who has non-complex items (complex=0), I would like to split the dataset from half point and get this below:
>simple.split.1
ids seq type complex
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
>simple.split.2
ids seq type complex
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
(b) for the students who have complex items (complex=1), I would like to set the complex item as a cutting point and split the data from there. So the data should look like this (excluding complex item):
>complex.split.1
ids seq type complex
1 1 1 1 1
2 1 2 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
>complex.split.2
ids seq type complex
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
11 2 5 1 1
12 2 6 1 1
Any thoughts?
Thanks
Here's a way to do it using data.table, zoo packages and split function:
library(data.table)
library(zoo)
setDT(data)[, complex := ifelse(type == 5 | type == 8, 1, NA_integer_), by = ids][, complex := na.locf(na.locf(complex, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = ids][, complex := ifelse(is.na(complex), 0, complex)] ## set data to data.table & add a flag 1 where type is 5 or 8 ## carry forward and backward of complex flag ## replace na values in complex column with 0
data <- data[!(type == 5 | type == 8), ] ## removing rows where type equals 5 or 8
complex <- split(data, data$complex) ## split data based on complex flag
complex_0 <- as.data.frame(complex$`0`) ## saving as data frame based on complex flag
complex_1 <- as.data.frame(complex$`1`)
split(complex_0, cut(complex_0$seq, 2)) ## split into equal parts
split(complex_1, cut(complex_1$seq, 2))
#$`(0.995,3.5]`
# ids seq type complex
#1 3 1 1 0
#2 3 2 1 0
#3 3 3 1 0
#$`(3.5,6]`
# ids seq type complex
#4 3 4 1 0
#5 3 5 1 0
#6 3 6 1 0
#$`(0.995,3.5]`
# ids seq type complex
#1 1 1 1 1
#2 1 2 1 1
#6 2 1 1 1
#7 2 2 1 1
#8 2 3 1 1
#$`(3.5,6]`
# ids seq type complex
#3 1 4 1 1
#4 1 5 1 1
#5 1 6 1 1
#9 2 5 1 1
#10 2 6 1 1
If you prefer using the tidyverse, here's an approach:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
step1.data <- data %>%
group_by(ids) %>%
mutate(complex = ifelse(any(type %in% c(5,8)), 1, 0)) %>%
ungroup()
simple.split.1 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq <= mean(seq)) %>% #if you happen to have more than 6 questions in seq, this gives the midpoint
ungroup()
simple.split.2 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq > mean(seq)) %>%
ungroup()
complex.split.1 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq < min(seq[type %in% c(5,8)])) %>%
ungroup()
complex.split.2 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq > min(seq[type %in% c(5,8)])) %>%
ungroup()

Code number of days elapsed since last activity

I want to code the number of days elapsed since the users last activity for a churn analysis.
I have tried a code I have found in a related topic but it does not work:
da = da %>%
arrange(dayid) %>%
group_by(dayid) %>%
mutate(dayssincelastactivity = c(NA, diff(dayid))
Lets say this is the data. active indicates if the user was active on this day. I want to add the variable dayssincelastactivity, that indicates the number of days elapsed since a user's last active day.
da <- data.frame(dayid = c(1,2,3,4,5,6,7,8), active = c(1,1,0,0,0,1,1,1), dayssincelastactivity = c(1,1,2,3,4,1,1,1))
da
dayid active dayssincelastactivity
1 1 1 1
2 2 1 1
3 3 0 2
4 4 0 3
5 5 0 4
6 6 1 1
7 7 1 1
8 8 1 1
Create a grouping variable using cumsum and seq_along each group.
with(da, ave(dayid, cumsum(active == 1), FUN = seq_along))
#[1] 1 1 2 3 4 1 1 1
You can also translate this to dplyr
library(dplyr)
da %>%
group_by(group = cumsum(active == 1)) %>%
mutate(new_val = row_number()) %>%
ungroup() %>%
select(-group)
# dayid active dayssincelastactivity new_val
# <dbl> <dbl> <dbl> <int>
#1 1 1 1 1
#2 2 1 1 1
#3 3 0 2 2
#4 4 0 3 3
#5 5 0 4 4
#6 6 1 1 1
#7 7 1 1 1
#8 8 1 1 1

Fill in rows based on condition for grouped data using tidyr

I have the following dataframe of which I am trying to create the 'index2' field conditional on the 'index1' field:
Basically this data represents a succession of behaviours for different individual (ID) penguins and I am trying to index groups of behaviour (index 2) that incorporates all other behaviours in between (and including) dives (which have been indexed into dive bouts = index 1). I would appreciate a tidyverse solution grouping by ID.
Reproducible:
df<-data.frame(ID=c(rep('A',9),rep('B',14)),behaviour=c('surface','dive','dive','dive','surface','commute','surface','dive', 'dive','dive','dive','surface','dive','dive','commute','commute','surface','dive','dive','surface','dive','dive','surface'),index1=c(0,1,1,1,0,0,0,1,1,2,2,0,3,3,0,0,0,3,3,0,3,3,0),index2=c(0,1,1,1,1,1,1,1,1,2,2,0,3,3,3,3,3,3,3,3,3,3,0))
We could create a function with rle
frle <- function(x) {
rl <- rle(x)
i1 <- cummax(rl$values)
i2 <- c(i1[-1] != i1[-length(i1)], FALSE)
i1[i2] <- 0
as.integer(inverse.rle(within.list(rl, values <- i1)))
}
After grouping by 'ID', mutate the 'Index1' to get the expected column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Index2New = frle(Index1))
# A tibble: 19 x 5
# Groups: ID [2]
# ID behaviour Index1 Index2 Index2New
# <chr> <chr> <int> <int> <int>
# 1 A surface 0 0 0
# 2 A dive 1 1 1
# 3 A dive 1 1 1
# 4 A dive 1 1 1
# 5 A surface 0 1 1
# 6 A commute 0 1 1
# 7 A surface 0 1 1
# 8 A dive 1 1 1
# 9 A dive 1 1 1
#10 B dive 2 2 2
#11 B dive 2 2 2
#12 B surface 0 0 0
#13 B dive 3 3 3
#14 B dive 3 3 3
#15 B commute 0 3 3
#16 B commute 0 3 3
#17 B surface 0 3 3
#18 B dive 3 3 3
#19 B dive 3 3 3

Resources