Which IDs have only zero-counts in variable across all days? [duplicate] - r

This question already has answers here:
Select groups where all values are positive
(2 answers)
Select rows where all values are TRUE by group, using data.table
(2 answers)
Closed 7 months ago.
In my dataset there is the variable "cigarettes per day" (CPD) for 21 days and several subjects (ID). I want to know how many and which subjects never smoked (e.g. have only 0 in CPD) across the 21 days.
Here is a example dataset for 3 subjects and 5 days
day <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
CPD <- c(3,4,0,2,0,0,0,0,0,0,4,0,0,0,1)
df <- data.frame(day, ID, CPD)
what I want would be something like this:
day ID CPD
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0

We may use a group by all approach
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(CPD %in% 0)) %>%
ungroup
-output
# A tibble: 5 × 3
day ID CPD
<dbl> <dbl> <dbl>
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
Or without grouping
df %>%
filter(!ID %in% ID[CPD != 0])
day ID CPD
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
Or with base R
subset(df, !ID %in% ID[CPD != 0])

Here is a slighltly modified dplyr (#akrun) approach:
libaray(dplyr)
df %>%
group_by(ID) %>%
filter(all(CPD==0)==TRUE)
# Groups: ID [1]
day ID CPD
<dbl> <dbl> <dbl>
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
and here is a data.table approach:
library(data.table)
setDT(df)[,if(all(CPD == 0)) .SD , by = ID]
ID day CPD
1: 2 1 0
2: 2 2 0
3: 2 3 0
4: 2 4 0
5: 2 5 0

Related

Create variable that flags an ID if it has existed in any previous month

I am unsure of how to create a variable that flags an ID in the current month if the ID has existed in any previous month.
Example data:
ID<-c(1,2,3,2,3,4,1,5)
Month<-c(1,1,1,2,2,2,3,3)
Flag<-c(0,0,0,1,1,0,1,0)
have<-cbind(ID,Month)
> have
ID Month
1 1
2 1
3 1
2 2
3 2
4 2
1 3
5 3
want:
> want
ID Month Flag
1 1 0
2 1 0
3 1 0
2 2 1
3 2 1
4 2 0
1 3 1
5 3 0
a data.table approach
library(data.table)
# set to data.table format
DT <- as.data.table(have)
# initialise Signal column
DT[, Signal := 0]
# flag duplicates with a 1
DT[duplicated(ID), Signal := 1, by = Month][]
ID Month Signal
1: 1 1 0
2: 2 1 0
3: 3 1 0
4: 2 2 1
5: 3 2 1
6: 4 2 0
7: 1 3 1
8: 5 3 0
The idea is presented from akrun in the comments. Here is the dplyr application:
First use as_tibble to bring matrix in tibble format
then use an ifelse statement with duplicated as #akrun already suggests.
library(tibble)
library(dplyr)
have %>%
as_tibble() %>%
mutate(flag = ifelse(duplicated(ID),1,0))
ID Month flag
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 0
4 2 2 1
5 3 2 1
6 4 2 0
7 1 3 1
8 5 3 0

Remove groups with only one individual in R without using dplyr package [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
group<-c(1,1,1,1,2,2,3,3,3,3,4,4)
individualID<-c(1,1,2,2,3,3,5,5,6,6,7,7)
X<-rbinom(12,1,0.5)
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data without use of dplyr package to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.
Or another option is with tidyverse - after grouping by 'group', filter the rows where the number of distinct (n_distinct) elements in 'individualID' is greater than 1
library(dplyr)
df1 %>%
group_by(group) %>%
filter(n_distinct(individualID) > 1) %>%
ungroup
# A tibble: 8 × 3
group individualID X
<dbl> <dbl> <int>
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
5 3 5 0
6 3 5 0
7 3 6 1
8 3 6 0
Or with subset and ave from base R
subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
group individualID X
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
7 3 5 0
8 3 5 0
9 3 6 1
10 3 6 0
There are more concise ways for sure, but here is the general idea.
# use your code to get the counts by group
df1_counts <- aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
# create a vector of groups where the count is > 1
keep_groups <- df1_counts$group[df1_counts$individualID > 1]
# filter the rows to only groups you want to keep
df1[df1$group %in% keep_groups,]
# group individualID X
# 1 1 1 0
# 2 1 1 0
# 3 1 2 1
# 4 1 2 0
# 7 3 5 1
# 8 3 5 1
# 9 3 6 0
# 10 3 6 1

Remove groups with only one individual in R [duplicate]

This question already has an answer here:
Select groups with more than one distinct value per group [duplicate]
(1 answer)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.
You could make a lookup table to identify the groups that have more than one unique individualID (similar to what you did with aggregate), then filter df1 based on that:
library(dplyr)
lookup <- df1 %>%
group_by(group) %>%
summarise(count = n_distinct(individualID)) %>%
filter(count > 1)
df1 %>% filter(group %in% unique(lookup$group))
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 3 5 1
6 3 5 1
7 3 6 1
8 3 6 1
Or, as #MrGumble suggests above, you could also merge df1 after creating lookup:
merge(df1, lookup)
group individualID X count
1 1 1 0 2
2 1 1 1 2
3 1 2 1 2
4 1 2 1 2
5 3 6 1 2
6 3 6 1 2
7 3 5 1 2
8 3 5 1 2

Subsetting data based on a value within ids in r

I'm trying to subset a dataset based on two criteria. Here is a snapshot of my data:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
ids seq type
1 1 1 1
2 1 2 1
3 1 3 5
4 1 4 1
5 1 5 1
6 1 6 1
7 2 1 1
8 2 2 1
9 2 3 1
10 2 4 8
11 2 5 1
12 2 6 1
13 3 1 1
14 3 2 1
15 3 3 1
16 3 4 1
17 3 5 1
18 3 6 1
ids is the student id, seq is the sequence of the questions (items) students take. type refers to the type of the question. 1 is simple, 5 or 8 is the complicated items. What I would like to do is to generate 1st variable(complex) as to whether or not student has a complicated item(type=5|8). Then I would like to get:
> data
ids seq type complex
1 1 1 1 1
2 1 2 1 1
3 1 3 5 1
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
10 2 4 8 1
11 2 5 1 1
12 2 6 1 1
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
The second step is to split data within students.
(a) For the student who has non-complex items (complex=0), I would like to split the dataset from half point and get this below:
>simple.split.1
ids seq type complex
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
>simple.split.2
ids seq type complex
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
(b) for the students who have complex items (complex=1), I would like to set the complex item as a cutting point and split the data from there. So the data should look like this (excluding complex item):
>complex.split.1
ids seq type complex
1 1 1 1 1
2 1 2 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
>complex.split.2
ids seq type complex
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
11 2 5 1 1
12 2 6 1 1
Any thoughts?
Thanks
Here's a way to do it using data.table, zoo packages and split function:
library(data.table)
library(zoo)
setDT(data)[, complex := ifelse(type == 5 | type == 8, 1, NA_integer_), by = ids][, complex := na.locf(na.locf(complex, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = ids][, complex := ifelse(is.na(complex), 0, complex)] ## set data to data.table & add a flag 1 where type is 5 or 8 ## carry forward and backward of complex flag ## replace na values in complex column with 0
data <- data[!(type == 5 | type == 8), ] ## removing rows where type equals 5 or 8
complex <- split(data, data$complex) ## split data based on complex flag
complex_0 <- as.data.frame(complex$`0`) ## saving as data frame based on complex flag
complex_1 <- as.data.frame(complex$`1`)
split(complex_0, cut(complex_0$seq, 2)) ## split into equal parts
split(complex_1, cut(complex_1$seq, 2))
#$`(0.995,3.5]`
# ids seq type complex
#1 3 1 1 0
#2 3 2 1 0
#3 3 3 1 0
#$`(3.5,6]`
# ids seq type complex
#4 3 4 1 0
#5 3 5 1 0
#6 3 6 1 0
#$`(0.995,3.5]`
# ids seq type complex
#1 1 1 1 1
#2 1 2 1 1
#6 2 1 1 1
#7 2 2 1 1
#8 2 3 1 1
#$`(3.5,6]`
# ids seq type complex
#3 1 4 1 1
#4 1 5 1 1
#5 1 6 1 1
#9 2 5 1 1
#10 2 6 1 1
If you prefer using the tidyverse, here's an approach:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
step1.data <- data %>%
group_by(ids) %>%
mutate(complex = ifelse(any(type %in% c(5,8)), 1, 0)) %>%
ungroup()
simple.split.1 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq <= mean(seq)) %>% #if you happen to have more than 6 questions in seq, this gives the midpoint
ungroup()
simple.split.2 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq > mean(seq)) %>%
ungroup()
complex.split.1 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq < min(seq[type %in% c(5,8)])) %>%
ungroup()
complex.split.2 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq > min(seq[type %in% c(5,8)])) %>%
ungroup()

Turning factor variable into a list of binary variable per row (trial) in R [duplicate]

This question already has answers here:
Transform one column from categoric to binary, keep the rest [duplicate]
(3 answers)
Closed 3 years ago.
A while ago I've posted a question about how to convert factor data.frame into a binary (hot-encoding) data.frame here. Now I am trying to find the most efficient way to loop over trials (rows) and binarize a factor variable. A minimal example would look like this:
d = data.frame(
Trial = c(1,2,3,4,5,6,7,8,9,10),
Category = c('a','b','b','b','a','b','a','a','b','a')
)
d
Trial Category
1 1 a
2 2 b
3 3 b
4 4 b
5 5 a
6 6 b
7 7 a
8 8 a
9 9 b
10 10 a
While I would like to get this:
Trial a b
1 1 1 0
2 2 0 1
3 3 0 1
4 4 0 1
5 5 1 0
6 6 0 1
7 7 1 0
8 8 1 0
9 9 0 1
10 10 1 0
What would be the most efficient way of doing it?
here is an option with pivot_wider. Create a column of 1's and then apply pivot_wider with names_from the 'Category' and values_from the newly created column
library(dplyr)
library(tidyr)
d %>%
mutate(n = 1) %>%
pivot_wider(names_from = Category, values_from = n, values_fill = list(n = 0))
# A tibble: 10 x 3
# Trial a b
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 1
# 4 4 0 1
# 5 5 1 0
# 6 6 0 1
# 7 7 1 0
# 8 8 1 0
# 9 9 0 1
#10 10 1 0
The efficient option would be data.table
library(data.table)
dcast(setDT(d), Trial ~ Category, length)
It can also be done with base R
table(d)

Resources