Trying to find occurrences of ID that meets sequential conditions in R - r

I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))

Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE

Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

subsetting duplicates per individual

dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))

Counting number of non zero observation by group

For the following data - I would like to count the number of students per class each year.
Class Students Gender Height Year_1999 Year_2000 Year_2001 Year_2002
1 Mark M 180 80 54 22 12
2 John M 234 0 59 32 62
1 Tom M 124 0 53 26 12
2 Jane F 180 80 54 22 0
3 Kim F 140 0 2 3 32
The output should be
Class Year_1999 Year_2000 Year_2001 Year_2002
1 1 2 2 2
2 1 2 2 1
3 0 1 1 1
I tried the following but didn't have much luck
Number_obs = df %>%
group_by(class) %>%
summarise(count=n())
We can use summarise_at in dplyr. After grouping by 'Class', loop through the columns that have 'year' matches in the column names in summarise_at, get the sum of values that are not equal to 0
library(dplyr)
df1 %>%
group_by(Class) %>%
summarise_at(vars(matches("Year")), list(~ sum(as.logical(.))))
# A tibble: 3 x 5
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or we can gather into 'long' format, do the group_by operation on a single column and spread it to 'wide' format
library(tidyr)
df1 %>%
gather(key, val, matches("Year")) %>%
group_by(Class, key) %>%
summarise(val = sum(val != 0)) %>%
spread(key, val)
Or using data.table
library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(as.logical(x))), .(Class), .SDcols = 5:8]
Or using base R with aggregate
aggregate(.~ Class, df1[-(2:4)], function(x) sum(x != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or using rowsum
rowsum(+(!!df1[5:8]), df1$Class)
# Year_1999 Year_2000 Year_2001 Year_2002
#1 1 2 2 2
#2 1 2 2 1
#3 0 1 1 1
Or using colSums
t(sapply(split(as.data.frame(df1[5:8] != 0), df1$Class), colSums))
data
df1 <- structure(list(Class = c(1L, 2L, 1L, 2L, 3L), Students = c("Mark",
"John", "Tom", "Jane", "Kim"), Gender = c("M", "M", "M", "F",
"F"), Height = c(180L, 234L, 124L, 180L, 140L), Year_1999 = c(80L,
0L, 0L, 80L, 0L), Year_2000 = c(54L, 59L, 53L, 54L, 2L), Year_2001 = c(22L,
32L, 26L, 22L, 3L),
Year_2002 = c(12L, 62L, 12L, 0L, 32L)), class = "data.frame",
row.names = c(NA,
-5L))
Similar to #akrun's colSums solution, using by.
do.call(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# 1 1 2 2 2
# 2 1 2 2 1
# 3 0 1 1 1
or
Reduce(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# init 1 2 2 2
# 1 2 2 1
# 0 1 1 1
do.call is faster.
Using dplyr, we can use summarise_at
library(dplyr)
df %>%
group_by(Class) %>%
summarise_at(vars(starts_with("Year")), ~sum(. != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1

Add column based on other columns values

I am honest, I could come up with a decent title for this.
Basically, I have a dateframe:
ID Qty BasePrice Total
1 2 30 50
1 1 20 20
2 4 5 15
For each line I want to calculate the following:
Result = (Qty * BasePrice) - Total
Which is supposedly easy to do in R. However, I want to group the results by ID (sum them).
Sample Output:
ID Qty BasePrice Total Results
1 2 30 50 10
1 1 20 20 10
2 4 5 15 5
For instance, for ID=1, the values represent ((2*30)-50)+((1*20)-20)
Any idea on how can I achieve this?
Thanks!
We can do a group_by sum of the difference between the product of 'Qty', 'BasePrice' with 'Total'
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(Result = sum((Qty * BasePrice) - Total))
# A tibble: 3 x 5
# Groups: ID [2]
# ID Qty BasePrice Total Result
# <int> <int> <int> <int> <int>
#1 1 2 30 50 10
#2 1 1 20 20 10
#3 2 4 5 15 5
data
df1 <- structure(list(ID = c(1L, 1L, 2L), Qty = c(2L, 1L, 4L), BasePrice = c(30L,
20L, 5L), Total = c(50L, 20L, 15L)), class = "data.frame", row.names = c(NA,
-3L))

R: frequency with group by ID [duplicate]

This question already has answers here:
Frequency count of two column in R
(8 answers)
Closed 6 years ago.
I have a data frame like this:
ID Cont
1 a
1 a
1 b
2 a
2 c
2 d
I need to report the frequence of "Cont" by ID. The output should be
ID Cont Freq
1 a 2
1 b 1
2 a 1
2 c 1
2 d 1
Using dplyr, you can group_by both ID and Cont and summarise using n() to get Freq:
library(dplyr)
res <- df %>% group_by(ID,Cont) %>% summarise(Freq=n())
##Source: local data frame [5 x 3]
##Groups: ID [?]
##
## ID Cont Freq
## <int> <fctr> <int>
##1 1 a 2
##2 1 b 1
##3 2 a 1
##4 2 c 1
##5 2 d 1
Data:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Cont = structure(c(1L,
1L, 2L, 1L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("ID",
"Cont"), class = "data.frame", row.names = c(NA, -6L))
## ID Cont
##1 1 a
##2 1 a
##3 1 b
##4 2 a
##5 2 c
##6 2 d
library(data.table)
setDT(x)[, .(Freq = .N), by = .(ID, Cont)]
# ID Cont Freq
# 1: 1 a 2
# 2: 1 b 1
# 3: 2 a 1
# 4: 2 c 1
# 5: 2 d 1
With base R:
df1 <- subset(as.data.frame(table(df)), Freq != 0)
if you want to order by ID, add this line:
df1[order(df1$ID)]
ID Cont Freq
1 1 a 2
3 1 b 1
2 2 a 1
6 2 c 1
8 2 d 1

Resources