Counting number of non zero observation by group

Counting number of non zero observation by group - r

For the following data - I would like to count the number of students per class each year.
Class Students Gender Height Year_1999 Year_2000 Year_2001 Year_2002
1 Mark M 180 80 54 22 12
2 John M 234 0 59 32 62
1 Tom M 124 0 53 26 12
2 Jane F 180 80 54 22 0
3 Kim F 140 0 2 3 32
The output should be
Class Year_1999 Year_2000 Year_2001 Year_2002
1 1 2 2 2
2 1 2 2 1
3 0 1 1 1
I tried the following but didn't have much luck
Number_obs = df %>%
group_by(class) %>%
summarise(count=n())

We can use summarise_at in dplyr. After grouping by 'Class', loop through the columns that have 'year' matches in the column names in summarise_at, get the sum of values that are not equal to 0
library(dplyr)
df1 %>%
group_by(Class) %>%
summarise_at(vars(matches("Year")), list(~ sum(as.logical(.))))
# A tibble: 3 x 5
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or we can gather into 'long' format, do the group_by operation on a single column and spread it to 'wide' format
library(tidyr)
df1 %>%
gather(key, val, matches("Year")) %>%
group_by(Class, key) %>%
summarise(val = sum(val != 0)) %>%
spread(key, val)
Or using data.table
library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(as.logical(x))), .(Class), .SDcols = 5:8]
Or using base R with aggregate
aggregate(.~ Class, df1[-(2:4)], function(x) sum(x != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or using rowsum
rowsum(+(!!df1[5:8]), df1$Class)
# Year_1999 Year_2000 Year_2001 Year_2002
#1 1 2 2 2
#2 1 2 2 1
#3 0 1 1 1
Or using colSums
t(sapply(split(as.data.frame(df1[5:8] != 0), df1$Class), colSums))
data
df1 <- structure(list(Class = c(1L, 2L, 1L, 2L, 3L), Students = c("Mark",
"John", "Tom", "Jane", "Kim"), Gender = c("M", "M", "M", "F",
"F"), Height = c(180L, 234L, 124L, 180L, 140L), Year_1999 = c(80L,
0L, 0L, 80L, 0L), Year_2000 = c(54L, 59L, 53L, 54L, 2L), Year_2001 = c(22L,
32L, 26L, 22L, 3L),
Year_2002 = c(12L, 62L, 12L, 0L, 32L)), class = "data.frame",
row.names = c(NA,
-5L))

Similar to #akrun's colSums solution, using by.
do.call(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# 1 1 2 2 2
# 2 1 2 2 1
# 3 0 1 1 1
or
Reduce(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# init 1 2 2 2
# 1 2 2 1
# 0 1 1 1
do.call is faster.

Using dplyr, we can use summarise_at
library(dplyr)
df %>%
group_by(Class) %>%
summarise_at(vars(starts_with("Year")), ~sum(. != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?

library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

Extracting sequences from columns using R

I have a df which looks like this
ID X003-APP X005-APP X008-APP X003-COP X004-COP X008-PIN X009-PIN
363 NA NA 1 0 NA 4 5
364 0 2 NA 1 5 1 5
678 0 NA NA 5 NA NA NA
713 1 1 1 1 1 1 1
219 1 2 3 3 NA 4 5
234 NA NA NA 2 3 NA NA
321 2 3 1 NA NA 1 2
I am interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My required output is:
ID APP COP PIN
363 1 1 1
364 1 1 1
678 1 1 0
713 1 1 1
219 1 1 1
234 0 1 0
321 1 0 1
For reference, I am sharing the dput():
structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L),
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L),
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L),
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L),
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA),
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA),
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L),
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)),
class = "data.frame", row.names = c(NA, -7L))
Edit:
Later on, I would like to analyse 2 and 3 sequences across IDs. For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My ultimate required output for a sequence of length 2 would be:
Spec_1 Spec_2 Counts
APP COP 5
APP PIN 5
COP PIN 4
Or correspondingly, my required output for a sequence of length 3 would be:
Spec_1 Spec_2 Spec_3 Counts
APP COP PIN 4
Is there an easy way to achieve this? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.

You may try
library(reshape2)
library(tidyverse)
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = as.numeric(any(!is.na(value)))) %>%
filter(value == 1) %>%
pivot_wider(names_from = "Spec", values_from = "value") %>%
replace(is.na(.), 0)
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 219 1 1 1
2 234 0 1 0
3 321 1 0 1
4 363 1 1 1
5 364 1 1 1
6 678 1 1 0
7 713 1 1 1
Is your edited one and
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 1) %>%
summarise(Spec = combn(Spec, 2, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2) %>%
summarize(Counts = n())
Spec_1 Spec_2 Counts
<chr> <chr> <int>
1 APP COP 5
2 APP PIN 5
3 COP PIN 4
is your previous one.
3 seq?
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 2) %>%
summarise(Spec = combn(Spec, 3, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2, Spec_3) %>%
summarize(Counts = n())
Spec_1 Spec_2 Spec_3 Counts
<chr> <chr> <chr> <int>
1 APP COP PIN 4

Try this using dplyr
library(dplyr)
df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) ,
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
output
# A tibble: 7 × 4
# Rowwise:
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 363 1 1 1
2 364 1 1 1
3 678 1 1 0
4 713 1 1 1
5 219 1 1 1
6 234 0 1 0
7 321 1 0 1
for your second required you can use
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |>
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) , APP_COP_PIN = sum(ACP))
output
# A tibble: 1 × 4
APP_COP APP_PIN COP_PIN APP_COP_PIN
<dbl> <dbl> <dbl> <dbl>
1 5 5 4 4

Sequential Increase in Column value based on a condition R

I have an R data frame that has an ID column with multiple records for an ID. When the flag is set to 1 for an ID, I want to create a column new timeline that starts from 1 and increases sequentially in increments of 6 (1,6,12...). How can I achieve this in R using dplyr ?
Below is a sample data frame
ID
Timepoint
Flag
A
0
0
A
6
0
A
12
0
A
18
1
A
24
0
A
30
0
A
36
0
Expected Dataframe
ID
Timepoint
Flag
New_Timepoint
A
0
0
A
6
0
A
12
0
A
18
1
1
A
24
0
6
A
30
0
12
A
36
0
18

An option is to group by 'ID', create the lag of the 'Timepoint' with n specified as the position of 'Flag' where the value is 1 (-1)
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = dplyr::lag(replace(Timepoint, !Timepoint, 1),
n = which(Flag == 1)-1)) %>%
ungroup
-output
# A tibble: 7 x 4
# ID Timepoint Flag New_Timepoint
# <chr> <int> <int> <dbl>
#1 A 0 0 NA
#2 A 6 0 NA
#3 A 12 0 NA
#4 A 18 1 1
#5 A 24 0 6
#6 A 30 0 12
#7 A 36 0 18
Or use a double cumsum to create the index
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = Timepoint[na_if(cumsum(cumsum(Flag)), 0)]) %>%
ungroup
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A"),
Timepoint = c(0L,
6L, 12L, 18L, 24L, 30L, 36L),
Flag = c(0L, 0L, 0L, 1L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -7L))

Another dplyr option
df %>%
group_by(ID) %>%
mutate(New_Timepoint = pmax(1, Timepoint - c(NA, Timepoint[Flag == 1])[cumsum(Flag) + 1])) %>%
ungroup()
gives
ID Timepoint Flag New_Timepoint
<chr> <int> <int> <dbl>
1 A 0 0 NA
2 A 6 0 NA
3 A 12 0 NA
4 A 18 1 1
5 A 24 0 6
6 A 30 0 12
7 A 36 0 18

Trying to find occurrences of ID that meets sequential conditions in R

I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))

Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE

Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1

subsetting duplicates per individual

dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.

An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))