Creating counts of subset with dplyr - r

I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!

We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2

I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".

df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1

Related

Select second largest row by group in r

I have this problem
library(dplyr)
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
For each id, I want to only keep row with second largest value of var1. I tried different ways (dplyr, base R):
problem %>%
group_by(id) %>%
slice_tail(2, -var1)
problem[with(problem, ave(var1, id, FUN = function(x) x == tail(sort(x), 2)[1])), ]
First code doesn;t work, second code gives wrong answer.
What am I doing wrong?
problem |> group_by(id) %>% arrange(var1) %>% slice(n()-1)
n() counts the number of rows in each group. slice(n()-1) takes the n-1th element. Note this will cause issues with groups with fewer than 2 members - you may wish to allow for that.
If you wish to use slice, I guess you can first slice_max() the largest two rows, than slice_tail to remove the largest row.
library(dplyr)
problem %>%
group_by(id) %>%
slice_max(var1, n = 2) %>%
slice_tail(n = 1)
Or you can use a single filter:
problem %>% group_by(id) %>% filter(var1 == max(var1[var1 != max(var1)]))
Output
# A tibble: 2 × 3
# Groups: id [2]
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9
In case you have volume, here is a data.tableapproach.
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
setDT(problem)
setorder(problem, id, - var1)
problem[, .SD[2], by=id]
As for #paul Stafford Allen comment, you will have issue for groups of size only 1.
After arrangeing the 'var1' on descending use slice with 2
library(dplyr)
problem %>%
arrange(id, desc(var1)) %>%
group_by(id) %>%
slice(2) %>%
ungroup
-output
# A tibble: 2 × 3
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Organizing a data frame with multiple entries per sample

I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.
If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0
You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1
We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)

Calculating percentage of increased and decreased values between factors

I'm looking for a way to calculate the change of scores between factors (for example, questionnaire scores between Pre and Post treatment). I want to figure out what percentage of participants improved and what percentage did not between Pre and Post.
I have looked at some dplyr solutions but I think I am missing a line of code from it but I am not sure.
ID<-c("aaa","bbb","ccc","ddd","eee","fff", "ggg","aaa","bbb","ccc","ddd","eee","fff", "ggg")
Score<-sample(40,14)
Pre_Post<-c(1,1,1,1,1,1,1,2,2,2,2,2,2,2)
df<-cbind(ID, Pre_Post, Score)
df$Score<-as.numeric(df$Score)
df<-as.data.frame(df)
#what I have tried
df2<-df%>%
group_by(ID, Pre_post)
mutate(Pct_change=mutate(Score/lead(Score)*100))
But I get error messages. As well, I wasn't confident that the code was right to begin with.
Expected outcome:-
What I want to achieve is getting the percentages of ID's that have improved. So in the case of the mock example that I have provided, only 42.86% of ID's have improved from Pre to Post, while 57.14% actually worsened between Pre and Post.
Any suggestions would be welcome :)
you have several typos that is why you get an error.
You can do something like this to get old and new scores side by side:
library(tidyverse)
df %>%
spread(Pre_Post, Score) %>%
rename(Score_pre = `1`, Score_post = `2`)
ID Score_pre Score_post
1 aaa 19 24
2 bbb 39 35
3 ccc 2 29
4 ddd 38 15
5 eee 36 9
6 fff 23 10
7 ggg 21 27
To get the number of improvements you have to convert Score to numeric first:
df %>% as_tibble() %>%
mutate(Score = as.numeric(Score)) %>%
spread(Pre_Post, Score) %>%
rename(Score_pre = `1`, Score_post = `2`) %>%
mutate(improve = if_else(Score_pre > Score_post, "0", "1")) %>%
group_by(improve) %>%
summarise(n = n()) %>%
mutate(percentage = n / sum(n))
# A tibble: 2 x 3
improve n percentage
<chr> <int> <dbl>
1 0 3 0.429
2 1 4 0.571
Another option with dplyr assuming you always have two values with Pre as 1 and Post as 2 would be to group_by ID and subtract the second value with first value and calculate the ratio for positive and negative values.
library(dplyr)
df %>%
arrange(ID, Pre_Post) %>%
group_by(ID) %>%
summarise(val = Score[2] - Score[1]) %>%
summarise(total_pos = sum(val > 0)/n(),
total_neg = sum(val < 0)/ n())
# A tibble: 1 x 2
# total_pos total_neg
# <dbl> <dbl>
#1 0.429 0.571
data
ID <- c("aaa","bbb","ccc","ddd","eee","fff", "ggg","aaa","bbb",
"ccc","ddd","eee","fff", "ggg")
Score <- sample(40,14)
Pre_Post <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2)
df <- data.frame(ID, Pre_Post, Score)

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Resources