How to get tally (rolling sum) by group in R? - r

I would like to create a column called "tally" in my dataset that takes sums the count of each type and rank.
type <- c("A","A","A","B","B","C")
rank <- c("low", "med", "high","med", "high", "low")
count <- c(9,20,31,2,4,14)
df <- data.frame(type, rank, count)
My desired output would be:
type rank count tally
1 A low 9 9
2 A med 20 29
3 A high 31 60
4 B med 2 2
5 B high 4 6
6 C low 14 14
I guess another way to describe it would be a rolling sum (where it takes into account the low to high order)? I have looked around but I can't find any good functions to do this. Ideally, I could have a for loop that would allow me to get this "rolling sum" by type.

We can use cumsum after grouping by 'type'
library(dplyr)
df <- df %>%
group_by(type) %>%
mutate(tally = cumsum(count)) %>%
ungroup
-output
# A tibble: 6 x 4
type rank count tally
<chr> <chr> <dbl> <dbl>
1 A low 9 9
2 A med 20 29
3 A high 31 60
4 B med 2 2
5 B high 4 6
6 C low 14 14

Related

Creating a Subset from a Dataframe using a Group Based on summary values

count(df1,age,gender)
age gender n
25 M 4
32 F 3
full_df
patient_ID age gender
pt1 23 M
pt2 26 F
...
I would like to create a 4:1 age/sex matched subset of full_df based on count stats of df1. For example, I have 4 male patients aged 25 in df1, so I would like to pull 16 random patients from full_df. And 12 32yo females.
I need to find a way to shuffle full_df, then add 1:len(group) to it as follows:
patient_ID age gender order
pt100 25 M 1
pt251 25 M 2
pt201 25 M 3
...
pt376 26 M 1
pt872 26 M 2
pt563 26 M 3
...
I have created a small example for you based only on age (since there was no example df available this saves a lot of typing) but you can easily add gender to the method.
First we join the dataframe with the count information to the full dataframe, and then sample the number of rows per age group (in this example 2 times n, you would want to do 4 times n but my df is too small).
Then we add a new column 'order' with numbers ranging from 1 to the number of samples and lastly drop the 'n' column.
df1 = data.frame(age = c(25,32),
n = c(1,2))
df = data.frame(patient_ID = 1:10,
age = c(rep(25,4),rep(32,6)))
df %>%
left_join(df1, by = 'age') %>%
group_by(age) %>%
sample_n(n*2) %>%
mutate(order = 1:n()) %>%
ungroup() %>%
select(-n)
this gives the output with the selected patients (in line with the numbers in df1):
# A tibble: 6 x 3
patient_ID age order
<int> <dbl> <int>
1 4 25 1
2 2 25 2
3 10 32 1
4 9 32 2
5 7 32 3
6 8 32 4

How to sum rows based on multiple conditions and replace it in the dataframe?

R beginner here in need of some help. I have this dataframe:
dat<-data.frame(Name=c("A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
Score=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
Frequency=c(9,11,10,5,5,3,7,10,5,5,20,3,3,2,2))
And I want to sum the frequencies of rows with scores 2-3 and 4-5 by name, and rename the scores High (score 1), Medium (scores 2-3) or Low (scores 4-5). Basically my dataframe should look like this:
Is there a more straightforward way to do this? Thanks a lot!
Here is a base R approach.
First, create Category based on the Score using cut:
dat$Category <- cut(dat$Score,
breaks = c(1, 2, 4, 5),
labels = c("High", "Medium", "Low"),
include.lowest = T,
right = F)
Then you can aggregate based on both Name and Category to get the final result:
aggregate(Frequency ~ Name + Category, data = dat, sum)
Output
Name Category Frequency
1 A High 9
2 B High 3
3 C High 20
4 A Medium 21
5 B Medium 17
6 C Medium 6
7 A Low 10
8 B Low 10
9 C Low 4
You could first use case_when to convert the score to right class en then group_by and sumamrise your data like this:
library(dplyr)
dat %>%
mutate(Score = case_when(Score == 1 ~ "High",
Score %in% c(2,3) ~ "Medium",
TRUE ~ "Low")) %>%
group_by(Name, Score) %>%
summarise(Frequency = sum(Frequency))
#> `summarise()` has grouped output by 'Name'. You can override using the
#> `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Name [3]
#> Name Score Frequency
#> <chr> <chr> <dbl>
#> 1 A High 9
#> 2 A Low 10
#> 3 A Medium 21
#> 4 B High 3
#> 5 B Low 10
#> 6 B Medium 17
#> 7 C High 20
#> 8 C Low 4
#> 9 C Medium 6
Created on 2023-01-11 with reprex v2.0.2

R data imputation from group_by table [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 7 months ago.
group = c(1,1,4,4,4,5,5,6,1,4,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c')
sleep = c(14,NA,22,15,NA,96,100,NA,50,2,1)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% summarise(mean_sleep = mean(sleep, na.rm = T))
I would like to replace the NA values the sleep column based on the mean sleep value grouped by group and animal.
Is there any way that I can perform some sort of lookup like Excel that matches group and animal from the test dataframe to the group_animal dataframe and replaces the NA value in the sleep column from the test df with the sleep value in the group_animal df?
We could use mutate instead of summarise as summarise returns a single row per group
library(dplyr)
library(tidyr)
test <- test %>%
group_by(group, animal) %>%
mutate(sleep = replace_na(sleep, mean(sleep, na.rm = TRUE))) %>%
ungroup
-output
test
# A tibble: 11 × 3
group animal sleep
<dbl> <chr> <dbl>
1 1 a 14
2 1 b 50
3 4 c 22
4 4 c 15
5 4 d 2
6 5 a 96
7 5 b 100
8 6 c 1
9 1 b 50
10 4 d 2
11 6 c 1

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

Manually calculate variance from count data for categorical ratings

I am trying to manually calculate the variance (and mean) from categorical rating count data.
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always
1 A 4 NA 17 10 3 2 7
2 B 12 10 5 12 21 14 NA
3 C 17 20 12 17 NA 12 18
4 D NA 15 6 NA 16 20 23
Each categorical rating has an equivalent numeric value (1:7). I have calculated the average numerical rating for each Item as follows:
Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)
Data
Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1 A 4 NA 17 10 3 2 7 3.976744
2 B 12 10 5 12 21 14 NA 3.837838
3 C 17 20 12 17 NA 12 18 3.739583
4 D NA 15 6 NA 16 20 23 5.112500
I would like to also calculate the variance for each Average and store that as a new variable in Data.
I believe I need to subtract the Average for each item from each numeric rating and multiply that value by the count in each respective cell, then sum those results across rows, then divide by the total counts in each row.
But, I can't figure out how to set up the element-wise calculations to accomplish that.
Conceptually, I think it should be something like this:
Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))
Where Numeric_Rating corresponds to Rating_wt:
Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7
and Value is the corresponding cell for each Numeric_Rating by Item intersection.
I'd suggest you try to reshape your dataset before you apply your calculations, as it will be easier.
library(dplyr)
library(tidyr)
Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)
Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)
Data %>%
gather(category, value, -Item) %>% # reshape dataset
mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,
"Sometimes" = 4,"Frequently" = 5,
"Usually" = 6,"Always" = 7)) %>% # assign rating
group_by(Item) %>% # for each item
mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T), # calculate Avg
variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>% # calculate Variance using the Avg
ungroup() %>% # forget the grouping
select(-Rating) %>% # no need the rating any more
spread(category, value) %>% # reshape back to original form
select_(.dots = c(names(Data), "Avg", "variance")) # get columns in the desired order
# # A tibble: 4 x 10
# Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg variance
# * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 4 NA 17 10 3 2 7 3.976744 1.326122
# 2 B 12 10 5 12 21 14 NA 3.837838 1.530314
# 3 C 17 20 12 17 NA 12 18 3.739583 1.879991
# 4 D NA 15 6 NA 16 20 23 5.112500 1.529062
Try to run the piped process step by step to see how it works, especially if you're not familiar with the dplyr and tidyr syntax.

Resources