Use `dplyr` to divide rows by group - r

On my attempt to learn dplyr, I want to divide each row by another row, representing the corresponding group's total.
I generated test data with
library(dplyr)
# building test data
data("OrchardSprays")
totals <- OrchardSprays %>% group_by(treatment) %>%
summarise(decrease = sum(decrease))
totals$decrease <- totals$decrease + seq(10, 80, 10)
totals$rowpos = totals$colpos <- "total"
df <- rbind(OrchardSprays, totals)
Note the line totals$decrease <- totals$decrease + seq(10, 80, 10): for the sake of the question, I assumed there was an additional decrease for each treatment, which was not observed in the single lines of the data frame but only in the "total" lines for each group.
What I now want to do is adding another column decrease_share to the data frame where each line's decrease value is divided by the corresponding treatment groups total decrease value.
So, for head(df) I would expect an output like this
> head(df)
decrease rowpos colpos treatment treatment_decrease
1 57 1 1 D 0.178125
2 95 2 1 E 0.1711712
3 8 3 1 B 0.09876543
4 69 4 1 H 0.08603491
5 92 5 1 G 0.1488673
6 90 6 1 F 0.1470588
My real world example is a bit more complex (more group variables and also more levels), therefore I am looking for a suitable solution in dplyr.

Here's a total dplyr approach:
library(dplyr) #version >= 1.0.0
OrchardSprays %>%
group_by(treatment) %>%
summarise(decrease = sum(decrease)) %>%
mutate(decrease = decrease + seq(10, 80, 10),
rowpos = "total",
colpos = "total") %>%
bind_rows(mutate(OrchardSprays, across(rowpos:colpos, as.character))) %>%
group_by(treatment) %>%
mutate(treatment_decrease = decrease / decrease[rowpos == "total"])
# A tibble: 72 x 5
# Groups: treatment [8]
treatment decrease rowpos colpos treatment_decrease
<fct> <dbl> <chr> <chr> <dbl>
1 A 47 total total 1
2 B 81 total total 1
3 C 232 total total 1
4 D 320 total total 1
5 E 555 total total 1
6 F 612 total total 1
7 G 618 total total 1
8 H 802 total total 1
9 D 57 1 1 0.178
10 E 95 2 1 0.171
# … with 62 more rows

Related

Creating a Subset from a Dataframe using a Group Based on summary values

count(df1,age,gender)
age gender n
25 M 4
32 F 3
full_df
patient_ID age gender
pt1 23 M
pt2 26 F
...
I would like to create a 4:1 age/sex matched subset of full_df based on count stats of df1. For example, I have 4 male patients aged 25 in df1, so I would like to pull 16 random patients from full_df. And 12 32yo females.
I need to find a way to shuffle full_df, then add 1:len(group) to it as follows:
patient_ID age gender order
pt100 25 M 1
pt251 25 M 2
pt201 25 M 3
...
pt376 26 M 1
pt872 26 M 2
pt563 26 M 3
...
I have created a small example for you based only on age (since there was no example df available this saves a lot of typing) but you can easily add gender to the method.
First we join the dataframe with the count information to the full dataframe, and then sample the number of rows per age group (in this example 2 times n, you would want to do 4 times n but my df is too small).
Then we add a new column 'order' with numbers ranging from 1 to the number of samples and lastly drop the 'n' column.
df1 = data.frame(age = c(25,32),
n = c(1,2))
df = data.frame(patient_ID = 1:10,
age = c(rep(25,4),rep(32,6)))
df %>%
left_join(df1, by = 'age') %>%
group_by(age) %>%
sample_n(n*2) %>%
mutate(order = 1:n()) %>%
ungroup() %>%
select(-n)
this gives the output with the selected patients (in line with the numbers in df1):
# A tibble: 6 x 3
patient_ID age order
<int> <dbl> <int>
1 4 25 1
2 2 25 2
3 10 32 1
4 9 32 2
5 7 32 3
6 8 32 4

Get two or more rows within the same round that have value difference less than 20%

Suppose I have data like this:
# Data frame
df <- data.frame(round = factor(c(rep(1,4),rep(2,3),rep(3,4),rep(4,2))),
value = c(100,150,200,250,200,160,900,500,600,900,1200,100,120),
SE = c(1.3,1.5,0.7,2,1,2,1,1,1,0.5,0.75,20,3))
df
round value SE
1 1 100 1.30
2 1 150 1.50
3 1 200 0.70
4 1 250 2.00
5 2 200 1.00
6 2 160 2.00
7 2 900 1.00
8 3 500 1.00
9 3 600 1.00
10 3 900 0.50
11 3 1200 0.75
12 4 100 20.00
13 4 120 3.00
I want to get 2 or more rows within the same round that have value difference less than 20% (such as in in round 1: all row will be excluded, round 2: a row with value = 900 will be excluded, and in round 3: a row with value = 900 and 1200 will be excluded)
What I have tried so far is:
library(dplyr)
df %>%
group_by(Round) %>%
mutate(medians = median(value),
deviation = abs(value - medians) * 100 / medians) %>%
mutate(rowcounts = n()) %>%
mutate(passORfailed = ifelse(
rowcounts == 2,
ifelse((max(value) - min(value)) * 100 / max(value) > 20, "failed", "pass"),
ifelse(deviation > 20, "failed", "pass"))) %>%
filter(passORfailed != "failed") %>%
filter(sum(rowcounts) != 1)
Result:
# A tibble: 8 x 7
# Groups: round [4]
round value SE medians deviation rowcounts passORfailed
<fct> <dbl> <dbl> <dbl> <dbl> <int> <chr>
1 1 150 1.5 175 14.3 4 pass # -> not right
2 1 200 0.7 175 14.3 4 pass # -> not right
3 2 200 1 200 0 3 pass # -> ok
4 2 160 2 200 20 3 pass # -> ok
5 3 600 1 750 20 4 pass # -> not right (500 was excluded)
6 3 900 0.5 750 20 4 pass # -> not right
7 4 100 20 110 9.09 2 pass # -> ok
8 4 120 3 110 9.09 2 pass # -> ok
As you can see when row count is even and >3, things gone mad. The problem is when I use median, the real value that is calculated for the criteria is half (due to mean value between 2 center values). Is there anyway to adjust the code to make it possible in all situations?
If possible, how can I adjust the code to calculate this data within range of value +- SE ?
My apology if the question does not clear, but I've tried my best to explain.
Regards
Here's an approach where we generate every possible pair within a round and then filter for just the rows within 20% of each other. It's a little different logic than yours, so it results in fewer matches, but maybe useful as an alternate approach if you use a different threshold, like +/- 35%, instead of +/- 20%.
df <- df %>% mutate(row = row_number())
df %>%
left_join(df, by = "round") %>%
mutate(ratio = value.x / value.y) %>%
filter(row.x != row.y,
ratio %>% between(1/1.2, 1.2))
Here's a variation that addresses the 2nd part of the question. I calculate the value +/- SE for each and find the row pairs within each round that overlap.
df <- df %>%
mutate(row = row_number()) %>%
mutate(low = value - SE,
high = value + SE)
df %>%
left_join(df, by = "round") %>%
filter(row.x != row.y,
(high.x >= low.y & high.x <= high.y) | (low.x >= low.y & low.x <= high.y))
round value.x SE.x row.x low.x high.x value.y SE.y row.y low.y high.y
1 4 100 20 12 80 120 120 3 13 117 123
2 4 120 3 13 117 123 100 20 12 80 120

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

cumulative sum by ID with lag

I want to create a cumulative sum by id. But, it should not sum the value that belongs to the row where is being calculated.
I've already tried with cumsum. However, I do not know how to add a statement which specifies to do not add the amount of the row where the sum is made. The result column I am looking for is the third column called: "sum".
For example, for id 1, the first row is sum=0, because should not add this row. But, for id 1 and row 2 sum=100 because the amount of id 1 previous to the row 2 was 100 and so on.
id amount sum
1: 1 100 0
2: 1 20 100
3: 1 150 120
4: 2 60 0
5: 2 100 60
6: 1 30 270
7: 2 40 160
This is what I've tried:
df[,sum:=cumsum(amount),
by ="id"]
data: df <- data.table(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
You can do this without using lag:
> df %>%
group_by(id) %>%
mutate(sum = cumsum(amount) - amount)
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
#1 1 100 0
#2 1 20 100
#3 1 150 120
#4 2 60 0
#5 2 100 60
#6 1 30 270
#7 2 40 160
With dplyr -
df %>%
group_by(id) %>%
mutate(sum = lag(cumsum(amount), default = 0)) %>%
ungroup()
# A tibble: 7 x 3
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160
Thanks to #thelatemail here's the data.table version -
df[, sum := cumsum(shift(amount, fill=0)), by=id]
Here is an option in base R
df$Sum <- with(df, ave(amount, id, FUN = cumsum) - amount)
df$Sum
#[1] 0 100 120 0 60 270 160
Or by removing the last observation, take the cumsum
with(df, ave(amount, id, FUN = function(x) c(0, cumsum(x[-length(x)]))))
You can shift the values you're summing by using the lag function.
library(tidyverse)
df <- data.frame(id = c(1, 1, 1, 2, 2,1,2), amount = c(100, 20,
150,60,100,30,40),sum=c(0,100,120,0,60,270,160) ,stringsAsFactors =
FALSE)
df %>%
group_by(id) %>%
mutate(sum = cumsum(lag(amount, 1, default=0)))
# A tibble: 7 x 3
# Groups: id [2]
id amount sum
<dbl> <dbl> <dbl>
1 1 100 0
2 1 20 100
3 1 150 120
4 2 60 0
5 2 100 60
6 1 30 270
7 2 40 160

How to aggregate on different variables from two columns in r

I have a data frame that looks like this-
PatGroup Variable Value StudyQuarter
A Patientdays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A Patientdays 20 2
B ExposedDays 90 2
and many such further combinations of variables in Columns 'PatGroup' and 'Variable'
I want a function that will let me select a combination of entries from column 'PatGroup' and a combination of entries from column 'Variable' to get the desired outputs.
For example, I want to calculate a proportion which calculates the sum of values for PatGroups A and B for variable ExposedDays as Numerator; and PatGroups A, B and C for variables ExposedDays and PatientDays as Denominator.
The output would look like-
Numerator Denominator Proportion StudyQaurter NewPatGroup Measure
120 320 0.37 1 A&B/A&B&C ExposedDays/PatientDays
Can anyone help me with this please?
To be honest, I'm not sure what the point is of aggregating data the way that you propose, but you can do something like this:
library(tidyverse);
df %>%
group_by(StudyQuarter) %>%
summarise(
Numerator = sum(Value[Variable == "ExposedDays" & PatGroup %in% c("A", "B")]),
Denominator = sum(Value[Variable %in% c("ExposedDays", "PatientDays") & PatGroup %in% c("A", "B", "C")]),
Proportion = Numerator / Denominator,
NewPatGroup = "A&B/A&B&C",
Measure = "ExposedDays/PatientDays")
## A tibble: 2 x 6
# StudyQuarter Numerator Denominator Proportion NewPatGroup Measure
# <int> <int> <int> <dbl> <chr> <chr>
#1 1 120 320 0.375 A&B/A&B&C ExposedDays/Patien…
#2 2 90 110 0.818 A&B/A&B&C ExposedDays/Patien…
Sample data
df <- read.table(text =
"PatGroup Variable Value StudyQuarter
A PatientDays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A PatientDays 20 2
B ExposedDays 90 2", header = T)

Resources