How to aggregate on different variables from two columns in r - r

I have a data frame that looks like this-
PatGroup Variable Value StudyQuarter
A Patientdays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A Patientdays 20 2
B ExposedDays 90 2
and many such further combinations of variables in Columns 'PatGroup' and 'Variable'
I want a function that will let me select a combination of entries from column 'PatGroup' and a combination of entries from column 'Variable' to get the desired outputs.
For example, I want to calculate a proportion which calculates the sum of values for PatGroups A and B for variable ExposedDays as Numerator; and PatGroups A, B and C for variables ExposedDays and PatientDays as Denominator.
The output would look like-
Numerator Denominator Proportion StudyQaurter NewPatGroup Measure
120 320 0.37 1 A&B/A&B&C ExposedDays/PatientDays
Can anyone help me with this please?

To be honest, I'm not sure what the point is of aggregating data the way that you propose, but you can do something like this:
library(tidyverse);
df %>%
group_by(StudyQuarter) %>%
summarise(
Numerator = sum(Value[Variable == "ExposedDays" & PatGroup %in% c("A", "B")]),
Denominator = sum(Value[Variable %in% c("ExposedDays", "PatientDays") & PatGroup %in% c("A", "B", "C")]),
Proportion = Numerator / Denominator,
NewPatGroup = "A&B/A&B&C",
Measure = "ExposedDays/PatientDays")
## A tibble: 2 x 6
# StudyQuarter Numerator Denominator Proportion NewPatGroup Measure
# <int> <int> <int> <dbl> <chr> <chr>
#1 1 120 320 0.375 A&B/A&B&C ExposedDays/Patien…
#2 2 90 110 0.818 A&B/A&B&C ExposedDays/Patien…
Sample data
df <- read.table(text =
"PatGroup Variable Value StudyQuarter
A PatientDays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A PatientDays 20 2
B ExposedDays 90 2", header = T)

Related

Calculation inside a data frame based on two columns

Say, I have a data frame with 3 columns
ID Type Amount
1 4 100
1 4 50
1 1 20
2 4 30
2 1 10
I want to do some calculations in the data frame which are based on the groups of ID and Type. For example, I want to calculate the sum of amount for type 4 - sum of amount for type 1 for all of the IDs of the data frame and append it to the end, so the final result would be something like
ID Type Amount Calculation
1 4 100 (100 + 50) - 20
1 4 50 (100 + 50) - 20
1 1 20 (100 + 50) - 20
2 4 30 30 - 10
2 1 10 30 - 10
Is there an easy way to implement this? Easy, because I want to do some more complexe calculations, but want to get the basics right first.
I tried to work it out with dplyr
Something like
df %>%
group_by(ID) %>%
sum( Calculation = Amount[Type == 4] - Amount[Type == 1])
This gave me the same value for all the columns in my data frame, so it doesn't seem to work.. Any ideas?
This does what you need with dplyr
library(dplyr)
df <- data.frame(ID = c(1,1,1,2,2), Type = c(4,4,1,4,1), Amount = c(100,50,20,30,10))
df %>% group_by(ID) %>% mutate(Calculation = sum(Amount[Type == 4]) - sum(Amount[Type == 1]))
# A tibble: 5 x 4
# Groups: ID [2]
ID Type Amount Calculation
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 130
2 1 4 50 130
3 1 1 20 130
4 2 4 30 20
5 2 1 10 20

Use `dplyr` to divide rows by group

On my attempt to learn dplyr, I want to divide each row by another row, representing the corresponding group's total.
I generated test data with
library(dplyr)
# building test data
data("OrchardSprays")
totals <- OrchardSprays %>% group_by(treatment) %>%
summarise(decrease = sum(decrease))
totals$decrease <- totals$decrease + seq(10, 80, 10)
totals$rowpos = totals$colpos <- "total"
df <- rbind(OrchardSprays, totals)
Note the line totals$decrease <- totals$decrease + seq(10, 80, 10): for the sake of the question, I assumed there was an additional decrease for each treatment, which was not observed in the single lines of the data frame but only in the "total" lines for each group.
What I now want to do is adding another column decrease_share to the data frame where each line's decrease value is divided by the corresponding treatment groups total decrease value.
So, for head(df) I would expect an output like this
> head(df)
decrease rowpos colpos treatment treatment_decrease
1 57 1 1 D 0.178125
2 95 2 1 E 0.1711712
3 8 3 1 B 0.09876543
4 69 4 1 H 0.08603491
5 92 5 1 G 0.1488673
6 90 6 1 F 0.1470588
My real world example is a bit more complex (more group variables and also more levels), therefore I am looking for a suitable solution in dplyr.
Here's a total dplyr approach:
library(dplyr) #version >= 1.0.0
OrchardSprays %>%
group_by(treatment) %>%
summarise(decrease = sum(decrease)) %>%
mutate(decrease = decrease + seq(10, 80, 10),
rowpos = "total",
colpos = "total") %>%
bind_rows(mutate(OrchardSprays, across(rowpos:colpos, as.character))) %>%
group_by(treatment) %>%
mutate(treatment_decrease = decrease / decrease[rowpos == "total"])
# A tibble: 72 x 5
# Groups: treatment [8]
treatment decrease rowpos colpos treatment_decrease
<fct> <dbl> <chr> <chr> <dbl>
1 A 47 total total 1
2 B 81 total total 1
3 C 232 total total 1
4 D 320 total total 1
5 E 555 total total 1
6 F 612 total total 1
7 G 618 total total 1
8 H 802 total total 1
9 D 57 1 1 0.178
10 E 95 2 1 0.171
# … with 62 more rows

How do I combine row entries for the same patient ID# in R while keeping other columns and NA values?

I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
What I need:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
I tried group_by with transmute() and sum() as follows:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
Could you do this?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
Some comments:
The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?
Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.
To exactly reproduce your expected output one can do
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we
keep the maximum value of heartattack if heartattack contains either 0 or 1.
return NA if heartattack does not contain 0 or 1.

R dplyr sum based on conditions

I'm trying to use dplyr to multiply and sum one column, based on variables in other columns.
location = c("LBJ", "LBJ", "LBJ","LBJ")
sample = c("100", "100", "100","100")
sum = c(0,1,2,3)
n = c(200,100,20,24)
df = data.frame(location, sample, sum,n)
df
location sample sum n
1 LBJ 100 0 200
2 LBJ 100 1 100
3 LBJ 100 2 20
4 LBJ 100 3 24
I would like to calculate ( (n where sum == 0) + ((n where sum == 1) / 2 ) ) / (sum of all n).
I am going to have multiple locations and samples which should act independently, so I want to use the group_by commands in dplyr.
Thanks for any help.
Is this what you want ?
library(dplyr)
df%>%group_by(location)%>%dplyr::mutate(Rate=mean(n[which(sum<=1)])/sum(n))
# A tibble: 4 x 5
# Groups: location [1]
location sample sum n Rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 LBJ 100 0 200 0.4360465
2 LBJ 100 1 100 0.4360465
3 LBJ 100 2 20 0.4360465
4 LBJ 100 3 24 0.4360465

Normalize all rows with first element within group

Is there an elegant method to normalize a column with a group-specific norm with dplyr?
Example:
I have a data frame:
df = data.frame(year=c(1:2, 1:2),
group=c("a", "a", "b", "b"),
val=c(100, 200, 300, 900))
i.e:
year group val
1 1 a 100
2 2 a 200
3 1 b 300
4 2 b 900
I want to normalize val by the value in year=1 of the given group. Desired output:
year group val val_norm
1 1 a 100 1
2 2 a 200 2
3 1 b 300 1
4 2 b 900 3
e.g. in row 4 the norm = 300 (year==1 & group=="b") hence val_norm = 900/300 = 3.
I can achieve this by extracting a ancillary data frame with just norms and then doing a left join on the original data frame.
What is a more elegant way to achieve this without creating a temporary data frame?
We can group by 'group', then divide the 'val' by the 'val' where 'year' is 1 (year==1). Here, I am selecting the first observation (in case there are duplicate 'year' of 1 for each 'group').
library(dplyr)
df %>%
group_by(group) %>%
mutate(val_norm = val/val[year==1][1L])
# year group val val_norm
# <int> <fctr> <dbl> <dbl>
#1 1 a 100 1
#2 2 a 200 2
#3 1 b 300 1
#4 2 b 900 3
If we need elegance and efficiency, data.table can be tried
library(data.table)
setDT(df)[, val_norm := val/val[year==1][1L] , by = group]

Resources