R dplyr sum based on conditions - r

I'm trying to use dplyr to multiply and sum one column, based on variables in other columns.
location = c("LBJ", "LBJ", "LBJ","LBJ")
sample = c("100", "100", "100","100")
sum = c(0,1,2,3)
n = c(200,100,20,24)
df = data.frame(location, sample, sum,n)
df
location sample sum n
1 LBJ 100 0 200
2 LBJ 100 1 100
3 LBJ 100 2 20
4 LBJ 100 3 24
I would like to calculate ( (n where sum == 0) + ((n where sum == 1) / 2 ) ) / (sum of all n).
I am going to have multiple locations and samples which should act independently, so I want to use the group_by commands in dplyr.
Thanks for any help.

Is this what you want ?
library(dplyr)
df%>%group_by(location)%>%dplyr::mutate(Rate=mean(n[which(sum<=1)])/sum(n))
# A tibble: 4 x 5
# Groups: location [1]
location sample sum n Rate
<fctr> <fctr> <dbl> <dbl> <dbl>
1 LBJ 100 0 200 0.4360465
2 LBJ 100 1 100 0.4360465
3 LBJ 100 2 20 0.4360465
4 LBJ 100 3 24 0.4360465

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

Calculation inside a data frame based on two columns

Say, I have a data frame with 3 columns
ID Type Amount
1 4 100
1 4 50
1 1 20
2 4 30
2 1 10
I want to do some calculations in the data frame which are based on the groups of ID and Type. For example, I want to calculate the sum of amount for type 4 - sum of amount for type 1 for all of the IDs of the data frame and append it to the end, so the final result would be something like
ID Type Amount Calculation
1 4 100 (100 + 50) - 20
1 4 50 (100 + 50) - 20
1 1 20 (100 + 50) - 20
2 4 30 30 - 10
2 1 10 30 - 10
Is there an easy way to implement this? Easy, because I want to do some more complexe calculations, but want to get the basics right first.
I tried to work it out with dplyr
Something like
df %>%
group_by(ID) %>%
sum( Calculation = Amount[Type == 4] - Amount[Type == 1])
This gave me the same value for all the columns in my data frame, so it doesn't seem to work.. Any ideas?
This does what you need with dplyr
library(dplyr)
df <- data.frame(ID = c(1,1,1,2,2), Type = c(4,4,1,4,1), Amount = c(100,50,20,30,10))
df %>% group_by(ID) %>% mutate(Calculation = sum(Amount[Type == 4]) - sum(Amount[Type == 1]))
# A tibble: 5 x 4
# Groups: ID [2]
ID Type Amount Calculation
<dbl> <dbl> <dbl> <dbl>
1 1 4 100 130
2 1 4 50 130
3 1 1 20 130
4 2 4 30 20
5 2 1 10 20

Use `dplyr` to divide rows by group

On my attempt to learn dplyr, I want to divide each row by another row, representing the corresponding group's total.
I generated test data with
library(dplyr)
# building test data
data("OrchardSprays")
totals <- OrchardSprays %>% group_by(treatment) %>%
summarise(decrease = sum(decrease))
totals$decrease <- totals$decrease + seq(10, 80, 10)
totals$rowpos = totals$colpos <- "total"
df <- rbind(OrchardSprays, totals)
Note the line totals$decrease <- totals$decrease + seq(10, 80, 10): for the sake of the question, I assumed there was an additional decrease for each treatment, which was not observed in the single lines of the data frame but only in the "total" lines for each group.
What I now want to do is adding another column decrease_share to the data frame where each line's decrease value is divided by the corresponding treatment groups total decrease value.
So, for head(df) I would expect an output like this
> head(df)
decrease rowpos colpos treatment treatment_decrease
1 57 1 1 D 0.178125
2 95 2 1 E 0.1711712
3 8 3 1 B 0.09876543
4 69 4 1 H 0.08603491
5 92 5 1 G 0.1488673
6 90 6 1 F 0.1470588
My real world example is a bit more complex (more group variables and also more levels), therefore I am looking for a suitable solution in dplyr.
Here's a total dplyr approach:
library(dplyr) #version >= 1.0.0
OrchardSprays %>%
group_by(treatment) %>%
summarise(decrease = sum(decrease)) %>%
mutate(decrease = decrease + seq(10, 80, 10),
rowpos = "total",
colpos = "total") %>%
bind_rows(mutate(OrchardSprays, across(rowpos:colpos, as.character))) %>%
group_by(treatment) %>%
mutate(treatment_decrease = decrease / decrease[rowpos == "total"])
# A tibble: 72 x 5
# Groups: treatment [8]
treatment decrease rowpos colpos treatment_decrease
<fct> <dbl> <chr> <chr> <dbl>
1 A 47 total total 1
2 B 81 total total 1
3 C 232 total total 1
4 D 320 total total 1
5 E 555 total total 1
6 F 612 total total 1
7 G 618 total total 1
8 H 802 total total 1
9 D 57 1 1 0.178
10 E 95 2 1 0.171
# … with 62 more rows

How to aggregate on different variables from two columns in r

I have a data frame that looks like this-
PatGroup Variable Value StudyQuarter
A Patientdays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A Patientdays 20 2
B ExposedDays 90 2
and many such further combinations of variables in Columns 'PatGroup' and 'Variable'
I want a function that will let me select a combination of entries from column 'PatGroup' and a combination of entries from column 'Variable' to get the desired outputs.
For example, I want to calculate a proportion which calculates the sum of values for PatGroups A and B for variable ExposedDays as Numerator; and PatGroups A, B and C for variables ExposedDays and PatientDays as Denominator.
The output would look like-
Numerator Denominator Proportion StudyQaurter NewPatGroup Measure
120 320 0.37 1 A&B/A&B&C ExposedDays/PatientDays
Can anyone help me with this please?
To be honest, I'm not sure what the point is of aggregating data the way that you propose, but you can do something like this:
library(tidyverse);
df %>%
group_by(StudyQuarter) %>%
summarise(
Numerator = sum(Value[Variable == "ExposedDays" & PatGroup %in% c("A", "B")]),
Denominator = sum(Value[Variable %in% c("ExposedDays", "PatientDays") & PatGroup %in% c("A", "B", "C")]),
Proportion = Numerator / Denominator,
NewPatGroup = "A&B/A&B&C",
Measure = "ExposedDays/PatientDays")
## A tibble: 2 x 6
# StudyQuarter Numerator Denominator Proportion NewPatGroup Measure
# <int> <int> <int> <dbl> <chr> <chr>
#1 1 120 320 0.375 A&B/A&B&C ExposedDays/Patien…
#2 2 90 110 0.818 A&B/A&B&C ExposedDays/Patien…
Sample data
df <- read.table(text =
"PatGroup Variable Value StudyQuarter
A PatientDays 100 1
B ExposedDays 80 1
A ExposedDays 40 1
A Patients 40 1
C ExposedDays 10 1
C PatientDays 90 1
A PatientDays 20 2
B ExposedDays 90 2", header = T)

how to create a variable based on lm in a regular mutate in dplyr?

Consider this simple example:
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to group by group and create two columns.
One is the number of distinct values in value (I can use dplyr::n_distinct), the other is the constant term from a regression of value on the vector 1. That is, the output of
tidy(lm(data = dataframe, value ~ 1)) %>% select(estimate)
estimate
1 203.3333
The difficulty here is combining these two simple outputs into a single mutate statement that preserves the grouping.
I tried something like:
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% select(estimate)
}
dataframe %>% group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
but this does not work. What I am missing here?
Thanks!
This approach will work if you use pull in place of select. This extracts the single estimate value from the tidy output.
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% pull(estimate)
}
dataframe %>%
group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
# A tibble: 6 x 5
# Groups: group [2]
id group value distinct mean
<dbl> <dbl> <dbl> <int> <dbl>
1 1 1 200 3 240.0000
2 2 1 400 3 240.0000
3 3 1 120 3 240.0000
4 4 2 300 2 166.6667
5 5 2 100 2 166.6667
6 6 2 100 2 166.6667

Resources