How to calculate the ratio of a condition based on another condition - r

I have a simplified data frame like this
date state hour
2020-01-01 A 6
2020-01-01 B 3
2020-01-02 A 4
2020-01-02 B 3.5
2020-01-03 A 5
2020-01-03 B 2.5
For each date, there are two states. I want to calculate the ratio of state A/B in hour each day
For example,
date ratio
2020-01-01 2
2020-01-02 1.143
2020-01-03 2
How do I get this result? Thank you!

With the help of match you can do :
library(dplyr)
df %>%
group_by(date) %>%
summarise(ratio = hour[match('A', state)]/hour[match('B', state)])
# date ratio
# <chr> <dbl>
#1 2020-01-01 2
#2 2020-01-02 1.14
#3 2020-01-03 2

You can use xtabs:
tt <- xtabs(hour ~ date + state, x)
data.frame(dimnames(tt)[1], ratio = tt[,1] / tt[,2])
# date ratio
#2020-01-01 2020-01-01 2.000000
#2020-01-02 2020-01-02 1.142857
#2020-01-03 2020-01-03 2.000000
Data:
x <- data.frame(date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03"), state = c("A", "B",
"A", "B", "A", "B"), hour = c(6, 3, 4, 3.5, 5, 2.5))

A data.table option
> setDT(df)[, .(ratio = Reduce(`/`, hour[order(state)])), date]
date ratio
1: 2020-01-01 2.000000
2: 2020-01-02 1.142857
3: 2020-01-03 2.000000

You can also use the following solution, albeit it is to some extent similar to the one posted by dear #Ronak Shah .
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = state, values_from = hour) %>%
group_by(date) %>%
summarise(ratio = A/B)
# A tibble: 3 x 2
date ratio
<chr> <dbl>
1 2020-01-01 2
2 2020-01-02 1.14
3 2020-01-03 2

Related

'Stretch' a grouped data frame using dplyr

I have a grouped dataframe with multiple IDs with a date and value column.
id <- c("a", "a", "a", "b", "b", "b", "c")
date <- c("2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01")
value <- rnorm(n = length(id))
df <- cbind.data.frame(id, date, value)
However, some IDs have less than 3 dates. I want to "stretch" those IDs and add an NA for the value column for the new dates. In this dataframe, the "c" ID would have two new dates added ("2020-01-02" and "2020-01-03").
Perhaps this approach would suit?
library(tidyverse)
id <- c("a", "a", "a", "b", "b", "b", "c")
date <- c("2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01", "2020-01-02", "2020-01-03",
"2020-01-01")
value <- rnorm(n = length(id))
df <- cbind.data.frame(id, date, value)
df %>%
right_join(df %>% expand(id, date))
#> Joining, by = c("id", "date")
#> id date value
#> 1 a 2020-01-01 -1.5371474
#> 2 a 2020-01-02 0.9001098
#> 3 a 2020-01-03 0.1523491
#> 4 b 2020-01-01 0.8194577
#> 5 b 2020-01-02 1.2005270
#> 6 b 2020-01-03 0.1158812
#> 7 c 2020-01-01 -0.8676445
#> 8 c 2020-01-02 NA
#> 9 c 2020-01-03 NA
Created on 2022-09-05 by the reprex package (v2.0.1)
In base R, by id you may merge with a data frame created out of sequences of the date range. First of all you want to use proper date format by doing df$date <- as.Date(df$date).
by(df, df$id, \(x)
merge(x,
data.frame(id=el(x$id),
date=do.call(seq.Date, c(as.list(range(df$date)), 'day'))),
all=TRUE)) |>
do.call(what=rbind)
# id date value
# a.1 a 2020-01-01 1.3709584
# a.2 a 2020-01-02 -0.5646982
# a.3 a 2020-01-03 0.3631284
# b.1 b 2020-01-01 0.6328626
# b.2 b 2020-01-02 0.4042683
# b.3 b 2020-01-03 -0.1061245
# c.1 c 2020-01-01 1.5115220
# c.2 c 2020-01-02 NA
# c.3 c 2020-01-03 NA
You could use complete() from tidyr.
library(tidyr)
df %>%
complete(id, date)
# # A tibble: 9 × 3
# id date value
# <chr> <chr> <dbl>
# 1 a 2020-01-01 1.12
# 2 a 2020-01-02 1.58
# 3 a 2020-01-03 1.26
# 4 b 2020-01-01 -2.30
# 5 b 2020-01-02 -1.45
# 6 b 2020-01-03 -0.212
# 7 c 2020-01-01 0.344
# 8 c 2020-01-02 NA
# 9 c 2020-01-03 NA

Add/Merge/Melt only specific columns and give out one unique row

I am trying to transform a dataset that has multiple product sales on a date. At the end I want to keep only unique columns with the sum of the product sales per day.
My MRE:
df <- data.frame(created = as.Date(c("2020-01-01", "2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
soldUnits = c(1, 1, 1, 1, 1, 1),
Weekday = c("Mo","Mo","Tu","Tu","Th","Th"),
Sunshinehours = c(7.8,7.8,6.0,6.0,8.0,8.0))
Which looks like this:
Date soldUnits Weekday Sunshinehours
2020-01-01 1 Mo 7.8
2020-01-01 1 Mo 7.8
2020-01-02 1 Tu 6.0
2020-01-02 1 Tu 6.0
2020-01-03 1 We 8.0
2020-01-03 1 We 8.0
And should look like this after transforming:
Date soldUnits Weekday Sunshinehours
2020-01-01 2 Mo 7.8
2020-01-02 2 Tu 6.0
2020-01-03 2 We 8.0
I tried aggregate() and group_by but without success because my data was dropped.
Is there anyone who has an idea, how i can transform and clean up my dataset according to the specifications i mentioned?
This can work:
library(tidyverse)
df %>%
group_by(created) %>%
count(Weekday, Sunshinehours, wt = soldUnits,name = "soldUnits")
#> # A tibble: 3 × 4
#> # Groups: created [3]
#> created Weekday Sunshinehours soldUnits
#> <date> <chr> <dbl> <dbl>
#> 1 2020-01-01 Mo 7.8 2
#> 2 2020-01-02 Tu 6 2
#> 3 2020-01-03 Th 8 2
Created on 2021-12-04 by the reprex package (v2.0.1)
Applying different functions to different columns (or set of columns) can be done with collap
library(collapse)
collap(df, ~ created + Weekday,
custom = list(fmean = "Sunshinehours", fsum = "soldUnits"))
created soldUnits Weekday Sunshinehours
1 2020-01-01 2 Mo 7.8
2 2020-01-02 2 Tu 6.0
3 2020-01-03 2 Th 8.0
Another dplyr approach:
df %>%
group_by(created, Weekday, Sunshinehours) %>%
summarise(soldUnits = sum(soldUnits))
created Weekday Sunshinehours soldUnits
<date> <chr> <dbl> <dbl>
1 2020-01-01 Mo 7.8 2
2 2020-01-02 Tu 6 2
3 2020-01-03 Th 8 2
Using base and dplyr R
df1 = aggregate(df["Sunshinehours"], by=df["created"], mean)
df2 = aggregate(df["soldUnits"], by=df["created"], sum)
df3 = inner_join(df1, df2)
#converting `Weekday` to factors
df$Weekday = as.factor(df$Weekday)
df3$Weekday = levels(df$Weekday)
created Sunshinehours soldUnits Weekday
1 2020-01-01 7.8 2 Mo
2 2020-01-02 6.0 2 Th
3 2020-01-03 8.0 2 Tu

Merge rows and keep values based on another column

I've got data from a number of surveys. Each survey can be sent multiple times with updated values. For each survey/row in the dataset there's a date when the survey was submited (created). I'd like to merge the rows for each survey and keep the date from the first survey but other data from the last survey.
A simple example:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 10 90
#> 3 s2 2020-01-03 20 20
#> 4 s3 2020-01-01 45 5
#> 5 s3 2020-01-02 50 50
#> 6 s3 2020-01-03 30 10
Desired result:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 20 20
#> 3 s3 2020-01-01 30 10
Example data:
df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
var1 = c(10, 10, 20, 45, 50, 30),
var2 = c(30, 90, 20, 5, 50, 10),
stringsAsFactors=FALSE)
I've tried group_by with summarize in different ways but can't make it work, any help would be highly appreciated!
After grouping by 'survey', change the 'created' as the first or min value in 'created' and then slice the last row (n())
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
slice(n())
# A tibble: 3 x 4
# Groups: survey [3]
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
Or using base R
transform(df, created = ave(created, survey, FUN = first)
)[!duplicated(df$survey, fromLast = TRUE),]
After selecting the first created date we can select the last values from all the columns.
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
summarise(across(created:var2, last))
#In older version use `summarise_at`
#summarise_at(vars(created:var2), last)
# A tibble: 3 x 4
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10

How do I insert dates based on conditions in R (example given)?

df is my current dataset and I want to insert dates from 1st Jan'2020 to 4th Jan'2020 for all possible locations .
df<-data.frame(location=c("x","x","y"),date=c("2020-01-01","2020-01-04","2020-01-03"))
This is what my expected dataset look like .
expected_df<-data.frame(location=c("x","x","x","x","y","y","y","y"),date=c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03","2020-01-04"))
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04
We can use complete from tidyr
library(dplyr)
library(tidyr)
start <- as.Date('2020-01-01')
end <- as.Date('2020-01-04')
df %>%
mutate(date = as.Date(date)) %>%
complete(location, date = seq(start, end, by = "1 day"))
# location date
# <fct> <date>
#1 x 2020-01-01
#2 x 2020-01-02
#3 x 2020-01-03
#4 x 2020-01-04
#5 y 2020-01-01
#6 y 2020-01-02
#7 y 2020-01-03
#8 y 2020-01-04
It is essential that you place "stringsAsFactor = FALSE" in your data frame so those values do not get transformed into factors.
df <- data.frame(location=c("x","x","y"), date=c("2020-01-01","2020-01-04","2020-01-03"), stringsAsFactors = F)
'['(
expand.grid(
date = seq.Date(from=min(as.Date(df$date)), to=max(as.Date(df$date)), by = "day"),
location = unique(df$location)
),
c(2,1)
)
Output
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04

Summarise values from different columns and rows

Is there a built-in way to calculate sums over different rows and columns? I know that I could form a new dataframe from id, drug, day2, sum_d2, rename the last two columns, delete these columns in the "old" dataframe, perform rbind with the "old" dataframe and summarise by group. But this seems strangely complicated and possibly error-prone.
How do I calculate the sum of sum_1 and sum_2 of drug_a given on 2020-01-02 using id + drugname as grouping variables + day1 + day2 (when these two are the same)?
The reason for this format is that I have to split the dosage of a continous infusion at midnight...
Example data:
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day1,sum_1,day2,sum_2)
id drug day1 sum_1 day2 sum_2
1 1 Drug_a 2020-01-01 250 2020-01-02 100
2 1 Drug_a 2020-01-02 550 2020-01-03 75
Expected output within these lines:
id drug day sum
1 1 Drug_a 2020-01-01 250
2 1 Drug_a 2020-01-02 650
3 1 Drug_a 2020-01-03 75
Perhaps something like this might work. You can use pivot_longer to put day and sum in single columns (i.e., combine day_1 and day_2 into day, sum_1 and sum_2 into sum).
library(tidyverse)
example_data %>%
pivot_longer(cols = c(-id, -drug), names_to = c(".value", "group"), names_sep = "_") %>%
group_by(id, drug, day) %>%
summarise (total = sum(sum))
# A tibble: 3 x 4
# Groups: id, drug [1]
id drug day total
<dbl> <fct> <fct> <dbl>
1 1 Drug_a 2020-01-01 250
2 1 Drug_a 2020-01-02 650
3 1 Drug_a 2020-01-03 75
Data
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day_1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day_2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day_1,sum_1,day_2,sum_2)
We can use melt from data.table
library(data.table)
melt(setDT(example_data), measure = patterns('^day', '^sum'),
value.name = c('day', 'sum'))[, .(total = sum(sum)), .(id, drug, day)]
# id drug day total
#1: 1 Drug_a 2020-01-01 250
#2: 1 Drug_a 2020-01-02 650
#3: 1 Drug_a 2020-01-03 75
data
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day_1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day_2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day_1,sum_1,day_2,sum_2)

Resources