Is there a built-in way to calculate sums over different rows and columns? I know that I could form a new dataframe from id, drug, day2, sum_d2, rename the last two columns, delete these columns in the "old" dataframe, perform rbind with the "old" dataframe and summarise by group. But this seems strangely complicated and possibly error-prone.
How do I calculate the sum of sum_1 and sum_2 of drug_a given on 2020-01-02 using id + drugname as grouping variables + day1 + day2 (when these two are the same)?
The reason for this format is that I have to split the dosage of a continous infusion at midnight...
Example data:
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day1,sum_1,day2,sum_2)
id drug day1 sum_1 day2 sum_2
1 1 Drug_a 2020-01-01 250 2020-01-02 100
2 1 Drug_a 2020-01-02 550 2020-01-03 75
Expected output within these lines:
id drug day sum
1 1 Drug_a 2020-01-01 250
2 1 Drug_a 2020-01-02 650
3 1 Drug_a 2020-01-03 75
Perhaps something like this might work. You can use pivot_longer to put day and sum in single columns (i.e., combine day_1 and day_2 into day, sum_1 and sum_2 into sum).
library(tidyverse)
example_data %>%
pivot_longer(cols = c(-id, -drug), names_to = c(".value", "group"), names_sep = "_") %>%
group_by(id, drug, day) %>%
summarise (total = sum(sum))
# A tibble: 3 x 4
# Groups: id, drug [1]
id drug day total
<dbl> <fct> <fct> <dbl>
1 1 Drug_a 2020-01-01 250
2 1 Drug_a 2020-01-02 650
3 1 Drug_a 2020-01-03 75
Data
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day_1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day_2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day_1,sum_1,day_2,sum_2)
We can use melt from data.table
library(data.table)
melt(setDT(example_data), measure = patterns('^day', '^sum'),
value.name = c('day', 'sum'))[, .(total = sum(sum)), .(id, drug, day)]
# id drug day total
#1: 1 Drug_a 2020-01-01 250
#2: 1 Drug_a 2020-01-02 650
#3: 1 Drug_a 2020-01-03 75
data
id <- c(rep(1,2))
drug <- c(rep("Drug_a",2))
day_1 <- c(rep("2020-01-01",1),rep("2020-01-02",1))
sum_1 <- c(rep(250,1),rep(550,1))
day_2 <- c(rep("2020-01-02",1),rep("2020-01-03",1))
sum_2 <- c(rep(100,1),rep(75,1))
example_data <- data.frame(id,drug,day_1,sum_1,day_2,sum_2)
Related
So here's the data:
DF1
ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday
I would like to join the following dictionary.
DF2
ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54
My goal is I want the total count of entries on each day as well as the hours worked on that day. But if a value on the list exists twice, it is not counted twice. (Thats the hard part)
Here's my attempt following R Code:
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
df3 %>%
group_by(ID) %>%
summarize(count = n())
sum = sum(Employee_Hrs)) %>%
mutate(injRate = count/sum)
This does not work because though it does successfully count total number of entries for each ID, it sums employee_Hrs every time, even when it is entered multiple times...
End product should be:
ID count sum
1 3 41
2 2 55
3 3 120
Again, take count, but sum hours , dont double count.
Here is a base R option using merge + aggregate
u <- merge(df1, df2, by = c("ID", "DOW"))
res <- setNames(
merge(aggregate(DOW ~ ID, u, length),
aggregate(Hours ~ ID, unique(u), sum),
by = "ID"
),
c("ID", "Count", "Sum")
)
which gives
> res
ID Count Sum
1 1 3 41
2 2 2 55
3 3 3 120
An option with data.table
library(data.table)
setDT(df1)[df2, .(Count = .N, Hours), on = .(ID), by = .EACHI][,
.(Sum = sum(Hours)), .(ID, Count)]
# ID Count Sum
#1: 1 3 41
#2: 2 2 55
#3: 3 3 120
Another approach is to summarize the tables prior to joining them.
textFile1 <- "ID DOW
1 Monday
1 Monday
1 Tuesday
2 Tuesday
2 Wednesday
3 Friday
3 Monday
3 Tuesday"
textFile2 <- "ID DOW Hours
1 Monday 20
1 Tuesday 21
2 Tuesday 30
2 Wednesday 25
3 Friday 24
3 Monday 42
3 Tuesday 54"
df1 <- read.table(text =textFile1,header=TRUE )
df2 <- read.table(text =textFile2,header=TRUE )
df1 %>% group_by(ID) %>%
summarise(count = n()) -> counts
df2 %>%
group_by(ID) %>%
summarize(sum = sum(Hours)) %>%
left_join(counts) %>%
mutate(injRate = count/sum)
...and the output:
# A tibble: 3 x 4
ID sum count injRate
<int> <int> <int> <dbl>
1 1 41 3 0.0732
2 2 55 2 0.0364
3 3 120 3 0.025
Try this solution where you compute the number of counts and then you filter to obtain final summary:
library(tidyverse)
#Data
df3 <- df1 %>%
left_join(df2, by = c("DOW" ,"ID"))
#Code
df3 %>%
group_by(ID) %>%
mutate(count=n()) %>%
filter(!duplicated(DOW)) %>%
summarise(count=unique(count),Sum=sum(Hours))
Output:
# A tibble: 3 x 3
ID count Sum
<int> <int> <int>
1 1 3 41
2 2 2 55
3 3 3 120
From a data frame as month and a continuous number id
data.frame( id = c(1,1,2,2,3), month = c("2011-02","2011-02","2011-02","2011-03","2011-01"))
How is it possible to extract the total number of month and have the frq for every id?
Expected output:
data.frame(id = c(1,2,2,3), month = c("2011-02","2011-02","2011-03","2011-01"), frq = c(1,1,2,1)
one option is:
df$freq <- 1
aggregate(freq ~ id + month, df, sum)
id month freq
1 3 2011-01 1
2 1 2011-02 2
3 2 2011-02 1
4 2 2011-03 1
An idea via dplyr can be,
library(dplyr)
distinct(df) %>%
group_by(id) %>%
mutate(freq = seq(n()))
# A tibble: 4 x 3
# Groups: id [3]
# id month freq
# <dbl> <fct> <int>
#1 1 2011-02 1
#2 2 2011-02 1
#3 2 2011-03 2
#4 3 2011-01 1
I've got data from a number of surveys. Each survey can be sent multiple times with updated values. For each survey/row in the dataset there's a date when the survey was submited (created). I'd like to merge the rows for each survey and keep the date from the first survey but other data from the last survey.
A simple example:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 10 90
#> 3 s2 2020-01-03 20 20
#> 4 s3 2020-01-01 45 5
#> 5 s3 2020-01-02 50 50
#> 6 s3 2020-01-03 30 10
Desired result:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 20 20
#> 3 s3 2020-01-01 30 10
Example data:
df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
var1 = c(10, 10, 20, 45, 50, 30),
var2 = c(30, 90, 20, 5, 50, 10),
stringsAsFactors=FALSE)
I've tried group_by with summarize in different ways but can't make it work, any help would be highly appreciated!
After grouping by 'survey', change the 'created' as the first or min value in 'created' and then slice the last row (n())
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
slice(n())
# A tibble: 3 x 4
# Groups: survey [3]
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
Or using base R
transform(df, created = ave(created, survey, FUN = first)
)[!duplicated(df$survey, fromLast = TRUE),]
After selecting the first created date we can select the last values from all the columns.
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
summarise(across(created:var2, last))
#In older version use `summarise_at`
#summarise_at(vars(created:var2), last)
# A tibble: 3 x 4
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)