R: calculate rates within grouped values in dplyr [duplicate] - r

This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 2 years ago.
Say we have a dataframe looking like this one below:
month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89
My goal is to tweak this in order to get a dataframe giving me the percentages of each value in issue related to each month. Plainly speaking, my desired output should look like follows:
month issue rate
Jan withdrawal 57.47
Jan delay 27.59
Jan other 14.94
Feb withdrawal 47.25
Feb delay 30.50
Feb other 22.25
I've tried helping myself with dplyr but my attempts had all been unsuccessful so far.

library(tidyverse)
df <- read.table(text = "month issue amount
Jan withdrawal 250
Jan delay 120
Jan other 65
Feb withdrawal 189
Feb delay 122
Feb other 89", header = T)
df
#> month issue amount
#> 1 Jan withdrawal 250
#> 2 Jan delay 120
#> 3 Jan other 65
#> 4 Feb withdrawal 189
#> 5 Feb delay 122
#> 6 Feb other 89
df %>%
group_by(month) %>%
mutate(rate = amount / sum(amount, na.rm = T) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
df %>%
group_by(month) %>%
mutate(rate = prop.table(amount) * 100)
#> # A tibble: 6 x 4
#> # Groups: month [2]
#> month issue amount rate
#> <chr> <chr> <int> <dbl>
#> 1 Jan withdrawal 250 57.5
#> 2 Jan delay 120 27.6
#> 3 Jan other 65 14.9
#> 4 Feb withdrawal 189 47.2
#> 5 Feb delay 122 30.5
#> 6 Feb other 89 22.2
Created on 2021-01-26 by the reprex package (v0.3.0)
using data.table
library(data.table)
setDT(df)
df[, rate := prop.table(amount) * 100, by = list(month)]
df
#> month issue amount rate
#> 1: Jan withdrawal 250 57.47126
#> 2: Jan delay 120 27.58621
#> 3: Jan other 65 14.94253
#> 4: Feb withdrawal 189 47.25000
#> 5: Feb delay 122 30.50000
#> 6: Feb other 89 22.25000
Created on 2021-01-26 by the reprex package (v0.3.0)

Try the answer here . it might be what you're looking for. Code is pasted below for your convenience.
library(dplyr)
group_by(df, group) %>% mutate(percent = value/sum(value))
In your case it would probably be something like :
library(dplyr)
group_by(df, month) %>% mutate(rate= amount/sum(amount))
Or alternatively :
group_by(df, month) %>% transmute(issue, rate= amount/sum(amount))

Related

Aggregate daily data into weeks

I have data resembling the following structure, where the when variable denotes the day of measurement:
## Generate data.
set.seed(1986)
n <- 1000
y <- rnorm(n)
when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
length.out = n), format = "%Y-%m-%d"))
dta <- data.frame(y, when)
head(dta)
#> y when
#> 1 -0.04625141 2021-11-01
#> 2 0.28000082 2021-11-01
#> 3 0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5 0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02
I need to compute averages of y over time. For instance, the following computes daily averages:
## Compute daily averages of y.
library(dplyr)
daily_avg <- dta %>%
group_by(when) %>%
summarise(daily_mean = mean(y)) %>%
ungroup()
daily_avg
#> # A tibble: 366 × 2
#> when daily_mean
#> <dttm> <dbl>
#> 1 2021-11-01 00:00:00 0.162
#> 2 2021-11-02 00:00:00 -0.390
#> 3 2021-11-03 00:00:00 -0.485
#> 4 2021-11-04 00:00:00 -0.152
#> 5 2021-11-05 00:00:00 0.425
#> 6 2021-11-06 00:00:00 0.726
#> 7 2021-11-07 00:00:00 0.855
#> 8 2021-11-08 00:00:00 0.0608
#> 9 2021-11-09 00:00:00 -0.995
#> 10 2021-11-10 00:00:00 0.395
#> # … with 356 more rows
I am having a hard time computing weekly averages. Here is what I have tried so far:
## Fail - compute weekly averages of y.
library(lubridate)
dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#> y when week
#> 165 0.9758333 2021-12-30 52
#> 166 -0.8630091 2021-12-31 53
#> 167 0.3054031 2021-12-31 53
#> 168 1.2814421 2022-01-01 1
#> 169 0.1025440 2022-01-01 1
#> 170 1.3665411 2022-01-01 1
#> 171 -0.5373058 2022-01-02 1
Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?
You can use %V (from ?strptime) for weeks, combining it with the year.
dta %>%
group_by(week = format(when, format = "%Y-%V")) %>%
summarize(daily_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 2
# week daily_mean
# <chr> <dbl>
# 1 2021-44 0.179
# 2 2021-45 0.0477
# 3 2021-46 0.0340
# 4 2021-47 0.356
# 5 2021-48 0.0544
# 6 2021-49 -0.0948
# 7 2021-50 -0.0419
# 8 2021-51 0.209
# 9 2021-52 0.251
# 10 2022-01 -0.197
# # ... with 44 more rows
There are different variants of "week", depending on your preference.
%V
Week of the year as decimal number (01–53) as defined in ISO 8601.
If the week (starting on Monday) containing 1 January has four or more
days in the new year, then it is considered week 1. Otherwise, it is
the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.)
%W
Week of the year as decimal number (00–53) using Monday as the first
day of week (and typically with the first Monday of the year as day 1
of week 1). The UK convention.
You can extract year and week from the dates and group by both:
dta %>%
mutate(year = year(when),
week = week(when)) %>%
group_by(year, week) %>%
summarise(y_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 3
# # Groups: year, week [54]
# year week y_mean
# <dbl> <dbl> <dbl>
# 1 2021 44 -0.222
# 2 2021 45 0.234
# 3 2021 46 0.0953
# 4 2021 47 0.206
# 5 2021 48 0.192
# 6 2021 49 -0.0831
# 7 2021 50 0.0282
# 8 2021 51 0.196
# 9 2021 52 0.132
# 10 2021 53 -0.279
# # ... with 44 more rows

finding minimum for a column based on another column and keep result as a data frame

I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9

How to forecast multiple time series in R

I have this dataset that contains multiple series (50 products). My dataset has 50 products (50 columns). each column has the daily sales of a product.
I want to forecast these product using ets. So I have created this code below and when I run it I get only one time series and some information that I do not understand. Thanks in advance :)
y<- read.csv("QAO2.csv", header=FALSE, fileEncoding = "latin1")
y <- ts(y[,-1],f=12,s=c(2007, 1))
ns <- ncol(y)
for(i in 1:ns)
fit.ets <- ets(y[,i])
print(fit.ets)
f.ets <- forecast(fit.ets,h=12)
print(f.ets)
plot(f.ets)
This is what the fable package is designed to do. Here is an example using 50 series of monthly data from 2007. Although you say you have daily data, the code you provide assumes monthly data (frequency 12).
library(fable)
library(dplyr)
library(tidyr)
library(ggplot2)
y <- ts(matrix(rnorm(175*50), ncol=50), frequency=12, start=c(2007,1)) %>%
as_tsibble() %>%
rename(Month = index, Sales=value)
y
#> # A tsibble: 8,750 x 3 [1M]
#> # Key: key [50]
#> Month key Sales
#> <mth> <chr> <dbl>
#> 1 2007 Jan Series 1 1.06
#> 2 2007 Feb Series 1 0.495
#> 3 2007 Mar Series 1 0.332
#> 4 2007 Apr Series 1 0.157
#> 5 2007 May Series 1 -0.120
#> 6 2007 Jun Series 1 -0.0846
#> 7 2007 Jul Series 1 -0.743
#> 8 2007 Aug Series 1 0.714
#> 9 2007 Sep Series 1 1.73
#> 10 2007 Oct Series 1 -0.212
#> # … with 8,740 more rows
fit.ets <- y %>% model(ETS(Sales))
fit.ets
#> # A mable: 50 x 2
#> # Key: key [50]
#> key `ETS(Sales)`
#> <chr> <model>
#> 1 Series 1 <ETS(A,N,N)>
#> 2 Series 10 <ETS(A,N,N)>
#> 3 Series 11 <ETS(A,N,N)>
#> 4 Series 12 <ETS(A,N,N)>
#> 5 Series 13 <ETS(A,N,N)>
#> 6 Series 14 <ETS(A,N,N)>
#> 7 Series 15 <ETS(A,N,N)>
#> 8 Series 16 <ETS(A,N,N)>
#> 9 Series 17 <ETS(A,N,N)>
#> 10 Series 18 <ETS(A,N,N)>
#> # … with 40 more rows
f.ets <- forecast(fit.ets, h=12)
f.ets
#> # A fable: 600 x 5 [1M]
#> # Key: key, .model [50]
#> key .model Month Sales .mean
#> <chr> <chr> <mth> <dist> <dbl>
#> 1 Series 1 ETS(Sales) 2021 Aug N(-0.028, 1.1) -0.0279
#> 2 Series 1 ETS(Sales) 2021 Sep N(-0.028, 1.1) -0.0279
#> 3 Series 1 ETS(Sales) 2021 Oct N(-0.028, 1.1) -0.0279
#> 4 Series 1 ETS(Sales) 2021 Nov N(-0.028, 1.1) -0.0279
#> 5 Series 1 ETS(Sales) 2021 Dec N(-0.028, 1.1) -0.0279
#> 6 Series 1 ETS(Sales) 2022 Jan N(-0.028, 1.1) -0.0279
#> 7 Series 1 ETS(Sales) 2022 Feb N(-0.028, 1.1) -0.0279
#> 8 Series 1 ETS(Sales) 2022 Mar N(-0.028, 1.1) -0.0279
#> 9 Series 1 ETS(Sales) 2022 Apr N(-0.028, 1.1) -0.0279
#> 10 Series 1 ETS(Sales) 2022 May N(-0.028, 1.1) -0.0279
#> # … with 590 more rows
f.ets %>%
filter(key == "Series 1") %>%
autoplot(y) +
labs(title = "Series 1")
Created on 2021-08-05 by the reprex package (v2.0.0)

Finding the Weekday Average from numerical values in R

So I have values like
Mon 162 Tue 123 Wed 29
and so on. I need to find the average for all weekdays in R. I have tried filter and group_by but cannot get an answer.
Time Day Count Speed
1 00:00 Sun 169 60.2
2 00:00 Mon 71 58.5
3 00:00 Tue 70 57.2
4 00:00 Wed 68 58.5
5 00:00 Thu 91 58.8
6 00:00 Fri 94 58.7
7 00:00 Sat 135 58.5
8 01:00 Sun 111 60.0
9 01:00 Mon 45 59.2
10 01:00 Tue 50 57.6
I need the out come to be Weekday Average = ####
Let's say your df is
> df
# A tibble: 14 x 2
Day Count
<chr> <dbl>
1 Sun 31
2 Mon 51
3 Tue 21
4 Wed 61
5 Thu 31
6 Fri 51
7 Sat 65
8 Sun 31
9 Mon 13
10 Tue 61
11 Wed 72
12 Thu 46
13 Fri 62
14 Sat 13
You can use
df %>%
filter(!Day %in% c('Sun', 'Sat')) %>%
group_by(Day) %>%
summarize(mean(Count))
To get
# A tibble: 5 x 2
Day `mean(Count)`
<chr> <dbl>
1 Fri 56.5
2 Mon 32
3 Thu 38.5
4 Tue 41
5 Wed 66.5
For the average of all filtered values
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count))
Output
# A tibble: 1 x 1
`Average of all Weekday counts`
<dbl>
1 46.9
To get a numeric value instead of a tibble
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count)) %>%
as.numeric()
Output
[1] 46.9
This might do the trick
days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
d.f <- data.frame(Day = rep(days, 3), Speed = rnorm(21))
# split dataframe by days then take the mean over the speed
lapply(split(d.f, f=days), function(d) mean(d$Speed))
If you're looking for the single mean for just the weekdays, you could do something like this:
dat = data.frame(Time = rep(c("00:00","01:00"),c(7,3)),
Day = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat","Sun","Mon","Tue"),
Count = c(169,71,70,68,91,94,135,111,45,50),
Speed = c(60.2,58.5,57.2,58.5,58.8,58.7,58.5,60.0,59.2,57.6))
mean(dat$Count[dat$Day %in% c("Mon","Tue","Wed","Thu","Fri")])
# [1] 69.85714
If, on the other hand, you're looking for the mean across each individual day then you could do this using base R:
aggregate(dat$Count, by=list(dat$Day), FUN = mean)
# Group.1 x
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
It looks like you've tried dplyr, so the syntax for that same operation in dplyr would be:
library(dplyr)
dat %>% group_by(Day) %>% summarize(mean_count = mean(Count))
# Day mean_count
# <chr> <dbl>
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
And if you want to do the same thing in data.table you would do this:
library(data.table)
as.data.table(dat)[,.(mean_count = mean(Count)), by = Day]
# Day mean_count
# 1: Sun 140
# 2: Mon 58
# 3: Tue 60
# 4: Wed 68
# 5: Thu 91
# 6: Fri 94
# 7: Sat 135

find yearly cumsum of a variable in R?

In my code below, i would like to find the cumsum for each year. Right now, Variable A is being summed for the entire duration. Any help would be appreciated.
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year, JDay) %>%
mutate(Precipitation = cumsum(A))
Just remove JDay from grouping variables
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
It seems the issue here is with your grouping clause. Specifically, as there are as many distinct combinations of Year and JDay in your data as there are rows in DF, the subsequent cumsum operation inside mutate will simply return the same value as the input column, A. I believe the following should give you what you're after
library(dplyr)
library(lubridate)
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
DF1 <- DF %>%
mutate(Year = year(date), Month = month(date), JDay = yday(date)) %>%
filter(between(Month,5,10)) %>%
arrange(Year, JDay) %>%
group_by(Year) %>%
mutate(Precipitation = cumsum(A)) %>%
ungroup()
# illustrate that Precipitation does indeed give the cumulative value of A for
# each year by printing the first 5 observations for each year in DF1
DF1 %>%
group_by(Year) %>%
slice(1:5)
#> # A tibble: 15 x 6
#> # Groups: Year [3]
#> date A Year Month JDay Precipitation
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-05-01 6.25 2001 5 121 6.25
#> 2 2001-05-02 0.188 2001 5 122 6.43
#> 3 2001-05-03 5.37 2001 5 123 11.8
#> 4 2001-05-04 5.55 2001 5 124 17.4
#> 5 2001-05-05 5.15 2001 5 125 22.5
#> 6 2002-05-01 2.95 2002 5 121 2.95
#> 7 2002-05-02 6.75 2002 5 122 9.71
#> 8 2002-05-03 7.77 2002 5 123 17.5
#> 9 2002-05-04 8.13 2002 5 124 25.6
#> 10 2002-05-05 5.58 2002 5 125 31.2
#> 11 2003-05-01 9.98 2003 5 121 9.98
#> 12 2003-05-02 8.24 2003 5 122 18.2
#> 13 2003-05-03 6.13 2003 5 123 24.4
#> 14 2003-05-04 5.22 2003 5 124 29.6
#> 15 2003-05-05 9.81 2003 5 125 39.4
Here is a data.table solution for that. If you want the cumsum for each year, but show only the interval from month 5 to 10, this would be data.table code for that:
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:lubridate':
#>
#> hour, isoweek, mday, minute, month, quarter, second, wday, week,
#> yday, year
set.seed(50)
DF <- data.frame(date = seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10))
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][, Precipitation := cumsum(A), by=Year][between(Month, 5, 10)][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 568.9538
#> 2: 2001-05-02 0.1877191 2001 5 122 569.1416
#> 3: 2001-05-03 5.3717570 2001 5 123 574.5133
#> 4: 2001-05-04 5.5457454 2001 5 124 580.0591
#> 5: 2001-05-05 5.1508288 2001 5 125 585.2099
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 1479.8115
#> 549: 2003-10-28 6.7286553 2003 10 301 1486.5402
#> 550: 2003-10-29 8.7215420 2003 10 302 1495.2617
#> 551: 2003-10-30 8.2572257 2003 10 303 1503.5190
#> 552: 2003-10-31 9.6567923 2003 10 304 1513.1757
If you want the cumsum for the months 5-10 only, you would put the filter before calculating the cumsum:
data.table(DF)[, `:=` (Year = year(date), Month = month(date), JDay = yday(date))][between(Month, 5, 10)][, Precipitation := cumsum(A), by=Year][]
#> date A Year Month JDay Precipitation
#> 1: 2001-05-01 6.2465000 2001 5 121 6.246500
#> 2: 2001-05-02 0.1877191 2001 5 122 6.434219
#> 3: 2001-05-03 5.3717570 2001 5 123 11.805976
#> 4: 2001-05-04 5.5457454 2001 5 124 17.351722
#> 5: 2001-05-05 5.1508288 2001 5 125 22.502550
#> ---
#> 548: 2003-10-27 0.1979292 2003 10 300 916.597973
#> 549: 2003-10-28 6.7286553 2003 10 301 923.326629
#> 550: 2003-10-29 8.7215420 2003 10 302 932.048171
#> 551: 2003-10-30 8.2572257 2003 10 303 940.305396
#> 552: 2003-10-31 9.6567923 2003 10 304 949.962189

Resources