R How to summarize two different groups after initial group_by - r

I have the following I would like to do in one go instead of making two different results and doing a union:
delivery_stats= data.frame(service=c("UberEats", "Seamless","UberEats", "Seamless"),
status = c("OnTime", "OnTime", "Late", "Late"),
totals = c(235, 488, 32, 58))
ds1 = filter(delivery_stats, service =="UberEats") %>%
group_by(service, status) %>%
summarise(count_status = sum(totals)) %>%
mutate(avg_of_status = count_status/sum(count_status))
#now do the same for Seamless, then union...

Provided I have understood you correctly, do you mean this?
delivery_stats %>%
group_by(service) %>%
mutate(n = sum(totals)) %>%
transmute(
status,
count_status = totals,
avg_of_status = count_status/n)
## A tibble: 4 x 4
## Groups: service, status [4]
# service status count_status avg_of_status
# <fct> <fct> <dbl> <dbl>
#1 UberEats OnTime 235 0.880
#2 Seamless OnTime 488 0.894
#3 UberEats Late 32 0.120
#4 Seamless Late 58 0.106
Explanation: First group by service to calculate the sum of totals by service; then group by service and status to calculate the mean (across service) of count_status = totals.

You also try base R using ave with the help of within.
res <- within(delivery_stats, {
count_status <- ave(totals, service, status, FUN=mean)
avg_of_status <- count_status / ave(totals, service, FUN=sum)
})
res
# service status totals avg_of_status count_status
# 1 UberEats OnTime 235 0.8801498 235
# 2 Seamless OnTime 488 0.8937729 488
# 3 UberEats Late 32 0.1198502 32
# 4 Seamless Late 58 0.1062271 58

As said above, I didn't have to filter and it would have worked fine for both groups:
delivery_stats= data.frame(service=c("UberEats", "Seamless","UberEats", "Seamless"),
status = c("OnTime", "OnTime", "Late", "Late"),
totals = c(235, 488, 32, 58))
ds1 = group_by(delivery_stats, service, status) %>%
summarise(count_status = sum(totals)) %>%
mutate(avg_of_status = count_status/sum(count_status))
# A tibble: 4 x 4
# Groups: service [2]
service status count_status avg_of_status
<fct> <fct> <dbl> <dbl>
1 Seamless Late 58 0.106
2 Seamless OnTime 488 0.894
3 UberEats Late 32 0.120
4 UberEats OnTime 235 0.880

Related

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

In R , there are `actual` and `budget` values,how to add new variable and calculate the variable values

In variable type ,there are actual and budget values,how to add new variable and calculate the variable value ? Current code can work, but a little bording. Anyone can help? Thanks!
ori_data <- data.frame(
category=c("A","A","A","B","B","B"),
year=c(2021,2022,2022,2021,2022,2022),
type=c("actual","actual","budget","actual","actual","budget"),
sales=c(100,120,130,70,80,90),
profit=c(3.7,5.52,5.33,2.73,3.92,3.69)
)
Add sales inc%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='actual'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2021'&type=='actual']-1
Add budget acheved%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='budget'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2022'&type=='budget']
Using a group_by and an if_elseyou could do:
library(dplyr)
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(sales_inc_or_budget_achieved = if_else(type == "actual",
sales / lag(sales) - 1,
lag(sales) / sales)) |>
ungroup()
#> # A tibble: 6 × 6
#> category year type sales profit sales_inc_or_budget_achieved
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA
#> 2 A 2022 actual 120 5.52 0.2
#> 3 A 2022 budget 130 5.33 0.923
#> 4 B 2021 actual 70 2.73 NA
#> 5 B 2022 actual 80 3.92 0.143
#> 6 B 2022 budget 90 3.69 0.889
And using across you could do the same for both sales and profit:
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(across(c(sales, profit), ~ if_else(type == "actual",
.x / lag(.x) - 1,
lag(.x) / .x),
.names = "{.col}_inc_or_budget_achieved")) |>
ungroup()
#> # A tibble: 6 × 7
#> category year type sales profit sales_inc_or_budget_achie… profit_inc_or_b…
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA NA
#> 2 A 2022 actual 120 5.52 0.2 0.492
#> 3 A 2022 budget 130 5.33 0.923 1.04
#> 4 B 2021 actual 70 2.73 NA NA
#> 5 B 2022 actual 80 3.92 0.143 0.436
#> 6 B 2022 budget 90 3.69 0.889 1.06
Answer from stefan suits perfectly well, however, I would suggest you rearrange your data first.
In my opinion sales and profit are types of measures (aka observations) and actual and budget are the measurements here:
library(tidyr)
library(dplyr)
ori_data2 <-
ori_data %>%
pivot_longer(c(sales, profit)) %>%
pivot_wider(names_from = type, values_from = value) %>%
group_by(category, name) %>%
arrange(year, .by_group = TRUE)
then your calculations become much more easier:
ori_data2 %>%
mutate(increase = actual / lag(actual) - 1, # compare to the year before
budget_acheved = actual / budget) %>% # compare actual vs. budget
filter(year == 2022) # you can filter for year of interest
mutate(across(c(increase, budget_acheved), scales::percent)) # and format as percent

Group by a variable in dataframe R

I have a dataframe like below,
Date
cat
cam
reg
per
22-01-05
A
60
120
50
22-01-05
B
20
100
20
22-01-08
A
30
150
20
22-01-08
B
30
100
30
But i want something like below,
Date
cam
reg
per
22-01-05
80
220
14.5
22-01-08
60
250
24
How to get this using R?
I am not sure why your expected per values are like that, but maybe you want the following:
df <- data.frame(Date = c("22-01-05", "22-01-05", "22-01-08", "22-01-08"),
cat = c("A", "B", "A", "B"),
cam = c(60,20,30,30),
reg = c(120,100,150,100),
per = c(50,20,20,30))
library(dplyr)
df %>%
group_by(Date) %>%
summarise(cam = sum(cam),
reg = sum(reg),
per = cam/reg)
#> # A tibble: 2 × 4
#> Date cam reg per
#> <chr> <dbl> <dbl> <dbl>
#> 1 22-01-05 80 220 0.364
#> 2 22-01-08 60 250 0.24
Created on 2022-07-07 by the reprex package (v2.0.1)
Using only the package dplyr (which is part of package tidyverse) just do:
df %>% group_by(Date) %>% summarise(cam = sum(cam),
reg = sum(reg),
per = 100*(cam/reg))
Date cam reg per
<chr> <int> <int> <dbl>
1 22-01-05 80 220 36.4
2 22-01-08 60 250 24
The nice thing with this syntax is, you can modify and add additional variables like sum, but also like mean, median, etc. in a very clean and structured way.
you can try this, but I don't how to get the value of per ,14.5 and 24
library(dplyr)
aggregate(cbind(cam, reg) ~ Date,df,sum) %>% mutate(per = 100*(cam/reg))
A data.frame: 2 × 4
Date cam reg per
<chr> <dbl> <dbl> <dbl>
22-01-05 80 220 36.36364
22-01-08 60 250 24.00000

Pivot_longer into an already existing column

I want to pivot multiple columns, two by two into an already existing couple, from this
have <- tribble(
~egtest,~egorres, ~egorresu, ~hrorres,~hrorresu,~prorres,~prorresu,~uninteresing,
"qt", 500,"msec",90,"bpm",100,"msec", "cat",
"qtc", 370,"msec",NA,"bpm",103,"msec","dog",
"pra",83,"msec",79,"bpm",97,"msec","cat"
)
To this :
want <- tribble(
~egtest,~egorres, ~egorresu,~uninteresting,
"qt", 500,"msec","cat",
"qtc", 370,"msec","dog",
"pra",83,"msec","cat",
"hr",90,"bpm","cat",
"pr",100,"msec","cat",
"hr",NA,"bpm","dog",
"pr",103,"msec","dog",
"hr",79,"bpm","cat",
"pr",97,"msec","dog"
)
For now my code is
colstopivotEG <- function(table){
out <- subset(colnames(table),grepl(pattern = "orres\\b",colnames(table)))
out <- out[out != "egorres"]
#print(out)
return(out)
}
pivot_eg <- function(ndf){
EG1 <- pivot_longer(ndf,
cols = colstopivotEG(ndf),
names_pattern = "(.*)orres",
names_to="egtest",
values_to="egorres")
EG2 <- pivot_longer(ndf,
cols=ends_with("orresu"),
names_pattern = "(.*)orresu",
values_to="egorresu")
ndf <- bind_cols(EG1,EG2 %>% select(EGORRESU_STD))
}
But I can't seem to be able to pivot into an existing column, I'm out of ideas and any help could be great thanks !
PS: There's a lot of column that don't want to be pivoted
I would split the tibble into two by columns:
The columns starting with eg (keep them as they are)
The rest (pivot them).
Afterwards (after repairing the second tibble's names) we can bind the two tibbles together again.
library(dplyr)
library(tidyr)
eg <- have %>%
select(starts_with("eg"))
rest <- have %>%
select(-starts_with("eg")) %>%
pivot_longer(everything(),
names_pattern = "(hr|pr)(.+)",
names_to = c("egtest", ".value")) %>%
rename(egorres = orres,
egorresu = orresu)
bind_rows(eg, rest)
which gives
egtest egorres egorresu
<chr> <dbl> <chr>
1 qt 500 msec
2 qtc 370 msec
3 pra 83 msec
4 hr 90 bpm
5 pr 100 msec
6 hr NA bpm
7 pr 103 msec
8 hr 79 bpm
9 pr 97 msec
Another possible solution:
library(tidyverse)
bind_rows(have[c(1:3,8)],
map(list(c(4:5,8), 6:8),
~ bind_cols(egtest = str_sub(names(have[.x])[1], 1, 2), have[.x] %>%
set_names(names(have[c(2:3,8)])))))
#> # A tibble: 9 × 4
#> egtest egorres egorresu uninteresing
#> <chr> <dbl> <chr> <chr>
#> 1 qt 500 msec cat
#> 2 qtc 370 msec dog
#> 3 pra 83 msec cat
#> 4 hr 90 bpm cat
#> 5 hr NA bpm dog
#> 6 hr 79 bpm cat
#> 7 pr 100 msec cat
#> 8 pr 103 msec dog
#> 9 pr 97 msec cat

Grouping by sector then aggregating by fiscal year

I have a dataset with fields comprising of isic (International Standard Industrial Classification), date, and cash. I would like to first group it by sector then get the sum by fiscal year.
#Here's a look at the data(cpt1). All the dates follow the following format "%Y-%m-01"
Cash Date isic
1 373165 2014-06-01 K
2 373165 2014-12-01 K
3 373165 2017-09-01 K
4 NA <NA> K
5 4789 2015-05-01 K
6 982121 2013-07-01 K
.
.
.
#I was able to group to group them by sector and sum them
cpt_by_sector=cpt1 %>% mutate(sector=recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry',.default = 'Services',
.missing = 'Services')) %>%
group_by(sector) %>% summarise_if(is.numeric, sum, na.rm=T)
#here's the result
sector `Cash`
<fct> <dbl>
1 Agriculture 2094393819.
2 Industry 53699068183.
3 Services 223995196357.
#Below is what I would like to get. I would like to take into account the fiscal year i.e. from july to june.
Sector `2009/10` `2010/11` `2011/12` `2012/13` `2013/14` `2014/15` `2015/16` `2016/17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agriculture 2.02 3.62 3.65 6.26 7.04 8.36 11.7 11.6
2 Industry 87.8 117. 170. 163. 185. 211. 240. 252.
3 Services 271. 343. 479. 495. 584. 664. 738. 821.
4 Total 361. 464. 653. 664. 776. 883. 990. 1085.
PS:I changed the date column to date format
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
# FY is the year of the date, plus 1 if the month is July or later.
# FY_label makes the requested format, by combining the prior year,
# a slash, and digits 3&4 of the FY.
mutate(FY = year(Date) + if_else(month(Date) >= 7, 1, 0),
FY_label = paste0(FY-1, "/", substr(FY, 3, 4))) %>%
mutate(sector = recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry', 'K'='Mystery Sector')) %>%
filter(!is.na(FY)) %>% # Exclude rows with missing FY
group_by(FY_label, sector) %>%
summarise(Cash = sum(Cash)) %>%
spread(FY_label, Cash)
# A tibble: 1 x 4
sector `2013/14` `2014/15` `2017/18`
<fct> <int> <int> <int>
1 Mystery Sector 1355286 377954 373165

Resources