Grouping by sector then aggregating by fiscal year - r

I have a dataset with fields comprising of isic (International Standard Industrial Classification), date, and cash. I would like to first group it by sector then get the sum by fiscal year.
#Here's a look at the data(cpt1). All the dates follow the following format "%Y-%m-01"
Cash Date isic
1 373165 2014-06-01 K
2 373165 2014-12-01 K
3 373165 2017-09-01 K
4 NA <NA> K
5 4789 2015-05-01 K
6 982121 2013-07-01 K
.
.
.
#I was able to group to group them by sector and sum them
cpt_by_sector=cpt1 %>% mutate(sector=recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry',.default = 'Services',
.missing = 'Services')) %>%
group_by(sector) %>% summarise_if(is.numeric, sum, na.rm=T)
#here's the result
sector `Cash`
<fct> <dbl>
1 Agriculture 2094393819.
2 Industry 53699068183.
3 Services 223995196357.
#Below is what I would like to get. I would like to take into account the fiscal year i.e. from july to june.
Sector `2009/10` `2010/11` `2011/12` `2012/13` `2013/14` `2014/15` `2015/16` `2016/17`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agriculture 2.02 3.62 3.65 6.26 7.04 8.36 11.7 11.6
2 Industry 87.8 117. 170. 163. 185. 211. 240. 252.
3 Services 271. 343. 479. 495. 584. 664. 738. 821.
4 Total 361. 464. 653. 664. 776. 883. 990. 1085.
PS:I changed the date column to date format

library(dplyr)
library(tidyr)
library(lubridate)
df %>%
# FY is the year of the date, plus 1 if the month is July or later.
# FY_label makes the requested format, by combining the prior year,
# a slash, and digits 3&4 of the FY.
mutate(FY = year(Date) + if_else(month(Date) >= 7, 1, 0),
FY_label = paste0(FY-1, "/", substr(FY, 3, 4))) %>%
mutate(sector = recode_factor(isic,
'A'='Agriculture','B'='Industry','C'='Industry','D'='Industry',
'E'='Industry','F'='Industry', 'K'='Mystery Sector')) %>%
filter(!is.na(FY)) %>% # Exclude rows with missing FY
group_by(FY_label, sector) %>%
summarise(Cash = sum(Cash)) %>%
spread(FY_label, Cash)
# A tibble: 1 x 4
sector `2013/14` `2014/15` `2017/18`
<fct> <int> <int> <int>
1 Mystery Sector 1355286 377954 373165

Related

In R , there are `actual` and `budget` values,how to add new variable and calculate the variable values

In variable type ,there are actual and budget values,how to add new variable and calculate the variable value ? Current code can work, but a little bording. Anyone can help? Thanks!
ori_data <- data.frame(
category=c("A","A","A","B","B","B"),
year=c(2021,2022,2022,2021,2022,2022),
type=c("actual","actual","budget","actual","actual","budget"),
sales=c(100,120,130,70,80,90),
profit=c(3.7,5.52,5.33,2.73,3.92,3.69)
)
Add sales inc%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='actual'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2021'&type=='actual']-1
Add budget acheved%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='budget'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2022'&type=='budget']
Using a group_by and an if_elseyou could do:
library(dplyr)
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(sales_inc_or_budget_achieved = if_else(type == "actual",
sales / lag(sales) - 1,
lag(sales) / sales)) |>
ungroup()
#> # A tibble: 6 × 6
#> category year type sales profit sales_inc_or_budget_achieved
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA
#> 2 A 2022 actual 120 5.52 0.2
#> 3 A 2022 budget 130 5.33 0.923
#> 4 B 2021 actual 70 2.73 NA
#> 5 B 2022 actual 80 3.92 0.143
#> 6 B 2022 budget 90 3.69 0.889
And using across you could do the same for both sales and profit:
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(across(c(sales, profit), ~ if_else(type == "actual",
.x / lag(.x) - 1,
lag(.x) / .x),
.names = "{.col}_inc_or_budget_achieved")) |>
ungroup()
#> # A tibble: 6 × 7
#> category year type sales profit sales_inc_or_budget_achie… profit_inc_or_b…
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA NA
#> 2 A 2022 actual 120 5.52 0.2 0.492
#> 3 A 2022 budget 130 5.33 0.923 1.04
#> 4 B 2021 actual 70 2.73 NA NA
#> 5 B 2022 actual 80 3.92 0.143 0.436
#> 6 B 2022 budget 90 3.69 0.889 1.06
Answer from stefan suits perfectly well, however, I would suggest you rearrange your data first.
In my opinion sales and profit are types of measures (aka observations) and actual and budget are the measurements here:
library(tidyr)
library(dplyr)
ori_data2 <-
ori_data %>%
pivot_longer(c(sales, profit)) %>%
pivot_wider(names_from = type, values_from = value) %>%
group_by(category, name) %>%
arrange(year, .by_group = TRUE)
then your calculations become much more easier:
ori_data2 %>%
mutate(increase = actual / lag(actual) - 1, # compare to the year before
budget_acheved = actual / budget) %>% # compare actual vs. budget
filter(year == 2022) # you can filter for year of interest
mutate(across(c(increase, budget_acheved), scales::percent)) # and format as percent

How best to calculate a year over year difference in R

Below is the sample code. The task at hand is to create a year over year difference (2021 q4 value - 2020 q4 value) for only the fourth quarter and percentage difference. Desired result is below. Usually I would do a pivot_wider and such. However, how does one do this and not take all quarters into account?
year <- c(2020,2020,2020,2020,2021,2021,2021,2021,2020,2020,2020,2020,2021,2021,2021,2021)
qtr <- c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4)
area <- c(1012,1012,1012,1012,1012,1012,1012,1012,1402,1402,1402,1402,1402,1402,1402,1402)
employment <- c(100,102,104,106,108,110,114,111,52,54,56,59,61,66,65,49)
test1 <- data.frame (year,qtr,area,employment)
area difference percentage
1012 5 4.7%
1402 -10 -16.9
You would use filter on quarter:
test1 |>
filter(qtr == 4) |>
group_by(area) |>
mutate(employment_lag = lag(employment),
diff = employment - employment_lag) |>
na.omit() |>
ungroup() |>
mutate(percentage = diff/employment_lag)
Output:
# A tibble: 2 × 7
year qtr area employment diff employment_start percentage
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 4 1012 111 5 106 0.0472
2 2021 4 1402 49 -10 59 -0.169
Update: Adding correct percentage.

How could I get the mean and mode summary at the same time for a dataframe?

I have a dataframe with 10 numeric columns and 3 character columns, as a sample I prepare this dataframe:
df <- data.frame(
name = c("ANCON","ANCON","ANCON", "LUNA", "MAGOLLO", "MANCHAY", "MANCHAY","PATILLA","PATILLA"),
destiny = c("sea","reuse","sea","sea", "reuse","sea","sea","sea","sea"),
year = c("2022","2015","2022","2022", "2015","2016","2016","2018","2018"),
QQ = c(10,11,3,4,13,11,12,23,7),
Temp = c(14,16,16,15,16,20,19,14,18))
I need to group it by column "name", get the mean summary for columns "QQ" and "Temp", and the mode for columns "destiny" and "year". I could get the mean summary but I couldn´t include the mode
df_mean <- df %>%
group_by(name) %>%
summarise_all(mean, na.rm = TRUE)
name destiny year QQ Temp
<chr> <dbl> <dbl> <dbl> <dbl>
1 ANCON NA NA 8 15.3
2 LUNA NA NA 4 15
3 MAGOLLO NA NA 13 16
4 MANCHAY NA NA 11.5 19.5
5 PATILLA NA NA 15 16
the desired output with the medians is something like this:
name destiny year QQ Temp
1 ANCON sea 2022 8.0 15.3
2 LUNA sea 2022 4.0 15.0
3 MAGOLLO reuse 2015 13.0 16.0
4 MANCHAY sea 2016 11.5 19.5
5 PATILLA sea 2018 15.0 16.0
How could I do it? Please help
Use across and cur_column. Median would only work with ordinal data, though, and for categorical data like the character columns you have, use mode:
mode <- function(x) {
x_unique <- unique(x)
x_unique[which.max(tabulate(match(x, x_unique)))]
}
Then
mode_columns <- c('destiny', 'year')
df %>%
group_by(name) %>%
summarise(
across(
everything(),
~ if (cur_column() %in% mode_columns) mode(.x) else mean(.x)
)
)
# A tibble: 5 × 5
name destiny year QQ Temp
<chr> <chr> <chr> <dbl> <dbl>
1 ANCON sea 2022 8 15.3
2 LUNA sea 2022 4 15
3 MAGOLLO reuse 2015 13 16
4 MANCHAY sea 2016 11.5 19.5
5 PATILLA sea 2018 15 16
UPD: Or you could summarise a bit differently
summarise(
across({{mode_cols}}, mode),
across(!{{mode_cols}}, mean)
)

Convert data.frame to time series

I have a question on how to convert a df to a time series. I am new with R and I am struggling with this operation.
These are some rows of my df which is named "test":
> test
SM weekY week art cat flagP Woy year ispromo yval yqta price ln_yval ln_price
<chr> <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11111 2016/01 2016-01-03 Z0005 C10 0 01 2016 0 59839. 4060 14.7 11.0 2.69
2 11111 2016/02 2016-01-10 Z0005 C10 0 02 2016 0 38186. 2640 14.5 10.6 2.67
3 11111 2016/03 2016-01-17 Z0005 C10 0 03 2016 0 38986. 2660 14.7 10.6 2.68
My date variable is "week", which doesn't have necessarily a frequency equal to 7 because some dates are missing. I would like to convert this df to a time series, where "week" is the date to consider. My aim is to use this df for forecasting purposes. In particular, I would like to use multiple linear regression applied to time series
####example where XXXXXXX is the converted df to time series and I am using some variables for lin. regr.
fit_test <- tslm(ln_yval ~ SM + cat + ispromo, data=XXXXXXX)
autoplot(XXXXXXX[,'ln_yval '], series="Data") +
autolayer(fitted(fit_test), series="Fitted")
Thank you for your help
from the back of m mind there is xts and zoo
what about this?
library(xts)
library(zoo)
library(forecast)
df <- data.frame(week = c("2020-01-01", "2020-01-08", "2020-01-18"),
SM = c(1, 2, 3),
ispromo = c(0,0,0),
cat = c("Z005", "Z005","Z005"),
yval = c(1.0, 2.0, 3.0),
ln_yval = c(3.4, 4.5, 4.6))
time_series_xts <- xts(df[,-1], order.by=as.Date(df[,1]))
time_series_zoo <- zoo(df[,-1], order.by=as.Date(df[,1]))

Assign day of the day year to a month

Sample data
df <- data.frame(ID1 = rep(1:1000, each= 5*365), year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5))
This data contains a column day which is the day of the year. I need to produce two columns:
Month column: a column of month (which month does the day belong)
Biweek column: which biweek does a day belong to. There are 24 biweek in a year. All days <= 15 in a month is the first biweek and > 15 is second biweek.
For e.g.
15th Jan is Biweek 1,
16-31 Jan is biweek 2,
1-15 Feb is biweek 3 and
16-28 Feb is biweek 4 and so on.
For sake of simplicity, I am assuming all the years are non-leap years.
Here's the code I have (with help from RS as well) that creates the two columns.
# create a vector of days for each month
months <- list(1:31, 32:59, 60:90, 91:120, 121:151, 152:181, 182:212, 213:243, 244:273, 274:304, 305:334, 335:365)
library(dplyr)
ptm <- proc.time()
df <- df %>% mutate(month = sapply(day, function(x) which(sapply(months, function(y) x %in% y))), # this assigns each day to a month
date = as.Date(paste0(year,'-',format(strptime(paste0('1981-',day), '%Y-%j'), '%m-%d'))), # this creates a vector of dates for a non-leap year
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
user system elapsed
121.71 0.31 122.43
My issue is that of the time it takes to run this script and I am looking for a solution that is relatively faster
EDIT: To be clear, I have assumed all years must have 365 days. In one of the answers below, for the year 2000 (a leap year), Feb has 29 days (last day of Feb is 60 but I want the last day to be 59) and therefore Dec has only 30 days (Dec start with 336 though it should start with 335). I hope this is clear. My solution addresses this issue but takes lot of time to run.
Here is a solution using lubridate extractors and replacement functions as mentioned by Frank in a comment. The key ones are yday<-, mday() and month(), which respectively set the day of year of a date, get the day of month of a date, and get the month of a date. 8 sec running time seems pretty acceptable to me, though I'm sure some optimising could shave that down though there might be a loss of generality.
Note also the use of case_when to ensure the correct numbering of days after Feb 29 on a leap year.
EDIT: Here is a significantly faster solution. You can just get the mapping of DOYs to months and biweeks for a single year, and then left_join to the main table. 0.36s running time, since you no longer have to repetitively create the date. We also bypass having to use case_when, since the join will take care of the missing days. See that Day 59 of year 2000 is February and Day 60 is March, as requested.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
tbl <- tibble(
ID1 = rep(1:1000, each= 5*365),
year = rep(rep(2000:2004, each = 365), times = 1000),
day = rep(1:365, times = 1000*5),
x= runif(365*1000*5)
)
tictoc::tic("")
doys <- tibble(
day = rep(1:365),
date = seq.Date(ymd("2001-1-1"), ymd("2001-12-31"), by = 1),
month = month(date),
biweek = case_when(
mday(date) <= 15 ~ (month * 2) - 1,
mday(date) > 15 ~ month * 2
)
)
tbl_out2 <- left_join(tbl, select(doys, -date), by = "day")
tictoc::toc()
#> : 0.36 sec elapsed
tbl_out2
#> # A tibble: 1,825,000 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 1 0.331 1. 1.
#> 2 1 2000 2 0.284 1. 1.
#> 3 1 2000 3 0.627 1. 1.
#> 4 1 2000 4 0.762 1. 1.
#> 5 1 2000 5 0.460 1. 1.
#> 6 1 2000 6 0.500 1. 1.
#> 7 1 2000 7 0.340 1. 1.
#> 8 1 2000 8 0.952 1. 1.
#> 9 1 2000 9 0.663 1. 1.
#> 10 1 2000 10 0.385 1. 1.
#> # ... with 1,824,990 more rows
tbl_out2[55:65, ]
#> # A tibble: 11 x 6
#> ID1 year day x month biweek
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 1 2000 55 0.127 2. 4.
#> 2 1 2000 56 0.779 2. 4.
#> 3 1 2000 57 0.625 2. 4.
#> 4 1 2000 58 0.245 2. 4.
#> 5 1 2000 59 0.640 2. 4.
#> 6 1 2000 60 0.423 3. 5.
#> 7 1 2000 61 0.439 3. 5.
#> 8 1 2000 62 0.105 3. 5.
#> 9 1 2000 63 0.218 3. 5.
#> 10 1 2000 64 0.668 3. 5.
#> 11 1 2000 65 0.589 3. 5.
Created on 2018-04-06 by the reprex package (v0.2.0).
You can speed this up almost an order of magnitude by defining date first, reducing redundancy in the date call, and then extracting month from date.
ptm <- proc.time()
df <- df %>% mutate(
date = as.Date(paste0(year, "-", day), format = "%Y-%j"), # this creates a vector of dates
month = as.numeric(format(date, "%m")), # extract month
twowk = month*2 - (as.numeric(format(date, "%d")) <= 15)) %>% # this describes which biweek each day falls into
dplyr::select(-date)
proc.time() - ptm
# user system elapsed
# 18.58 0.13 18.75
Versus original version in the question
# user system elapsed
# 117.67 0.15 118.45
Filtered for one year. I think it solves the leap issue you described, unless I'm not clear on what you're saying. Last day of Feb is 59 in the df in my result below, but only because day is 0 indexed.
df2000 <- filter(df, year == "2000")
ptm <- proc.time()
df2000 <- df2000 %>% mutate(
day = day - 1, # dates are 0 indexed
date = as.Date(day, origin = "2000-01-01"),
month = as.numeric(as.POSIXlt(date, format = "%Y-%m-%d")$mon + 1),
bis = month * 2 - (as.numeric(format(date, "%d")) <= 15)
)
proc.time() - ptm
user system elapsed
0.8 0.0 0.8
One year is 0.2 of the whole df, so times reflect that.

Resources