I am trying to create a function so I convert the month values to quarters using the case when function.
Then I want to leverage mutate() to create a new variable Qtr and determine how many observations I see in each quarter.
convert_to_qtr <- function(Month) {
case_when(
Month == "Jan" ~ "Q1",
Month == "Feb" ~ "Q1",
Month == "Mar" ~ "Q1",
Month == "Apr" ~ "Q2",
Month == "May" ~ "Q2",
Month == "Jun" ~ "Q2",
Month == "Jul" ~ "Q3",
Month == "Aug" ~ "Q3",
Month == "Sep" ~ "Q3",
Month == "Oct" ~ "Q4",
Month == "Nov" ~ "Q4",
Month == "Dec" ~ "Q4"
)
}
example_months <- c("Jan", "Mar", "May", "May", "Aug", "Nov", "Nov", "Dec")
convert_to_qtr(example_months)
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
group_by(Qtr) %>%
count(Qtr)
However I am not getting the same answer as my professor in his drop down so I am not sure if I am doing something wrong in my r coding.
He sees the numbers 161,071 85,588 100,227 142,651
I am not getting that, I see 152174 165778 205615 174592
You could write the function as below:
convert_to_qtr <- function(Month){
setNames(paste0("Q", rep(1:4,each=3)), month.abb)[Month]
}
df %>%
mutate(Qtr = convert_to_qtr(Month)) %>%
count(Qtr)
Qtr n
1 Q1 29
2 Q2 24
3 Q3 26
4 Q4 21
Related
Full disclosure- I inherited this code and tried to Frankenstein it enough to make it work. It isn't perfect.
I have a series of Excel workbooks I'm iterating through to extract financial data for a group of medical practices. The workbooks have a tab for each month. I used lapply() to iterate over the sheets to pull only the months in each quarter. One of the practices only has data from January and February of 2022 so I wouldn't expect that to show up for the 4th quarter update we just ran. However, that data is there.
library(tidyverse)
library(readxl)
library(openxlsx)
df1 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
df2 <- data.frame("Medication" = seq(1:50),
"Total WAC" = seq(51:100))
list_of_datasets <- list("January" = df1, "February" = df2)
write.xlsx(list_of_datasets, file = "C:/MC_report.xlsx")
current_month <- lubridate::month(as.Date(Sys.Date(), format = "%Y/%m/%d"))
current_year <- lubridate::year(as.Date(Sys.Date(), format = "%Y/%m/%d"))
Q1 <- c("January", "February", "March")
Q2 <- c("April", "May", "June")
Q3 <- c("July", "August", "September")
Q4 <- c("October", "November", "December")
quarter <- switch(current_month,
"1" = Q4, "2" = Q4, "3" = Q4,
"4" = Q1, "5" = Q1, "6" = Q1,
"7" = Q2, "8" = Q2, "9" = Q2,
"10" = Q3, "11" = Q3, "12" = Q3)
year <- ifelse(current_month %in% c(1, 2, 3), current_year - 1, current_year)
names = c("Medication", "Total WAC")
MCPath22 = "C:/MC_report.xlsx"
MClist22 = lapply(quarter, function(x){ # this function is repeated for each practice. I won't paste it over and over
dat = read_excel(MCPath22, sheet = x, skip = 1)[c(1,2)] # 1 is 'Medication' 2 is "Total WAC'
names(dat) = names
dat$Month = x
dat$Year = year
dat$Location = "Medical Center"
return(dat)
})
MC_newdata = do.call(rbind, MClist22) %>%
select( Medication, `Total WAC`, Month, Year, Location) %>%
mutate(Date.Added = Sys.time())
data = rbind(MC_newdata, DHP_newdata, Lex_newdata, Derm_newdata, Onc_newdata, oldvalues) %>%
filter(!is.na(Medication)) #includes all the practices
write_csv(data,"PAP Data.csv")
I just ran this again and all facilities save for the one with only January and February tabs are running correctly. It throws an error that 'October" not found, which is expected. I can stop that piece in R Studio and the script completes. And then Jan and Feb are in the output. Any idea why it's outputting the wrong data?
Is a way to apply create_calendar directly into a tibble? I found a work around which is to create a "Calendar" df based off the other data set Calendar <- create_calendar(df$Date) and then bind the columns together df <- left_join(df, Calendar, by = "Date") but would like to understand how to apply it directly into a df.
Thanks!
Current approach:
#Create Calendar Function
create_calendar = function(dates) {
#Create a sequence of dates from the minimum to the maximum date in the input column
Dates = seq(min(dates), max(dates), by="days")
#Convert the dates to a tibble and add additional columns
Calendar = as_tibble(Dates) %>%
rename(Date = value) %>%
mutate(
DateNo = day(Date),
NameDay = wday(Date,label = TRUE, abbr=FALSE),
NameDayShort = wday(Date,label = TRUE),
DayinWeek = wday(Date),
MonthNo = month(Date),
NameMonth = month(Date, label = TRUE, abbr=FALSE),
NameMonthShort = month(Date, label = TRUE),
Week = week(Date),
Year = year(Date),
Quarter = quarter(Date, with_year = F, fiscal_start = 7)) %>%
#Month number for the Australian financial year
Calendar <- Calendar %>%
mutate(AFYMonth = case_when(
NameMonthShort == "Jul" ~ 1
NameMonthShort == "Aug" ~ 2,
NameMonthShort == "Sep" ~ 3,
NameMonthShort == "Oct" ~ 4,
NameMonthShort == "Nov" ~ 5,
NameMonthShort == "Dec" ~ 6,
NameMonthShort == "Jan" ~ 7,
NameMonthShort == "Feb" ~ 8,
NameMonthShort == "Mar" ~ 9,
NameMonthShort == "Apr" ~ 10,
NameMonthShort == "May" ~ 11,
NameMonthShort == "Jun" ~ 12,
))
#Format the week column as %V
Calendar$Week <- format(Calendar$Date, format = "%V")
return(Calendar)
}
#Read function and apply to current df [Missing Step]
df <- df %>%
mutate(Calendar = create_calendar(Date))
Yes, you can the tidyverse operator {{}}, it allows to reference a column name inside a tidyverse function.
Function
library(lubridate)
library(dplyr)
create_calendar <- function(df,dt_var) {
#Convert the dates to a tibble and add additional columns
Calendar <-
df %>%
mutate(
DateNo = day({{dt_var}}),
NameDay = wday({{dt_var}},label = TRUE, abbr=FALSE),
NameDayShort = wday({{dt_var}},label = TRUE),
DayinWeek = wday({{dt_var}}),
MonthNo = month({{dt_var}}),
NameMonth = month({{dt_var}}, label = TRUE, abbr=FALSE),
NameMonthShort = month({{dt_var}}, label = TRUE),
Week = week({{dt_var}}),
Year = year({{dt_var}}),
Quarter = quarter({{dt_var}}, with_year = F, fiscal_start = 7))
#Month number for the Australian financial year
Calendar <-
Calendar %>%
mutate(AFYMonth = case_when(
NameMonthShort == "Aug" ~ 1,
NameMonthShort == "Sep" ~ 2,
NameMonthShort == "Oct" ~ 3,
NameMonthShort == "Nov" ~ 4,
NameMonthShort == "Dec" ~ 5,
NameMonthShort == "Jan" ~ 6,
NameMonthShort == "Feb" ~ 7,
NameMonthShort == "Mar" ~ 8,
NameMonthShort == "Apr" ~ 9,
NameMonthShort == "May" ~ 10,
NameMonthShort == "Jun" ~ 11,
NameMonthShort == "Jul" ~ 12
))
#Format the week column as %V
Calendar$Week <- format(Calendar$Date, format = "%V")
return(Calendar)
}
Example
df <- data.frame(dt = lubridate::dmy("01/01/01"))
create_calendar(df,dt)
Basicly I've got a function that I'll need to run close to 1M times and is taking a lot of time because it is not vectorized (my guess)
The idea is that there's a parameter pf.d.day that contains a date, and the output will be a transformation of that date (adding/removing days)
pf.s.Freq will offset the date to the next period.
08 Apr 2020 with Freq = "month" will become 01 May 2020
08 Apr 2020 with Freq = "week" will become 13 Apr 2020 #week starts on monday
08 Apr 2020 with Freq = "year" will become 01 Jan 2021
library(dplyr)
library(lubridate)
fn.Delay <- function(pf.d.day, pf.s.Freq){
d.DateWithouthDelay <- as.Date(
#note: using chained ifs instead of parsing pf.s.Freq into unit to avoid errors from misspells on excel file
ifelse(pf.s.Freq == "day", as.character(ceiling_date(pf.d.day + days(1), unit = "day" )),
ifelse(pf.s.Freq == "week", as.character(ceiling_date(pf.d.day + days(1), unit = "week", week_start = 1)),
ifelse(pf.s.Freq == "month", as.character(ceiling_date(pf.d.day + days(1), unit = "month" )),
ifelse(pf.s.Freq == "quarter", as.character(ceiling_date(pf.d.day + days(1), unit = "quarter")),
ifelse(pf.s.Freq == "year", as.character(ceiling_date(pf.d.day + days(1), unit = "year" )),
ifelse(pf.s.Freq != "BiWeek", "1900-1-2", #default date if pf.s.Freq is wrong
ifelse( day(pf.d.day) < 15,
as.character(pf.d.day - day(pf.d.day) +15),
as.character(ceiling_date(pf.d.day, unit = "month")))
)))))))
return(d.DateWithouthDelay)
}
for a small example:
data.frame(
Di = as.Date(c("2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8")),
Fr = c("day", "week", "month", "quarter", "year", "BiWeek", "ups")) %>%
rowwise() %>%
mutate(Df = fn.Delay(Di, Fr)) %>%
data.frame()
The main problem with this code is it's speed. Mainly because it's not vectorized but probably also because I'm having to constantly change between dates and characters simply because ifelse likes to mess the dates
Your function is vectorized. Remove the rowwise for a speed increase and the same result:
identical(
dd %>% mutate(Df = fn.Delay(Di, Fr)) %>% pull(Df),
dd %>%rowwise() %>% mutate(Df = fn.Delay(Di, Fr)) %>% pull(Df)
)
# TRUE
ifelse isn't actually that bad. Here's a simplified version using case_when, but the performance difference vs ifelse is negligible - a tiny bit slower actually. But the code is cleaner.
fn.Delay2 <- function(pf.d.day, pf.s.Freq){
case_when(
pf.s.Freq == "day" ~ ceiling_date(pf.d.day + days(1), unit = "day"),
pf.s.Freq == "week" ~ ceiling_date(pf.d.day + days(1), unit = "week", week_start = 1),
pf.s.Freq == "month" ~ ceiling_date(pf.d.day + days(1), unit = "month" ),
pf.s.Freq == "quarter" ~ ceiling_date(pf.d.day + days(1), unit = "quarter"),
pf.s.Freq == "year" ~ ceiling_date(pf.d.day + days(1), unit = "year" ),
pf.s.Freq != "BiWeek" ~ as.Date("1900-1-2"), #default date if pf.s.Freq is wrong
day(pf.d.day) < 15 ~ pf.d.day - day(pf.d.day) + 15,
TRUE ~ ceiling_date(pf.d.day, unit = "month")
)
}
microbenchmark::microbenchmark(
rowwise = dd %>%rowwise() %>% mutate(Df = fn.Delay(Di, Fr)),
vectorized = dd %>% mutate(Df = fn.Delay(Di, Fr)),
case_when = dd %>% mutate(Df = fn.Delay2(Di, Fr))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rowwise 10.0593 12.47230 13.59725 13.00590 14.1138 30.3810 100
# vectorized 7.5237 7.97235 10.21504 10.26205 10.7905 25.7858 100
# case_when 7.7331 8.43595 10.42024 10.54705 11.1035 21.4732 100
I have a data frame with a column of 'months' and coordinating values. When I create a graph, the months are ordered alphabetically. I want to order the months using the factor function, but now my graph is only showing the month of May and 'NAs'.
xnames<-c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Data$Month<-factor(Data$Month, levels = xnames)
ggplot(DAtaTidy_MergeRWPeaks2, (aes(x=factor(Month, xnames), y=Volume)), na.rm=TRUE) +
geom_bar()
I tried embedding the factor in the ggplot function but it produced the same result. When I delete 'May' from 'xnames', the graph just shows NAs.
We can't see your data, but the behavior is indicative of Data$Month containing a value that is not included in your level term xnames. Is anything misspelled? I would suggest you compare levels(as.factor(Data$Month)) and xnames - it will certainly show you the issue.
Example dataset that shows the same problem you have:
yums <- c('soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df1 <- data.frame(yums, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df1, aes(x=factor(yums, yum.levels), y=nums)) + geom_col()
That gives you this:
...but if we mispell one of them (like capitalizing "Soup" in yums), you get this:
yums1 <- c('Soup', 'salad', 'bread')
nums <- c(10, 14, 5)
df2 <- data.frame(yums1, nums)
yum.levels <- c('soup', 'salad', 'bread', 'pasta')
ggplot(df2, aes(x=factor(yums1, yum.levels), y=nums)) + geom_col()
I used the dcast function to show the spendings per month of different companies. Of course I want January first, then February etc. and not the alphabetical order.
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings
Firma Produktmarke Apr Aug Dez Feb Jan Jul Jun Mai Mrz Nov Okt Sep
Company1 Product1 228582 1902138 725781 NA 709970 NA 265313 228177 NA NA 1463258 4031267
Is there a way to reorder the colums dynamically ? For 2018 for example the dataframe is shorter, so i can not use:
Spendings <- Spendings[,c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")]
Spendings_raw <- data.frame(matrix(ncol = 14, nrow = 0))
colnames(Spendings_raw) <- c("Firma", "Produktmarke", "Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez")
Spendings_raw
Spendings <- data %>%
filter(Familie == "Riegel" & Jahr == "2017") %>%
group_by(Firma, Produktmarke, `Name Kurz`) %>%
summarise(Spendingsges = sum(EUR, na.rm = TRUE))
Spendings <- dcast(data = Spendings, Firma + Produktmarke ~ `Name Kurz`, value.var="Spendingsges")
Spendings <- rbind.fill(Spendings_raw, Spendings)
This works perfectly ;-).