R - working days in a given week of a month - r

I am trying to get the working days of a given week on a given month.
This is my current data, as example:
year month week
2020 6 1
2020 6 2
2020 6 3
2020 6 4
2020 6 5
2020 7 1
2020 7 2
2020 7 3
2020 7 4
2020 7 5
The expected result:
year month week work_days
2020 6 1 5
2020 6 2 5
2020 6 3 5
2020 6 4 5
2020 6 5 2
2020 7 1 3
2020 7 2 5
2020 7 3 5
2020 7 4 5
2020 7 5 5
So as you can see I have year, month and week of the month but I can't get my head around how to get the working days for the week in R for any week of any month.
Thanks in advance

cal <- data.table(dates = seq.Date(ymd(20200601), ymd(20200731), by = "day")) %>%
.[, .(dates,
year = year(dates),
month = month(dates),
week = isoweek(dates), # new week starts on monday
weekday = !(weekdays(dates) %in% c("Sunday", "Saturday"))
)]
cal[, .(work_days = sum(weekday)), by = .(year, month, week)
][, week := rowid(month)][]

Related

Select first row for each id for each year

Say I have a dataset below where each id can have multiple records per year. I would like to keep only the id's most recent record per year.
id<-c(1,1,1,2,2,2)
year<-c(2020,2020,2019,2020,2018,2018)
month<-c(12,6,4,5,4,1)
have<-as.data.frame(cbind(id,year,month))
have
id year month
1 2020 12
1 2020 6
1 2019 4
2 2020 5
2 2018 4
2 2018 1
This is what would like the dataset to look like:
want
id year month
1 2020 12
1 2019 4
2 2020 5
2 2018 4
I know that I can get the first instance of each id with this code, however I want the latest record for each year.
want<-have[match(unique(have$id), have$id),]
id year month
1 2020 12
2 2020 5
I modified the code to add in year, but it outputs the same results as the code above:
want<-have[match(unique(have$id,have$year), have$id),]
id year month
1 2020 12
2 2020 5
How would I modify this so I can see one record displayed per year?
You can use dplyr::slice_min like this:
library(dplyr)
have %>%
group_by(id,year) %>%
slice_min(order_by = month)
Output:
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
We could group and then summarise with first()
library(dplyr)
have %>%
group_by(id, year) %>%
summarise(month = first(month))
id year month
<dbl> <dbl> <dbl>
1 1 2019 4
2 1 2020 12
3 2 2018 4
4 2 2020 5
You can use the group_by in dplyr as follows:
have %>% group_by(year) %>% tally(max(month))

How to clean messy date formats in a dataframe using R

What is a quick way to clean a column with multiple date formats and obtain only the year?
Suppose in r there is a dataframe (df) as below, which has aDatecolumn of characters with different dates formats.
df <- data.frame(z= paste("Date",seq(1:10)), Date=c("2000-10-22", "9/21/2001", "2003", "2017/2018", "9/28/2010",
"9/27/2011","2019/2020", "2017-10/2018-12", "NA", "" ))
df:
z Date
1 Date 1 2000-10-22
2 Date 2 9/21/2001
3 Date 3 2003
4 Date 4 2017/2018
5 Date 5 9/28/2010
6 Date 6 9/27/2011
7 Date 7 2019/2020
8 Date 8 2017-10/2018-12
9 Date 9 NA
10 Date 10
Using r commands what is a quick way to extract out the years e.g. 2003, 2010 from the Date column? The first year is to be selected for cells with two years in a row.
So that the expected output would be like below:
z Date year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2007/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2007/2018 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10
Use extract from tidyr. If there are two years it will use the first.
library(dplyr)
library(tidyr)
df %>% extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year
1 Date 1 2000-10-22 2000
2 Date 2 9/21/2001 2001
3 Date 3 2003 2003
4 Date 4 2017/2018 2017
5 Date 5 9/28/2010 2010
6 Date 6 9/27/2011 2011
7 Date 7 2019/2020 2019
8 Date 8 2017-10/2018-12 2017
9 Date 9 NA NA
10 Date 10 NA
If the second year is needed as well then:
df %>%
extract(Date, "Year2", "\\d{4}.*(\\d{4})", remove = FALSE, convert = TRUE) %>%
extract(Date, "Year", "(\\d{4})", remove = FALSE, convert = TRUE)
giving:
z Date Year Year2
1 Date 1 2000-10-22 2000 NA
2 Date 2 9/21/2001 2001 NA
3 Date 3 2003 2003 NA
4 Date 4 2017/2018 2017 2018
5 Date 5 9/28/2010 2010 NA
6 Date 6 9/27/2011 2011 NA
7 Date 7 2019/2020 2019 2020
8 Date 8 2017-10/2018-12 2017 2018
9 Date 9 NA NA NA
10 Date 10 NA NA

How to row bind all cases of a particular weekday in a given year-month into an R dataset

I have data that includes a date and day of week.
I would like to identify all instances of a particular weekday that match the given year/month/weekday
in the original data.
For instance if the first record has the date "2010-07-05" which is a Thursday, I want to rowbind all Thursdays
that occur in July of 2010 to my original dataset.
While adding those new rows, I also want to fill in those new rows with values from the original data for all columns, except one. The exception is a variable which indicates whether or not that row
was in the original dataset or not.
Example data:
(1) alldays -- this data includes all dates and weekdays for the appropriate years.
(2) dt1 -- this is the example dataset that includes the date Adate, and day of week dow that will be used to identify the year/month/weekday and then look for all dates within that same month for the given weekday. For example - all Thursdays in July of 2017 will need to row bound to the original data.
library(data.table)
library(tidyverse)
library(lubridate)
alldays <- data.table (date = seq(as.Date("2010-01-01"),
as.Date("2011-12-31"), by="days"))
alldays <- alldays %>%
dplyr::mutate(year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date),
dow = weekdays(date))
setDT(alldays)
head(alldays)
date year month day dow
1 2010-01-01 2010 1 1 Friday
2 2010-01-02 2010 1 2 Saturday
3 2010-01-03 2010 1 3 Sunday
4 2010-01-04 2010 1 4 Monday
5 2010-01-05 2010 1 5 Tuesday
6 2010-01-06 2010 1 6 Wednesday
Here is an example of the primary dataset
id <- seq(1:2)
admit <- rep(1,2)
zip <- c(54123, 54789)
Adate <- as.Date(c("2010-07-15","2011-03-14"))
year <- c(2010, 2011)
month <- c(7,3)
day <- c(15,14)
dow <- c("Thursday","Monday")
dt1 <- data.table(id, admit, zip, Adate, year, month, day, dow)
dt1
#> id admit zip Adate year month day dow
#> 1: 1 1 54123 2010-07-15 2010 7 15 Thursday
#> 2: 2 1 54789 2011-03-14 2011 3 14 Monday
The resulting dataset should be:
id admit zip Adate year month day dow
1: 1 0 54123 2010-07-01 2010 7 1 Thursday
2: 1 0 54123 2010-07-08 2010 7 8 Thursday
3: 1 1 54123 2010-07-15 2010 7 15 Thursday
4: 1 0 54123 2010-07-22 2010 7 22 Thursday
5: 1 0 54123 2010-07-29 2010 7 29 Thursday
6: 2 0 54789 2011-03-07 2011 3 7 Monday
7: 2 1 54789 2011-03-14 2011 3 14 Monday
8: 2 0 54789 2011-03-21 2011 3 21 Monday
9: 2 0 54789 2011-03-28 2011 3 28 Monday
So we can see that the first date dt1 2010-07-15 associated with id=1, which was a Thursday fell within a month with 4 additional Thursday in that month which were added to the dataset. The variable admit is the indicator of whether that row was in the original or subsequently added by virtue of the being matched.
I have tried first selecting the additional dates from alldays with matching weekdays but I am running into issues on how to rowbind those back into the original dataset while filling in the other values appropriately. Eventually I will be running this on a dataset with about 300,000 rows.
Here is an option:
alldays[dt1[, .(id, zip, admit=0L, year, month, dow)],
on=.(year, month, dow), allow.cartesian=TRUE][
dt1, on=.(id, date=Adate), admit := i.admit][]
output:
date year month day dow id zip admit
1: 2010-07-01 2010 7 1 Thursday 1 54123 0
2: 2010-07-08 2010 7 8 Thursday 1 54123 0
3: 2010-07-15 2010 7 15 Thursday 1 54123 1
4: 2010-07-22 2010 7 22 Thursday 1 54123 0
5: 2010-07-29 2010 7 29 Thursday 1 54123 0
6: 2011-03-07 2011 3 7 Monday 2 54789 0
7: 2011-03-14 2011 3 14 Monday 2 54789 1
8: 2011-03-21 2011 3 21 Monday 2 54789 0
9: 2011-03-28 2011 3 28 Monday 2 54789 0

Use of EXCEL OFFSET IN R for a range of values and multiple times

This is file I want to append my data in
Collection A
Jan
Feb
March
April
Collection B
Jan
Feb
March
April
Revenue A
Jan
Feb
March
April
Revenue B
Jan
Feb
March
April
The file I want to pull my data from looks like this:
Collection Month Collection A Collection B Revenue Month Revenue A Revenue B
Collection January 1 5 Revenue January 4 8
Collection February 2 6 Revenue February 3 7
Collection March 3 7 Revenue March 2 6
Collection April 4 8 Revenue April 1 5
I want the final output to look like this:
Collection A
Jan 1
Feb 2
March 3
April 4
Collection B
Jan 5
Feb 6
March 7
April 8
Revenue A
Jan 4
Feb 3
March 2
April 1
Revenue B
Jan 8
Feb 7
March 6
April 5
I am able to run this on excel using OFFSET and INDIRECT function. But I want to automate it better for future purposes so I am trying it on R.
I am really stuck on how to combine the two datasets to find the desired output. It seems like an impossible task for me. I have played around with several functions like select, subset and arrange by none of them have helped me progress.
I will be glad if someone can help me out with this.
Here's a way to achieve that output. Note that I removed spaces from the column names in the sample data in order to make it easier to read into R. You didn't specify what you wanted the column names of the output dataframe to be so as given they make little sense.
library(tidyverse)
tbl <- read_table2(
"Collection Month CollectionA CollectionB Revenue Month RevenueA RevenueB
Collection January 1 5 Revenue January 4 8
Collection February 2 6 Revenue February 3 7
Collection March 3 7 Revenue March 2 6
Collection April 4 8 Revenue April 1 5"
)
#> Warning: Duplicated column names deduplicated: 'Month' => 'Month_1' [6]
tbl %>%
select(-Collection, -Revenue, -Month_1) %>%
gather(variable, value, -Month) %>%
group_by(variable) %>%
group_modify(~ add_row(.x, Month = .y$variable, value = NA, .before = 1)) %>%
ungroup() %>%
select(-variable)
#> # A tibble: 20 x 2
#> Month value
#> <chr> <dbl>
#> 1 CollectionA NA
#> 2 January 1
#> 3 February 2
#> 4 March 3
#> 5 April 4
#> 6 CollectionB NA
#> 7 January 5
#> 8 February 6
#> 9 March 7
#> 10 April 8
#> 11 RevenueA NA
#> 12 January 4
#> 13 February 3
#> 14 March 2
#> 15 April 1
#> 16 RevenueB NA
#> 17 January 8
#> 18 February 7
#> 19 March 6
#> 20 April 5
Created on 2019-06-18 by the reprex package (v0.3.0)

How to de-cumulate variable in dplyr?

I have an issue. I have panel of quarterly individual data, which are "annually cumulative", ie. values for 1st quarter are for 1st quarter, values for 2nd quarter are sum for 1st and 2nd, 3rd quarter values are sums for first 3 quarters of the year and 4th quarter are annual sums. How to easily de-cumulate those in dplyr, grouping by id and year?
Assuming we have two years, and in year one sales are 2 per quarter, and in year 2 sales are 3 per quarter, the original is:
df = data.frame(quarter = c("Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4"), year=c(rep(2017,4),rep(2018,4)), cum_tot= c(2,4,6,8,3,6,9,12))
quarter year cum_tot
1 Q1 2017 2
2 Q2 2017 4
3 Q3 2017 6
4 Q4 2017 8
5 Q1 2018 3
6 Q2 2018 6
7 Q3 2018 9
8 Q4 2018 12
Then we can get the sales per quarter as:
library(dplyr)
df %>% group_by(year) %>% mutate(original = c(cum_tot[1], diff(cum_tot)))
Or, as per GGamba's comment below:
df %>% group_by(year) %>% mutate(original = cum_tot - lag(cum_tot, default = 0))
They both result in:
quarter year cum_tot original
1 Q1 2017 2 2
2 Q2 2017 4 2
3 Q3 2017 6 2
4 Q4 2017 8 2
5 Q1 2018 3 3
6 Q2 2018 6 3
7 Q3 2018 9 3
8 Q4 2018 12 3
Hope this helps!

Resources