Convert monthly pay data to weekly using complete and fill in dplyr - r

I have data on worker pay and some workers are paid monthly and others weekly. I would like to combine the data into a panel by worker and week (of year). To do that, I need to expand the monthly rows.
The data look like:
pay_data <- tibble(worker="Jim", start=ymd("2020-1-3"), end=ymd("2020-2-2"), rate=10, hours=50, wages=rate*hours) %>%
mutate(f_week=week(start), l_week=week(end))
# A tibble: 1 x 8
worker start end rate hours wages f_week l_week
<chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Jim 2020-01-03 2020-02-02 10 50 500 1 5
Is there a way to use complete, fill or any other dplyr function to get the data to look like the below?
# A tibble: 5 x 5
worker week rate hours wage
<chr> <int> <dbl> <dbl> <dbl>
1 Jim 1 10 50 500
2 Jim 2 10 50 500
3 Jim 3 10 50 500
4 Jim 4 10 50 500
5 Jim 5 10 50 500
(I would then of course divide the amounts to put them all in common units).
Thanks!

Another tidyverse way would be :
library(tidyverse)
pay_data %>%
mutate(week = map2(f_week, l_week, seq)) %>%
unnest(week) %>%
select(worker, rate:wages, week)
# worker rate hours wages week
# <chr> <dbl> <dbl> <dbl> <int>
#1 Jim 10 50 500 1
#2 Jim 10 50 500 2
#3 Jim 10 50 500 3
#4 Jim 10 50 500 4
#5 Jim 10 50 500 5

A tidyverse approach making use of tidyr::separate_rows may look like so. To make the data more interesting I added data for a second worker.
library(tidyverse)
tbl %>%
rowwise() %>%
mutate(weeks = paste(seq(f_week, l_week, by = 1), collapse = ", ")) %>%
ungroup() %>%
separate_rows(weeks) %>%
select(-ends_with("_week"), -start, -end)
#> # A tibble: 13 x 5
#> worker rate hours wages weeks
#> <chr> <int> <int> <int> <chr>
#> 1 Jim 10 50 500 1
#> 2 Jim 10 50 500 2
#> 3 Jim 10 50 500 3
#> 4 Jim 10 50 500 4
#> 5 Jim 10 50 500 5
#> 6 John 20 100 1000 1
#> 7 John 20 100 1000 2
#> 8 John 20 100 1000 3
#> 9 John 20 100 1000 4
#> 10 John 20 100 1000 5
#> 11 John 20 100 1000 6
#> 12 John 20 100 1000 7
#> 13 John 20 100 1000 8
DATA
tbl <- read.table(text="worker start end rate hours wages f_week l_week
1 Jim 2020-01-03 2020-02-02 10 50 500 1 5\n
2 John 2020-01-03 2020-02-02 20 100 1000 1 8", header = TRUE)
tbl
#> worker start end rate hours wages f_week l_week
#> 1 Jim 2020-01-03 2020-02-02 10 50 500 1 5
#> 2 John 2020-01-03 2020-02-02 20 100 1000 1 8

Try this:
#Code
pay_data <- pay_data[rep(seq_len(nrow(pay_data)), unique(pay_data$l_week)),
c('worker','rate','hours','wages')]
pay_data$week <- 1:nrow(pay_data)
Output:
# A tibble: 5 x 5
worker rate hours wages week
<chr> <dbl> <dbl> <dbl> <int>
1 Jim 10 50 500 1
2 Jim 10 50 500 2
3 Jim 10 50 500 3
4 Jim 10 50 500 4
5 Jim 10 50 500 5

Related

Moving average by multiple group

I have a following DF (demo). I would like to find the previous 3 month moving average of Amount column per ID, Year and Month.
ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500
Following is an attempt.
library(TTR)
library(dplyr)
DF %>% group_by(ID, YEAR, MONTH) %>% mutate(3MA = runMean(AMOUNT, 3))
resulting in error with n=3 is outside valid range.
Desired Output:
ID YEAR MONTH AMOUNT 3MA
1 ABC 2020 09 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 01 400 200 (100+200+300)/3
5 ABC 2021 04 500 300 (400+300+200)/3
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 01 400 200 (100+200+300)/3
10 PQR 2021 03 500 300 (400+300+200)/3
You can use the following code:
library(dplyr)
arrange(DF,ID,YEAR) %>%
group_by(ID) %>%
mutate(lag1=lag(AMOUNT),
lag2=lag(AMOUNT,2),
lag3=lag(AMOUNT,3),
movave=(lag1+lag2+lag3)/3)
#> # A tibble: 10 × 8
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT lag1 lag2 lag3 movave
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA NA NA NA
#> 2 ABC 2020 11 200 100 NA NA NA
#> 3 ABC 2020 12 300 200 100 NA NA
#> 4 ABC 2021 1 400 300 200 100 200
#> 5 ABC 2021 4 500 400 300 200 300
#> 6 PQR 2020 10 100 NA NA NA NA
#> 7 PQR 2020 11 200 100 NA NA NA
#> 8 PQR 2020 12 300 200 100 NA NA
#> 9 PQR 2021 1 400 300 200 100 200
#> 10 PQR 2021 3 500 400 300 200 300
Created on 2022-07-02 by the reprex package (v2.0.1)
An option using a sliding window:
library(tidyverse)
library(slider)
df <- tribble(
~id, ~year, ~month, ~amount,
"ABC", 2020, 09, 100,
"ABC", 2020, 11, 200,
"ABC", 2020, 12, 300,
"ABC", 2021, 01, 400,
"ABC", 2021, 04, 500,
"PQR", 2020, 10, 100,
"PQR", 2020, 11, 200,
"PQR", 2020, 12, 300,
"PQR", 2021, 01, 400,
"PQR", 2021, 03, 500
)
df |>
arrange(id, year, month) |>
group_by(id) |>
mutate(ma3 = slide_dbl(lag(amount), mean, .before = 2, complete = TRUE)) |>
ungroup() # if needed
#> # A tibble: 10 × 5
#> id year month amount ma3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Here is a way.
suppressPackageStartupMessages({
library(dplyr)
library(TTR)
})
x <- ' ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500 '
DF <- read.table(textConnection(x), header = TRUE)
DF %>%
arrange(ID, YEAR, MONTH) %>%
group_by(ID) %>%
mutate(`3MA` = lag(runMean(AMOUNT, 3)))
#> # A tibble: 10 × 5
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT `3MA`
#> <chr> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Try this
DF |> arrange(ID , YEAR , MONTH) |> group_by(ID) |>
mutate(`3M` = (lag(AMOUNT) + lag(AMOUNT ,2) + lag(AMOUNT , 3)) / 3)
output
# A tibble: 10 × 5
# Groups: ID [2]
ID YEAR MONTH AMOUNT `3M`
<chr> <int> <int> <int> <dbl>
1 ABC 2020 9 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 1 400 200
5 ABC 2021 4 500 300
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 1 400 200
10 PQR 2021 3 500 300

Dataframe with start & end date to daily data

I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows

R - Number of days since last event in another dataframe

I have the following two data frames:
> head(Reaction_per_park_per_day_3)
Park Date Type_1_2 Number_AC_events
<chr> <date> <chr> <int>
1 Beaverdam Flats 2018-09-25 0 1
2 Nosehill 64 ave 2018-09-26 0 1
3 Nosehill 64 ave 2018-09-26 0 1
4 Nosehill Macewin 2018-09-26 0 1
5 Crestmont 2018-09-27 0 2
6 Country Hills G.C. - Nose Creek 2018-09-28 0 1
> head(All_reports_per_month2)
Month Park Code Reports_per_month
<date> <chr> <chr> <dbl>
1 2018-09-29 Beaverdam Flats 1 1
2 2018-10-12 Nosehill 64 ave 2 1
3 2018-10-25 Nosehill 64 ave 1 2
4 2018-09-21 Crestmont 1 1
5 2018-09-29 Crestmont 2 1
I would like to add a "days since last AC event" column to All_reports_per_month2 that would take into account the date and the park of the AC event as well as the date and park of the report. If the report data is prior to the first AC event in a certain park, NA would appear. See example below:
Month Park Code Reports_per_month Days_since_last_AC
<date> <chr> <chr> <dbl> <chr>
1 2018-09-29 Beaverdam Flats 1 1 4
2 2018-10-12 Nosehill 64 ave 2 1 16
3 2018-10-25 Nosehill 64 ave 1 2 29
4 2018-09-21 Crestmont 1 1 NA
5 2018-09-29 Crestmont 2 1 2
Any help would be appreciated!
This is a joining and filtering operation that will use the dplyr package.
# import the packages
library( dplyr )
# join the data tables and filter so that we are always looking back in time
All_reports_per_month2 %>%
left_join( Reaction_per_park_per_day_3, by="Park" ) %>%
filter( Date <= Month ) %>%
group_by( Park, Month ) %>%
summarize( Days_since_last_AC = Month - max(Date) )

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

Summarise? Count occurences in column based on another column

I believe this may have a simple solution but I'm having trouble describing what I need to do (and hence what to search for). I think I need the summarize function. My goal output is at the very bottom.
I'm trying to count the occurrences of a value between each unique value in another column. Here is an example df that hopefully illustrates what I need todo.
library(dplyr)
set.seed(1)
df <- tibble("name" = c(rep("dinah",2),rep("lucy",4),rep("sora",9)),
"meal" = c(rep(c("chicken","beef","fish"),5)),
"date" = seq(as.Date("1999/1/1"),as.Date("2000/1/1"),25),
"num.wins" = sample(0:30)[1:15])
Among other things, I'm trying to summarize (sum) the types of meals each name had using this data.
df
# A tibble: 15 x 4
name meal date num.wins
<chr> <chr> <date> <int>
1 dinah chicken 1999-01-01 8
2 dinah beef 1999-01-26 11
3 lucy fish 1999-02-20 16
4 lucy chicken 1999-03-17 25
5 lucy beef 1999-04-11 5
6 lucy fish 1999-05-06 23
7 sora chicken 1999-05-31 27
8 sora beef 1999-06-25 15
9 sora fish 1999-07-20 14
10 sora chicken 1999-08-14 1
11 sora beef 1999-09-08 4
12 sora fish 1999-10-03 3
13 sora chicken 1999-10-28 13
14 sora beef 1999-11-22 6
15 sora fish 1999-12-17 18
I've made progress with other calculations I'm interested in, below:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins))
# A tibble: 3 x 5
name count medianDate life wins
<chr> <int> <date> <time> <int>
1 dinah 2 1999-01-13 25 days 19
2 lucy 4 1999-03-29 75 days 69
3 sora 9 1999-09-08 200 days 101
My goal is to add an additional column for each type of food, and have the sum of the occurrences of that food displayed in each row, like so:
name count medianDate life wins chicken beef fish
1 dinah 2 1999-01-13 25 days 19 1 1 0
2 lucy 4 1999-03-29 75 days 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3
Though older, and possibly on a deprecation path, reshape2::dcast does this nicely:
reshape2::dcast(df, name ~ meal)
# name beef chicken fish
# 1 dinah 1 1 0
# 2 lucy 1 1 2
# 3 sora 3 3 3
You can understand the formula as rows ~ columns. By default, it will aggregate the values in the columns using the length function---which gives exactly what you want, the count of each.
This can be easily joined to your summary data:
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins)) %>%
left_join(reshape2::dcast(df, name ~ meal))
# # A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <int> <int> <int>
# 1 dinah 2 1999-01-13 25 days 19 1 1 0
# 2 lucy 4 1999-03-29 75 days 69 1 1 2
# 3 sora 9 1999-09-08 200 days 101 3 3 3
One option is to use table inside summarise as a list column, unnest and then spread it to 'wide' format
library(tidyverse)
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
n = list(enframe(table(meal))) ) %>%
unnest %>%
spread(name1, value, fill = 0)
# A tibble: 3 x 8
# name count medianDate life wins beef chicken fish
# <chr> <int> <date> <time> <int> <dbl> <dbl> <dbl>
#1 dinah 2 1999-01-13 25 days 19 1 1 0
#2 lucy 4 1999-03-29 75 days 69 1 1 2
#3 sora 9 1999-09-08 200 days 101 3 3 3
I'm not entirely sure why I'm getting the funky formatting for life, but I think this gets at your need for a count of the meal types.
df %>%
group_by(name) %>%
summarise(count=n(),
medianDate=median(date),
life=(max(date)-min(date)),
wins=sum(num.wins),
chicken = sum(meal == "chicken"),
beef = sum(meal == "beef"),
fish = sum(meal == "fish"))
# A tibble: 3 x 8
name count medianDate life wins chicken beef fish
<chr> <int> <date> <time> <int> <int> <int> <int>
1 dinah 2 1999-01-13 " 25 days" 19 1 1 0
2 lucy 4 1999-03-29 " 75 days" 69 1 1 2
3 sora 9 1999-09-08 200 days 101 3 3 3

Resources