Fill columns with most recent value - r

I have a dataset like this in R:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 4 |
2018-05-09 | 1 | 4 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 4 |
2017-07-21 | 1 | 3 |
How do I change the Age values of each group of ID to the most recent Age record?
Results should look like this:
Date | ID | Age |
2019-11-22 | 1 | 5 |
2018-12-21 | 1 | 5 |
2018-05-09 | 1 | 5 |
2018-05-01 | 2 | 5 |
2017-10-10 | 2 | 5 |
2017-07-21 | 1 | 5 |
I tried group_by(ID)%>% mutate(Age = max(Date, Age))
but it seems to be giving strange huge numbers for certain cases when I try it on a v huge dataset. What could be going wrong?

Try sorting first,
df %>%
arrange(as.Date(Date)) %>%
group_by(ID) %>%
mutate(Age = last(Age))
which gives,
# A tibble: 6 x 3
# Groups: ID [2]
Date ID Age
<fct> <int> <int>
1 2017-07-21 1 5
2 2017-10-10 2 5
3 2018-05-01 2 5
4 2018-05-09 1 5
5 2018-12-21 1 5
6 2019-11-22 1 5

I think the issue is in your mutate function:
Try this:
group_by(ID) %>%
arrange(as.date(Date) %>%
mutate(Age = max(Age))

Related

find average incidents per business day

I've a dataset as under:
+----+-------+---------------------+
| ID | SUBID | date |
+----+-------+---------------------+
| A | 1 | 2021-01-01 12:00:00 |
| A | 1 | 2021-01-02 01:00:00 |
| A | 1 | 2021-01-02 02:00:00 |
| A | 1 | 2021-01-03 03:00:00 |
| A | 2 | 2021-01-05 16:00:00 |
| A | 2 | 2021-01-06 13:00:00 |
| A | 2 | 2021-01-07 06:00:00 |
| A | 2 | 2021-01-08 08:00:00 |
| A | 2 | 2021-01-08 10:00:00 |
| A | 2 | 2021-01-08 11:00:00 |
| A | 3 | 2021-01-09 09:00:00 |
| A | 3 | 2021-01-10 19:00:00 |
| A | 3 | 2021-01-11 20:00:00 |
| A | 3 | 2021-01-12 22:00:00 |
| B | 1 | 2021-02-01 23:00:00 |
| B | 1 | 2021-02-02 15:00:00 |
| B | 1 | 2021-02-03 06:00:00 |
| B | 1 | 2021-02-04 08:00:00 |
| B | 2 | 2021-02-05 18:00:00 |
| B | 2 | 2021-02-05 19:00:00 |
| B | 2 | 2021-02-06 22:00:00 |
| B | 2 | 2021-02-06 23:00:00 |
| B | 2 | 2021-02-07 04:00:00 |
| B | 2 | 2021-02-08 02:00:00 |
| B | 3 | 2021-02-09 01:00:00 |
| B | 3 | 2021-02-10 03:00:00 |
| B | 3 | 2021-02-11 13:00:00 |
| B | 3 | 2021-02-12 14:00:00 |
+----+-------+---------------------+
I want to be able to get the time difference between each ID and SUBID group in hours, preferably in terms of business hours, where each of the date that appears on a weekend or a federal holiday can be moved to a nearest weekday (preceding or succeeding) with a time of 23:59:59 as under:
+----+-------+---------------------+------------------------------------------------------------------+
| ID | SUBID | date | timediff (hours) with preceding date for each group (ID, SUBID) |
+----+-------+---------------------+------------------------------------------------------------------+
| A | 1 | 2021-01-01 12:00:00 | 0 |
| A | 1 | 2021-01-02 01:00:00 | 13 |
| A | 1 | 2021-01-02 02:00:00 | 1 |
| A | 1 | 2021-01-03 03:00:00 | 1 |
| A | 2 | 2021-01-05 16:00:00 | 0 |
| A | 2 | 2021-01-06 13:00:00 | 21 |
| A | 2 | 2021-01-07 06:00:00 | 17 |
| A | 2 | 2021-01-08 08:00:00 | 2 |
| A | 2 | 2021-01-08 10:00:00 | 2 |
| A | 2 | 2021-01-08 11:00:00 | 1 |
| A | 3 | 2021-01-09 09:00:00 | 0 |
| A | 3 | 2021-01-10 19:00:00 | 36 |
| A | 3 | 2021-01-11 20:00:00 | 1 |
| A | 3 | 2021-01-12 22:00:00 | 1 |
| B | 1 | 2021-02-01 23:00:00 | 0 |
| B | 1 | 2021-02-02 15:00:00 | 16 |
| B | 1 | 2021-02-03 06:00:00 | 15 |
| B | 1 | 2021-02-04 08:00:00 | 26 |
| B | 2 | 2021-02-05 18:00:00 | 0 |
| B | 2 | 2021-02-05 19:00:00 | 1 |
| B | 2 | 2021-02-06 22:00:00 | 27 |
| B | 2 | 2021-02-06 23:00:00 | 1 |
| B | 2 | 2021-02-07 04:00:00 | 5 |
| B | 2 | 2021-02-08 02:00:00 | 22 |
| B | 3 | 2021-02-09 01:00:00 | 0 |
| B | 3 | 2021-02-10 03:00:00 | 26 |
| B | 3 | 2021-02-11 13:00:00 | 11 |
| B | 3 | 2021-02-12 14:00:00 | 1 |
+----+-------+---------------------+------------------------------------------------------------------+
and lastly I want to calculate the average time which would be the sum of time differences per group (ID, SUBID) divide by the total count per group as under:
+----+-------+------------------------------------------------------------+
| ID | SUBID | Average time (count per group/ total time diff of group ) |
+----+-------+------------------------------------------------------------+
| A | 1 | 15/4 |
| A | 2 | 43/6 |
| A | 3 | 38/4 |
| B | 1 | 57/4 |
| B | 2 | 56/6 |
| B | 3 | 38/4 |
+----+-------+------------------------------------------------------------+
I'm fairly new to R and I came across lubridate to help me format the dates and I wasable to get the time diff using the code below
df%>%
group_by(ID, SUBID) %>%
mutate(time_diff = difftime(date, lag(date), unit = 'min'))
However I'm having troubles getting difference of just the business days time and also getting the average time as per the last table
Welcome on SO! Using dplyr and lubridate:
Data used:
library(tidyverse)
library(lubridate)
df <- data.frame(ID = c("A","A","A","A"),
SUBID = c(1,1,2,2),
Date = lubridate::as_datetime(c("2021-01-01 12:00:00","2021-01-02 1:00:00","2021-01-01 2:00:00","2021-01-01 13:00:00")))
ID SUBID Date
1 A 1 2021-01-01 12:00:00
2 A 1 2021-01-02 01:00:00
3 A 2 2021-01-01 02:00:00
4 A 2 2021-01-01 13:00:00
Code:
df %>%
group_by(ID, SUBID) %>%
mutate(diff = Date - lag(Date)) %>%
mutate(diff = ifelse(is.na(diff), 0, diff)) %>%
summarise(Average = sum(diff)/n())
Output:
ID SUBID Average
<chr> <dbl> <dbl>
1 A 1 6.5
2 A 2 5.5
Edit: How to handle week_ends
For the week-ends, the simplier solutions is to change the day to the next monday:
df %>%
mutate(week_day = wday(Date,label = TRUE, abbr = FALSE)) %>%
mutate(Date = ifelse(week_day == "samedi", Date + days(2),
ifelse(week_day == "dimanche", Date + days(1), Date))) %>%
mutate(Date = as_datetime(Date))
This create the column week_day with the name of the day. If the day is a "samedi" (saturday) or a "dimanche" (sunday), it adds 2 or 1 day to the Date so it becomes a Monday. Then, you just need to reorder the dates (df %>% arrange(ID, SUBID, Date)) and rerun the first code.
As my local langage is french, you have to change the samedi and dimanche to saturday and sunday
For the holidays, you can apply the same system by creating a time-interval variable which represents the holidays, test for each date if it is whithin this interval, and if so, change the date to the last day of this interval.

Weekly Weights Calculations in R Using dplyr

I have the following data and looking to create the "Final Col" shown below using dplyr in R.
Please note, the "Trees" main category has a value of 0 for week 1, 2017. In this case, I would like the final values for both weeks to be 0.
I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Trees |00 |00 |
| 2017 | 2 | Trees |12 |00 |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Trees |32 |32/(32+12) |
| 2018 | 2 | Trees |12 |12/(32+12) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
For this, you can use a combination of grou_by and a regular if ... else clause
library(dplyr)
set.seed(0)
# dummy data
data <- tidyr::expand_grid(year = 2017:2018,
quarter = 1:2,
MainCat = LETTERS[1:3]) %>%
mutate(Qty = sample(0:15, 3*2*2)) %>%
arrange(year, MainCat, quarter)
data %>%
group_by(year, MainCat) %>%
mutate(finalCol = if(any(Qty == 0)){ 0 } else {Qty / sum(Qty)})
#> # A tibble: 12 x 5
#> # Groups: year, MainCat [6]
#> year quarter MainCat Qty finalCol
#> <int> <int> <chr> <int> <dbl>
#> 1 2017 1 A 13 0.684
#> 2 2017 2 A 6 0.316
#> 3 2017 1 B 8 0
#> 4 2017 2 B 0 0
#> 5 2017 1 C 3 0.75
#> 6 2017 2 C 1 0.25
#> 7 2018 1 A 12 0.632
#> 8 2018 2 A 7 0.368
#> 9 2018 1 B 10 0.476
#> 10 2018 2 B 11 0.524
#> 11 2018 1 C 2 0.333
#> 12 2018 2 C 4 0.667

Remove the first row from each group if the second row meets a condition

Here's a sample of my dataset:
df=data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"))
I need to calculate the difference between dates which I have already done and the dataset then looks like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 5/25/2018 | Maria | -188 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/7/2018 | Sandy | -251 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
Now, if the value of second row within each group of column 'diff' is greater than or equal to 5, then I need to delete the first row of each group. For example, the diff value 264 is greater than 5 for Buyer 'Maria' having id '5', so I would want to delete the first row within that group which would be the buyer 'Maria' having id '5', Date as '5/25/2018', and diff as '-188'
Below is a sample of my code:
df1=df %>% group_by(Buyer,id) %>%
mutate(diff = c(NA, diff(Date))) %>%
filter(!(diff >=5 & row_number() == 1))
The problem is that the above code selects the first row instead of the second row and I don't know how to specify the row to be 2nd for each group where the diff value should be greater than or equal to 5.
My expected output should look like:
| id | Date | Buyer | diff |
|----|:----------:|------:|------|
| 9 | 11/29/2018 | John | NA |
| 9 | 11/29/2018 | John | 0 |
| 9 | 11/29/2018 | John | 0 |
| 5 | 2/13/2019 | Maria | 264 |
| 5 | 2/13/2019 | Maria | 0 |
| 4 | 6/15/2018 | Sandy | 8 |
| 4 | 6/20/2018 | Sandy | 5 |
| 4 | 8/17/2018 | Sandy | 58 |
| 4 | 8/20/2018 | Sandy | 3 |
| 20 | 12/25/2018 | Paul | 127 |
| 20 | 12/25/2018 | Paul | 0 |
I think you forgot to provide the diff column in df. I created one called diffs so that it doesn't conflict with the function diff(). -
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) %>%
filter(
n() == 1 | # always keep if only one row in group
row_number() > 1 | # always keep all row_number() > 1
diffs[2] < 5 # keep 1st row only if 2nd row diffs < 5
) %>%
ungroup()
# A tibble: 11 x 4
id Date Buyer diffs
<chr> <chr> <chr> <dbl>
1 9 11/29/2018 John NA
2 9 11/29/2018 John 0
3 9 11/29/2018 John 0
4 5 2/13/2019 Maria 264
5 5 2/13/2019 Maria 0
6 4 6/15/2018 Sandy 8
7 4 6/20/2018 Sandy 5
8 4 8/17/2018 Sandy 58
9 4 8/20/2018 Sandy 3
10 20 12/25/2018 Paul NA
11 20 12/25/2018 Paul 0
Data -
I added stringsAsFactors = FALSE
df1 <- data.frame(id=c("9","9","9","5","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","5/25/2018","2/13/2019","2/13/2019","6/7/2018",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul")
, stringsAsFactors = F)
Maybe I overthought it, but here is one idea,
df8 %>%
mutate(Date = as.Date(Date, format = '%m/%d/%Y')) %>%
mutate(diff = c(NA, diff(Date))) %>%
group_by(id) %>%
mutate(diff1 = as.integer(diff >= 5) + row_number()) %>%
filter(diff1 != 1 | lead(diff1) != 3) %>%
select(-diff1)
which gives,
# A tibble: 11 x 4
# Groups: id [4]
id Date Buyer diff
<fct> <date> <fct> <dbl>
1 9 2018-11-29 John NA
2 9 2018-11-29 John 0
3 9 2018-11-29 John 0
4 5 2019-02-13 Maria 264
5 5 2019-02-13 Maria 0
6 4 2018-06-15 Sandy 8
7 4 2018-06-20 Sandy 5
8 4 2018-08-17 Sandy 58
9 4 2018-08-20 Sandy 3
10 20 2018-12-25 Paul 127
11 20 2018-12-25 Paul 0

Calculate mean value for ID keeping Panel Data Shape

Good afternoon,
I have the following problem hope that somebody could help me find the right solution.
The situation is as following:
Suppose, one has an unbalanced panel dataset
| ID | Value | Time |
| 1 | 12 | 2011 |
| 1 | 8 | 2012 |
| 1 | 10 | 2013 |
| 2 | 24 | 2011 |
| 2 | 10 | 2012 |
| 3 | 1 | 2011 |
| 3 | 8 | 2012 |
| 3 | 2 | 2013 |
What I try to do is to calculate the mean of value for each ID, and plug this one value for each year of that individual. The results should look like this:
| ID | Value | Time |
| 1 | 10 | 2011 |
| 1 | 10 | 2012 |
| 1 | 10 | 2013 |
| 2 | 17 | 2011 |
| 2 | 17 | 2012 |
| 3 | 4 | 2011 |
| 3 | 4 | 2012 |
| 3 | 4 | 2013 |
I've seen many questions of the same type, but there was no solution that keep the panel data form. Does anyone has an idea how to solve this in R?
library(dplyr)
df <- data.frame(ID = c(1,1,1,2,2,3,3,3),
Value = c(12,8,10,24,10,1,8,2),
Time = c(2011,2012,2013,2011,2012,2011,2012,2013))
df %>%
group_by(ID) %>%
summarise(Value = round(mean(Value))) %>%
right_join(df %>% select(-Value), by ="ID")
# A tibble: 8 x 3
ID Value Time
<dbl> <dbl> <dbl>
1 1 10 2011
2 1 10 2012
3 1 10 2013
4 2 17 2011
5 2 17 2012
6 3 4 2011
7 3 4 2012
8 3 4 2013
EDIT
As Sotos points out below, this is a better solution:
df %>% group_by(ID) %>% mutate(Value = round(mean(Value)))
With data.table this becomes a "one-liner":
library(data.table)
setDT(df)[, Value := round(mean(Value)), by = ID][]
ID Value Time
1: 1 10 2011
2: 1 10 2012
3: 1 10 2013
4: 2 17 2011
5: 2 17 2012
6: 3 4 2011
7: 3 4 2012
8: 3 4 2013
Data
df <- fread(
"| ID | Value | Time |
| 1 | 12 | 2011 |
| 1 | 8 | 2012 |
| 1 | 10 | 2013 |
| 2 | 24 | 2011 |
| 2 | 10 | 2012 |
| 3 | 1 | 2011 |
| 3 | 8 | 2012 |
| 3 | 2 | 2013 |",
sep = "|", drop = c(1L, 5L))
The base R solution via ave,
round(ave(df$Value, df$ID))
#[1] 10 10 10 17 17 4 4 4

R plyr summarize chunks

I have a table of daily temperature
+-------+-----+-------------+
| City | Day | Temperature |
+-------+-----+-------------+
| Miami | 1 | 25 |
| Miami | 2 | 27 |
| Miami | 3 | 34 |
| Miami | 4 | 23 |
| Miami | 5 | 30 |
| Miami | 6 | 31 |
| Paris | 1 | 15 |
| Paris | 2 | 17 |
| Paris | 3 | 14 |
| Paris | 4 | 13 |
| Paris | 5 | 10 |
| Paris | 6 | 11 |
+-------+-----+-------------+
I would to be able to summarize them by city in chunks of n days.
An exemple of the result with chunks of 3 days
+-------+-----+---------------------+
| City | Day | AVGTemperature |
+-------+-----+---------------------+
| Miami | 1-3 | 28.66 |
| Miami | 4-6 | 29 |
| Paris | 1-3 | 15.33 |
| Paris | 4-6 | 14.5 |
+-------+-----+---------------------+
I could do
AVGTemp <- ddply(temp, .(Day, City), summarize, AVGTemperature=mean(Temperature))
But that gives me the average for every single day. I can a make it so it returns chunks of n days?
Here's a dplyr solution. Change breaks from 3 to the number of chunks you'd like.
library(dplyr)
tab %>%
mutate(day_group = cut(Day, 3, include.lowest = TRUE, labels = FALSE)) %>%
group_by(City, day_group) %>%
summarise(mean_temp = mean(Temperature), start_day = min(Day), end_day = max(Day))
# Source: local data frame [6 x 5]
# Groups: City [?]
#
# City day_group mean_temp start_day end_day
# (fctr) (int) (dbl) (int) (int)
# 1 Miami 1 26.0 1 2
# 2 Miami 2 28.5 3 4
# 3 Miami 3 30.5 5 6
# 4 Paris 1 16.0 1 2
# 5 Paris 2 13.5 3 4
# 6 Paris 3 10.5 5 6

Resources