Spline interpolation by group in R - r

My Dataframe contains some Options data with several variable. The relevant ones are: date, days (which means days to expiration) and mid. Mid needs to be interpolated via natural spline for a future timepoint in 30 days. This interpolation should be done every day for every strike price.
The data table looks as follows:
| date | days | mid |strike|
|------|------|-----|------|
| 2020- 01 - 01 | 8 | 12 | 110 |
| 2020- 01 - 01 | 28 | 14 | 110 |
| 2020- 01 - 01 | 49 | 15 | 110 |
| 2020- 01 - 01 | 80| 17 | 110 |
| 2020- 01 - 01 | 8 | 11 | 120 |
| 2020- 01 - 01 | 28 | 12 | 120 |
| 2020- 01 - 01 | 49 | 13 | 120 |
| 2020- 01 - 01 | 80| 14 | 120 |
| 2020- 01 - 12 | 6 | 12 | 110 |
| 2020- 01 - 12 | 26 | 14 | 110 |
| 2020- 01 - 12 | 47 | 15 | 110 |
| 2020- 01 - 12 | 82| 17 | 110 |
| 2020- 01 - 12 | 7 | 11 | 120 |
| 2020- 01 - 12 | 27 | 12 | 120 |
| 2020- 01 - 12 | 47 | 13 | 120 |
| 2020- 01 - 12 | 85| 14 | 120 |
This is just an example. The original data frame contains over 1 million entries. For this I can't use a for loop and want to interpolate by group.
I found some approaches online, unfortunately none of them really worked for me.
My last guess was:
df$id <- paste0(df$date, df$strike)
df %>%
group_by(id) %>%
mutate(mid_30 = splime(df$days, df$mid, xout = 30 , method = "natural" ))
Do you have any possible solution?
Thank you very much in advance!

Related

Calculating weighted average buy and hold return per ID in R

Thanks to #langtang, I was able to calculate Buy and Hold Return around the event date for each company (Calculating Buy and Hold return around event date per ID in R). But then now I am facing a new problem.
Below is the data I currently have.
+----+------------+-------+------------+------------+----------------------+
| ID | Date | Price | EventDate | Market Cap | BuyAndHoldIndividual |
+----+------------+-------+------------+------------+----------------------+
| 1 | 2011-03-06 | 10 | NA | 109 | NA |
| 1 | 2011-03-07 | 9 | NA | 107 | -0.10000 |
| 1 | 2011-03-08 | 12 | NA | 109 | 0.20000 |
| 1 | 2011-03-09 | 14 | NA | 107 | 0.40000 |
| 1 | 2011-03-10 | 15 | NA | 101 | 0.50000 |
| 1 | 2011-03-11 | 17 | NA | 101 | 0.70000 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 110 | 0.20000 |
| 1 | 2011-03-13 | 14 | NA | 110 | 0.40000 |
| 1 | 2011-03-14 | 17 | NA | 100 | 0.70000 |
| 1 | 2011-03-15 | 14 | NA | 101 | 0.40000 |
| 1 | 2011-03-16 | 17 | NA | 107 | 0.70000 |
| 1 | 2011-03-17 | 16 | NA | 104 | 0.60000 |
| 1 | 2011-03-18 | 15 | NA | 104 | NA |
| 1 | 2011-03-19 | 16 | NA | 102 | 0.06667 |
| 1 | 2011-03-20 | 17 | NA | 107 | 0.13333 |
| 1 | 2011-03-21 | 18 | NA | 104 | 0.20000 |
| 1 | 2011-03-22 | 11 | NA | 105 | -0.26667 |
| 1 | 2011-03-23 | 15 | NA | 100 | 0.00000 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | 110 | -0.20000 |
| 1 | 2011-03-25 | 13 | NA | 110 | -0.13333 |
| 1 | 2011-03-26 | 15 | NA | 107 | 0.00000 |
| 2 | 2011-03-12 | 48 | NA | 300 | NA |
| 2 | 2011-03-13 | 49 | NA | 300 | NA |
| 2 | 2011-03-14 | 50 | NA | 290 | NA |
| 2 | 2011-03-15 | 57 | NA | 296 | 0.14000 |
| 2 | 2011-03-16 | 60 | NA | 297 | 0.20000 |
| 2 | 2011-03-17 | 49 | NA | 296 | -0.02000 |
| 2 | 2011-03-18 | 64 | NA | 299 | 0.28000 |
| 2 | 2011-03-19 | 63 | NA | 292 | 0.26000 |
| 2 | 2011-03-20 | 67 | 2011-03-20 | 290 | 0.34000 |
| 2 | 2011-03-21 | 70 | NA | 299 | 0.40000 |
| 2 | 2011-03-22 | 58 | NA | 295 | 0.16000 |
| 2 | 2011-03-23 | 65 | NA | 290 | 0.30000 |
| 2 | 2011-03-24 | 57 | NA | 296 | 0.14000 |
| 2 | 2011-03-25 | 55 | NA | 299 | 0.10000 |
| 2 | 2011-03-26 | 57 | NA | 299 | NA |
| 2 | 2011-03-27 | 60 | NA | 300 | NA |
| 3 | 2011-03-18 | 5 | NA | 54 | NA |
| 3 | 2011-03-19 | 10 | NA | 50 | NA |
| 3 | 2011-03-20 | 7 | NA | 53 | NA |
| 3 | 2011-03-21 | 8 | NA | 53 | NA |
| 3 | 2011-03-22 | 7 | NA | 50 | NA |
| 3 | 2011-03-23 | 8 | NA | 51 | 0.14286 |
| 3 | 2011-03-24 | 7 | NA | 52 | 0.00000 |
| 3 | 2011-03-25 | 6 | NA | 55 | -0.14286 |
| 3 | 2011-03-26 | 9 | NA | 54 | 0.28571 |
| 3 | 2011-03-27 | 9 | NA | 55 | 0.28571 |
| 3 | 2011-03-28 | 9 | 2011-03-28 | 50 | 0.28571 |
| 3 | 2011-03-29 | 6 | NA | 52 | -0.14286 |
| 3 | 2011-03-30 | 6 | NA | 53 | -0.14286 |
| 3 | 2011-03-31 | 4 | NA | 50 | -0.42857 |
| 3 | 2011-04-01 | 5 | NA | 50 | -0.28571 |
| 3 | 2011-04-02 | 8 | NA | 55 | 0.00000 |
| 3 | 2011-04-03 | 9 | NA | 55 | NA |
+----+------------+-------+------------+------------+----------------------+
This time, I would like to make a new column called BuyAndHoldWeightedMarket, where I calculate the weighted average (by Market cap) Buy and Hold return for each ID around -5 ~ +5 days of the event date. For example, for ID =1, starting from 2011-03-19, BuyAndHoldWeightedMarket is calculated as the sum product of (prices for each ID(t)/prices for each ID(eventdate-6)-1) and Market Caps for that day for each ID and then dividing that by the sum of the Market Caps for each ID on that day.
Please check the below picture for the details. The equations are listed for each case of colored blocks.
Please note that for the uppermost BuyAndHoldWeightedMarket, ID =2,3 is not involved because they begin later than 2011-03-06. For the third block (grey colored area), the calculation of weighted return only includes ID=1,2 because Id=3 begins later than 2011-03-14. Also, for the Last block (mixed color), the first four rows use all three IDs, Blue area uses only ID=2,3 because ID=1 ends 2011-03-26, and the yellow block uses only ID=3 because ID=1, 2 ends before 2011-03-28.
Eventually, I would like to get a nice data table that looks as below.
+----+------------+-------+------------+------------+----------------------+--------------------------+
| ID | Date | Price | EventDate | Market Cap | BuyAndHoldIndividual | BuyAndHoldWeightedMarket |
+----+------------+-------+------------+------------+----------------------+--------------------------+
| 1 | 2011-03-06 | 10 | NA | 109 | NA | NA |
| 1 | 2011-03-07 | 9 | NA | 107 | -0.10000 | -0.10000 |
| 1 | 2011-03-08 | 12 | NA | 109 | 0.20000 | 0.20000 |
| 1 | 2011-03-09 | 14 | NA | 107 | 0.40000 | 0.40000 |
| 1 | 2011-03-10 | 15 | NA | 101 | 0.50000 | 0.50000 |
| 1 | 2011-03-11 | 17 | NA | 101 | 0.70000 | 0.70000 |
| 1 | 2011-03-12 | 12 | 2011-03-12 | 110 | 0.20000 | 0.20000 |
| 1 | 2011-03-13 | 14 | NA | 110 | 0.40000 | 0.40000 |
| 1 | 2011-03-14 | 17 | NA | 100 | 0.70000 | 0.70000 |
| 1 | 2011-03-15 | 14 | NA | 101 | 0.40000 | 0.40000 |
| 1 | 2011-03-16 | 17 | NA | 107 | 0.70000 | 0.70000 |
| 1 | 2011-03-17 | 16 | NA | 104 | 0.60000 | 0.60000 |
| 1 | 2011-03-18 | 15 | NA | 104 | NA | NA |
| 1 | 2011-03-19 | 16 | NA | 102 | 0.06667 | 0.11765 |
| 1 | 2011-03-20 | 17 | NA | 107 | 0.13333 | 0.10902 |
| 1 | 2011-03-21 | 18 | NA | 104 | 0.20000 | 0.17682 |
| 1 | 2011-03-22 | 11 | NA | 105 | -0.26667 | -0.07924 |
| 1 | 2011-03-23 | 15 | NA | 100 | 0.00000 | 0.07966 |
| 1 | 2011-03-24 | 12 | 2011-03-24 | 110 | -0.20000 | -0.07331 |
| 1 | 2011-03-25 | 13 | NA | 110 | -0.13333 | -0.09852 |
| 1 | 2011-03-26 | 15 | NA | 107 | 0.00000 | 0.02282 |
| 2 | 2011-03-12 | 48 | NA | 300 | NA | NA |
| 2 | 2011-03-13 | 49 | NA | 300 | NA | NA |
| 2 | 2011-03-14 | 50 | NA | 290 | NA | NA |
| 2 | 2011-03-15 | 57 | NA | 296 | 0.14000 | 0.059487331 |
| 2 | 2011-03-16 | 60 | NA | 297 | 0.20000 | 0.147029703 |
| 2 | 2011-03-17 | 49 | NA | 296 | -0.02000 | -0.030094118 |
| 2 | 2011-03-18 | 64 | NA | 299 | 0.28000 | 0.177381404 |
| 2 | 2011-03-19 | 63 | NA | 292 | 0.26000 | 0.177461929 |
| 2 | 2011-03-20 | 67 | 2011-03-20 | 290 | 0.34000 | 0.24836272 |
| 2 | 2011-03-21 | 70 | NA | 299 | 0.40000 | 0.311954459 |
| 2 | 2011-03-22 | 58 | NA | 295 | 0.16000 | 0.025352941 |
| 2 | 2011-03-23 | 65 | NA | 290 | 0.30000 | 0.192911011 |
| 2 | 2011-03-24 | 57 | NA | 296 | 0.14000 | 0.022381918 |
| 2 | 2011-03-25 | 55 | NA | 299 | 0.10000 | 0.009823098 |
| 2 | 2011-03-26 | 57 | NA | 299 | NA | NA |
| 2 | 2011-03-27 | 60 | NA | 300 | NA | NA |
| 3 | 2011-03-18 | 5 | NA | 54 | NA | NA |
| 3 | 2011-03-19 | 10 | NA | 50 | NA | NA |
| 3 | 2011-03-20 | 7 | NA | 53 | NA | NA |
| 3 | 2011-03-21 | 8 | NA | 53 | NA | NA |
| 3 | 2011-03-22 | 7 | NA | 50 | NA | NA |
| 3 | 2011-03-23 | 8 | NA | 51 | 0.14286 | 0.178343199 |
| 3 | 2011-03-24 | 7 | NA | 52 | 0.00000 | 0.010691161 |
| 3 | 2011-03-25 | 6 | NA | 55 | -0.14286 | -0.007160905 |
| 3 | 2011-03-26 | 9 | NA | 54 | 0.28571 | 0.106918456 |
| 3 | 2011-03-27 | 9 | NA | 55 | 0.28571 | 0.073405953 |
| 3 | 2011-03-28 | 9 | 2011-03-28 | 50 | 0.28571 | 0.285714286 |
| 3 | 2011-03-29 | 6 | NA | 52 | -0.14286 | -0.142857143 |
| 3 | 2011-03-30 | 6 | NA | 53 | -0.14286 | -0.142857143 |
| 3 | 2011-03-31 | 4 | NA | 50 | -0.42857 | -0.428571429 |
| 3 | 2011-04-01 | 5 | NA | 50 | -0.28571 | -0.285714286 |
| 3 | 2011-04-02 | 8 | NA | 55 | 0.00000 | 0.142857143 |
| 3 | 2011-04-03 | 9 | NA | 55 | NA | NA |
+----+------------+-------+------------+------------+----------------------+--------------------------+
I tried so far by using the following code, with the help of the previous question, but I am having a hard time figure out how to calculate the weighted BUY AND HOLD return that begins around different event dates for each ID.
#choose rows with no NA in event date and only show ID and event date
events = unique(df[!is.na(EventDate),.(ID,EventDate)])
#helper column
#:= is defined for use in j only. It adds or updates or removes column(s) by reference.
#It makes no copies of any part of memory at all.
events[, eDate:=EventDate]
#makes new column(temporary) lower and upper boundary
df[, `:=`(s=Date-6, e=Date+6)]
#non-equi match
bhr = events[df, on=.(ID, EventDate>=s, EventDate<=e), nomatch=0]
#Generate the BuyHoldReturn column, by ID and EventDate
bhr2 = bhr[, .(Date, BuyHoldReturnM1=c(NA, (Price[-1]/Price[1] -1)*MarketCap[-1])), by = .(Date)]
#merge back to get the full data
bhr3 = bhr2[df,on=.(ID,Date),.(ID,Date,Price,EventDate=i.EventDate,BuyHoldReturn)]
I would be grateful if you could help.
Thank you very much in advance!

R add rows to dataframe from other dataframe based on column value

For my thesis, I am trying to use several variables from two types of surveys (the British Election Studies (BES) and the British Social Attitudes Survey (BSA)) and combine them into one dataset.
Currently, I have two datasets, one with BES data, which looks like this (in simplified version):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
The BSA data looks like this (again, simplified):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
Basically, what I am trying to do is combine the two into one dataframe that looks like this:
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
I have googled a lot about joins and merging, but I can't figure it out in a way that works correctly. From what I understand, I believe I should join "by" the year variable, but is that correct? And how can I prevent it taking up a lot of memory to perform the computation (the actual datasets are about 30k for the BES and 130k for the BSA)? Is there a solution using either dplyr or data.tables in R?
Any help is much appreciated!!!
This is not a "merge" (or join) operation, it's just row-concatenation. In R, that's done with rbind (which works for matrix and data.frame using different methods). (For perspective, there's also cbind, which concatenates by columns. Not applicable here.)
base R
rbind(BES, BSA)
# year class education gender age
# 1 1992 working A-levels female 32
# 2 1992 middle GCSE male 49
# 3 1997 lower Undergrad female 24
# 4 1997 middle GCSE male 29
# 5 1992 middle A-levels male 22
# 6 1993 working GCSE female 45
# 7 1994 upper Postgrad female 38
# 8 1994 middle GCSE male 59
other dialects
dplyr::bind_rows(BES, BSA)
data.table::rbindlist(list(BES, BSA))

Mariadb: select average by hour and other column

I have a table in a Mariadb version 10.3.27 database that looks like this:
+----+------------+---------------+-----------------+
| id | channel_id | timestamp | value |
+----+------------+---------------+-----------------+
| 1 | 2 | 1623669600000 | 2882.4449252449 |
| 2 | 1 | 1623669600000 | 295.46914369742 |
| 3 | 2 | 1623669630000 | 2874.46365243 |
| 4 | 1 | 1623669630000 | 295.68124546516 |
| 5 | 2 | 1623669660000 | 2874.9638893452 |
| 6 | 1 | 1623669660000 | 295.69561247521 |
| 7 | 2 | 1623669690000 | 2878.7120274678 |
and I want to have a result like this:
+------+-------+-------+
| hour | valhh | valwp |
+------+-------+-------+
| 0 | 419 | 115 |
| 1 | 419 | 115 |
| 2 | 419 | 115 |
| 3 | 419 | 115 |
| 4 | 419 | 115 |
| 5 | 419 | 115 |
| 6 | 419 | 115 |
| 7 | 419 | 115 |
| 8 | 419 | 115 |
| 9 | 419 | 115 |
| 10 | 419 | 115 |
| 11 | 419 | 115 |
| 12 | 419 | 115 |
| 13 | 419 | 115 |
| 14 | 419 | 115 |
| 15 | 419 | 115 |
| 16 | 419 | 115 |
| 17 | 419 | 115 |
| 18 | 419 | 115 |
| 19 | 419 | 115 |
| 20 | 419 | 115 |
| 21 | 419 | 115 |
| 22 | 419 | 115 |
| 23 | 419 | 115 |
+------+-------+-------+
but with valhh (valwp) being the average of the values for the hour of the day for all days where the channel_id is 1 (2) and not the overall average. So far, I've tried:
select h.hour, hh.valhh, wp.valwp from
(select hour(from_unixtime(timestamp/1000)) as hour from data) h,
(select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as valhh from data where channel_id = 1) hh,
(select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as valwp from data where channel_id = 2) wp group by h.hour;
which gives the result above (average of all values).
I can get what I want by querying the channels separately, i.e.:
select hour(from_unixtime(timestamp/1000)) as hour, cast(avg(value) as integer) as value from data where channel_id = 1 group by hour;
gives
+------+-------+
| hour | value |
+------+-------+
| 0 | 326 |
| 1 | 145 |
| 2 | 411 |
| 3 | 142 |
| 4 | 143 |
| 5 | 171 |
| 6 | 160 |
| 7 | 487 |
| 8 | 408 |
| 9 | 186 |
| 10 | 214 |
| 11 | 199 |
| 12 | 942 |
| 13 | 521 |
| 14 | 196 |
| 15 | 247 |
| 16 | 364 |
| 17 | 252 |
| 18 | 392 |
| 19 | 916 |
| 20 | 1024 |
| 21 | 1524 |
| 22 | 561 |
| 23 | 249 |
+------+-------+
but I want to have both channels in one result set as separate columns.
How would I do that?
Thanks!
After a steep learning curve I think I figured it out:
select
hh.hour, hh.valuehh, wp.valuewp
from
(select
hour(from_unixtime(timestamp/1000)) as hour,
cast(avg(value) as integer) as valuehh
from data
where channel_id=1
group by hour) hh
inner join
(select
hour(from_unixtime(timestamp/1000)) as hour,
cast(avg(value) as integer) as valuewp
from data
where channel_id=2
group by hour) wp
on hh.hour = wp.hour;
gives
+------+---------+---------+
| hour | valuehh | valuewp |
+------+---------+---------+
| 0 | 300 | 38 |
| 1 | 162 | 275 |
| 2 | 338 | 668 |
| 3 | 166 | 38 |
| 4 | 152 | 38 |
| 5 | 176 | 37 |
| 6 | 174 | 38 |
| 7 | 488 | 36 |
| 8 | 553 | 37 |
| 9 | 198 | 36 |
| 10 | 214 | 38 |
| 11 | 199 | 612 |
| 12 | 942 | 40 |
| 13 | 521 | 99 |
| 14 | 187 | 38 |
| 15 | 209 | 38 |
| 16 | 287 | 39 |
| 17 | 667 | 37 |
| 18 | 615 | 39 |
| 19 | 854 | 199 |
| 20 | 1074 | 44 |
| 21 | 1470 | 178 |
| 22 | 665 | 37 |
| 23 | 235 | 38 |
+------+---------+---------+

left_join with individual lag to new column

I need to merge two data frames probably with left_join, offset the joining observation by a specific amount and add it to a new column. The purpose is the preparation of a time-series analysis hence the different shifts in calendar weeks. I would like to stay in the tidyverse.
I read a few posts with a nested left-join() and lag() but that's beyond my current capability.
MWE
library(tidyverse)
set.seed(1234)
df1 <- data.frame(
Week1 = sample(paste("2015", 20:40, sep = "."),10, replace = FALSE),
Qty = as.numeric(sample(1:10)))
df2 <- data.frame(
Week2 = paste0("2015.", 1:52),
Value = as.numeric(sample(1:52)))
df1 %>%
left_join(df2, by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
Current output
+----+---------+-------+-------+
| | Week1 | Qty | Lag_0 |
+====+=========+=======+=======+
| 1 | 2015.35 | 6.00 | 50.00 |
+----+---------+-------+-------+
| 2 | 2015.24 | 10.00 | 26.00 |
+----+---------+-------+-------+
| 3 | 2015.31 | 7.00 | 43.00 |
+----+---------+-------+-------+
| 4 | 2015.34 | 9.00 | 42.00 |
+----+---------+-------+-------+
| 5 | 2015.28 | 4.00 | 10.00 |
+----+---------+-------+-------+
| 6 | 2015.39 | 8.00 | 24.00 |
+----+---------+-------+-------+
| 7 | 2015.25 | 5.00 | 33.00 |
+----+---------+-------+-------+
| 8 | 2015.23 | 1.00 | 39.00 |
+----+---------+-------+-------+
| 9 | 2015.21 | 2.00 | 17.00 |
+----+---------+-------+-------+
| 10 | 2015.26 | 3.00 | 27.00 |
+----+---------+-------+-------+
It might be worthwhile pointing out that the target data frame does not hold the same amount of week observations as the joining data frame.
Desired output
+----+---------+-------+-------+-------+-------+--------+
| | Week1 | Qty | Lag_0 | Lag_3 | Lag_6 | Lag_12 |
+====+=========+=======+=======+=======+=======+========+
| 1 | 2015.35 | 6.00 | 50.00 | 9.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 2 | 2015.24 | 10.00 | 26.00 | 17.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 3 | 2015.31 | 7.00 | 43.00 | 10.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 4 | 2015.34 | 9.00 | 42.00 | 43.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 5 | 2015.28 | 4.00 | 10.00 | 33.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 6 | 2015.39 | 8.00 | 24.00 | 13.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 7 | 2015.25 | 5.00 | 33.00 | 25.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 8 | 2015.23 | 1.00 | 39.00 | 38.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 9 | 2015.21 | 2.00 | 17.00 | 6.00 | | |
+----+---------+-------+-------+-------+-------+--------+
| 10 | 2015.26 | 3.00 | 27.00 | 39.00 | | |
+----+---------+-------+-------+-------+-------+--------+
Column Lag_3, which I added manually, contains the values from the matching df2 week value but offset by three rows. Lag_6 would be offset by six rows, etc.
I suppose the challenge is, that the lag() would have to happen in the joining table after the matching but before the returning of the value.
Hope this makes sense and thanks for the assistance.
We just need to create the lag before in the second data and then do the join
library(dplyr)
df2 %>%
mutate(Lag_3 = lag(Value, 3), Lag_6 = lag(Value, 6)) %>%
left_join(df1, ., by = c("Week1" = "Week2")) %>%
rename(Lag_0 = Value)
-output
# Week1 Qty Lag_0 Lag_3 Lag_6
#1 2015.35 6 50 9 46
#2 2015.24 10 26 17 6
#3 2015.31 7 43 10 33
#4 2015.34 9 42 43 10
#5 2015.28 4 10 33 25
#6 2015.39 8 24 13 16
#7 2015.25 5 33 25 49
#8 2015.23 1 39 38 15
#9 2015.21 2 17 6 32
#10 2015.26 3 27 39 38

R - Grouped data with DoD change

Say I have a raw dataset (already in data frame and I can convert that easily to xts.data.table with as.xts.data.table), the DF is like the following:
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30
and so on (many more cities and many more days).
And I would like to make this to show both the current day temperature and the day over day change from the previous day, together with the other info on the city (state, country). i.e., the new data frame should be something like (from the example above):
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature| ChangeInDailyMin | ChangeInDailyMax | ChangeInDailyMedian
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19 | 6 | -8 | 1
2018-02-03 | London | LDN |UK | 10 | 25 | 15 | -2 | -10 | 1
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29 | 1 | 1 | -1
2018-02-03 | New York City | NY | US | ...
and so on. i.e., add 3 more columns to show the day over day change.
Note that in the dataframe I may not have data everyday, however my change is defined as the differences between temperature on day t - temperature on the most recent date where I have data on the temperature.
I tried to use the shift function but R was complaining about the := sign.
Is there any way in R I could get this to work?
Thanks!
You can use dplyr::mutate_at and lubridate package to transform data in desired format. The data needs to be arranged in Date format and difference of current record with previous record can be taken with help of dplyr::lag function.
library(dplyr)
library(lubridate)
df %>% mutate_if(is.character, funs(trimws)) %>% #Trim any blank spaces
mutate(Date = ymd(Date)) %>% #Convert to Date/Time
group_by(City, State, Country) %>%
arrange(City, State, Country, Date) %>% #Order data date
mutate_at(vars(starts_with("Daily")), funs(Change = . - lag(.))) %>%
filter(!is.na(DailyMinTemperature_Change))
Result:
# # A tibble: 3 x 10
# # Groups: City, State, Country [3]
# Date City State Country DailyMinTemperature DailyMaxTemperature DailyMedianTemperature DailyMinTemperature_Change DailyMaxT~ DailyMed~
# <date> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 2018-02-03 London LDN UK 10.0 25.0 15 -2.00 10.0 1
# 2 2018-02-03 New York City NY US 18.0 22.0 19 6.00 - 8.00 1
# 3 2018-02-03 Singapore SG SG 28.0 32.0 29 1.00 1.00 -1
#
Data:
df <- read.table(text =
"Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30",
header = TRUE, stringsAsFactors = FALSE, sep = "|")

Resources