I have a following problme. I have two dataframes. In the second one. there are conditions about how a new column in the first dataframe should be calculated. See example bellow:
First df:
df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)
Second df:
df2 <- data.frame( country = c("01", "02", "03", "04", "05" ),
match_country1 = c("02", "03", "01", "01", "01"),
match_country2 = c("03", "04", "02", "02", "03"),
match_country3 = c("05", "05", "04", "03", "04")
)
Now I need to compute a new_value that is an average of three values as defined in df2. I need to respect a date in df1. For example, new_value for country "01" and date "2020-01-01" is an average of a value from country "02", country "03", country "05" all from date "2020-01-01".
Desired output is below:
new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
(2-3+15)/3,
(-3+15)/2, #because no data for 04 from 2020-01-02
(3+2)/2, #because no data for 04 from 2020-01-02
NA, #because no data for 01, 02, 04 from 2020-01-03
NA, #because no data for 01, 02, 04 from 2020-01-04
4, #because no data for 02, 03 from 2020-01-01
(3-3)/2 #because no data for 04 from 2020-01-02
)
)
How can I do this, please?
This can be done using an SQL triple join. For each row in df1 get the matching country row in df2 via left join and then get all the rows in the b instance of df1 for which the date is the same and there is a country match in df2. Then take the average b value in the matching rows.
library(sqldf)
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and
b.country in (c.match_country1, c.match_country2, c.match_country3)
group by a.rowid")
giving this data frame:
country date value blabla new_value
1 01 2020-01-01 4.0 23 NA
2 01 2020-01-02 3.0 41 4.666667
3 02 2020-01-02 2.0 32 6.000000
4 03 2020-01-02 -3.0 8 2.500000
5 03 2020-01-03 1.5 50 NA
6 03 2020-01-04 12.0 27 NA
7 04 2020-01-01 10.0 8 4.000000
8 05 2020-01-02 15.0 7 0.000000
Variations
Here are two variations. The first generates the in (...) string as matches and substitutes it in and the second converts df2 to long form, long first.
matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and b.country in ($matches)
group by a.rowid")
varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join long c on a.country = c.country
left join df1 b on a.date = b.date and b.country = c.match_country
group by a.rowid")
This tidyverse approach may help
df1
#> country date value blabla
#> 1 01 2020-01-01 4.0 23
#> 2 01 2020-01-02 3.0 41
#> 3 02 2020-01-02 2.0 32
#> 4 03 2020-01-02 -3.0 8
#> 5 03 2020-01-03 1.5 50
#> 6 03 2020-01-04 12.0 27
#> 7 04 2020-01-01 10.0 8
#> 8 05 2020-01-02 15.0 7
df2
#> country match_country1 match_country2 match_country3
#> 1 01 02 03 05
#> 2 02 03 04 05
#> 3 03 01 02 04
#> 4 04 01 02 03
#> 5 05 01 03 04
suppressMessages(library(tidyverse))
df1 %>%
left_join(df2, by = 'country') %>%
nest(data = !date) %>%
mutate(data = map(data, ~.x %>%
mutate(across(contains('match'), ~value[match(., country)])) %>%
rowwise() %>%
mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
select(!contains('match'))
)
) %>%
unnest(data)
#> # A tibble: 8 x 5
#> date country value blabla avg
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-01-01 01 4 23 NaN
#> 2 2020-01-01 04 10 8 4
#> 3 2020-01-02 01 3 41 4.67
#> 4 2020-01-02 02 2 32 6
#> 5 2020-01-02 03 -3 8 2.5
#> 6 2020-01-02 05 15 7 0
#> 7 2020-01-03 03 1.5 50 NaN
#> 8 2020-01-04 03 12 27 NaN
Created on 2021-05-02 by the reprex package (v2.0.0)
Though there already is an accepted answer, here is a base R, since the two answers posted (2nd) require external packages.
df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
mrg <- merge(df1[i, ], df2)
j <- grep("^match", names(mrg))
ctry <- unique(df1[i, "country"])
apply(mrg[j], 1, function(row){
k <- match(row, ctry)
if(any(!is.na(k)))
mean(mrg[k, "value"], na.rm = TRUE)
else NA_real_
})
}))
identical(df1$new_value, new_df$new_value)
#[1] TRUE
Related
I have a following problem.
I computed average temperature per country and also a difference between the actual daily temperature and the average temperature. See code below:
df1 <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 0.1, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12)
)
library(dplyr)
df2 <- df1 %>%
group_by(country) %>%
mutate(mean_per_country = mean(temperature))
df2$difference <- df2$temperature - df2$mean_per_country
Now I need to create a new column that checks if (unlimited number of) consecutive days in the same country have negative, or positive difference between the actual daily temperature and the average temperature. Is there an elegant way how can I do it in R?
Desired output is here:
desired_df <- data.frame(country = c("01", "01", "01","01", "01", "02", "02" , "03", "03","03"),
date = c("2020-01-01", "2020-01-02", "2020-01-03" , "2020-01-05", "2020-01-07", "2020-01-01", "2020-01-03", "2020-01-02", "2020-01-03", "2020-01-04"),
temperature = c(4, 3, -2, 2, -3, 1.5, 12, 10, 7, 5),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7, 6, 12),
mean_per_country = c(0.42, 0.42, 0.42, 0.42, 0.42, 6.75, 6.75, 7.33, 7.33, 7.33),
difference = c(3.58, 2.58, -2.42 , -0.32, -3.42 , -5.25, 5.25, 2.67, -0.333, -2.33),
new_column = c("hot",
"hot",
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"", #day interrupted, therefor not "cold"
"",
"",
"",
"cold",
"cold")
)
Thank you very much
Here's an approach with dplyr:
library(dplyr)
df2 %>%
group_by(country) %>%
mutate(date = as.Date(date),
consecutive = date - lag(date) == 1,
result = (sign(difference) == sign(lead(difference)) & lead(consecutive) |
(sign(difference) == sign(lag(difference)) & consecutive)),
new_column = c("cold",NA_character_,"hot")[result * sign(difference) + 2])
# A tibble: 10 x 9
# Groups: country [3]
country date temperature blabla mean_per_country difference consecutive result new_column
<chr> <date> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <chr>
1 01 2020-01-01 4 23 0.42 3.58 NA TRUE hot
2 01 2020-01-02 3 41 0.42 2.58 TRUE TRUE hot
3 01 2020-01-03 -2 32 0.42 -2.42 TRUE FALSE NA
4 01 2020-01-05 0.1 8 0.42 -0.32 FALSE FALSE NA
5 01 2020-01-07 -3 50 0.42 -3.42 FALSE NA NA
6 02 2020-01-01 1.5 27 6.75 -5.25 NA NA NA
7 02 2020-01-03 12 8 6.75 5.25 FALSE NA NA
8 03 2020-01-02 10 7 7.33 2.67 NA NA NA
9 03 2020-01-03 7 6 7.33 -0.333 TRUE TRUE cold
10 03 2020-01-04 5 12 7.33 -2.33 TRUE TRUE cold
To get rid of the intermediate columns that I left there for illustration purposes, just user select(-(consecutive:result)).
You need to turn the dates to Date class and then you can calculate the differences between dates. Then group by country and use ifelse() to set the values if the differences are 1:
require(plyr)
require(dplyr)
df2$date = as.Date(df2$date)
diffs <- c(0,diff(df2$date))
df2 %>% group_by(country) %>%
plyr::mutate(new_column = ifelse((difference > 0) & (diffs == 1), "hot", ifelse((difference < 0) & (diffs == 1), "cold", " ")))
> df2
country date temperature blabla mean_per_country difference new_column
1 01 2020-01-01 4.0 23 0.420000 3.5800000
2 01 2020-01-02 3.0 41 0.420000 2.5800000 hot
3 01 2020-01-03 -2.0 32 0.420000 -2.4200000 cold
4 01 2020-01-05 0.1 8 0.420000 -0.3200000
5 01 2020-01-07 -3.0 50 0.420000 -3.4200000
6 02 2020-01-01 1.5 27 6.750000 -5.2500000
7 02 2020-01-03 12.0 8 6.750000 5.2500000
8 03 2020-01-02 10.0 7 7.333333 2.6666667
9 03 2020-01-03 7.0 6 7.333333 -0.3333333 cold
10 03 2020-01-04 5.0 12 7.333333 -2.3333333 cold
Below is a sample data set
area periodyear period employment date
01 2020 08 100 2020-08-01
01 2020 09 105 2020-09-01
01 2020 10 110 2020-10-01
02 2020 08 101 2020-08-01
02 2020 09 102 2020-09-01
02 2020 10 103 2020-10-01
The question is how I get R to return the last TWO rows. I created the date using the following code as a way of having a single value (instead of periodyear and period) that a max value can be found for.
substate$date<- ymd(paste(substate$PERIODYEAR,substate$PERIOD,"1",sep="-"))
I know how to have it find the max value of a column (date, in this instance) but unclear how to have it create a data frame that looks like below
area periodyear period employment date
01 2020 09 105 2020-09-01
01 2020 10 110 2020-10-01
02 2020 09 102 2020-09-01
02 2020 10 103 2020-10-01
The reason for wanting the last TWO is that one month is brand new data and the one before is revised. From here, I update a SQL database.
An option is slice after arrangeing the 'area', and the Date class converted 'date' (if they are not in the order)
library(dplyr)
df1 %>%
arrange(area, as.Date(date)) %>%
group_by(area) %>%
slice_tail(n = 2) %>%
ungroup
-output
# A tibble: 4 x 5
# area periodyear period employment date
# <chr> <int> <int> <int> <chr>
#1 01 2020 9 105 2020-09-01
#2 01 2020 10 110 2020-10-01
#3 02 2020 9 102 2020-09-01
#4 02 2020 10 103 2020-10-01
data
df1 <- structure(list(area = c("01", "01", "01", "02", "02", "02"),
periodyear = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L),
period = c(8L, 9L, 10L, 8L, 9L, 10L), employment = c(100L,
105L, 110L, 101L, 102L, 103L), date = c("2020-08-01", "2020-09-01",
"2020-10-01", "2020-08-01", "2020-09-01", "2020-10-01")),
row.names = c(NA,
-6L), class = "data.frame")
Maybe this:
library(dplyr)
#Code
df %>% arrange(area,date) %>% group_by(area) %>%filter(row_number() %in% 2:n())
Output:
# A tibble: 4 x 5
# Groups: area [2]
area periodyear period employment date
<int> <int> <int> <int> <date>
1 1 2020 9 105 2020-09-01
2 1 2020 10 110 2020-10-01
3 2 2020 9 102 2020-09-01
4 2 2020 10 103 2020-10-01
I have a table with multiple rows person, and the date that each tax year ends:
df1 <- tibble::tribble(~ID, ~TAX_YEAR_END_DATE,
"01", "2009-04-06",
"01", "2010-04-06",
"01", "2011-04-06",
"02", "2010-04-06",
"02", "2011-04-06",
"02", "2012-04-06")
And another table, with multiple rows per person, giving the start date and end date for periods of work:
df2 <- tibble::tribble(~ID, ~START_DATE, ~END_DATE,
"01", "2007-09-11", "2010-04-06",
"02", "2008-06-06", "2010-04-06",
"02", "2011-09-09", "2014-04-06")
The END_DATE is always on 6th April, and everybody always has a START_DATE and END_DATE - there are no NULLs.
I want to add a new STATUS column to the first table, saying whether or not each person was EMPLOYED or NOT for each year. This is what it would look like for the above example:
ID TAX_YEAR_END_DATE STATUS
01 2009-04-06 EMPLOYED
01 2010-04-06 EMPLOYED
01 2011-04-06 NOT
02 2010-04-06 EMPLOYED
02 2011-04-06 NOT
02 2012-04-06 EMPLOYED
I've figured out that I can join the tables by ID, then apply some rules while using mutate() to create a new column - if the TY_END_DATE is between the START_DATE and END_DATE then the STATUS is EMPLOYED, and if it isn't then the STATUS is NOT.
Where I get stuck is with borrowers who have more than one period of employment in the second table. In these cases the rows in the first table get duplicated (or more) when I carry out the join, and I've not been able to figure out an alternative way of doing this.
I'm using R, would prefer data.table as it's normally quicker, but dplyr might be ok too.
A solution using a join to associate the tables and then a summarise
df1 %>% left_join(df2, by = "ID") %>%
mutate(employed = between(TAX_YEAR_END_DATE, START_DATE, END_DATE)) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(employed = any(employed))
An option using non equi join in data.table:
DT1[, status := c("NOT","EMP")[
DT2[.SD, on=.(ID, START_DATE<=TAX_YEAR_END_DATE, END_DATE>=TAX_YEAR_END_DATE),
by=.EACHI, .N>0L]$V1 + 1L
]]
output:
ID TAX_YEAR_END_DATE status
1: 1 2009-04-06 EMP
2: 1 2010-04-06 EMP
3: 1 2011-04-06 NOT
4: 2 2010-04-06 EMP
5: 2 2011-04-06 NOT
6: 2 2012-04-06 EMP
data:
library(data.table)
DT1 <- fread("ID TAX_YEAR_END_DATE
01 2009-04-06
01 2010-04-06
01 2011-04-06
02 2010-04-06
02 2011-04-06
02 2012-04-06")[,
TAX_YEAR_END_DATE := as.IDate(TAX_YEAR_END_DATE)]
cols <- c("START_DATE", "END_DATE")
DT2 <- fread("ID START_DATE END_DATE
01 2007-09-11 2010-04-06
02 2008-06-06 2010-04-06
02 2011-09-09 2014-04-06")[,
(cols) := lapply(.SD, as.IDate), .SDcols=cols]
One dplyr and lubridate solution could be:
df1 %>%
left_join(df2) %>%
group_by(ID, TAX_YEAR_END_DATE) %>%
summarise(STATUS = any(int_overlaps(interval(TAX_YEAR_END_DATE, TAX_YEAR_END_DATE),
interval(START_DATE, END_DATE))))
ID TAX_YEAR_END_DATE STATUS
<int> <chr> <lgl>
1 1 2009-04-06 TRUE
2 1 2010-04-06 TRUE
3 1 2011-04-06 FALSE
4 2 2010-04-06 TRUE
5 2 2011-04-06 FALSE
6 2 2012-04-06 TRUE
# Create a lookup data.frame for the durations in which ID was employed:
# dates_ro => data.frame
dates_ro <- data.frame(do.call("rbind", lapply(split(df2, rownames(df2)), function(x){
data.frame(id = x$ID,
emp_date = seq.Date(x$START_DATE, x$END_DATE, by = "days"))
}
)
),
row.names = NULL)
# Lookup whether or not the person is employed at end date
# STATUS => character vector
df1$STATUS <- ifelse(is.na(
match(df1$ID, dates_ro$id) &
match(df1$TAX_YEAR_END_DATE, dates_ro$emp_date)),"UNEMPLOYED", "EMPLOYED")
Data:
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), TAX_YEAR_END_DATE = structure(c(14340,
14705, 15070, 14705, 15070, 15436), class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 2L, 2L), START_DATE = structure(c(13767,
14036, 15226), class = "Date"), END_DATE = structure(c(14705,
14705, 16166), class = "Date")), class = "data.frame", row.names = c(NA, -3L))
I have a problem in my dataset with missing values. For some reason, several ID’s miss a value at the column ‘Names’. This is strange, because other ID’s (with the same CODE (there are more codes in my whole dataset (>10K) and same year(6 options for years)) do have a value in that column.
Can somebody help me figuring out the code, so that ID’s with missing values in the ‘Names’ column, do get the same character value in ‘Names’ column, if other ID’s with the same code and year, do have a value in that column?
For example: the NA at row 4; should change to 'Hospital'; based on the same code and year, of another ID.(In my original dataframe there is an ID with 2013 and code 01 with name 'Hospital'; if not, it should stay NA).
Sidenote: it is panel data, so each ID can be in the dataset for multiple years (and rows; each year is one row) and not everybody is in for every year. There are also more variables in my dataframe.
> dput(Dataframe[1:7, ])
structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
A tibble: 7 x 8
ID Gender CODE Year Quarter Size Names
<dbl> <dbl> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 1 2 01 2018 3 24.5 Hospital
2 2 1 01 2014 4 23.2 Hospital
3 2 1 01 2018 4 24.5 Hospital
4 2 1 01 2013 4 30 NA
5 2 1 01 2013 3 30 Hospital
6 2 1 01 2015 4 19.2 NA
7 2 1 01 2015 3 19.2 Hospital
Selecting and checking indvidual rows is too much work, I have over 1.1 million rows..
Edit: it also possible to transfer the 'names' column to 1 if it has a (character) value, and 0 if NA.
Thank you!
I'm not exactly sure because in your example all the names are the same but I think this might do what you are looking for.
I changed the example below to have the last Names be "Not Hospital".
df <- structure(list(ID = structure(c(1, 2, 2, 2, 2, 2, 2), format.spss = "F9.3"), CODE = c("01", "01", "01","01", "01", "01", "01"), Year = structure(c(2018, 2014, 2018, 2013, 2013, 2015, 2015), format.spss = "F9.3"), Quarter = structure(c(3, 4, 4, 4, 3, 4, 3), format.spss = "F9.3"), Size = c(24.5, 23.25, 24.5, 30, 30, 19.25, 19.25), Names = c("Hospital", "Hospital", "Hospital", NA, "Hospital", NA, "Not Hospital")), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame") )
Original
# A tibble: 7 x 6
ID CODE Year Quarter Size Names
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 01 2018 3 24.5 Hospital
2 2 01 2014 4 23.2 Hospital
3 2 01 2018 4 24.5 Hospital
4 2 01 2013 4 30 NA
5 2 01 2013 3 30 Hospital
6 2 01 2015 4 19.2 NA
7 2 01 2015 3 19.2 Not Hospital
Here's the code to update the names.
df %>%
filter(!is.na(Names)) %>%
select(CODE, Year, Names) %>%
group_by_all() %>%
summarise() %>%
right_join(df, by = c("CODE", "Year")) %>%
rename(Names = Names.x) %>%
select(-Names.y)
Output:
# A tibble: 7 x 6
# Groups: CODE, Year [4]
CODE Year Names ID Quarter Size
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 01 2018 Hospital 1 3 24.5
2 01 2014 Hospital 2 4 23.2
3 01 2018 Hospital 2 4 24.5
4 01 2013 Hospital 2 4 30
5 01 2013 Hospital 2 3 30
6 01 2015 Not Hospital 2 4 19.2
7 01 2015 Not Hospital 2 3 19.2
There are several ways to approach this problem, as far as I can see. However, I prefer the following solution.
The first step is to split the data frame into two. One data frame contains only rows without NA's in the Names column, while the other contains only rows with NA's in the Names column. Then, you simply search in the former for CODE YEAR combinations and return the name of the corresponding row. The first is to collect the rows that contain NA's, and take this to search for code and year combinations.
# Your data frame
df <-
# Split df
df.with.nas <- df[ is.na(df$Names) ,]
df.without.nas <- df[ !is.na(df$Names) ,]
# Define function to separat logic
get.name <- function(row) {
# row is an atomic vector. Hence we have to use row["<SELECTOR>"]
result <- subset(df.without.nas, CODE == row["CODE"] & Year == row["Year"])
return(result["Names"])
}
# Finally, search and return.
row.axis <- 1
df.with.nas$Names <- apply(df.with.nas, row.axis, get.name)
# Combine the dfs
df <- rbind(
df.with.nas, df.without.nas)
This solution has a shortcoming. What should happen, when we find dublicates?
I hope this useful!
Here is the situation where I got kinda stuck with R. I have data table with one row for each day, something like this:
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month1 = c("DEC", "JAN");
Year1 = c("15", "16");
Price1 = c(100, 110);
Month2 = c(NA_character_, NA_character_);
Year2 = c(NA_character_, NA_character_);
Price2 = c(NA_integer_, NA_integer_);
Month3 = c(NA_character_, NA_character_);
Year3 = c(NA_character_, NA_character_);
Price3 = c(NA_integer_, NA_integer_);
Month4 = c(NA_character_, NA_character_);
Year4 = c(NA_character_, NA_character_);
Price4 = c(NA_integer_, NA_integer_);
dataSample = data.frame(Date, Month1, Year1, Price1, Month2, Year2, Price2, Month3, Year3, Price3, Month4, Year4, Price4);
Which gives such a table:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now I need to calculate all months and prices for each. For that I have 2 other data frames:
Date = c(as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2016-01-01"), as.Date("2016-01-01"), as.Date("2016-01-01"));
Month.Start = c("DEC", "JAN", "FEB", "JAN", "FEB", "MAR");
Year.Start = c("15", "16", "16", "16", "16", "16")
Month.End = c("JAN", "FEB", "MAR", "FEB", "MAR", "APR");
Year.End = c("16", "16", "16", "16", "16", "16")
Diff = c(10, 15, -15, 19, -20, -5);
diffsOneMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month.Start = c("DEC", "MAR");
Year.Start = c("15", "16")
Month.End = c("MAR", "JUN");
Year.End = c("16", "16")
Diff = c(11, 25);
diffsThreeMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Which gives me these tables:
One month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 JAN 16 10
2 2015-12-31 JAN 16 FEB 16 15
3 2015-12-31 FEB 16 MAR 16 -15
4 2016-01-01 JAN 16 FEB 16 19
5 2016-01-01 FEB 16 MAR 16 -20
6 2016-01-01 MAR 16 APR 16 -5
Three month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 MAR 16 20
2 2016-01-01 MAR 16 JUN 16 25
Now I must fill dataSample data frame by using data from differences tables. I check what start/end months/years are available there and have to fill those months/years in dataSample. Then take difference of price and set calculated price in dataSample. So for example in dataSample we start with DEC 15, then in diffsOneMonth we have entry DEC 15 - JAN 16 with difference 10 so we add it to DEC 15 price and get JAN 16 price 110:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now its possible to do next month and then next etc. If we use diffsOneMonth only we would get desirable result like this:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 110
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
However there is additional requirement that I must use wider month spread to calculate prices if its possible. So for 2015-12-31 there exists three month spread from DEC 15 to MAR 16 which should override price from one month difference. So DEC 15 price is 110 and DEC 15 - MAR 16 difference is 11 which makes MAR 16 price not 110 but 111:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 111
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
So for this sample it would be my final desirable output.
Real data is much more complex, with 6 and 12 month differences and 64 months forward for each date. Also some months can be missing. I tried to do it with a loop but it was very slow, however I am not sure how to approach such a problem without a loop. I have created few helper methods to be able to calculate next year/month:
nextContract = function(currentMonth, currentYear, length = 1,
years = c("10", "11", "12", "13", "14", "15", "16", "17", "18"),
months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
yDiff = ifelse(length(months) < mIdx, mIdx / length(months) - ifelse(mIdx %% length(months) == 0, 1, 0), 0);
return(data.frame(nextMonth(currentMonth, length, months), nextYear(currentYear, length = yDiff)))
}
nextYear = function(currentYear, length = 1, years = c("10", "11", "12", "13", "14", "15", "16", "17", "18")) {
return(years[match(currentYear, years)+length]);
}
nextMonth = function(currentMonth, length = 1, months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
return(months[ifelse(length(months) < mIdx, ifelse(mIdx %% length(months) != 0, mIdx %% length(months), length(months)), mIdx)]);
}
Example of usage could be:
> nextContract("DEC", "15")
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 JAN 16
or:
> nextContract("DEC", "15", length = 3)
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 MAR 16
This got to be pretty long question but I hope someone will take time to review it :)
Thanks in advance!
EDIT
A little bit of improvement on proposed solution and I got what I needed:
outrightAndForwardRows <- list("1" = diffsOneMonth, "3" = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
mutate(Day.End = 1) %>%
mutate(Outright.Day = 1) %>%
unite("Contract.Start", Day.Start, Month.Start, Year.Start) %>%
unite("Contract.End", Day.End, Month.End, Year.End) %>%
unite("Contract.Outright", Outright.Day, Month1, Year1) %>%
mutate(time_step = as.numeric(time_step),
Contract.Start =
Contract.Start %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.End =
Contract.End %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.Outright =
Contract.Outright %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Contract.End) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Contract.End) %>%
slice(time_step %>% which.max) %>%
ungroup() %>%
select(-time_step, -Diff, -Contract.Start)
#### add outright and forward months to the same columns
outright <- outrightAndForwardRows %>% select(Date, Price=Price1, Contract=Contract.Outright) %>% unique
forwardMonths <- outrightAndForwardRows %>% select(Date, Contract=Contract.End, Price)
# join and sort rows
joined <- rbind(outright, forwardMonths) %>% arrange(Date, Contract)
# add contract sequence
joined = data.table(joined)
joined = joined[, Contract.seq:=seq(.N), by=Date];
dcast(joined, Date ~ Contract.seq, value.var=c("Price", "Contract"))
Something like this:
library(dplyr)
library(tidyr)
library(lubridate)
list(`1` = diffsOneMonth,
`3` = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
unite("Date.Start", Day.Start, Month.Start, Year.Start) %>%
mutate(time_step = as.numeric(time_step),
Date.Start =
Date.Start %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Date.Start) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Date.Start) %>%
slice(time_step %>% which.max)