a pair-case multiplication of variables/df (R) - r

I've a question about a pair-case multiplication of variables/df in R.
Consider the following problem:
having data in vector (or in dataframe) that have labels and values as follow:
alpha_lab <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
alpha_num <- c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
the alpha_num is a product of other calculations (irrelevant), the following values correspond to their labels in alpha_lab (so January = 15.28767, April = 103.56164...).
I also have a dataframe with "case", "month" (as int), "year" and "value":
> df_values
# A tibble: 1,173 x 4
# Groups: case, month
case month year value
<chr> <int> <int> <dbl>
1 A1 1 2009 121.
2 A1 1 2010 177.
3 A1 1 2011 220.
4 A1 1 2012 196.
5 A1 1 2013 161.
6 A1 1 2014 142.
7 A1 2 2009 82.3
8 A1 2 2010 169.
9 A1 2 2011 194.
10 A1 2 2012 169.
# ... with 1,163 more rows
what I am looking for, is a way to compute for each case (20 different) in each month-year a product of
value * alpha_num
where alpha_num is taken only for a calculated month, so for example:
row 1 (A1, January 2009 case): 121 * 15.28767
row 5 (A1, January 2013 case): 161 * 15.28767
row 7 (A1, February 2011 case): 82.3 * 44.38356
and so on for each case in each month in each year...
Is there a way to compute this without adding corresponding alpha_num value to df_values table one-by-one month case?
Thanks!

This should be helpful:
library(dplyr)
# original vectors
alpha_lab <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
alpha_num <- c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
# example of your dataframe
df_values = data.frame(case = c("A1", "A1"),
month = c(1, 2),
year = c(2009, 2009),
value = c(121, 82.3), stringsAsFactors = F)
df_values %>% mutate(new_col = value * alpha_num[month])
# case month year value new_col
# 1 A1 1 2009 121.0 1849.808
# 2 A1 2 2009 82.3 3652.767
Note that this works because your alpha_lab vector has the months in the right order. i.e. Jan, Feb, ..., Dec represent the positions 1, 2, ..., 12.

You can also try to work with an lookup table and dplyr::left_join.
library("magrittr")
sampleData <- tibble::tibble(
case = "A1",
month = rep(1:12, each = 6),
year = rep(2009:2014, 12),
value = runif(72, 10, 130)
)
lookup_table <- tibble::tibble(
alpha_lab = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
alpha_id = 1:12,
alpha_num = c(15.28767, 44.38356, 73.47945, 103.56164, 133.64384, 163.72603, 193.80822, 224.38356, 254.46575, 284.54795, 314.63014, 344.71233)
)
result <- dplyr::left_join(sampleData, lookup_table, by = c("month" = "alpha_id")) %>%
dplyr::mutate(new_col = alpha_num * value) %>%
dplyr::select(-alpha_num, -alpha_lab)

Related

How to add an increasing index based on multiple columns in R

I have a data frame that contains the columns "hour", "day","month" and "count".
library(tidyverse)
set.seed(0)
df <- expand_grid(expand_grid(
hour = seq(0:23),
day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
mutate(count = sample(0:100, n(), replace = TRUE))
head(df)
# A tibble: 6 × 4
hour day month count
<int> <chr> <chr> <int>
1 1 Mon Jan 13
2 1 Mon Feb 67
3 1 Mon Mar 38
4 1 Mon Apr 0
5 1 Mon May 33
6 1 Mon Jun 86
I would like to add a new column named "id" that contains an increasing index which can be used to sort the data in chronological order. The solution I found is not particularly concise and requires me to set factor levels before calling arrange(). Is there another way to solve this issue that capitalises on the fact that I am working with (unformatted) dates?
This is my solution with arrange():
df2 <- df %>%
mutate(day = factor(day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))) %>%
arrange(month, day, hour) %>%
mutate(id = row_number())
head(df2)
# A tibble: 6 × 5
hour day month count id
<int> <fct> <fct> <int> <int>
1 1 Mon Jan 13 1
2 2 Mon Jan 43 2
3 3 Mon Jan 82 3
4 4 Mon Jan 66 4
5 5 Mon Jan 49 5
6 6 Mon Jan 79 6
Any suggestions are much appreciated. Thank you!

Turning date into integers in R

I have a data frame in R and I have a month column and a day column containing characters like "jan", "feb", or "mar" for months or "mon", "tue" or "wed" for days. I would like to find a way to convert both columns into integers ranging from 1 to 12 for months and 1 to 7 for days. I have tried built-in functions like month.abb but when I try using match with the column for months it just returns a list of NA. Thank you very much for your help !
A general method would be to define a factor with the levels you want, and then turn it into an integer. See reprex underneath.
This would also work for weekdays.
months <- c(
"jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "des"
)
x <- sample(months, 10, replace = TRUE)
x
#> [1] "sep" "oct" "mar" "jun" "oct" "mar" "apr" "aug" "jul" "sep"
as.integer(factor(x, levels = months))
#> [1] 9 10 3 6 10 3 4 8 7 9
Use match:
match(c("jan", "feb", "may"), tolower(month.abb))
match(c("mon", "tue", "thur"), c("mon", "tue", "wed", "thur", "fri", "sat", "sun"))

How to order my column names by month after I've used the function with() to pivot rows to columns in R?

I have this dataframe
df = structure(list(month = c("Jun", "Jun", "Jun", "Jun", "Jun", "Jun",
"Jun", "Jun", "Jun", "Jul", "Jul", "Jul", "Jul", "Jul", "Jul",
"Jul", "Jul", "Jul", "Aug", "Aug"), year = c(2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020), name = c("X20700254_SiteCode_9RB",
"X20819833_SiteCode_7RB", "X20874286_SiteCode_5RB", "X20874298_SiteCode_.CB",
"X20874301_SiteCode_.RB", "X20874311_SiteCode_4RB", "X20874315_SiteCode_.HB",
"X20874322_SiteCode_3Fields_Brook_below_289_crossing.", "X20874323_SiteCode_6RB",
"X20700254_SiteCode_9RB", "X20819833_SiteCode_7RB", "X20874286_SiteCode_5RB",
"X20874298_SiteCode_.CB", "X20874301_SiteCode_.RB", "X20874311_SiteCode_4RB",
"X20874315_SiteCode_.HB", "X20874322_SiteCode_3Fields_Brook_below_289_crossing.",
"X20874323_SiteCode_6RB", "X20700254_SiteCode_9RB", "X20819833_SiteCode_7RB"
), avg = c(17.9671617647059, 17.96046875, 18.8039981617647, 12.7146985294118,
19.8493308823529, 20.840299047619, 14.0127959558824, 17.7243290441176,
20.8349797794118, 18.417934811828, 18.9261226478495, 18.6461377688172,
16.4656639784946, 19.8924495967742, 20.5506465053763, 15.236438172043,
17.2303807123656, 20.3705809811828, 17.2725913978495, 18.388158938172
), sd = c(1.01630124136236, 0.31920182877467, 1.44545541517925,
0.233082255964961, 1.96762945427252, 2.22557010024103, 0.824501588774415,
1.52965887463383, 2.21987506441605, 1.86382574377114, 2.00506644767057,
2.32206032778809, 1.90349256072417, 2.86383652545279, 3.02180888784572,
1.63882057453028, 2.20687928427123, 3.17513491088921, 1.76257380093557,
1.5778580871338), max = c(20.234, 19.092, 22.046, 13.173, 24.351,
25.319, 15.473, 20.71, 25.61, 23.484, 26.488, 25.125, 22.525,
28.555, 29.953, 19.948, 23.1, 29.053, 23.581, 20.615), greater20 = c(0.0183823529411765,
0, 0.242647058823529, 0, 0.431985294117647, 0.533333333333333,
0, 0.0845588235294118, 0.566176470588235, 0.177083333333333,
0.269153225806452, 0.237903225806452, 0.0443548387096774, 0.428091397849462,
0.539650537634409, 0, 0.116935483870968, 0.512432795698925, 0.0443548387096774,
0.0782930107526882), greater23 = c(0, 0, 0, 0, 0.0698529411764706,
0.253333333333333, 0, 0, 0.229779411764706, 0.00268817204301075,
0.0352822580645161, 0.0443548387096774, 0, 0.14885752688172,
0.191868279569892, 0, 0.0030241935483871, 0.193212365591398,
0.00235215053763441, 0)), row.names = c(NA, -20L), groups = structure(list(
month = c("Aug", "Jul", "Jun"), year = c(2020, 2020, 2020
), .rows = structure(list(19:20, 10:18, 1:9), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
I've pivoted this table using the following code
df_Greater20_ByMonth = with(df,
tapply(greater20, list(name, month) , I) )
and it has nearly given me what I wanted but the months were changed from characters to numeric, and they are now out of order. Does anyone know how to make them in proper order by month? Anytime I try to fix it before using the function with() the months just get changed back to numeric.
You can also create a factor:
df$month <- factor(df$month, levels = unique(df$month))
df_Greater20_ByMonth = with(df,
tapply(greater20, list(name, month) , I) )
You can use pivot_wider :
library(dplyr)
library(tidyr)
df %>%
ungroup %>%
select(name, month, greater20) %>%
pivot_wider(names_from = month, values_from = greater20)
# name Jun Jul Aug
# <chr> <dbl> <dbl> <dbl>
#1 X20700254_SiteCode_9RB 0.0184 0.177 0.0444
#2 X20819833_SiteCode_7RB 0 0.269 #0.0783
#3 X20874286_SiteCode_5RB 0.243 0.238 NA
#4 X20874298_SiteCode_.CB 0 0.0444 NA
#5 X20874301_SiteCode_.RB 0.432 0.428 NA
#6 X20874311_SiteCode_4RB 0.533 0.540 NA
#7 X20874315_SiteCode_.HB 0 0 NA
#8 X20874322_SiteCode_3Fields_Brook_below_289_crossing. 0.0846 0.117 NA
#9 X20874323_SiteCode_6RB 0.566 0.512 NA

Combining abbreviated months and year into one variable in R

I have a time series data with a column for a month and a column for a year. The months are JAN, FEB, etc.
I'm trying to combine them into one month year variable in order to run time series analysis on it. I'm very new to R and could use any guidance.
Perhaps something like this?
library(dplyr)
c("JAN", "FEB", "MAR", "APR",
"MAY", "JUN", "JUL", "AUG",
"SEP", "OCT", "NOV", "DEC") %>%
rep(., times = 3) %>%
as.factor() -> months
c("2018", "2019", "2020") %>%
rep(., each = 12) %>%
as.factor() -> years
df1 <- cbind.data.frame(months, years)
paste(df1$months, df1$years, sep = ".") %>%
as.factor() -> merged.years.months
Start with your month/year df.
library(tidyverse)
library(lubridate)
events <- tibble(month = c("JAN", "MAR", "FEB", "NOV", "AUG"),
year = c(2018, 2019, 2018, 2020, 2019))
Let's say that each of your time periods start on the first of the month.
series <- events %>%
mutate(mo1 = dmy(paste(1, month, year)))
This is what you want
R > series
# A tibble: 5 x 3
month year mo1
<chr> <dbl> <date>
1 JAN 2018 2018-01-01
2 MAR 2019 2019-03-01
3 FEB 2018 2018-02-01
4 NOV 2020 2020-11-01
5 AUG 2019 2019-08-01
These are now dates;you can use them in other analyses.
Base R solution:
events <- within(events,{
month_no <- as.integer(as.factor(sort(month)))
date <- as.Date(paste(year, ifelse(nchar(month_no) < 2, paste0("0", month_no),
month_no), "01", sep = "-"), "%Y-%m-%d")
rm(month_no, month, year)
}
)

Efficient algorithm to calculate values in data.frame without loop

Here is the situation where I got kinda stuck with R. I have data table with one row for each day, something like this:
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month1 = c("DEC", "JAN");
Year1 = c("15", "16");
Price1 = c(100, 110);
Month2 = c(NA_character_, NA_character_);
Year2 = c(NA_character_, NA_character_);
Price2 = c(NA_integer_, NA_integer_);
Month3 = c(NA_character_, NA_character_);
Year3 = c(NA_character_, NA_character_);
Price3 = c(NA_integer_, NA_integer_);
Month4 = c(NA_character_, NA_character_);
Year4 = c(NA_character_, NA_character_);
Price4 = c(NA_integer_, NA_integer_);
dataSample = data.frame(Date, Month1, Year1, Price1, Month2, Year2, Price2, Month3, Year3, Price3, Month4, Year4, Price4);
Which gives such a table:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now I need to calculate all months and prices for each. For that I have 2 other data frames:
Date = c(as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2016-01-01"), as.Date("2016-01-01"), as.Date("2016-01-01"));
Month.Start = c("DEC", "JAN", "FEB", "JAN", "FEB", "MAR");
Year.Start = c("15", "16", "16", "16", "16", "16")
Month.End = c("JAN", "FEB", "MAR", "FEB", "MAR", "APR");
Year.End = c("16", "16", "16", "16", "16", "16")
Diff = c(10, 15, -15, 19, -20, -5);
diffsOneMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month.Start = c("DEC", "MAR");
Year.Start = c("15", "16")
Month.End = c("MAR", "JUN");
Year.End = c("16", "16")
Diff = c(11, 25);
diffsThreeMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Which gives me these tables:
One month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 JAN 16 10
2 2015-12-31 JAN 16 FEB 16 15
3 2015-12-31 FEB 16 MAR 16 -15
4 2016-01-01 JAN 16 FEB 16 19
5 2016-01-01 FEB 16 MAR 16 -20
6 2016-01-01 MAR 16 APR 16 -5
Three month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 MAR 16 20
2 2016-01-01 MAR 16 JUN 16 25
Now I must fill dataSample data frame by using data from differences tables. I check what start/end months/years are available there and have to fill those months/years in dataSample. Then take difference of price and set calculated price in dataSample. So for example in dataSample we start with DEC 15, then in diffsOneMonth we have entry DEC 15 - JAN 16 with difference 10 so we add it to DEC 15 price and get JAN 16 price 110:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now its possible to do next month and then next etc. If we use diffsOneMonth only we would get desirable result like this:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 110
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
However there is additional requirement that I must use wider month spread to calculate prices if its possible. So for 2015-12-31 there exists three month spread from DEC 15 to MAR 16 which should override price from one month difference. So DEC 15 price is 110 and DEC 15 - MAR 16 difference is 11 which makes MAR 16 price not 110 but 111:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 111
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
So for this sample it would be my final desirable output.
Real data is much more complex, with 6 and 12 month differences and 64 months forward for each date. Also some months can be missing. I tried to do it with a loop but it was very slow, however I am not sure how to approach such a problem without a loop. I have created few helper methods to be able to calculate next year/month:
nextContract = function(currentMonth, currentYear, length = 1,
years = c("10", "11", "12", "13", "14", "15", "16", "17", "18"),
months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
yDiff = ifelse(length(months) < mIdx, mIdx / length(months) - ifelse(mIdx %% length(months) == 0, 1, 0), 0);
return(data.frame(nextMonth(currentMonth, length, months), nextYear(currentYear, length = yDiff)))
}
nextYear = function(currentYear, length = 1, years = c("10", "11", "12", "13", "14", "15", "16", "17", "18")) {
return(years[match(currentYear, years)+length]);
}
nextMonth = function(currentMonth, length = 1, months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
return(months[ifelse(length(months) < mIdx, ifelse(mIdx %% length(months) != 0, mIdx %% length(months), length(months)), mIdx)]);
}
Example of usage could be:
> nextContract("DEC", "15")
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 JAN 16
or:
> nextContract("DEC", "15", length = 3)
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 MAR 16
This got to be pretty long question but I hope someone will take time to review it :)
Thanks in advance!
EDIT
A little bit of improvement on proposed solution and I got what I needed:
outrightAndForwardRows <- list("1" = diffsOneMonth, "3" = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
mutate(Day.End = 1) %>%
mutate(Outright.Day = 1) %>%
unite("Contract.Start", Day.Start, Month.Start, Year.Start) %>%
unite("Contract.End", Day.End, Month.End, Year.End) %>%
unite("Contract.Outright", Outright.Day, Month1, Year1) %>%
mutate(time_step = as.numeric(time_step),
Contract.Start =
Contract.Start %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.End =
Contract.End %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.Outright =
Contract.Outright %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Contract.End) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Contract.End) %>%
slice(time_step %>% which.max) %>%
ungroup() %>%
select(-time_step, -Diff, -Contract.Start)
#### add outright and forward months to the same columns
outright <- outrightAndForwardRows %>% select(Date, Price=Price1, Contract=Contract.Outright) %>% unique
forwardMonths <- outrightAndForwardRows %>% select(Date, Contract=Contract.End, Price)
# join and sort rows
joined <- rbind(outright, forwardMonths) %>% arrange(Date, Contract)
# add contract sequence
joined = data.table(joined)
joined = joined[, Contract.seq:=seq(.N), by=Date];
dcast(joined, Date ~ Contract.seq, value.var=c("Price", "Contract"))
Something like this:
library(dplyr)
library(tidyr)
library(lubridate)
list(`1` = diffsOneMonth,
`3` = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
unite("Date.Start", Day.Start, Month.Start, Year.Start) %>%
mutate(time_step = as.numeric(time_step),
Date.Start =
Date.Start %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Date.Start) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Date.Start) %>%
slice(time_step %>% which.max)

Resources