Multiplication of several columns - r

I want to multiply some counts in specific locations (0-100, 0-12 etc.., individual variable columns) by the number of days a count is present (days)
Here is an example of my data:
df <- structure(list(month = c("Apr", "Apr", "Aug", "Aug", "Aug", "Sep"
), Year = c(2018, 2018, 2018, 2018, 2018, 2018), First =
structure(c(17995,
17998, 17750, 17758, 17770, 17778), class = "Date"), Last =
structure(c(17999,
17998, 17750, 17761, 17771, 17778), class = "Date"), days = c(5,
1, 1, 4, 2, 1), `0-100` = c(1, 0, 1, 1, 1, 1), `0-12` = c(0,
0, 1, 1, 1, 1), `0-25` = c(1, 1, 1, 1, 1, 1), `0-50` = c(1, 0,
1, 1, 1, 1)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
So i was thinking something along the lines of:
df2 <- df %>%
mutate("0-100b" = days * "0-100", "0-12b" = days * "0-12", "0-25b" = days * "0-25", "0-50b" = days * "0-25")
Which one doesn't seem to work, but two there must be a more concise way than writing out each multiplication too ... if i had many more columns this seems a little tedious.
ok edit for col names:
colnames(df) <- c("month", "Year", "First", "Last" , "days", "V", "I",
"II", "III")
df2 <- df %>%
mutate(Vb = days * V, Ib = days * I, IIb = days *
II, IIIb = days * III)

Like I said above, you can select improperly named columns by wrapping them in backticks. One of the places that naming rules are laid out is in the docs of the base function make.names.
The easiest solution to having improper names is to just create data with valid names to begin with...but in practice, that isn't always possible. There are several ways to change the names into valid ones. The aforementioned make.names does this from a character vector.
If you're working in a larger piped workflow, you can use rename_all with a few string manipulation functions to 1) convert to lowercase, 2) replace - with _, and 3) prepend an x before any leading digits. You can also use janitor::clean_names, which cleans all the names in a data frame.
library(dplyr)
df %>%
rename_all(~tolower(.) %>%
stringr::str_replace_all(., "\\-", "_") %>%
stringr::str_replace("^\\b(?=\\d)", "x"))
# omitted: same names as below
With clean names, you can use mutate_at, select the columns, and pass it a function to multiply by days. If you use a named list, the name is appended to create new columns, instead of replacing them.
df %>%
janitor::clean_names() %>%
mutate_at(vars(x0_100:x0_50), list(b = ~. * days))
#> # A tibble: 6 x 13
#> month year first last days x0_100 x0_12 x0_25 x0_50 x0_100_b
#> <chr> <dbl> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Apr 2018 2019-04-09 2019-04-13 5 1 0 1 1 5
#> 2 Apr 2018 2019-04-12 2019-04-12 1 0 0 1 0 0
#> 3 Aug 2018 2018-08-07 2018-08-07 1 1 1 1 1 1
#> 4 Aug 2018 2018-08-15 2018-08-18 4 1 1 1 1 4
#> 5 Aug 2018 2018-08-27 2018-08-28 2 1 1 1 1 2
#> 6 Sep 2018 2018-09-04 2018-09-04 1 1 1 1 1 1
#> # … with 3 more variables: x0_12_b <dbl>, x0_25_b <dbl>, x0_50_b <dbl>
In this case, it might also make sense to select columns by regex:
df %>%
janitor::clean_names() %>%
mutate_at(vars(matches("^x\\d")), list(b = ~. * days))
# same output as above

Related

Slider (slide_period) Moving Average Calculation Wrong

I'm trying to use Slider to compute moving averages over some time series data. The data has day resolution (one observation per day). For each observation I want to compute the average daily value over the last 7 days.
The problem is my code is ignoring the missing observations with implied values of zero. So if my period is 7 days, and during some 7 day window there are only 2 observations, it's summing them and dividing by 2, whereas I'm looking to sum and divide by 7 to get the average per day.
In the code below you'll see that the second row (2023-02-03) is computing the average by dividing by 2 (the number of observations), rather than by dividing by 4 (the number of days in the period 2023-01-31 to 2023-02-03).
Is there a good way to achieve the desired result, or do I just need to replace the mean calculation with sum() / 7?
I had originally backfilled the missing observations which worked, but the data is relatively large and quite sparse and doing so massively increased the runtime (from ~8 seconds to ~100).
library(tidyverse)
library(slider)
data <- data.frame(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
)
print(as_tibble(data))
summary <- function(data) {
summarise(data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
res <- data %>%
arrange(date) %>%
mutate(
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
)
print(as_tibble(res))
# A tibble: 11 x 2
date val
<date> <dbl>
1 2023-02-13 0
2 2023-02-12 0
3 2023-02-11 2
4 2023-02-10 1
5 2023-02-09 0
6 2023-02-07 10
7 2023-02-06 0
8 2023-02-05 1
9 2023-02-04 1
10 2023-02-03 6
11 2023-01-31 1
# A tibble: 11 x 3
date val weekly$moving_total $moving_avg $num_observations
<date> <dbl> <dbl> <dbl> <int>
1 2023-01-31 1 1 1 1
2 2023-02-03 6 7 3.5 2
3 2023-02-04 1 8 2.67 3
4 2023-02-05 1 9 2.25 4
5 2023-02-06 0 9 1.8 5
6 2023-02-07 10 18 3.6 5
7 2023-02-09 0 18 3 6
8 2023-02-10 1 13 2.17 6
9 2023-02-11 2 14 2.33 6
10 2023-02-12 0 13 2.17 6
11 2023-02-13 0 13 2.17 6
Just a note on the implementation. In the real world the moving averages are being computed over groups, hence the use of pick(everything()) above. Don't think it's necessary for the toy example, but I leave it in just in case it influences the answer.
Thanks
A straightforward solution is to supply zero values for the absent dates.
library(tidyverse)
library(slider)
(data_ <- tibble(
date = Sys.Date() - c(0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 13),
val = c(0, 0, 2, 1, 0, 10, 0, 1, 1, 6, 1)
))
summary <- function(.data) {
summarise(.data,
moving_total = sum(val),
moving_avg = mean(val, na.rm = FALSE),
num_observations = n()
)
}
(full_dates <- tibble(date=
seq(min(data_$date),
max(data_$date),
by = "1 day"
)))
(fdj <- left_join(
full_dates,
data_
) |> mutate(val = if_else(
is.na(val), 0, val
)))
(res <-
mutate(fdj,
weekly = slide_period_dfr(
.x = pick(everything()),
.i = date,
.period = "day",
.f = summary,
.before = 6,
.complete = FALSE
)
))

counting rows between two specific rows with a condition

df <- structure(list(inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"), ass = c("x", "x", "x", "x", "x"), datetime = c("2010-01-01",
"2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19"), portfolio = c(10,
0, 5, 2, 0)), operation = c(10, -10, 5, -3, -2), class = "data.frame", row.names = c(NA, -5L))
So I have 4000 investors with 6000 different assets, for each investor I have his trading operations in two different variables: operation tells me if he is buying/selling; portfolio tells me how much he has in the portfolio.
What I want to do is computing the number of days a position stays open in the portfolio, so I though about computing the difference between the day in which the portfolio goes back to zero and the day in which the portfolio went positive (it is not possible to get negative portfolio).
so in the dataset above I would count row2 - row1 ==> 2010-01-02 - 2010-01-01
and row 5 - row 3 ==> 2010-01-19 - 2010-01-03 and so on...
I want to do this computation for all the investor & asset I have in my dataset for all the rows in which I find that portfolio > 0.
So my dataset will have a further column called duration which would be equal, in this case to c(0,1,0,5,16) (so of course i also had to compute raw1 - raw1 and raw3 - raw3)
Hence my problem is to restart the count everytime portfolio goes back to zero.
library(dplyr)
df %>%
mutate(datetime = as.Date(datetime, "%Y-%m-%d")) %>%
group_by(investor, asset) %>%
arrange(datetime) %>%
mutate(grp.pos = cumsum(lag(portfolio, default = 1) == 0)) %>%
group_by(investor, asset, grp.pos) %>%
mutate(`Open (#days)` = datetime - datetime[1])
#> # A tibble: 5 x 6
#> # Groups: investor, asset, grp.pos [2]
#> investor asset datetime portfolio grp.pos `Open (#days)`
#> <chr> <chr> <date> <dbl> <int> <drtn>
#> 1 INV_1 x 2010-01-01 10 0 0 days
#> 2 INV_1 x 2010-01-02 0 0 1 days
#> 3 INV_1 x 2010-01-03 5 1 0 days
#> 4 INV_1 x 2010-01-08 2 1 5 days
#> 5 INV_1 x 2010-01-19 0 1 16 days
Data:
df <- structure(list(investor = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1"),
asset = c("x", "x", "x", "x", "x"),
datetime = c("2010-01-01", "2010-01-02", "2010-01-03",
"2010-01-08", "2010-01-19"),
portfolio = c(10, 0, 5, 2, 0)),
operation = c(10, -10, 5, -3, -2),
class = "data.frame", row.names = c(NA, -5L))
Here is a way how we could do it, that is expandable if necessary for ass
First we group by inv to use for the original dataset. Then transform datetime to date format to do calculations easily (here we use ymd() function).
The next step could be done in different ways:
Main idea is to group the column portfolio indicated by the last row of the group that is 0. For this we arrange datetime in descending form to easily apply the grouping id with cumsum == 0.
After rearranging datetime we can calculate the last from the first as intended:
library(dplyr)
library(lubridate)
df %>%
group_by(inv) %>%
mutate(datetime = ymd(datetime)) %>%
arrange(desc(datetime)) %>%
group_by(position_Group = cumsum(portfolio==0)) %>%
arrange(datetime) %>%
mutate(position_open = last(datetime)-first(datetime)) %>%
ungroup()
inv ass datetime portfolio operation id_Group position_open
<chr> <chr> <date> <dbl> <dbl> <int> <drtn>
1 INV_1 x 2010-01-01 10 10 2 1 days
2 INV_1 x 2010-01-02 0 -10 2 1 days
3 INV_1 x 2010-01-03 5 5 1 16 days
4 INV_1 x 2010-01-08 2 -3 1 16 days
5 INV_1 x 2010-01-19 0 -2 1 16 days

R: Loops, Dplyr and lubridate, how to combine them

I'm new to R and I'm facing a problem, I have a date vector and a dataframe containing data regarding sales values and coverage start and end dates.
I need to defer the sale value at each analysis date, for the first analysis period, I can create an algorithm that gives me the desired answer. However in my real data I am working with a base of 200K+ rows and 50+ analysis periods.
I'm not able to build a loop or find an alternative function in R that allows me to create the variables Aux[i] and Test[i] according to the number of dates present in the vec_date vector.
The following is an example of code that works for the first analysis period.
library(tidyverse)
library(lubridate)
df <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500))
date <- ymd("2021-12-31")
vec_date <- date %m+% months(seq(0, 12, by = 6))
df_new <- df |>
mutate(duration = as.numeric(DateFin - DateIn),
Pr_day = Premium/duration,
Aux1 = if_else(DateIn > vec_date[1] | DateFin < vec_date[1], "N", "Y"),
test1 = if_else(Aux1 == "Y" & DateFin > vec_date[1], as.numeric(DateFin - vec_date[1])*Pr_day,
if_else(DateIn > vec_date[1], Premium, 0)))
Does anyone have any idea how I could build this loop, or is there any R function/package that allows me to perform this interaction between my df dataframe and vec_date vector?
Edit: an outline of the format you would need as a result would be:
df_final <- tibble(DateIn = c(ymd("2021-10-21", "2021-12-25", "2022-05-11")),
DateFin = c(ymd("2022-03-10", "2022-07-12", "2023-02-15")),
Premium = c(11000, 5000, 24500),
Aux1 = c("Y", "Y", "N"),
test1 = c(5421.429, 4849.246, 24500.000),
Aux2 = c("N", "Y", "Y"),
test2 = c(0.0000, 301.5075, 20125.0000),
Aux3 = c("N", "N", "Y"),
test3 = c(0, 0, 4025))
Where, Aux1 and test1 are the results referring to vec_date[1], 2 = vec_date[2], 3 = vec_date[3]. For me it is important to keep the resulting variables in the same dataframe because later analysis will be done.
As #Jon Spring suggests in the comments, probably the preferred approach here
would be to use tidyr::complete() to extend your data frame, repeating each
row in it for each of your analysis dates. Then, you can stick to vectorized
calculations and get the analysis date column in the resulting data, too.
Below is how to do just that with the example data you provided. I took the
liberty of renaming some columns, and simplifying the control-flow based
calculation according to my understanding of the problem, based on what you
shared.
First, the example data slightly reframed:
library(tidyverse)
library(lubridate)
policies <- tibble(
policy_id = seq_len(3),
start = ymd("2021-10-21", "2021-12-25", "2022-05-11"),
end = ymd("2022-03-10", "2022-07-12", "2023-02-15"),
premium = c(11000, 5000, 24500)
)
policies
#> # A tibble: 3 x 4
#> policy_id start end premium
#> <int> <date> <date> <dbl>
#> 1 1 2021-10-21 2022-03-10 11000
#> 2 2 2021-12-25 2022-07-12 5000
#> 3 3 2022-05-11 2023-02-15 24500
Then, finding remaining prorated premiums for policies at given dates:
start_date <- ymd("2021-12-31")
dates <- start_date %m+% months(seq(0, 12, by = 6))
policies %>%
mutate(
days = as.numeric(end - start),
daily_premium = premium / days
) %>%
crossing(date = dates) %>%
mutate(
days_left = pmax(0, end - pmax(start, date)),
premium_left = days_left * daily_premium
) %>%
select(policy_id, date, days_left, premium_left)
#> # A tibble: 9 x 4
#> policy_id date days_left premium_left
#> <int> <date> <dbl> <dbl>
#> 1 1 2021-12-31 69 5421.
#> 2 1 2022-06-30 0 0
#> 3 1 2022-12-31 0 0
#> 4 2 2021-12-31 193 4849.
#> 5 2 2022-06-30 12 302.
#> 6 2 2022-12-31 0 0
#> 7 3 2021-12-31 280 24500
#> 8 3 2022-06-30 230 20125
#> 9 3 2022-12-31 46 4025

Multiple string replacement, decimals to quarters

I want to replace .00 with -Q1, .25 with -Q2, .50 with -Q3, and .75 with -Q4 as given below. However, my code is not working as expected. Any hints?
library(tidyverse)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1
# A tibble: 4 x 1
Date
<dbl>
1 2015
2 2015.
3 2016.
4 2016.
dt1 %>%
pull(Date)
[1] 2015.00 2015.25 2015.50 2015.75
dt1 %>%
mutate(Date1 = str_replace_all(string = Date, pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
# A tidytable: 4 × 2
Date Date1
<dbl> <chr>
1 2015 2015
2 2015. 2015-Q2
3 2016. 2015.5
4 2016. 2015-Q4
There also is a zoo-function for that:
library(tidyverse)
library(zoo)
dt1 <-
tibble(Date = c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 %>%
mutate(Date1 = format.yearqtr(Date, format = "%Y.Q%q") )
# Date Date1
# <dbl> <chr>
# 1 2015 2015.Q1
# 2 2015. 2015.Q2
# 3 2016. 2015.Q3
# 4 2016. 2015.Q4
You may also use integer division %/% and modulo division %% simultaneously
paste0(dt1$Date %/% 1, '-Q',(dt1$Date %% 1)*4 +1)
[1] "2015-Q1" "2015-Q2" "2015-Q3" "2015-Q4"
Thus, using it in piped syntax as
dt1 %>%
mutate(date1 = paste0(Date %/% 1, '-Q',(Date %% 1)*4 +1))
# A tibble: 4 x 2
Date date1
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
here is a quick fix:
dt1 %>%
mutate(Date1 = str_replace_all(format(Date, nsmall = 2),
pattern = c(".00" = "-Q1", ".25" = "-Q2", ".50" = "-Q3", ".75" = "-Q4")))
The problem is that 2015.00 is first transformed to character at which point it becomes 2015. Therefore, the string replacement fails.
You can see this, by trying as.character(2015.00).
However, this can easily be fixed by using format to format the number first.
vec <- c("00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")
dt1 %>%
mutate(new = paste0(Date %/% 1, vec[sprintf("%02d", Date %% 1 * 100)]))
Date new
<dbl> <chr>
1 2015 2015-Q1
2 2015. 2015-Q2
3 2016. 2015-Q3
4 2016. 2015-Q4
library(tidyverse)
dt1 <-
as.character(c(2015.00, 2015.25, 2015.50, 2015.75))
dt1 <- if_else(str_detect(dt1, '\\.', negate = TRUE),
paste0(dt1, '.00'), #If condition TRUE
dt1) #if condition FALSE
value_before <- c("\\.00","\\.25","\\.5","\\.75" )
value_after <- c("-Q1", "-Q2","-Q3", "-Q4")
tibble(Date = str_replace(dt1, value_before, value_after))
#> # A tibble: 4 x 1
#> Date
#> <chr>
#> 1 2015-Q1
#> 2 2015-Q2
#> 3 2015-Q3
#> 4 2015-Q4
Created on 2021-06-01 by the reprex package (v2.0.0)
A solution with dyplr and tidyr:
Prepare decimals for further process with format
separate and mutate with -Q1-Q4
unite
library(tidyr)
library(dplyr)
dt1 %>%
mutate(Date = format(round(Date, digits=2), nsmall = 2)) %>%
separate(Date, into = c("Year", "Quarter"), remove=FALSE) %>%
mutate(Quarter = recode(Quarter, "00" = "-Q1", "25" = "-Q2", "50" = "-Q3", "75" = "-Q4")) %>%
unite("new", Year:Quarter, sep = "")
Output:
Date new
<chr> <chr>
1 2015.00 2015-Q1
2 2015.25 2015-Q2
3 2015.50 2015-Q3
4 2015.75 2015-Q4

In R: Create a data frame which includes consecutive observations only and a variable indicating sequence number

I have a data frame with column names similar to below, in which each row is one observation:
user_id; date; Var_1; Var_2
Each user_id can have 0 or 1 observations for each date. Only dates with an observation are included in the data frame for each user.
From this data, I want to create a data frame containing these variables (and the ones described below) but it should only contain observations for 3 successive dates (for each user). Days in each succession should be numbered 1 to 3 and each succession should be numbered as well.
For instance if user with user_id == 1 has observations on the following date: 2020-01-01, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 2020-01-10, 2020-01-12, 2020-01-13, 2020-01-14.
And user with user_id == 2 has observations on the following dates: 2020-01-01, 2020-01-03, 2020-01-04, 2020-01-06, 2020-01-10, 2020-01-12, 2020-01-15, 2020-01-16, 2020-01-17
Then the new data frame should include:
user_id      date      Sequence      Day      Var_1      Var_2      
1            20-01-03      1                  1            value            value      
1            20-01-04      1                  2            value            value      
1            20-01-05      1                  3            value            value      
1            20-01-04      2                  1            value            value      
1            20-01-05      2                  2            value            value      
1            20-01-06      2                  3            value            value      
1            20-01-12      3                  1            value            value      
1            20-01-13      3                  2            value            value      
1            20-01-14      3                  3            value            value      
2            20-01-15      1                  1            value            value      
2            20-01-16      1                  2            value            value      
2            20-01-17      1                  3            value            value      
(where value is the value of the observation for the variable)
Thank you for your help with this tricky problem !
Best wishes,
Eric
Here is something you can try working with. I'm sure there are better ways than this, but it seems to work.
Filter by identifying rows that are the beginning of 3-day sequences. To do that, calculate difference diff between dates, and identify dates where the subsequent two rows have a diff of one day.
Knowing the start dates of sequences, you can enumerate these as sequence. Then, use map to expand to 3 day sequences based on these starting dates. After that, you can enumerate again for day after grouping by both user_id and sequence.
Finally, would join result back to your original data to get your Var_1, Var_2, etc.
library(dplyr)
library(tidyr)
df %>%
select(user_id, date) %>%
group_by(user_id) %>%
mutate(diff = c(0, diff(date))) %>%
filter((lead(diff, 1L) == 1 & lead(diff, 2L) == 1)) %>%
mutate(sequence = row_number(),
date = map(date, seq.Date, length = 3, by = "1 day")) %>%
unnest(cols = date) %>%
group_by(user_id, sequence) %>%
mutate(day = row_number()) %>%
inner_join(df, by = c("user_id", "date")) %>%
select(-diff)
Output
user_id date sequence day Var_1 Var_2
<dbl> <date> <int> <int> <int> <int>
1 1 2020-01-03 1 1 2 17
2 1 2020-01-04 1 2 3 16
3 1 2020-01-05 1 3 4 15
4 1 2020-01-04 2 1 3 16
5 1 2020-01-05 2 2 4 15
6 1 2020-01-06 2 3 5 14
7 1 2020-01-12 3 1 7 12
8 1 2020-01-13 3 2 8 11
9 1 2020-01-14 3 3 9 10
10 2 2020-01-15 1 1 16 3
11 2 2020-01-16 1 2 17 2
12 2 2020-01-17 1 3 18 1
Data
df <- structure(list(user_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2), date = structure(c(18262, 18264, 18265, 18266,
18267, 18271, 18273, 18274, 18275, 18262, 18264, 18265, 18267,
18271, 18273, 18276, 18277, 18278), class = "Date"), Var_1 = 1:18,
Var_2 = 18:1), class = "data.frame", row.names = c(NA, -18L
))

Resources