dplyr dynamically create lag and ma features

dplyr dynamically create lag and ma features - r

I am trying to create a process that takes in a dataframe and creates additional lagged and rolling window features (e.g. moving average). This is what I have so far.
# dummy dataframe
n <- 20
set.seed(123)
foo <- data.frame(
date = seq(as.Date('2020-01-01'),length.out = n, by = 'day'),
var1 = sample.int(n),
var2 = sample.int(n))
# creates lags and based on (some of) them creates rolling average features
foo %>%
mutate_at(vars(starts_with('var')),
funs(lag_1 = lag(.), lag_2 = lag(.,2))) %>%
mutate_at(vars(contains('lag_1')),
funs(ra_3 = rollmean(., k = 3, align = 'right', fill = NA)))
The above chunk :
creates lag01,lag02 features considering the selected variables
based on a subset of the newly created columns, creates rolling average features
What I am now looking for, is to create an arbitrary number of lagged features (e.g. lag3,lag6,lag9 so on) as well as create an arbitrary number of rolling average features (of different window length - i.e. var1_lag_1_ra_3, var1_lag_1_ra_6, var2_lag_1_ra_3, var2_lag_1_ra_6. At the moment the settings to generate such features are hardcoded. Ideally I would have couple of vectors to adjust the outcome; like so:
lag_features <- c(3,6,9)
ma_features <- c(12,15)
Lastly, it would be quite nice if there was a way to configure the names of the generated features in a dynamic manner. I 've seen {{}},!!,:= operators, but I am not really in a position to tell the difference or how to use them.
I have also implemented the above using some readily available functions from the timetk package, but since I am looking for some additional flexibility, I was wondering how I could replicate such behavior myself.
library(timetk)
foo %>%
select(date,starts_with('var')) %>%
tk_augment_lags(.value = starts_with("var"),
.lags = 1) %>%
tk_augment_slidify(.value = ends_with("lag1"),
.period = seq(0,24,3)[-1],
.f = mean,
.align = 'right',
.partial = TRUE
)
Any support would be really appreciated.

You can use the map function to get the lagged value for variable numbers. We can use the .names argument in across to provide names to new columns.
library(dplyr)
library(purrr)
library(zoo)
lag_features <- c(3,6,9)
ma_features <- c(12,15)
foo <- bind_cols(foo, map_dfc(lag_features, ~foo %>%
transmute(across(starts_with('var'),
lag, .x, .names = '{col}_lag{.x}'))),
map_dfc(ma_features, ~foo %>%
transmute(across(contains('lag3'), rollmeanr, k = .x,
fill = NA, .names = '{col}_{.x}'))))

Related

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!

Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

Dplyr solution for difference in row values based on two factor levels in separate columns

I am trying to use dplyr to calculate the difference between two row values based on factor levels in large data frame. In practical terms, I want the vote distance between two groups across each party within each country. For the data below, I would like to end up with a data frame with rows indicating the difference between the vote values for each group pair for each party level within each country level. The lag function does not seem to work with my data as the number of factor levels varies by country, meaning each country has a different total number of groups and parties. A small sample of the setup is below.
df1 <- data.frame(id = c(1:12),
country = c("a","a","a","a","a","a","b","b","b","b","b","b"),
group = c("x","y","z","x","y","z","x","y","z","x","y","z"),
party = c("d","d","d","e","e","e","d","d","d","e","e","e"),
vote = c(.15,.02,.7, .5, .6, .22,.47,.33,.09,.83,.77,.66))
This is how I would like the end product to look.
df2 <- data.frame(id= c(1:12),
country = c("a","a","a","b","b","b","a","a","a","b","b","b"),
group1 = c("x","x","y","x","x","y","x","x","y","x","x","y"),
group2 = c("y","z","z","y","z","z","y","z","z","y","z","z"),
party = c("d","d","d","d","d","d","e","e","e","e","e","e"),
dist = c(.13,-.5,-.68,.14,.38,.24,-.1,.28,.38,.06,.17,.11))
I have tried dcast previously and if I fill with the column I want, it doesn't line up and produces NA or 0 where there should be values. The lag function doesn't work in my case because the number of parties and groups are unique for each country and not fixed. Whenever I have tried different intervals for the lag the values are comparing across countries of across parties rather than across groups in some instances.
I have found solutions outside of dplyr but for parsimony in presenting code I am wondering if there is a way in dplyr. Also, the code I have is incredibly long and clunky, and uses six or seven packages just for this problem.
Thanks

We can use combn to create the difference
library(dplyr)
df1 %>%
group_by(country, party) %>%
mutate(dist = combn(vote, 2, FUN = function(x) x[1] - x[2]))

Another way is to use
library(tidyverse)
df1 %>%
left_join(df1 %>% select(-id), by = c("country", "party"), suffix = c("1", "2")) %>%
filter(group1 != group2) %>%
mutate(dist = vote1 - vote2)

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.

There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)

How to return a value from a variable based on a condition in another variable within a grouped data frame?

I am calculating some metrics on each of a set of variables within a grouped dataframe using the basic group_by() + summarize_at approach. Each group represents a small timeseries. One metric I would like to calculate is the initial value (in this case, day == 1) of each variable within each group. Thus, the generalized problem is to return a value of a variable based on a criterion in another variable, within groups of a grouped dataframe. Within the group_by() + summarize_at approach, I believe I need a custom function that summarize_at can then apply to each variable. I can successfully deploy other custom functions that depend only on the data variable at hand. I seem to be hung up on getting the function to go look in other columns of the dataframe.
I am not married to this approach, and welcome alternate recommendations. However, I am most comfortable with dplyr.
# a dataset
df <- data.frame(day = rep(c(1:5),3),
group = c(rep(1,5),rep(2,5),rep(3,5)),
var_a = seq(1:15),
var_b = seq(2,30, length.out = 15),
var_c = seq(3,45, length.out = 15))
# the logic of what I am going for, on a manually extracted example group:
# initial value (day == 1) of var_a for group 2
df_subset <- df %>%
filter(group == 2)
df_subset$var_a[which(df_subset$day == 1)]
# [1] 6
# my laughable attempt at a function
initial <- function(x){
ini <- which(.$day == 1)
x[ini]
}
# custom function deployed in dplyr pipe (which of course doesn't work)
df %>%
group_by(group) %>%
summarize_at(c("var_a","var_b","var_c"),
list(max = max, ini = initial))
Many thanks.

After the group_by step, specify the variables to select in summarise_at using one of the select_helpers (here starts_with works fine), and within the list, apply the different functions on each of the columns (~ is one way to prefix the anonymous call instead of explicitly specifying function(x)), For the second function, 'day' is not part of the selected columns, but it can be selected with the unquoted column name
library(dplyr)
df %>%
group_by(group) %>%
summarise_at(vars(starts_with('var')),
list(max = ~max(.), ini = ~ .[day == 1]))

Programmatically choosing which variables to put into dplyr pipe

I'm working with datasets (from smartphone experience sampling) where I have to very frequently performed grouped operations (such as find the variability of a measure within each person, or within each day within each person, etc). Typical code might look like the code below, which calculates within-day variability for some variables, then takes the mean of the within-day variability and joins it to the original data.
output <- group_by(mydata, id, day) %>%
mutate_at(vars(angr, sad, guil, anx, hap), funs(sd(., na.rm = TRUE))) %>%
ungroup() %>%
group_by(id) %>%
summarize_at(vars(angr, sad, guil, anx, hap), funs('var_day_mean' = mean(., na.rm = TRUE))) %>%
join(mydata, .)
What I want to do is be able to save this as a function so that instead of having to type out angr, sad, guil, anx, hap many times over, I can call this code (and slight variations on it saved as different functions) on a vector of variable names in a string. So the desired functionality is:
vars <- c('angr', 'sad', 'guil', 'anx', 'hap')
output <- myfunc(vars)
Where myfunc performs the piped operations above.
I'm aware that there is a vignette for non standard evaluation using dplyr but it's very limited and doesn't cover mutate or most of what I need to do with this use case, so would appreciate any insight.
Reproducible example - what I desire is essentially that the below code work, but currently the dplyr pipe cannot take vars as a character vector the way I have input it.
Edit: I was mistaken - the below code does work, and dplyr can function in this way (and can also take character vectors to group_by, making this easy to program with). I leave the code below as a (working) reference.
data <- data.frame('ID' = rep(1:10, each = 10),
'day' = rep(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 10),
'anx' = rnorm(100), 'sad' = rnorm(100), 'hap' = rnorm(100))
vars = c('anx', 'sad', 'hap')
out <- group_by(data, ID, day) %>%
mutate_at(vars, funs(sd(., na.rm = TRUE)))

With mutate_at you can simply supply the names of the columns as a vector:
mtcars %>% mutate_at(c("mpg", "hp"), funs(mean))
This should do the trick.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr dynamically create lag and ma features - r

Related

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

Dplyr solution for difference in row values based on two factor levels in separate columns

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

How to return a value from a variable based on a condition in another variable within a grouped data frame?

Programmatically choosing which variables to put into dplyr pipe

Categories

Resources