I've found the solution to this question for either grouped lags or lagging one group multiple times, but not both. For this example, I'd like to use the following dataframe:
df <- data.frame(doy = rep((1:2), each = 6), years = rep(2017:2022), var = rnorm(6))
How can I create multiple lags (lvar1, lvar2, lvar3) that exist for two different groups, so that the result is that my 2017 group 2 does not have the lagged variable for 2022 in group 1?
library(dplyr)
df <- data.frame(doy = rep((1:2), each = 6), years = rep(2017:2022), var = rnorm(6))
lag_fn <- list()
lags <- 1:10
for(i in lags) {
fixer <- function(x, i) {
force(i)
return(
function(x) {
return(dplyr::lag(x, i))
}
)}
lag_fn[[i]] <- fixer(x, i)
}
df %>% group_by('your_group_column') %>%
mutate(across("your_value_column", lag_fn, .names = "lag_{.fn}"))
BR
Related
My goal is to take a subset of columns (COL_NAMES) within a dataframe (LR_DATA) and apply a function (FUNCTION). The dataframe (LR_DATA) is mostly nested vectors except for one identifying column (var1). However, I cannot seem to correctly pass the override inputs which are nested under the current column name with the additional suffix "_OVERRIDE".
TRENDED_LR_DATA <- LR_DATA %>% mutate(across(all_of(COL_NAMES), ~list(FUNCTION(var1, unlist(var2), unlist(.x), unlist(!!sym(paste0(cur_column(), "_OVERRIDE")))))))
Specifically I get the error:
If I replace cur_column() with a hardcoded string the code works (though obviously not as intended since it would be referencing the same override column for each specified column in COL_NAMES. Any tips this group has would be greatly appreciated - I'm relatively new to R so please bear with me ^_^.
EDIT: Below is code to reproduce the error in full. Sorry for not including this on the original question submission.
library(dplyr)
LR_DATA <- data.frame(STATE = c(1,2,3),
YEAR = c(2000,2001,2002),
DEVT_A = c(2,4,6),
DEVT_B = c(3,6,9),
DEVT_C = c(4,8,12))
LOSS_COLS <- c("DEVT_A", "DEVT_B", "DEVT_C")
DATA_OVERRIDE <- data.frame(STATE = c(1,2,3),
DEVT_A_OVERRIDE = c(NaN,1,1),
DEVT_B_OVERRIDE = c(1,1,1),
DEVT_C_OVERRIDE = c(1.5,1.5,1.5))
LR_DATA <- LR_DATA %>% left_join(DATA_OVERRIDE, by = 'STATE')
TRENDED_LR_DATA <- LR_DATA %>% summarise(across(everything(), list), .groups = "keep") %>%
mutate(across(all_of(LOSS_COLS), ~list(TREND_LOSS(unlist(.x), unlist(YEAR), unlist( !!sym(paste0(cur_column(), "_OVERRIDE"))) ))))
TREND_LOSS <-
function(LOSSES,
YEARS,
OVERRIDES) {
x <- YEARS
y = log(LOSSES)
xy = x * y
x_sq = x * x
sum_x <- sum(x)
sum_y <- sum(y)
sum_xy <- sum(xy)
sum_x_sq <- sum(x_sq)
n <- length(YEARS)
Slope <- (n*sum_xy - sum_x*sum_y) / (n*sum_x_sq - sum_x*sum_x)
OVERRIDES[is.na(OVERRIDES)] <- Slope
TRENDED_LOSSES <- LOSSES*exp(OVERRIDES)
return(TRENDED_LOSSES)
}
}
I have a simulation over time (dev_quarters) that looks like this, which is a data.table :
simulation <- data.table(`Scenario ID` = 1, dev_quarter = seq(1:80), brand = 1, proportion = runif(80))
For each scenario, we have n_brand, n_scenario and a proportion.
I try to code the following : for each scenario, for each brand, compute the difference of the proportion between the beginning and the end of the year, for each year.
I made the following to recover the corresponding dev_quarters for each year :
x <- 2002:2021
lookup_T <- as.integer(format(Sys.Date(), "%Y"))
lookup_period <- data.table(years = lookup_T-x+1, quarters_t = (lookup_T-x+1)*4, quarters_t1 = (lookup_T-x+2)*4)
With a small example
n_scenario <- 1
n_brand <- 10
An ugly code that uses for loops :
result <- data.table(`Scenario ID` = numeric(), years = numeric(), brand = numeric(), proportion = numeric())
for(i in 1:n_scenario){
for(j in 1:n_brand){
prop_per_year <- c()
# for each year
for(k in 1:length(x)){
year <- lookup_period[k, ]
quarter_start_year <- year[["quarters_t"]]
quarter_end_year <- year[["quarters_t1"]]
end_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_end_year]
start_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_start_year]
prop_this_year <- max(end_year_prop[["proportion"]] - start_year_prop[["proportion"]], 0)
prop_per_year <- append(prop_per_year, prop_this_year)
}
result_temp <- data.table(`Scenario ID` = i, years = x, brand = j, proportion = prop_per_year)
result <- rbind(result, result_temp)
}
}
I considered to filter my data.table, using only rows were dev_quarters were 4k factors, but the issue remains the same about the for loops.
How can I avoid them using data.table ?
Thanks.
The absolute change in proportion between the 4th and 1st quarter can be calculated much more easily.
simulation[, year := 2002 + (dev_quarter-1) %/% 4] # Easier way to calculate the year
simulation[, .(change = last(proportion) - first(proportion)), by = c("Scenario ID", "brand", "year")
Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)
I have two lists of dataframes, the first list of dfs hold values that extend down the column and the second list of dfs holds single values like this:
dynamic_df_1 <- data.frame(x = 1:10)
dynamic_df_2 <- data.frame(y = 1:10)
df_list <- list(dynamic_df_1, dynamic_df_2)
df_list
static_df_1 <- data.frame(mu = 10,
stdev = 5)
static_df_2 <- data.frame(mu = 12,
stdev = 6)
static_df_list <- list(stat_df1 = static_df_1,
stat_df2 = static_df_2)
static_df_list
I would like to add a column to each dataframe (dynamic_df_1 and dynamic_df_2) using values from static_df_1 and static_df_2 to perform the calculation where the calculation for dynamic_df_1 computes with static_df_1 and the calculation for dynamic_df_2 computes with static_df_2.
The result I'm aiming for is this:
df_list[[1]] <- df_list[[1]] %>%
mutate(z = dnorm(x = df_list[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list
df_list[[2]] <- df_list[[2]] %>%
mutate(z = dnorm(x = df_list[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list
I can take a loop approach which gets messy with more complex functions in my real code:
for (i in 1:length(df_list)) {
df_list[[i]]$z <- dnorm(x = df_list[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
df_list
I'm trying to find an lapply / map / mutate type solution that calculates across dataframes - imagine a grid of dataframes where the objective is to calculate across rows. Also open to other solutions such as single df with nested values but haven't figured out how to do that yet.
Hope that is clear - I did my best!
Thanks!
This Map solution seems to be simpler. And the results are identical(). The code that creates df_list2 and df_list3 follows below.
df_list4 <- df_list
fun <- function(DF, Static_DF){
DF[["z"]] = dnorm(DF[[1]], mean = Static_DF[["mu"]], sd = Static_DF[["stdev"]])
DF
}
df_list4 <- Map(fun, df_list4, static_df_list)
identical(df_list2, df_list3)
#[1] TRUE
identical(df_list2, df_list4)
#[1] TRUE
Data.
After running the question's code that creates the initial df_list, run the dplyr pipe and for loop code:
df_list2 <- df_list
df_list2[[1]] <- df_list2[[1]] %>%
mutate(z = dnorm(x = df_list2[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list2[[2]] <- df_list2[[2]] %>%
mutate(z = dnorm(x = df_list2[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list3 <- df_list
for (i in 1:length(df_list3)) {
df_list3[[i]]$z <- dnorm(x = df_list3[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
I just discovered the power of plyr frequency table with several variables in R
and I am still struggling to understand how it works and I hope some here can help me.
I would like to create a table (data frame) in which I can combine frequencies and summary stats but without hard-coding the values.
Here an example dataset
require(datasets)
d1 <- sleep
# I classify the variable extra to calculate the frequencies
extraClassified <- cut(d1$extra, breaks = 3, labels = c('low', 'medium', 'high') )
d1 <- data.frame(d1, extraClassified)
The results I am looking for should look like that :
require(plyr)
ddply(d1, "group", summarise,
All = length(ID),
nLow = sum(extraClassified == "low"),
nMedium = sum(extraClassified == "medium"),
nHigh = sum(extraClassified == "high"),
PctLow = round(sum(extraClassified == "low")/ length(ID), digits = 1),
PctMedium = round(sum(extraClassified == "medium")/ length(ID), digits = 1),
PctHigh = round(sum(extraClassified == "high")/ length(ID), digits = 1),
xmean = round(mean(extra), digits = 1),
xsd = round(sd(extra), digits = 1))
My question: how can I do this without hard-coding the values?
For the records:
I tried this code, but it does not work
ddply (d1, "group",
function(i) c(table(i$extraClassified),
prop.table(as.character(i$extraClassified))),
)
Thanks in advance
Here's an example to get you started:
foo <- function(x,colfac,colval){
tbl <- table(x[,colfac])
res <- cbind(n = nrow(x),t(tbl),t(prop.table(tbl)))
colnames(res)[5:7] <- paste(colnames(res)[5:7],"Pct",sep = "")
res <- as.data.frame(res)
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
Don't take anything in that function foo as gospel. I just wrote that off the top of my head. Surely improvements/modifications are possible, but at least it's something to start with.
Thanks to Joran.
I slighlty modified your function to make it more generic (without reference to the position of the variables) .
require(plyr)
foo <- function(x,colfac,colval)
{
# table with frequencies
tbl <- table(x[,colfac])
# table with percentages
tblpct <- t(prop.table(tbl))
colnames( tblpct) <- paste(colnames(t(tbl)), 'Pct', sep = '')
# put the first part together
res <- cbind(n = nrow(x), t(tbl), tblpct)
res <- as.data.frame(res)
# add summary statistics
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
and it works !!!
P.S : I still do not understand what (group) stands for but