My goal is to take a subset of columns (COL_NAMES) within a dataframe (LR_DATA) and apply a function (FUNCTION). The dataframe (LR_DATA) is mostly nested vectors except for one identifying column (var1). However, I cannot seem to correctly pass the override inputs which are nested under the current column name with the additional suffix "_OVERRIDE".
TRENDED_LR_DATA <- LR_DATA %>% mutate(across(all_of(COL_NAMES), ~list(FUNCTION(var1, unlist(var2), unlist(.x), unlist(!!sym(paste0(cur_column(), "_OVERRIDE")))))))
Specifically I get the error:
If I replace cur_column() with a hardcoded string the code works (though obviously not as intended since it would be referencing the same override column for each specified column in COL_NAMES. Any tips this group has would be greatly appreciated - I'm relatively new to R so please bear with me ^_^.
EDIT: Below is code to reproduce the error in full. Sorry for not including this on the original question submission.
library(dplyr)
LR_DATA <- data.frame(STATE = c(1,2,3),
YEAR = c(2000,2001,2002),
DEVT_A = c(2,4,6),
DEVT_B = c(3,6,9),
DEVT_C = c(4,8,12))
LOSS_COLS <- c("DEVT_A", "DEVT_B", "DEVT_C")
DATA_OVERRIDE <- data.frame(STATE = c(1,2,3),
DEVT_A_OVERRIDE = c(NaN,1,1),
DEVT_B_OVERRIDE = c(1,1,1),
DEVT_C_OVERRIDE = c(1.5,1.5,1.5))
LR_DATA <- LR_DATA %>% left_join(DATA_OVERRIDE, by = 'STATE')
TRENDED_LR_DATA <- LR_DATA %>% summarise(across(everything(), list), .groups = "keep") %>%
mutate(across(all_of(LOSS_COLS), ~list(TREND_LOSS(unlist(.x), unlist(YEAR), unlist( !!sym(paste0(cur_column(), "_OVERRIDE"))) ))))
TREND_LOSS <-
function(LOSSES,
YEARS,
OVERRIDES) {
x <- YEARS
y = log(LOSSES)
xy = x * y
x_sq = x * x
sum_x <- sum(x)
sum_y <- sum(y)
sum_xy <- sum(xy)
sum_x_sq <- sum(x_sq)
n <- length(YEARS)
Slope <- (n*sum_xy - sum_x*sum_y) / (n*sum_x_sq - sum_x*sum_x)
OVERRIDES[is.na(OVERRIDES)] <- Slope
TRENDED_LOSSES <- LOSSES*exp(OVERRIDES)
return(TRENDED_LOSSES)
}
}
Related
I am trying to apply fm.Choquet function (Rfmtool package) to my R data frame, but no success. The function works like this (ref. here):
# let x <- 0.6 (N = 1)
# and y <- c(0.3, 0.5). y elements are always 2 power N (here, 2)
# env<-fm.Init(1). env is propotional to N
# fm.Choquet(0.6, c(0.3, 0.5),env) gives a single value output
I have this sample data frame:
set.seed(123456)
a <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
b <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
c <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
df <- data.frame(a=a, b=b, c=c)
df$id <- seq_len(nrow(df))
I would like to apply fm.Choquet function to each row of my df such that, for each row (or ID), a is read as x, while b and c are read as y vector (N = 2), and add the function output as a new column for each row. However, I am getting the dimension error "The environment mismatches the dimension to the fuzzy measure.".
Here is my attempt.
df2 <- df %>% as_tibble() %>%
rowwise() %>%
mutate(ci = fm.Choquet(df$a,c(df[,2],df[,3]), env)) %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>% # Also tried adding sum which works
as.matrix()
I am using dplyr::rowwise(), but I am open to looping or other suggestions. Can someone help me?
EDIT 1:
A relevant question is identified as a possible solution for the above question, but using one of the suggestions, by(), still throws the same error:
by(df, seq_len(nrow(df)), function(row) fm.Choquet(df$a,c(df$b,df$c), env))
set.seed(123456)
a <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
b <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
c <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
df <- data.frame(a = a, b = b, c = c)
df$id <- seq_len(nrow(df))
library(Rfmtool)
library(tidyverse)
env <- fm.Init(1)
map_dbl(
seq_len(nrow(df)),
~ {
row <- slice(df,.x)
fm.Choquet(
x = row$a,
v = c(row$b, row$c), env
)
}
)
I wrote some code to performed oversampling, meaning that I replicate my observations in a data.frame and add noise to the replicates, so they are not exactly the same anymore. I'm quite happy that it works now as intended, but...it is too slow. I'm just learning dplyr and have no clue about data.table, but I hope there is a way to improve my function. I'm running this code in a function for 100s of data.frames which may contain about 10,000 columns and 400 rows.
This is some toy data:
library(tidyverse)
train_set1 <- rep(0, 300)
train_set2 <- rep("Factor1", 300)
train_set3 <- data.frame(replicate(1000, sample(0:1, 300, rep = TRUE)))
train_set <- cbind(train_set1, train_set2, train_set3)
row.names(train_set) <- c(paste("Sample", c(1:nrow(train_set)), sep = "_"))
This is the code to replicate each row a given number of times and a function to determine whether the added noise later will be positive or negative:
# replicate each row twice, added row.names contain a "."
train_oversampled <- train_set[rep(seq_len(nrow(train_set)), each = 3), ]
# create a flip function
flip <- function() {
sample(c(-1,1), 1)
}
In the relevant "too slow" piece of code, I'm subsetting the row.names for the added "." to filter for the replicates. Than I select only the numeric columns. I go through those columns row by row and leave the values untouched if they are 0. If not, a certain amount is added (here +- 1 %). Later on, I combine this data set with the original data set and have my oversampled data.frame.
# add percentage of noise to non-zero values in numerical columns
noised_copies <- train_oversampled %>%
rownames_to_column(var = "rowname") %>%
filter(grepl("\\.", row.names(train_oversampled))) %>%
rowwise() %>%
mutate_if(~ is.numeric(.), ~ if_else(. == 0, 0,. + (. * flip() * 0.01 ))) %>%
ungroup() %>%
column_to_rownames(var = "rowname")
# combine original and oversampled, noised data set
train_noised <- rbind(noised_copies, train_set)
I assume there are faster ways using e.g. data.table, but it was already tough work to get this code running and I have no idea how to improve its performance.
EDIT:
The solution is working perfectly fine with fixed values, but called within a for loop I receive "Error in paste(Sample, n, sep = ".") : object 'Sample' not found"
Code to replicate:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = train_set, cc = train_set)
for(current_table in train_list) {
setDT(current_table, keep.rownames="Sample")
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Any ideas why the column Sample can't be found now?
Here is a more vectorized approach using data.table:
library(data.table)
setDT(train_set, keep.rownames="Sample")
cols <- names(train_set)[sapply(train_set, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(train_set)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, list(train_set)), use.names=FALSE)
With data.table version >= 1.12.9, you can pass is.numeric directly to .SDcols argument and maybe a shorter way (e.g. (.SD) or names(.SD)) to pass to the left hand side of :=
address OP's updated post:
The issue is that although each data.frame within the list is converted to a data.table, the train_list is not updated. You can update the list with a left bind before the for loop:
library(data.table)
train_set <- data.frame(
x = c(rep(0, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
# changing the row name to avoid confusion with "Sample"
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
train_list <- list(aa = train_set, bb = copy(train_set), cc = copy(train_set))
train_list <- lapply(train_list, setDT, keep.rownames="Sample")
for(current_table in train_list) {
cols <- names(current_table)[sapply(current_table, is.numeric)]
noised_copies <- lapply(c(1,2), function(n) {
copy(current_table)[,
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
train_noised <- rbindlist(c(noised_copies, train_list), use.names=FALSE)
# As this is an example, I did not write anything to actually
# store the results, so I have to remove the object
rm(train_noised)
}
Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)
I have two lists of dataframes, the first list of dfs hold values that extend down the column and the second list of dfs holds single values like this:
dynamic_df_1 <- data.frame(x = 1:10)
dynamic_df_2 <- data.frame(y = 1:10)
df_list <- list(dynamic_df_1, dynamic_df_2)
df_list
static_df_1 <- data.frame(mu = 10,
stdev = 5)
static_df_2 <- data.frame(mu = 12,
stdev = 6)
static_df_list <- list(stat_df1 = static_df_1,
stat_df2 = static_df_2)
static_df_list
I would like to add a column to each dataframe (dynamic_df_1 and dynamic_df_2) using values from static_df_1 and static_df_2 to perform the calculation where the calculation for dynamic_df_1 computes with static_df_1 and the calculation for dynamic_df_2 computes with static_df_2.
The result I'm aiming for is this:
df_list[[1]] <- df_list[[1]] %>%
mutate(z = dnorm(x = df_list[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list
df_list[[2]] <- df_list[[2]] %>%
mutate(z = dnorm(x = df_list[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list
I can take a loop approach which gets messy with more complex functions in my real code:
for (i in 1:length(df_list)) {
df_list[[i]]$z <- dnorm(x = df_list[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
df_list
I'm trying to find an lapply / map / mutate type solution that calculates across dataframes - imagine a grid of dataframes where the objective is to calculate across rows. Also open to other solutions such as single df with nested values but haven't figured out how to do that yet.
Hope that is clear - I did my best!
Thanks!
This Map solution seems to be simpler. And the results are identical(). The code that creates df_list2 and df_list3 follows below.
df_list4 <- df_list
fun <- function(DF, Static_DF){
DF[["z"]] = dnorm(DF[[1]], mean = Static_DF[["mu"]], sd = Static_DF[["stdev"]])
DF
}
df_list4 <- Map(fun, df_list4, static_df_list)
identical(df_list2, df_list3)
#[1] TRUE
identical(df_list2, df_list4)
#[1] TRUE
Data.
After running the question's code that creates the initial df_list, run the dplyr pipe and for loop code:
df_list2 <- df_list
df_list2[[1]] <- df_list2[[1]] %>%
mutate(z = dnorm(x = df_list2[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list2[[2]] <- df_list2[[2]] %>%
mutate(z = dnorm(x = df_list2[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list3 <- df_list
for (i in 1:length(df_list3)) {
df_list3[[i]]$z <- dnorm(x = df_list3[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
Below is my data
set.seed(100)
toydata <- data.frame(A = sample(1:50,50,replace = T),
B = sample(1:50,50,replace = T),
C = sample(1:50,50,replace = T)
)
Below is my swapping function
derangement <- function(x){
if(max(table(x)) > length(x)/2) return(NA)
while(TRUE){
y <- sample(x)
if(all(y != x)) return(y)
}
}
swapFun <- function(x, n = 10){
inx <- which(x < n)
y <- derangement(x[inx])
if(length(y) == 1) return(NA)
x[inx] <- y
x
}
In the first case,I get the new data toy by swapping the entire dataframe. The code is below:
toydata<-as.matrix(toydata)
toy<-swapFun(toydata)
toy<-as.data.frame(toy)
In the second case, I get the new data toy by swapping each column respectively. Below is the code:
toydata<-as.data.frame(toydata)
toy2 <- toydata # Work with a copy
toy2[] <- lapply(toydata, swapFun)
toy<-toy2
Below is the function that can output the difference of contigency table after swapping.
# the function to compare contingency tables
f = function(x,y){
table1<-table(toydata[,x],toydata[,y])
table2<-table(toy[,x],toy[,y])
sum(abs(table1-table2))
}
# vectorise your function
f = Vectorize(f)
combn(x=names(toydata),
y=names(toydata), 2) %>%# create all combinations of your column names
t() %>% # transpose
data.frame(., stringsAsFactors = F) %>% # save as dataframe
filter(X1 != X2) %>% # exclude pairs of same
# column
mutate(SumAbs = f(X1,X2)) # apply function
In the second case, this mutate function works.
But in the first case, this mutatefunction does not work. It says:
+ filter(X1 != X2) %>% # exclude pairs of same column
+ mutate(SumAbs = f(X1,X2)) # apply function
Error in combn(x = names(toydata), y = names(toydata), 2) : n < m
However in the two cases, the toy data are all dataframes with the same dimension, the same row names and the same column names. I feel confused.
How can I fix it? Thanks.