Calculations across dataframes

Calculations across dataframes - r

I have two lists of dataframes, the first list of dfs hold values that extend down the column and the second list of dfs holds single values like this:
dynamic_df_1 <- data.frame(x = 1:10)
dynamic_df_2 <- data.frame(y = 1:10)
df_list <- list(dynamic_df_1, dynamic_df_2)
df_list
static_df_1 <- data.frame(mu = 10,
stdev = 5)
static_df_2 <- data.frame(mu = 12,
stdev = 6)
static_df_list <- list(stat_df1 = static_df_1,
stat_df2 = static_df_2)
static_df_list
I would like to add a column to each dataframe (dynamic_df_1 and dynamic_df_2) using values from static_df_1 and static_df_2 to perform the calculation where the calculation for dynamic_df_1 computes with static_df_1 and the calculation for dynamic_df_2 computes with static_df_2.
The result I'm aiming for is this:
df_list[[1]] <- df_list[[1]] %>%
mutate(z = dnorm(x = df_list[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list
df_list[[2]] <- df_list[[2]] %>%
mutate(z = dnorm(x = df_list[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list
I can take a loop approach which gets messy with more complex functions in my real code:
for (i in 1:length(df_list)) {
df_list[[i]]$z <- dnorm(x = df_list[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}
df_list
I'm trying to find an lapply / map / mutate type solution that calculates across dataframes - imagine a grid of dataframes where the objective is to calculate across rows. Also open to other solutions such as single df with nested values but haven't figured out how to do that yet.
Hope that is clear - I did my best!
Thanks!

This Map solution seems to be simpler. And the results are identical(). The code that creates df_list2 and df_list3 follows below.
df_list4 <- df_list
fun <- function(DF, Static_DF){
DF[["z"]] = dnorm(DF[[1]], mean = Static_DF[["mu"]], sd = Static_DF[["stdev"]])
DF
}
df_list4 <- Map(fun, df_list4, static_df_list)
identical(df_list2, df_list3)
#[1] TRUE
identical(df_list2, df_list4)
#[1] TRUE
Data.
After running the question's code that creates the initial df_list, run the dplyr pipe and for loop code:
df_list2 <- df_list
df_list2[[1]] <- df_list2[[1]] %>%
mutate(z = dnorm(x = df_list2[[1]]$x, mean = static_df_list$stat_df1$mu, sd = static_df_list$stat_df1$stdev))
df_list2[[2]] <- df_list2[[2]] %>%
mutate(z = dnorm(x = df_list2[[2]]$y, mean = static_df_list$stat_df2$mu, sd = static_df_list$stat_df2$stdev))
df_list3 <- df_list
for (i in 1:length(df_list3)) {
df_list3[[i]]$z <- dnorm(x = df_list3[[i]][[1]], mean = static_df_list[[i]]$mu, sd = static_df_list[[i]]$stdev)
}

Related

Error: Unable to apply package function to each row in R

I am trying to apply fm.Choquet function (Rfmtool package) to my R data frame, but no success. The function works like this (ref. here):
# let x <- 0.6 (N = 1)
# and y <- c(0.3, 0.5). y elements are always 2 power N (here, 2)
# env<-fm.Init(1). env is propotional to N
# fm.Choquet(0.6, c(0.3, 0.5),env) gives a single value output
I have this sample data frame:
set.seed(123456)
a <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
b <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
c <- qnorm(runif(30,min=pnorm(0),max=pnorm(1)))
df <- data.frame(a=a, b=b, c=c)
df$id <- seq_len(nrow(df))
I would like to apply fm.Choquet function to each row of my df such that, for each row (or ID), a is read as x, while b and c are read as y vector (N = 2), and add the function output as a new column for each row. However, I am getting the dimension error "The environment mismatches the dimension to the fuzzy measure.".
Here is my attempt.
df2 <- df %>% as_tibble() %>%
rowwise() %>%
mutate(ci = fm.Choquet(df$a,c(df[,2],df[,3]), env)) %>%
mutate(sum = rowSums(across(where(is.numeric)))) %>% # Also tried adding sum which works
as.matrix()
I am using dplyr::rowwise(), but I am open to looping or other suggestions. Can someone help me?
EDIT 1:
A relevant question is identified as a possible solution for the above question, but using one of the suggestions, by(), still throws the same error:
by(df, seq_len(nrow(df)), function(row) fm.Choquet(df$a,c(df$b,df$c), env))

set.seed(123456)
a <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
b <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
c <- qnorm(runif(30, min = pnorm(0), max = pnorm(1)))
df <- data.frame(a = a, b = b, c = c)
df$id <- seq_len(nrow(df))
library(Rfmtool)
library(tidyverse)
env <- fm.Init(1)
map_dbl(
seq_len(nrow(df)),
~ {
row <- slice(df,.x)
fm.Choquet(
x = row$a,
v = c(row$b, row$c), env
)
}
)

Using cur_column() with mutate(across(all_of())) - Recursive Error

My goal is to take a subset of columns (COL_NAMES) within a dataframe (LR_DATA) and apply a function (FUNCTION). The dataframe (LR_DATA) is mostly nested vectors except for one identifying column (var1). However, I cannot seem to correctly pass the override inputs which are nested under the current column name with the additional suffix "_OVERRIDE".
TRENDED_LR_DATA <- LR_DATA %>% mutate(across(all_of(COL_NAMES), ~list(FUNCTION(var1, unlist(var2), unlist(.x), unlist(!!sym(paste0(cur_column(), "_OVERRIDE")))))))
Specifically I get the error:
If I replace cur_column() with a hardcoded string the code works (though obviously not as intended since it would be referencing the same override column for each specified column in COL_NAMES. Any tips this group has would be greatly appreciated - I'm relatively new to R so please bear with me ^_^.
EDIT: Below is code to reproduce the error in full. Sorry for not including this on the original question submission.
library(dplyr)
LR_DATA <- data.frame(STATE = c(1,2,3),
YEAR = c(2000,2001,2002),
DEVT_A = c(2,4,6),
DEVT_B = c(3,6,9),
DEVT_C = c(4,8,12))
LOSS_COLS <- c("DEVT_A", "DEVT_B", "DEVT_C")
DATA_OVERRIDE <- data.frame(STATE = c(1,2,3),
DEVT_A_OVERRIDE = c(NaN,1,1),
DEVT_B_OVERRIDE = c(1,1,1),
DEVT_C_OVERRIDE = c(1.5,1.5,1.5))
LR_DATA <- LR_DATA %>% left_join(DATA_OVERRIDE, by = 'STATE')
TRENDED_LR_DATA <- LR_DATA %>% summarise(across(everything(), list), .groups = "keep") %>%
mutate(across(all_of(LOSS_COLS), ~list(TREND_LOSS(unlist(.x), unlist(YEAR), unlist( !!sym(paste0(cur_column(), "_OVERRIDE"))) ))))
TREND_LOSS <-
function(LOSSES,
YEARS,
OVERRIDES) {
x <- YEARS
y = log(LOSSES)
xy = x * y
x_sq = x * x
sum_x <- sum(x)
sum_y <- sum(y)
sum_xy <- sum(xy)
sum_x_sq <- sum(x_sq)
n <- length(YEARS)
Slope <- (n*sum_xy - sum_x*sum_y) / (n*sum_x_sq - sum_x*sum_x)
OVERRIDES[is.na(OVERRIDES)] <- Slope
TRENDED_LOSSES <- LOSSES*exp(OVERRIDES)
return(TRENDED_LOSSES)
}
}

Add multiple RNG columns to a tibble

db %>%
add_column(X1 = rnorm(nrow(db),0,sample(SD,1)),
X2 = rnorm(nrow(db),0,sample(SD,1))
)
I would like to automatically generate 100 random columns. I want to keep each sampled sd a different draw.

1 - You can use the dplyover package.
An exemple with mtcars as db and 10 different SD:
library(tidyverse)
db <- as_tibble(mtcars)
SD <- 1:10
n_col <- 100
sds <- sample(SD, n_col, replace = TRUE)
names(sds) <- paste0("X", seq_len(n_col))
mutate(db, dplyover::over(sds, ~ rnorm(n(), sd = .x)))
2 - Alternatively, you can create the random generated matrix separately and bind with db after
mat_rng <- as_tibble(lapply(sds, \(x) rnorm(nrow(db), sd = x)))
# or, as well suggested by #Adam
# mat_rng <- map_dfc(sds, rnorm, n = nrow(db), mean = 0)
bind_cols(db, mat_rng)
(works only with R >= 4.0.0 because I used the new anonymous function syntax)

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!

As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()

It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

Combining frequencies and summary statistics in one table?

I just discovered the power of plyr frequency table with several variables in R
and I am still struggling to understand how it works and I hope some here can help me.
I would like to create a table (data frame) in which I can combine frequencies and summary stats but without hard-coding the values.
Here an example dataset
require(datasets)
d1 <- sleep
# I classify the variable extra to calculate the frequencies
extraClassified <- cut(d1$extra, breaks = 3, labels = c('low', 'medium', 'high') )
d1 <- data.frame(d1, extraClassified)
The results I am looking for should look like that :
require(plyr)
ddply(d1, "group", summarise,
All = length(ID),
nLow = sum(extraClassified == "low"),
nMedium = sum(extraClassified == "medium"),
nHigh = sum(extraClassified == "high"),
PctLow = round(sum(extraClassified == "low")/ length(ID), digits = 1),
PctMedium = round(sum(extraClassified == "medium")/ length(ID), digits = 1),
PctHigh = round(sum(extraClassified == "high")/ length(ID), digits = 1),
xmean = round(mean(extra), digits = 1),
xsd = round(sd(extra), digits = 1))
My question: how can I do this without hard-coding the values?
For the records:
I tried this code, but it does not work
ddply (d1, "group",
function(i) c(table(i$extraClassified),
prop.table(as.character(i$extraClassified))),
)
Thanks in advance

Here's an example to get you started:
foo <- function(x,colfac,colval){
tbl <- table(x[,colfac])
res <- cbind(n = nrow(x),t(tbl),t(prop.table(tbl)))
colnames(res)[5:7] <- paste(colnames(res)[5:7],"Pct",sep = "")
res <- as.data.frame(res)
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
Don't take anything in that function foo as gospel. I just wrote that off the top of my head. Surely improvements/modifications are possible, but at least it's something to start with.

Thanks to Joran.
I slighlty modified your function to make it more generic (without reference to the position of the variables) .
require(plyr)
foo <- function(x,colfac,colval)
{
# table with frequencies
tbl <- table(x[,colfac])
# table with percentages
tblpct <- t(prop.table(tbl))
colnames( tblpct) <- paste(colnames(t(tbl)), 'Pct', sep = '')
# put the first part together
res <- cbind(n = nrow(x), t(tbl), tblpct)
res <- as.data.frame(res)
# add summary statistics
res$mn <- mean(x[,colval])
res$sd <- sd(x[,colval])
res
}
ddply(d1,.(group),foo,colfac = "extraClassified",colval = "extra")
and it works !!!
P.S : I still do not understand what (group) stands for but

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculations across dataframes - r

Related

Error: Unable to apply package function to each row in R

Using cur_column() with mutate(across(all_of())) - Recursive Error

Add multiple RNG columns to a tibble

Adding a column to a data frame by calculating each value to be added

Combining frequencies and summary statistics in one table?

Categories

Resources