loop over variable names - r

I am trying to build various regression models with different columns (independent variables in my dataset).
set.seed(0)
True = rnorm(20, 100, 10)
v = matrix(rnorm(120, 10, 3), nrow = 20)
dt = data.frame(cbind(True, v))
colnames(dt) = c('True', paste0('ABC', 1:6))
So the independent variables I want to throw in the data is "ABCi", aka when i=1, use ABC1, etc. Each model uses the first 80% of the observations to build, then I make a prediction on the rest 20%.
I tried this:
reg.pred = rep(0, ncol(dt))
for (i in 1:nrow(dt)){
reg = lm(True~paste0('ABC', i), data = dt[(1:(0.8*nrow(dt))),])
reg.pred[i] = predict(reg, data = dt[(0.8*nrow(dt)):nrow(dt),])
}
Not working... giving errors like:
Error in model.frame.default(formula = True ~ paste0("ABC", i), data = dt[(1:(0.8 * :
variable lengths differ (found for 'paste0("ABC", i)')
Not sure how can I retrieve the variable name in a loop... Any suggestion is appreciated!

You do not technically need to use as.formula() as #Sonny suggests, but you cannot mix a character representation of the formula and formula notation. So, you need to fix that. However, once you do, you'll notice that there are other issues with your code that #Sonny either did not notice or opted not to address.
Most notably, the line
reg.pred = rep(0, ncol(dt))
implies you want a single prediction from each model, but
predict(reg, data = dt[(0.8*nrow(dt)):nrow(dt),])
implies you want a prediction for each of the observations not in the training set (you'll need a +1 after 0.8*nrow(dt) for that by the way).
I think the following should fix all your issues:
set.seed(0)
True = rnorm(20, 100, 10)
v = matrix(rnorm(120, 10, 3), nrow = 20)
dt = data.frame(cbind(True, v))
colnames(dt) = c('True', paste0('ABC', 1:6))
# Make a matrix for the predicted values; each column is for a model
reg.pred = matrix(0, nrow = 0.2*nrow(dt), ncol = ncol(dt)-1)
for (i in 1:(ncol(dt)-1)){
# Get the name of the predictor we want here
this_predictor <- paste0("ABC", i)
# Make a character representation of the lm formula
lm_formula <- paste("True", this_predictor, sep = "~")
# Run the model
reg = lm(lm_formula, data = dt[(1:(0.8*nrow(dt))),])
# Get the appropriate test data
newdata <- data.frame(dt[(0.8*nrow(dt)+1):nrow(dt), this_predictor])
names(newdata) <- this_predictor
# Store predictions
reg.pred[ , i] = predict(reg, newdata = newdata)
}
reg.pred
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 100.2150 100.8394 100.7915 99.88836 97.89952 105.7201
# [2,] 101.2107 100.8937 100.9110 103.52487 102.13965 104.6283
# [3,] 100.0426 101.0345 101.2740 100.95785 102.60346 104.2823
# [4,] 101.1055 100.9686 101.5142 102.56364 101.56400 104.4447
In this matrix of predictions, each column is from a different model, and the rows correspond to the last four rows of your data (the rows not in your training set).

You can use as.formula
f <- as.formula(
paste("True",
paste0('ABC', i),
sep = " ~ "))
reg = lm(f, data = dt[(1:(0.8*nrow(dt))),])

Related

How can I check for convergence in Sobol' sensitivity indices, using sensobol?

I would like to check the convergence of Sobol' sensitivity indices, using the sensobol library, by re-computing the sensitivity indices using sub-samples of decreasing size extracted from the original sample.
Here, I present an example code using the Ishigami function as model. Since computing the model output takes very long with the model I actually use, I want to avoid recomputing the model output for different sample sizes, but want to use sub-samples of my overall sample for this check.
I have written code that runs through, however, it seems that the result is 'not correct', as soon as the sample size is not equal the initial sample size.
Inital set-up
library(sensobol)
# Define settings
matrices <- c("A", "B", "AB", "BA")
N <- 1000
params <- paste("X", 1:3, sep = "")
first <- total <- "azzini"
order <- "first"
R <- 10
type <- "percent"
conf <- 0.95
# Create sample matrix using Sobol' (1967) quasi-random numbers
mat <- sobol_matrices(matrices = matrices, N = N, params = params, order = order, type = "QRN")
# Compute model output using Ishigami function as model
Y <- ishigami_Fun(mat)
Correct Sobol' indices as benchmark result
# Compute and bootstrap Sobol' indices for entire sample N
ind <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = Y,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind)[1:length(params)]
ind[ , (cols):= round(.SD, 3), .SDcols = (cols)]
Check for convergence
Now, to analyze whether convergence is reached, I want to re-compute the sensitivity indices using sub-samples of decreasing size extracted from the original sample
# function to compute sensitivity indices, depending on the sample size and the model output vector
fct_conv <- function(N, Y) {
# compute how many model runs are performed in the case of the Azzini estimator
nr_model_runs <- 2*N*(length(params)+1) # length(params) = k
# extract sub-sample of model output
y_sub <- Y[1:nr_model_runs]
# compute and bootstrap Sobol' indices
ind_sub <- sobol_indices(matrices = c("A", "B", "AB", "BA"),
Y = y_sub,
N = N,
params = params,
boot = TRUE,
first = "azzini",
total = "azzini",
order = "first",
R = R,
type = type,
conf = conf)
cols <- colnames(ind_sub)[1:length(params)]
ind_sub[ , (cols):= round(.SD, 3), .SDcols = (cols)]
return(ind_sub)
}
Let's compare the benchmark result (ind) to two other outputs: Running fct_conv with the full sample (ind_full_sample) and running fct_conv with a very slightly reduced sample (ind_red_sample).
ind_full_sample <- fct_conv(1000, Y)
ind_red_sample <- fct_conv(999, Y)
ind
ind_full_sample
ind_red_sample
It seems that as soon as the sample size is reduced, the result doesn't make sense. Why is that? I'd be glad for any hints or ideas!
The results do not make sense because you are sampling without considering the ordering of the sample matrix. Try the following
# Load the required packages:
library(sensobol)
library(data.table)
library(ggplot2)
# Create function to swiftly check convergence (you do not need bootstrap)
sobol_convergence <- function(Y, N, sample.size, seed = 666) {
dt <- data.table(matrix(Y, nrow = N))
set.seed(seed) # To permit replication
subsample <- unlist(dt[sample(.N, sample.size)], use.names = FALSE)
ind <- sobol_indices(matrices = matrices,
Y = subsample,
N = sample.size,
params = params,
first = first,
total = total,
order = order)
return(ind)
}
# Define sequence of sub-samples at which you want to check convergence
sample.size <- seq(100, 1000, 50) # every 50
# Run function
ind.list <- lapply(sample.size, function(n)
sobol_convergence(Y = Y, N = N, sample.size = n))
# Extract total number of model runs C and results in each run
Cost <- indices <- list()
for(i in 1:length(ind.list)) {
Cost[[i]] <- ind.list[[i]]$C
indices[[i]] <- ind.list[[i]]$results
}
names(indices) <- Cost
# Final dataset
final.dt <- rbindlist(indices, idcol = "Cost")[, Cost:= as.numeric(Cost)]
# Plot results
ggplot(final.dt, aes(Cost, original, color = sensitivity)) +
geom_line() +
labs(x = "Total number of model runs", y = "Sobol' indices") +
facet_wrap(~parameters) +
theme_bw()

Accessing a variable in a data frame by columns number in R?

I have a data frame as "df" and 41 variables var1 to var41. If I write this command
pcdtest(plm(var1~ 1 , data = df, model = "pooling"))[[1]]
I can see the test value. But I need to apply this test 41 times. I want to access variable by column number which is "df[1]" for "var1" and "df[41]" for "var41"
pcdtest(plm(df[1]~ 1 , data = dfp, model = "pooling"))[[1]]
But it fails. Could you please help me to do this? I will have result in for loop. And I will calculate the descriptive statistics for all the results. But it is very difficult to do test for each variable.
I think you can easily adapt the following code to your data. Since you didn't provide any of your data, I used data that comes with the plm package.
library(plm) # for pcdtest
# example data from plm package
data("Cigar" , package = "plm")
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)
# example for one column
p_model <- plm(formula = pop~1, data = Cigar.p, model = "pooling")
pcdtest(p_model)[[1]]
# run through multiple models
l_plm_models <- list() # store plm models in this list
l_tests <- list() # store testresults in this list
for(i in 3:ncol(Cigar.p)){ # start in the third column, since the first two are state and year
fmla <- as.formula(paste(names(Cigar.p)[i], '~ 1', sep = ""))
l_plm_models[[i]] <- plm(formula = as.formula(paste0(colnames(Cigar.p)[i], "~ 1", sep = "")),
data = Cigar.p,
model = "pooling")
l_tests[[i]] <- pcdtest(l_plm_models[[i]])[[1]]
}
testresult <- data.frame("z" = unlist(l_tests), row.names = (colnames(Cigar.p[3:11])))
> testresult
z
price 175.36476
pop 130.45774
pop16 155.29092
cpi 176.21010
ndi 175.51938
sales 99.02973
pimin 175.74600
fact1 176.21010
fact2 176.21010
# example for cipstest
matrix_results <- matrix(NA, nrow = 11, ncol = 2) # use 41 here for your df
l_ctest <- list()
for(i in 3:ncol(Cigar.p)){
l_ctest[[i]] <- cipstest(Cigar.p[, i], lags = 4, type = 'none', model = 'cmg', truncated = F)
matrix_results[i, 1] <- as.numeric(l_ctest[[i]][1])
matrix_results[i, 2] <- as.numeric(l_ctest[[i]][7])
}
res <- data.frame(matrix_results)
names(res) <- c('cips-statistic', 'p-value')
print(res)
Try using as.formula(), for example:
results <- list()
for (i in 1:41){
varName <- paste0('var',i)
frml <- paste0(varName, ' ~ 1')
results[[i]] <-
pcdtest(plm(as.formula(frml) , data = dfp, model = "pooling"))[[1]]
}
You can use reformulate to create the formula and apply the code for 41 times using lapply :
var <- paste0('var', 1:41)
result <- lapply(var, function(x) pcdtest(plm(reformulate('1', x),
data = df, model = "pooling"))[[1]])

How to extract the coefficients of various n-th degree polynomial models and store them into a 1D array?

I have a collection of sixth-degree polynomial regression models from which I want to gather only the coefficients.
I have a large dataset that contains 3 columns: the first one is an arbitrary parameter that acts as a flag, the second is the input, and third is the output.
I subsetted my dataset according to my parameters, so I have 10 smaller datasets. My models arose from these subsets.
As an example:
#-----"Dummy" Dataset-----
a = seq(1:100) #act as input
b = a + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = as.data.frame(cbind(a,b))
colnames(df) = c("input", "output")
#
#-----Subsets-----
df_1_XlessThen50 = subset(df, x< 50) #example of subsetting. In this
#case I used the x values itself as threshold
#for subsetting just for simplicity.
#In reality, I use the first column of my dataframe(parameter).
df_2_XmoreThen50 = subset(df, x >= 50) #second subset. In other words,
#for every parameter, I will divide that subset
#into two smaller ones.
#
#-----Models-----
model_3_ab.1 = lm(output ~ poly(input, 6, raw = T), data = df_1_XlessThen50)
model_3_ab.2 = lm(output ~ poly(input, 6, raw = T), data = df_2_XmoreThen50)
My models's names follow a pattern: "model" + parameter + "_ab." + id number.
I should clarify that the "id number" indicates which of the two models for every parameter I will consider. (Theses smaller datasets within every parameter are the results of subsetting according to a pre-determined threshold.)
What I have now is a collection of models like these two above for every parameter in my dataset. I have 10 parameters, hence, 20 models.
I want to gather only the coefficients of every model and store them into a matrix or dataframe. To achieve that, I tried:
parameter = c(2,4,6,7,9,11,33,35,37,50)
myData = array()
for (i in parameter){ #Loop over all parameters
for (j in 1:2){ #Loop over the pair of models for each parameter
for ( k in 1:6){ #Loop over my model's coefficient
aux = paste("model",i,"ab.",j, sep = "")
aux = get(aux)
myData[i,j,k] = aux$coefficients[k]
}
}
}
However, I keep getting the same error:
Error in myData[i, j, k] = aux$coefficients[k] :
incorrect number of subscripts
With this error, I can't advance into my goal, which is to write a .txt with one single column formatted as such:
A(2,1,1) = first order coefficient for the first model related to parameter 2
B(2,2,1) = second order coefficient for the first model related to parameter 2
C(2,3,1)
...
G(2,7,1)
A(2,1,2)
where in (M, N, O): M is the parameter, N is the the coefficient of the N-th degree (N = 7 is the intercept), and O is either 1 or 2, respectively, the first or second model in each pair of models for every parameter.
It'd be nice to get help/guidance for the whole problem, but I'll already be grateful if I can get past the part where I want to store my coefficients in a matrix using for-loops. Thanks
Here is what I mean:
set.seed(42)
a1 = seq(1:100) #act as input
a2 <- runif(100)
b = a1 + a2 + rnorm(n = 100, mean = 0, sd = 20) #act as output
df = data.frame(input1 = a1,
input2 = a2,
output = b)
df$flag <- a1 <= 50
library(reshape2)
df <- melt(df, id.vars = c("output", "flag"))
library(lme4)
df$flag_par <- interaction(df$flag, df$variable)
fits <- lmList(output ~ poly(value, 2, raw = TRUE) | flag_par, data = df)
coef(fits)
# (Intercept) poly(value, 2, raw = TRUE)1 poly(value, 2, raw = TRUE)2
#FALSE.input1 125.957730 -2.434849 0.022137337
#TRUE.input1 2.842223 1.216113 -0.006686362
#FALSE.input2 68.807752 -7.429319 26.486493218
#TRUE.input2 31.791633 -18.595105 16.608600876

Linear Regression Analysis - rolling across rows

I need a suggestion on how I get the results
of my regression analysis into an object.
I wan't to perform the regression analysis row wise and
with a window of 20 days.
The object Slope should save the results (slopes) of each days regressions analysis over the window.
#Loading Library
require(quantmod)
#Initiation of Example
mc_result <- matrix(sample(c(1:200)), ncol = 200, nrow =1)
mc_result1 <- matrix(sample(c(1:200)), ncol =200, nrow =1)
mc_result <- rbind(mc_result, mc_result1)
a <- c(1:200)
Slope <- matrix(ncol=2, nrow=181)
Caution this Loop that does not work.
The Loop should apply Rollapply row wise
and save the results for each day in the object Slope.
However, this is how the result should look like, but with changing Slope values. At the moment the Slope Value is stable and I don't know why.
for (i in 1:2) {
Slope[,i] <- rollapply(data =mc_result[i,], width=20,
FUN = function(z)
summary(lm(mc_result[i,] ~ a, data = as.data.frame(z)))$coefficients[2], by.column = FALSE)
}
I think what you want is the following (in your code none of mc_result[i,] or a is rolling over the indices in the data, that's why the linear regression coefficients are not changing, since you are training on the same dataset, only z is changing, you need to change the code to something like the following):
#Loading Library
require(quantmod)
#Initiation of Example
mc_result <- matrix(sample(c(1:200)), ncol = 200, nrow =1)
mc_result1 <- matrix(sample(c(1:200)), ncol =200, nrow =1)
mc_result <- rbind(mc_result, mc_result1)
a <- c(1:200)
Slope <- matrix(ncol=2, nrow=181)
for (i in 1:2) {
Slope[,i] <- rollapply(data = 1:200, width=20,
FUN = function(z) {
summary(lm(mc_result[i,z] ~ a[z]))$coefficients[2]
}, by.column = FALSE)
}
head(Slope)
[,1] [,2]
[1,] 1.3909774 2.0278195
[2,] 1.0315789 2.8421053
[3,] 1.5082707 2.8571429
[4,] 0.0481203 1.6917293
[5,] 0.2969925 0.2060150
[6,] 1.3526316 0.6842105

p-value matrix of x and y variables from anova output

I have many X and Y variables (something like, 500 x 500). The following just small data:
yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
X2 = sample (c(1,0, -1), 100, replace = T),
Xv3 = sample (c(1,0, -1), 100, replace = T),
D = sample (c(1,0, -1), 100, replace = T))
I want to extact p-values and make a matrix like this:
Yv1 Y2 Yv3
Xv1
X2
Xv3
D
Here is my attempt to loop the process:
prob = NULL
anova.pmat <- function (x) {
mydata <- data.frame(yvar = yvars[, x], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
sapply (yvars,anova.pmat)
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?
Edit:
For the first Y variable:
For first Y variable:
prob <- NULL
mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
prob
[1] 0.4995179 0.4067040 0.4181571 0.6291167
Edit again:
for (j in seq(length (yvars))){
prob <- NULL
mydata <- data.frame(yvar = yvars[, j], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
Gives the same result as above !!!
Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.
library("plyr")
probs <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)
Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:
dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))
Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).

Resources