Forecasting More than One Column

Forecasting More than One Column - r

How do I run this for each of the 50 columns I have instead of one at a time?
#Chosen vector creation
IMBFM <- as.numeric(data$IMBFM)
#Hidden layers creation
alpha <- 1.5^(-10)
hn <- length(IMBFM)/(alpha*(length(IMBFM)+30))
#Fitting nnetar
lambda <- BoxCox.lambda(IMBFM)
dnn_pred <- nnetar(IMBFM, size= hn, lambda = lambda)
#Fitting nnetar
dnn_forecast <- forecast(dnn_pred, h= 30, PI = TRUE)
dnn_forecast
plot(dnn_forecast)

Create a function that takes your column, and returns a list of the forecast and the plot
f <- function(x) {
x <- as.numeric(x)
alpha <- 1.5^(-10)
hn <- length(x)/(alpha*(length(x)+30))
lambda <- BoxCox.lambda(x)
dnn_pred <- nnetar(x, size= hn, lambda = lambda)
dnn_forecast <- forecast(dnn_pred, h= 30, PI = TRUE)
return(
list("forecast" = dnn_forecast, "plot" = plot(dnn_forecast))
)
}
Create a vector of your columns of interest / many ways to do this; this is the manual way, but your might be able to use a regex on colnames(data) to select the ones of interest, depending on the names
mycols = c("IBMF", "col2", "col3", ... "col50")
Use lapply to apply the function to each element of mycols
result = lapply(mycols, function(col) data[[col]])

Related

How to create a function that calculates multiple statistical functions?

I want to make statistical analysis of many vectors, such as cor, MAE, bias, sd, t.test, chisq.test,... and I guess there is any way of creating a function that I only feed the data to analyze and it gives a vector with the calculations.
Ideally I would like to feed vector1 and vector2, and get the calculations made.
Right now I am doint the following, but it is gettin unsustainable pretty fast.
## R^2
rsq_15_18 <- round(cor(x = study_15_18$potential_15, y = study_15_18$overall_18 ,method = "pearson")^2,4)
rsq_16_19 <- round(cor(x = study_16_19$potential_16, y = study_16_19$overall_19 ,method = "pearson")^2,4)
rsq_17_20 <- round(cor(x = study_17_20$potential_17, y = study_17_20$overall_20 ,method = "pearson")^2,4)
rsq_18_21 <- round(cor(x = study_18_21$potential_18, y = study_18_21$overall_21 ,method = "pearson")^2,4)
rsq_19_22 <- round(cor(x = study_19_22$potential_19, y = study_19_22$overall_22 ,method = "pearson")^2,4)
## MAE
mae_15_18 <- round(mae(study_15_18$overall_18, study_15_18$potential_15),4)
mae_16_19 <- round(mae(study_16_19$overall_19, study_16_19$potential_16),4)
mae_17_20 <- round(mae(study_17_20$overall_20, study_17_20$potential_17),4)
mae_18_21 <- round(mae(study_18_21$overall_21, study_18_21$potential_18),4)
mae_19_22 <- round(mae(study_19_22$overall_22, study_19_22$potential_19),4)
## Bias
bias_15_18 <- round(bias(study_15_18$overall_18, study_15_18$potential_15),4)
bias_16_19 <- round(bias(study_16_19$overall_19, study_16_19$potential_16),4)
bias_17_20 <- round(bias(study_17_20$overall_20, study_17_20$potential_17),4)
bias_18_21 <- round(bias(study_18_21$overall_21, study_18_21$potential_18),4)
bias_19_22 <- round(bias(study_19_22$overall_22, study_19_22$potential_19),4)
comparison <- c("15_18", "16_19", "17_20", "18_21", "19_22")
R2 <- c(rsq_15_18, rsq_16_19, rsq_17_20, rsq_18_21, rsq_19_22)
MAE <- c(mae_15_18, mae_16_19, mae_17_20, mae_18_21, mae_19_22)
bias <- c(bias_15_18, bias_16_19, bias_17_20, bias_18_21, bias_19_22)
data.frame(comparison, R2, MAE, bias)
thank you,

So you have two lists of studies that you're comparing. Put them in lists:
study_overall <- list(study_15_18$overall_18, ...) # fill in ... as needed
study_potential <- list(study_15_18$potential_15, ...)
Now you can process those lists in parallel:
library(purrr)
cors <- map2_dbl(study_overall, study_potential,
\(x, y) round(cor(x, y, method = "pearson"))
)
Now you can put the resulting vectors into your data frame.

Performing a large number of 2-sample t-tests in R

So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.
# data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)
# append the labels to the data, then filter
my_data$labels <- labels
sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)
#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}
return(p_vals)
This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.

Instead of splitting the samples you can use the formula interface to t.test, and sapply over the columns of my_data to conduct the tests:
p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )
This will make a vector of p-values, the order will be the same as the columns of my_data

You can also use the package genefilter:
library(genefilter)
colttests(as.matrix(my_data[,-ncol(my_data)]),factor(my_data$labels))

Reqsubsets results differ with coef() for model with linear dependencies

while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)

The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}

Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.

How to extract the p.value and estimate from cor.test() in a data.frame?

In this example, I have temperatures values from 50 different sites, and I would like to correlate the Site1 with all the 50 sites. But I want to extract only the components "p.value" and "estimate" generated with the function cor.test() in a data.frame into two different columns.
I have done my attempt and it works, but I don't know how!
For that reason I would like to know how can I simplify my code, because the problem is that I have to run two times a Loop "for" to get my results.
Here is my example:
# Temperature data
data <- matrix(rnorm(500, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
# Extraction
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[,2:3] <- df1[c("estimate", "p.value")]
}
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[i,2:3] <- df1[c("estimate", "p.value")]
}
df
I will appreciate very much your help :)

I might offer up the following as well (masking the loops):
result <- do.call(rbind,lapply(2:50, function(x) {
cor.result<-cor.test(data[,1],data[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)

First of all, I'm guessing you had a typo in your code (you should have rnorm(5000 if you want unique values. Otherwise you're going to cycle through those 500 numbers 10 times.
Anyway, a simple way of doing this would be:
data <- matrix(rnorm(5000, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
estimates = numeric(50)
pvalues = numeric(50)
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
df$Estimate <- estimates
df$P.value <- pvalues
df
Edit: I believe your issue was is that in the line df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="") if you do typeof(df$Estimate), you see it's expecting an integer, and typeof(test$estimate) shows it spits out a double, so R doesn't know what you're trying to do with those two values. you can redo your code like thus:
df <- data.frame(label=paste("Site", 1:50), Estimate=numeric(50), P.value=numeric(50))
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
df$Estimate[i] = test$estimate
df$P.value[i] = test$p.value
}
to make it a little more concise.

similar to the answer of colemand77:
create a cor function:
cor_fun <- function(x, y, method){
tmp <- cor.test(x, y, method= method)
cbind(r=tmp$estimate, p=tmp$p.value) }
apply through the data.frame. You can transpose the result to get p and r by row:
t(apply(data, 2, cor_fun, data[, 1], "spearman"))

p-value matrix of x and y variables from anova output

I have many X and Y variables (something like, 500 x 500). The following just small data:
yvars <- data.frame (Yv1 = rnorm(100, 5, 3), Y2 = rnorm (100, 6, 4),
Yv3 = rnorm (100, 14, 3))
xvars <- data.frame (Xv1 = sample (c(1,0, -1), 100, replace = T),
X2 = sample (c(1,0, -1), 100, replace = T),
Xv3 = sample (c(1,0, -1), 100, replace = T),
D = sample (c(1,0, -1), 100, replace = T))
I want to extact p-values and make a matrix like this:
Yv1 Y2 Yv3
Xv1
X2
Xv3
D
Here is my attempt to loop the process:
prob = NULL
anova.pmat <- function (x) {
mydata <- data.frame(yvar = yvars[, x], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
sapply (yvars,anova.pmat)
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
What could be the solution ?
Edit:
For the first Y variable:
For first Y variable:
prob <- NULL
mydata <- data.frame(yvar = yvars[, 1], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
prob
[1] 0.4995179 0.4067040 0.4181571 0.6291167
Edit again:
for (j in seq(length (yvars))){
prob <- NULL
mydata <- data.frame(yvar = yvars[, j], xvars)
for (i in seq(length(xvars))) {
prob[[i]] <- anova(lm(yvar ~ mydata[, i + 1],
data = mydata))$`Pr(>F)`[1]
}
}
Gives the same result as above !!!

Here is an approach that uses plyr to loop over the columns of a dataframe (treating it as a list) for each of the xvars and yvars, returning the appropriate p-value, arranging it into a matrix. Adding the row/column names is just extra.
library("plyr")
probs <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(probs) <- names(xvars)
colnames(probs) <- names(yvars)

Here is one solution, which consists in generating all combinations of Y- and X-variables to test (we cannot use combn) and run a linear model in each case:
dfrm <- data.frame(y=gl(ncol(yvars), ncol(xvars), labels=names(yvars)),
x=gl(ncol(xvars), 1, labels=names(xvars)), pval=NA)
## little helper function to create formula on the fly
fm <- function(x) as.formula(paste(unlist(x), collapse="~"))
## merge both datasets
full.df <- cbind.data.frame(yvars, xvars)
## apply our LM row-wise
dfrm$pval <- apply(dfrm[,1:2], 1,
function(x) anova(lm(fm(x), full.df))$`Pr(>F)`[1])
## arrange everything in a rectangular matrix of p-values
res <- matrix(dfrm$pval, nc=3, dimnames=list(levels(dfrm$x), levels(dfrm$y)))
Sidenote: With high-dimensional datasets, relying on the QR decomposition to compute the p-value of a linear regression is time-consuming. It is easier to compute the matrix of Pearson linear correlation for each pairwise comparisons, and transform the r statistic into a Fisher-Snedecor F using the relation F = νar2/(1-r2), where degrees of freedom are defined as νa=(n-2)-#{(xi=NA),(yi=NA)} (that is, (n-2) minus the number of pairwise missing values--if there're no missing values, this formula is the usual coefficient R2 in regression).