R Moving Window Model fit with variable window size - r

Currently, I am working with data frames in R, the first column of which is a numeric for date. I now have the data sorted in ascending order of date. I want to fit a model (the code I've provided is a simple OLS model) for a 20 day period but for now I've had to assume that I have exactly 124 observations per day, requiring me to use a for loop, however that is not the case. Is there a way for me to include a 20 day window without making that assumption? The current algorithm I have is below. Any help would be much appreciated. The inputs are a data set and two integers, predict and predictor.
rollerOLS <- function(data, predict, predictor){
res <- list()
alpha <- c()
beta <- c()
m <- dim(data)[1]
for(i in 1:(floor(m/124)-10)){
data.new <- as.data.frame(data[c((1+(124*(i-1))):((i+9)*124)),])
data.pred <- as.data.frame(data[c((1+(124*(i+9))):((i+10)*124)-1),])
n <- dim(data.new)[1]
k <- dim(data.pred)[1]
x <- data.new[-1,predictor]
y <- data.new[-n, predict]
mod <- lm(y ~ x)
ts <- mod$coefficients[1] + mod$coefficients[2]*data.pred[-1,predictor]
actual <- data.pred[-k,predict]
alpha[i] <- mod$coefficients[1]
beta[i] <- mod$coefficients[2]
}
coef <- as.data.frame(cbind(alpha, beta))
res$coefs <- coef
res <- as.data.frame(res)
return(res)
}

Related

fitting linear regression models with different predictors using loops

I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)

Different set.seed each run in R

I want to "measure" which Regression Method is more robust to the outliers.
For this, I sum the variances of model coefficients. Each run, I generate data from t-distribution. I set.seed Ten times to have Ten specific data.
However, I also want to have Ten different seed each run. So, in total, I will have 10 sums of the variances. The code below is giving me one sum of the first (Ten different seed).
How can I do this?
#######################################
p <- 5
n <- 50
#######################################
FX <- function(seed, data) {
#for loops over a seed #
for (i in seed) {
set.seed(seed)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
dat=cbind(x,y)
data<-as.data.frame(dat)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs <- coef(lm1)
lad1 <- lad(y ~ ., data=data, method="BR")
lad.coefs <- coef(lad1)
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=var(lm.coefs), lad=var(lad.coefs)), "seed", seed))
}
#######################################
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Here is something closer to your intended results.
The code below fixes a key issues from your original code. It was not clear on what data was intended to be returned from the function.
This creates a vector of seeds numbers inside the function
This also creates a vector to inside the function to store the value of the variance of coefficients for each iteration of the loop. (not sure if is what you want).
I needed to comment out the lad function since I do not know which package this is from. (you would need to follow 2 from above to add this back in.
Some general clean of the code
p <- 5
n <- 50
FX <- function(seed, data) {
#for loops over a seed #
#Fixes the starting seed issue
startingSeed <- (seed-1)*10 +1
seeds <- seq( startingSeed, startingSeed+9)
#create vector to store results from loop iteration
lm.coefs <- vector(mode="numeric", length=10)
index <- 1
for (i in seeds) {
set.seed(i)
# generating data from t-distribution #
x<- matrix(rt(n*p,1), ncol = p)
y<-rt(n,1)
data<-data.frame(x, y)
# performing a regression model on the data #
lm1 <- lm(y ~ ., data=data)
lm.coefs[index] <- var(coef(lm1))
# lad1 <- lad(y ~ ., data=data, method="BR")
# lad.coefs <- coef(lad1)
index <- index +1
}
# calculate variance of the coefficients #
return(`attr<-`(cbind(lmm=lm.coefs), "seed", seed))
}
seeds <- 1:10 ## 10 set seed to have diffrent data set from t-distribution #
res <- lapply(seeds, FX, data=data) # 10 diffrent variance of 10 data/model
sov <- t(sapply(res, colSums)) # put them in matrix
colSums(sov) # sum of 10 varainnces for each model.
Hope this provides the answer or at least guidance to solve your problem.

Random forest variable importance AND direction of correlation for binomial response

I am using the randomForest package in R, but am not partial to solutions using other packages.
my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way.
Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836

using predict with averages of coefficients

I have a time-series dataset and am taking a rolling average of the past 2 years of coefficients and applying that to the current year variables. I created a method that applies the average, but I am wondering if I can do this with a function like predict so I don't have to individually write out each variable.
A simplified version of my code is like this:
formula <- as.formula( a ~ b + c)
subset <- which(data$year == 2009)
fm1 <- lm(formula, data[subset,])
subset2 <- which(data$year == 2010)
fm2 <- lm(formula, data[subset,])
#Take average of coefficients from these two regression
avg_coeff <- (fm1$coeff + fm2$coeff)/2
#Apply average coefficients to current year data
subset3 <- which(data$year == 2011)
subset_data <- data[subset,]
a_hat <- avg_coeff[1] + avg_coeff[2] * subset_data$b + avg_coeff[3] * subset_data$c
This method works, but I want to keep the lm object so I can just use the predict method and do:
a_hat <- predict(fm, subset_data)
where fm$coefficient contains avg_coeff. I tried
fm2$coeff <- avg_coeff
but this does change fm2$coeff, but when I run the predict with fm2, it uses the original coefficient.
If you want to do this for each year maybe you can do something like
require(plyr)
formula <- as.formula( a ~ b + c)
fn <- function(X){
fm <- lm(formula, X)
out <- fm$coefficients
}
coeffs <- dlply(dta, .(year), fn)
avg_coeffs <- sapply(coeffs, mean)

Running multiple GAMM models using for loop or lapply

Can someone please help me with running multiple GAMM models in a for loop or lapply: I have a set of 10 response and 20 predictor variables in a large data frame arranged in columns.
I'd like to apply GAMM model for each predictor-response combination, and summarize their coefficients and significance tests in a table.
models<-gamm(AnimalCount ~ s(temperature), data=dat,family=poisson(link=log) , random=list(Province=~1) )
I think one way to do this is to create a "matrix" list where the number of rows and columns corresponds to the number of responses (i) and predictors (j), respectively. Then you can store each model result in the cell[i, j]. Let me illustrate:
## make up some data
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
Now the made-up data has 2 responses (y, y1) and 8 predictors (x0 ~ x7). I think you can simplify the process by storing the responses and predictors in separate data frames:
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
## create a "matrix" list of dimensions i x j
results_m <- vector("list", length=ncol(d_resp)*ncol(d_pred))
dim(results_m) <- c(ncol(d_resp), ncol(d_pred))
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(d_pred)){
results_m[i, j][[1]] <- gamm(d_resp[, i] ~ s(d_pred[, j]))
}
}
# flatten the "matrix" list
results_l <- do.call("list", results_m)
You can use sapply/lapply to create a data frame to summarize coefficients, etc. Say, you want to extract fixed-effect intercepts and slopes and stored in a data frame.
data.frame(t(sapply(results_l, function(l) l$lme$coef$fixed)))

Resources