Is it possible to make a rolling randomForest reproducible?

Is it possible to make a rolling randomForest reproducible? - r

I have the following randomForest loop:
require(randomForest)
require(xts)
require(quantmod)
getSymbols("MSFT", from = '2017-01-01', to = '2017-09-13', src = "yahoo")
rets <- diff(log(Ad(MSFT))); rets[is.na(rets)]=0
rets1 <- ifelse(rets>0,"up","dn")
Lag1 <- lag(rets,1);Lag1[is.na(Lag1)]=0;Lag1 <- ifelse(Lag1>0,"up","dn")
Lag2 <- lag(rets,2);Lag2[is.na(Lag2)]=0;Lag2 <- ifelse(Lag2>0,"up","dn")
Lag3 <- lag(rets,3);Lag3[is.na(Lag3)]=0;Lag3 <- ifelse(Lag3>0,"up","dn")
preds <- rets* 0
win=20
for(i in win:(NROW(rets)-1)){
RFDataf <- data.frame(factor(Lag1),factor(Lag2),factor(Lag3))
fit <- try(randomForest(RFDataf[(i-win+1):i,], factor(as.character(rets1[(i-win+1):i])), ntree=1000, mtry=round(sqrt(NCOL(RFDataf)),0)))
if(class(fit) != "try-error"){
fit0 <- fit
preds[(i+1)] <- predict(fit, RFDataf[(i+1),])
}else{
if(exists("fit0")){
preds[(i+1)] <- predict(fit0, RFDataf[(i+1),])}else{next}
}
}
}
I read these carefully:
https://stats.stackexchange.com/questions/120446/different-results-from-several-passes-of-random-forest-on-same-dataset
https://github.com/mlr-org/mlr/issues/938
I understand that if in the R function is used a different language, set.seed() won't work.
I have the additional problem of a loop.
If I used a function with no foreign languages but r in a loop, how could I preserve the reproducibility?
Is there a way to make set.seed() work in a loop? Any alternative solution?

Related

For loop using decimals

I was wondering if I could create a for loop where i goes up by decimals. I have tried writing:
for (i in seq(2,6,.1))
{
data1 <- data[data$x1 > i,]
model <- lm(y~x1, data = data1)
r = summary(model1)$r.squared
result[[i]] = r
}
but the result only gives 5 observations from taking only the integers from 2-6.
Is there a way to get around this.

result[[i]] inside your loop will never work with decimal values of i,
because list indexes must be integers.
Other than that, you can loop in increments of .1, if you change the way you think about .1 increments:
for (i in seq(20, 60)) {
div <- i / 10
data1 <- data[data$x1 > div,]
model <- lm(y~x1, data = data1)
result[[i]] = summary(model1)$r.squared
}

The way you did that is not the best in R. That way is better (but not the best). However it is close to your original code.
data = data.frame(y = runif(100), x1 = runif(1000, 1,7))
f = function(x, data)
{
data1 <- data[data$x1 > x,]
model <- lm(y ~ x1, data = data1)
r <- summary(model)$r.squared
return(r)
}
results = lapply(seq(2,6,.1), f, data)

R issue : performing lm and then a boxcox to find a proper lambda value

I have a dataset of climate data in a data.frame (columns are measuring stations, and rows indicate time of measurement), and I'm trying to find the proper lambda values in a Yeo-Johnson transform to limit skewness impact on a principal component analysis.
Obviously, the first step is to get log likelihoods to find the best lambda : I use the following, where i is the index of a column :
getYeoJohsnonLambda <- function(myClimateData,cols,lambda_min, lambda_max,eps)
...
lambda <- seq(lambda_min,lambda_max,eps)
for(i in cols)
{
formula <- as.formula(paste("myClimateData$",colnames(myClimateData)[i],"~1"))
currentModel <- lm(formula,myClimateData)
print(currentModel)
myboxCox <- boxCox(currentModel, lambda = lambda ,family="yjPower", plotit = FALSE)
...
}
When I am trying to call it for a climateData time series which could be, for example :
`climateData <-data.frame(c(8.2,6.83,5.46,4.1,3.73,3.36,3,3,3,3,3.7),c(0,0.66,1.33,2,2,2,2,2,2,2,1.6))`
I get this error : Error in is.data.frame(data) : object 'myClimateData' not found
This is weird, as lm seems to find it and return a correct fit, and myClimateData should be found as it is one of the arguments of the function, right ?

Sadly, it seems that the problem comes from the function boxCox rather than your getYeoJohsnonLambda function. As BrodieG pointed out in a related question, this function uses parent.frame as an argument to eval which is considered as bad practice in the doc.
One way to solve this is to build the models before the call, as suggested in Adam Quek's answer:
library(car)
climateData <- data.frame(c(8.2,6.83,5.46,4.1,3.73,3.36,3,3,3,3,3.7),c(0,0.66,1.33,2,2,2,2,2,2,2,1.6))
names(climateData) <- c("a","b")
modelList <- list()
for(k in 1:ncol(climateData)) {
modelList[[k]] <- lm(as.formula(paste0(names(climateData)[k],"~1")),data=climateData)
}
getYeoJohnsonLambda <- function(myClimateData, cols, lambda_min, lambda_max, eps)
{
#Recommended values for lambda_min = -0.5 and lambda_max = 2.0, eps = 0.1
myboxCox <- list()
lmd <- seq(lambda_min,lambda_max,eps)
for(i in cols)
{
cat("Creating model for column # ",i,"\n")
currentModel <- modelList[[i]]
myboxCox[[i]] <- boxCox(currentModel, lambda = lmd ,family="yjPower", plotit = FALSE)
}
return(myboxCox)
}
test <- getYeoJohnsonLambda(climateData,c(1,2) ,-0.5,2,0.1)
Other solution (arguably cleaner): use yeo.johnson in VGAM
library(VGAM)
getYeoJohnsonLambda_VGAM <- function(myClimateData, cols, lambda_min, lambda_max, eps)
{
#Recommended values for lambda_min = -0.5 and lambda_max = 2.0, eps = 0.1
myboxCox <- list()
lmd <- seq(lambda_min,lambda_max,eps)
return(apply(climateData,2,yeo.johnson,lambda=lmd))
}
test2 <- getYeoJohnsonLambda_VGAM(climateData,c(1,2) ,-0.5,2,0.1)

Here's a solution without troubleshooting the function getYeoJohsnonLambda:
iris.dat <- iris[-5]
vars <- names(iris.dat)
lmd <- seq(.1, 1, .1) #lambda_min, lambda_max, eps
all.form <- lapply(vars, function(x) as.formula(paste0(x, "~ 1")))
all.lm <- lapply(all.form, lm, data=iris.dat)
library(MASS)
all.bcox <- lapply(all.form, boxcox, data=iris.dat,
lambda=lmd, family="yjPower", plotit=FALSE)

compute function in neuralnet R package not working when reproduced

I trained a model in neuralnet and I am trying to figure out how to compute results in Excel. Using the compute function that you call from the package everything works fine. But I went into the source code using F2 in Rstudio and the github page and the function is not working and stops on the relist() function and gives the error: Error in relist(weights, nrow.weights, ncol.weights) :
unused argument (ncol.weights)
I think the problem is the relist() function but I do not know how to transform the weights without it. And the neuralnet package does not come with its own version of relist(). If you ignore the relist line you get the following error: Error in neurons[[i]] %*% weights[[i]] : non-conformable arguments because the weights wasn't transformed correctly. I tried the same thing on my own data and got the same error.
library(neuralnet)
normalize <-function(x) {
return((x - min(x))/(max(x) - min(x)))
}
newdf <- Cars93
newdf = na.omit(newdf)
newdf <- newdf[complete.cases(newdf),]
newdf$Cylinders <- as.numeric(levels(newdf$Cylinders))[newdf$Cylinders]
newdf$Horsepower <- normalize(newdf$Horsepower)
newdf$EngineSize <- normalize(newdf$EngineSize)
newdf$Cylinders <- normalize(newdf$Cylinders)
smp_size <- floor(0.75 * nrow(newdf))
set.seed(12)
train_ind <- sample(seq_len(nrow(newdf)), size = smp_size)
train <- newdf[train_ind, ]
test <- newdf[-train_ind, ]
carsNN <- neuralnet(Horsepower ~ Cylinders+EngineSize,
data = train,hidden = c(1))
cars_results = compute(carsNN,test[11:12])
#this is the source code using F2 in RStudio and on github
sourceCodeCompute = function (x, covariate, rep = 1)
{
nn <- x
linear.output <- nn$linear.output
weights <- nn$weights[[rep]]
nrow.weights <- sapply(weights, nrow)
ncol.weights <- sapply(weights, ncol)
weights <- unlist(weights)
if (any(is.na(weights)))
weights[is.na(weights)] <- 0
weights <- relist(weights, nrow.weights, ncol.weights)
length.weights <- length(weights)
covariate <- as.matrix(cbind(1, covariate))
act.fct <- nn$act.fct
neurons <- list(covariate)
if (length.weights > 1)
for (i in 1:(length.weights - 1)) {
temp <- neurons[[i]] %*% weights[[i]]
act.temp <- act.fct(temp)
neurons[[i + 1]] <- cbind(1, act.temp)
}
temp <- neurons[[length.weights]] %*% weights[[length.weights]]
if (linear.output)
net.result <- temp
else net.result <- act.fct(temp)
list(neurons = neurons, net.result = net.result)
}
sourceCodeCompute(carsNN,test[11:12])

You use the wrong relist function. Try explicitly calling neuralnet:::relist, which is the (unexported) function used automatically within the package namespace.
(I don't know how this question relates to Excel.)

Viewing AIC for many variables with proper header

I'm pretty new in R and i'm stuck with one problem.
I've already found how to create many linear models at once, i made a function that counts AIC for each lm, but I cannot display this function with header that will show the name of the lm. I mean i want to get a data frame with header e.g. lm(a~b+c, data=data), and the AIC result for this lm.
Here's what i already wrote (with big help from stackoverflow, of course)
vars <- c("azot_stand", "przeplyw", "pH", "twardosc", "fosf_stand", "jon_stand", "tlen_stand", "BZO_stand", "spadek_stand")
N <- list(1,2,3,4,5,6,7,8)
COMB <- sapply(N, function(m) combn(x=vars[1:8], m))
COMB2 <- list()
k=0
for(i in seq(COMB)){
tmp <- COMB[[i]]
for(j in seq(ncol(tmp))){
k <- k + 1
COMB2[[k]] <- formula(paste("azot_stand", "~", paste(tmp[,j], collapse=" + ")))
}
}
res <- vector(mode="list", length(COMB2))
for(i in seq(COMB2)){
res[[i]] <- lm(COMB2[[i]], data=s)
}
aic <- vector(mode="list", length(COMB2))
d=0
for(i in seq(res)){
aic[[i]] <- AIC(res[[i]])
}
View(aic)
show(COMB2)
I guess that i miss something in the aic, but don't know what...

With formula you can obtain the formula of a regression model. Since you want to store the formula with the AIC, I would create a data.frame containing both:
aic <- data.frame(model = character(length(res)), aic = numeric(length(res)),
stringsAsFactors = FALSE)
for(i in seq(res)){
aic$model[i] <- deparse(formula(res[[i]]), width.cutoff = 500)
aic$aic[i] <- AIC(res[[i]])
}
Normally you would use format to convert a formula to a character. However, for long formulas this results in multiple lines. Therefore, I use deparse (which is also used by format) and passed it the width.cutoff argument.
You cannot use res[[i]]$call as this is always equal to lm(formula = COMB2[[i]], data = s).
Other suggestions
The first part of your code can be simplified. I would write something like:
s <- attitude
vars <- names(attitude)[-1]
yvar <- names(attitude)[1]
models <- character(0)
for (i in seq_along(vars)) {
comb <- combn(vars, i)
models <- c(models,
paste(yvar, " ~ ", apply(comb, 2, paste, collapse=" + ")))
}
res <- lapply(models, function(m) lm(as.formula(m), data = s))
It is shorter and also has the advantage that magical constants such as the 8 and azot_stand are defined outside the main code and can easily be modified.
I also noticed that you use azot_stand both as target variable and predictor (it is also part of vars). I don't think you will want to do that.

Non-Deterministic behaviour of svm{e1071}

I noticed that SVM when fed with decision.values=T (plus sigmoid to get probabilities ) produces non-deterministic result when I permute data frame under analysis. Does anyone has any idea why? Please try the code yourself
install.packages("e1071")
library(e1071)
A <- cbind(rnorm(20,1,1),rnorm(20,1,1),rep(1,20))
B <- cbind(rnorm(20,9,1),rnorm(20,9,1),rep(0,20))
dataframe <- as.data.frame(rbind(A,B))
predc <- rep(0,length(dataframe[,1]))
K <- length(dataframe[1,])
permutator <- sample(nrow(dataframe))
dataframe$V3 <- factor(dataframe$V3)
dataframe <- dataframe[permutator, ]
for(i in 1:length(dataframe[,1])) {
frm <- as.formula(object=paste("V",as.character(K), " ~ .",sep=""))
r <- svm(formula=frm, data=(dataframe[-i,]))
predicted <- predict(r,newdata=dataframe[i,],decision.values=TRUE)
predc[i] <- sigmoid(attr(predicted,'decision.values')[1])
}
plot(sort(predc))
[edited: code]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is it possible to make a rolling randomForest reproducible? - r

Related

For loop using decimals

R issue : performing lm and then a boxcox to find a proper lambda value

compute function in neuralnet R package not working when reproduced

Viewing AIC for many variables with proper header

Non-Deterministic behaviour of svm{e1071}

Categories

Resources