I get an error when i try to run this code. i followed a guide on youtube for building a neural network. Everything works except when i try to run this code to fit the model.
history <- modnn %>% fit(
train_X_matrix, train_Y, epochs = 50, batch_size = 600,
validation_data = list(validation_X_matrix,validation_Y))
```
the error i get when i try to run the code above, all the names you see are the names of the columns. So the features of the model:
[error in visual studio](https://i.stack.imgur.com/ZaPXw.png)
some extra info about the variables i use. Here i created a matrix of the input variables. They did this in the guide. I tried train_x_data as input then it gave the same error but immediately so not after 1 epoch
```
# dependent and independent variables in 1 dataframe
train_X_data <- data.frame(train_X,train_y)
validation_X_data <- data.frame(validation_X,validation_y)
train_X_matrix <- model.matrix(average_daily_rate ~. -1 , data = train_X_data)
train_Y <- train_X_data$average_daily_rate
validation_X_matrix <- model.matrix(average_daily_rate ~. -1, data = validation_X_data)
validation_Y <- validation_X_data$average_daily_rate
```
The model i use, it is just a simple single layer model for testing.
# 1) single layer model structure
# step 1 make architecture powerful enough
modnn <- keras_model_sequential() %>%
layer_dense(units = 500, activation = "relu",
input_shape = ncol(train_X)) %>%
layer_dense(units = 1)
summary(modnn)
modnn %>% compile(loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error"))
The error occurs after running the first epoch. I tought it was because the model could not read the names of the columns, but i tried a lot of things and nothing seemed to work.
Does anyone have an idea on how to fix this?
I'm new in R and I'm currently working with RandomForest Analysis.
I need to create at least 100 replicates of a RF model, each one with different test/train data.
I would like to automate the task wrapping the code into a loop if that's possible, and save the results of every model.
Without a loop, I have to run the code every time and manually write the output.
This is my code:
#split data into 80 for training/20 for testing
obs_split <- obs_split %>%
split(if_else(runif(nrow(.)) <= 0.8, "train", "test"))
map_int(obs_split, nrow)
# grow random forest with ranger package
detection_freq <- mean(obs_split$train$species_observed)
# ranger requires a factor response to do classification
obs_split$train$species_observed <- factor(obs_split$train$species_observed)
rf <- ranger(formula = species_observed ~ .,
data = obs_split$train,
importance = "impurity",
probability = TRUE,
replace = TRUE,
sample.fraction = c(detection_freq, detection_freq))
I would appreciate any solution! Thank you
I am using keras package in R to train a deep learning model. My data set is highly imbalanced. Therefore, I want to set class_weight argument in the fit function. Here is the fit function and its arguments that I used for my model
history <- model %>% fit(
trainData, trainClass,
epochs = 5, batch_size = 1000,
class_weight = ????,
validation_split = 0.2
)
In python I can set class_weight as follow:
class_weight={0:1, 1:30}
But I am not sure how to do it in R. In the help menu of R it describes class_weight as follow:
Optional named list mapping indices (integers) to a weight (float) to
apply to the model's loss for the samples from this class during
training. This can be useful to tell the model to "pay more attention"
to samples from an under-represented class.
Any idea or suggestions?
Class_weight needs to be a list, so
history <- model %>% fit(
trainData, trainClass,
epochs = 5, batch_size = 1000,
class_weight = list("0"=1,"1"=30),
validation_split = 0.2
)
seems to work. Keras internally uses a function called as_class_weights to change the list to a python-dictionary (see https://rdrr.io/cran/keras/src/R/model.R).
class_weight <- dict(list('0'=1,'1'=10))
class_weight
>>> {0: 1.0, 1: 10.0}
Looks just like the python dictionary that you mentioned above.
I found a generic solution in Python solution, so I converted into R:
counter=funModeling::freq(Y_data_aux_tr, plot=F) %>% select(var, frequency)
majority=max(counter$frequency)
counter$weight=ceil(majority/counter$frequency)
l_weights=setNames(as.list(counter$weight), counter$var)
Using it:
fit(..., class_weight = l_weights)
An advice if you are using fit_generator: since the weights are based on frequency, having a different number of training-validation samples may bias the validation results. They should be equally-sized.
I understand why parallel processing can be used during training only for XGB and cannot be used for other models. However, surprisingly I noticed that predict with xgb uses parallel processing too.
I noticed this by accident when I split my large 10M + data frame into pieces to predict on using foreach %dopar%. This caused some errors so to try to get around them I switched to sequential looping with %do% but noticed in the terminal that all processors where being used.
After some trial and error I found that caret::train() appears to use parallel processing where the model is XGBtree only (possibly others) but not on other models.
Surely predict could be done on parallel with any model, not just xgb?
Is it the default or expected behaviour of caret::predict() to use all available processors and is there a way to control this by e.g. switching it on or off?
Reproducible example:
library(tidyverse)
library(caret)
library(foreach)
# expected to see parallel here because caret and xgb with train()
xgbFit <- train(Species ~ ., data = iris, method = "xgbTree",
trControl = trainControl(method = "cv", classProbs = TRUE))
iris_big <- do.call(rbind, replicate(1000, iris, simplify = F))
nr <- nrow(iris_big)
n <- 1000 # loop over in chunks of 20
pieces <- split(iris_big, rep(1:ceiling(nr/n), each=n, length.out=nr))
lenp <- length(pieces)
# did not expect to see parallel processing take place when running the block below
predictions <- foreach(i = seq_len(lenp)) %do% { # %do% is a sequential loop
# get prediction
preds <- pieces[[i]] %>%
mutate(xgb_prediction = predict(xgbFit, newdata = .))
return(preds)
}
If you change method = "xgbTree" to e.g. method = "knn" and then try to run the loop again, only one processor is used.
So predict seems to use parallel processing automatically depending on the type of model.
Is this correct?
Is it controllable?
In this issue you can find the information you need:
https://github.com/dmlc/xgboost/issues/1345
As a summary, if you trained your model with parallelism, the predict method will also run with parallel processing.
If you want to change the latter behaviour you must change a setting:
xgb.parameters(bst) <- list(nthread = 1)
An alternative, is to change an environment variable:
OMP_NUM_THREADS
And as you explain, this only happens for xgbTree
First some background info, which is probably more interesting on stats.stackexchange:
In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).
I use time series data where the explanatory variables are lagged values of other data and the dependent variable.
For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.
To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.
Now my problem is this:
To achieve this I would normally copy the existing function and edit the desired step/lines.
randomForest2 <- randomForest()
But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?
So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).
The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.
For the code below the performance seems to be a coin-toss.
Taking Soren's code as an example it looks a bit like this:
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){
aBlock[[ttt]] <- unlist(
sample(
split2(1:nrow(dX.train),splits=splits),
splits,
replace=T
)
)
}
#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
.packages=("rpart")) %dopar% {
dBlock = df1[aBlock,]
rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) )
}
#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame, rfBlocks)),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test)
)
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.
A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.
edited code [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster