how to solve errors in frbs package of R using GFC.GCCL method? - r

I'm using frbs package in R on my data set using 5-fold stratified cross validation. I've implemented stratified CV. I use GFS.GCCL method for frbs.learn function in each fold and predict the result using test data. I get this error as well as 30 equal warning messages:
Error: object 'temp.rule.degree' not found
Warning: In max(MF.temp[m, ], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
My code is written in below:
library(frbs)
data<-read.csv(file.address)
data[,30] <- unclass(data[,30]) #column 30 has the class of samples
data <- data[,c(1,14,20,26,27, 30)] # I choose to have 5 attr. since
#my data is high dimensional
k <- 5 # 5-fold
seed <- 1
folds <- strf.cv(data, k, seed) #stratification function for CV
range.data.inp <- matrix(apply(data[,-ncol(data)], 2, range), nrow=2)
data<-norm.data(as.matrix(data[,-ncol(data)]),range.data.
inp,min.scale = 0.1, max.scale = 1)
ctrl <- list(popu.size = 30, num.class = 2, num.labels= 3,
persen_cross = 0.9, max.gen = 200, persen_mutant = 0.3,
name="sim-1")
for(i in 1:k){
str <- paste("fold",i)
print(str)
test.ind <- folds[[str]]
test.data <- data[test.ind,]
train.data <- data[-test.ind,]
obj <- frbs.learn(train.data , method.type="GFS.GCCL",
range.data.inp , ctrl)
pred <- predict(obj, test.data)
print("Predicted classes:")
print(pred)
}
I don't have any idea about error and warnings. Please let me know what I should do.

I've had similar problem (and others) trying to reproduce the SLAVE learning starting with the iris example data. I had 2 format items to solve before being able to run this with my artifical data:
my dataframe import was giving me integer, where the learn needs at least numeric.
my distribution of criteria was not flat. When I flattened the distribution (3 values so n/3 samples per value) everything went fine.
That's all I know.
Hope it helps.

I encountered the same issue when I was running SLAVE and GFS.GCCL. When I was looking at the source code of the library. I found that in frbs.learn(), each method has an implementation to calculate the range of input data. So, I think it might be a problem with the range of input data. For example, in GFS.GCCL, in the source code, for setting the parameters, it looks like this:
range.data.input <- range.data
data.train.ori <- data.train
popu.size <- control$popu.size
persen_cross <- control$persen_cross
persen_mutant <- control$persen_mutant
max.gen <- control$max.gen
name <- control$name
n.labels <- control$num.labels
n.class <- control$num.class
num.labels <- matrix(rep(n.labels, ncol(range.data)), nrow = 1)
num.labels <- cbind(num.labels, n.class)
## normalize range of data and data training
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
range.data.input.ori <- range.data.input
data.tra.norm <- norm.data(data.train[, 1 : ncol(data.train) - 1], range.data.input, min.scale = 0, max.scale = 1)
data.train <- cbind(data.tra.norm, matrix(data.train[, ncol(data.train)], ncol = 1))
in the first line, range.data is either coming from your specification nor the default setting of frbs.learn(). For the default setting, it gets the max and min for each row. In the source code:
range.data <- rbind(dt.min, dt.max)
After that, the range of data taken by the GFS.GCCL is
range.data.norm <- range.data.input
range.data.norm[1, ] <- 0
range.data.norm[2, ] <- 1
which is between 0 and 1. The GFS.GCCL is also taken the range.data.input as parameter. So, it takes both range.data.norm and range.data.input.
Therefore, I think if internally, there are some calculation corresponding to range.data.input (it needs to be set as min, max for each row), but the setting for this is actually not min and max for each row. The error is generated.
But, in summary, after I remove "range.data"from frbs.learn(), both GFS.GCCL and SLAVE work for me.
You can download the source code from here:
https://cran.r-project.org/web/packages/frbs/index.html
You can find the code for GFS.GCCL and SLAVE in:
FRBS.MainFunction.R
GFS.Methods.R

In addition to #Pilip38's good advice, I have three other ideas that have fixed similar errors for me while working with the frbs package.
Most important: Make sure your output variable is never equal to 0. It looks like you have a binary output variable so I am hoping just adding 1 to it so it is 1/2 instead of 0/1 will work.
Try setting your range.data.inp matrix to be all 0's in the first row and all 1's in the second. Naturally it's better to have a tighter range but it may be causing your bug.
Try decreasing the number of labels to 2.
It's can be a brittle procedure.

Related

Finding optimal parameter for each input combination in the objective function in an optimization

I am calibrating a model and for that I have to estimate a parameter for each input combination I give to the objective function. I have a bit more than 10k input combinations and I want to minimize the parameter for each combination. All other variables in the model are known. I achieved to estimate 1 minimal value for the whole set but that doesn't help me, and when I tried my approach for each combination I get the error: Error in mP[, logik] <- mPv[, logik, drop = FALSE] : NAs are not allowed in subscripted assignments.
My objective function looks like this
x_vol <- vector(mode = "double", length = 10776)
objective_function_vol <- function(x_vol){
S <- calibration_set$index_level
K <- calibration_set$strike
tau <- calibration_set$tau
r <- calibration_set$riskfree_rate
q <- calibration_set$q
model_prices_vol <- vector(mode = "double", length = 10776)
for (i in 1:10776){
model_prices_vol[i] <- hestonCallcf(S = S[i], K = K[i], t = tau[i],
r = r[i], q = 0,
v0 = x_vol[i],
vbar = 0.1064688, rho = -0.9914710,
a = 1.6240300, vvol = 0.98839192)
print(i)
}
diff_sq <- (market_price - model_prices_vol)^2
wdiff <- diff_sq/market_price
error <- sum(wdiff)/10776
return(error)
}
I am using NMOF::DEopt for the optimization. Is it maybe possible to write a second loop which stores the optimal values of x_vol because I think using the subscript i for the known inputted values as well as the unknown is somehow wrong.
The error means that some objective-function calls resulted in NA.
If you only wish to minimize a single parameter (i.e. a scalar), Differential Evolution is probably not the method you want. A grid search along one dimension, possibly with refinements, would likely work better.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

ChoiceModelR, none=TRUE, "Error in betadraw[good, ] = newbeta[good, ]"

I am using ChoiceModelR for hierarchical multinomial logit. I want to get estimates for the utility of the outside good (which follows a normal distribution). The outside good has no covariates like the inside goods - e.g. it cannot have a price or brand dummy - , so I set list(none=TRUE) and do not add this no-choice to the X data (as described in the documentation of ChoiceModelR) but only to the y (choice) data.
The iterations start normally, then at some point it stops and says
"Error in betadraw[good, ] = newbeta[good, ] : NAs are not allowed in subscripted assignments".
This likely happens because in row 388 of the function "choicemodelr", the "good" subscript is NA.
I looked at some questions about choicemodelr (this,this,this), and also about NAs in subscript (this,this), but my guess is that my problem is specific to this function in the sense that probably some inputs in the iteration just get so large/small such that "good" will turn to be NA.
Below is a very simple example. I generate data with 3 products with varying attributed. In half of the periods product 3 is not offered. The 2000 consumers have preferences - distributed normally - over 3 attributes (and a preference for the outside good). Logit error added to be consistent with the model. Outside good is indexed as product 4 (both when 3 and 2 products were in the choice set).
How could I avoid the NA error? Am I doing something wrong, or is it a general bug in the function?
I also searched for examples online setting the option none=TRUE, but I did not find any reproducible one. Perhaps this option is only the problematic thing as there is no problem recovering the true parameters if I set none=FALSE, and I do not let customers choose the outside option.
So the code which results in the NA bug is the following:
library("ChoiceModelR")
library("MASS")
set.seed(36)
# Set demand pars
beta_mu = c(-3,4,1)
beta_sigma = diag(c(1,1,1))
alfa_mu = 5 #outside good mean utility
alfa_sigma = 2 #outside good sd
# Three/two products, 3 vars (2 continuous,1 dummy)
threeprod <- list()
twoprod <- list()
purchase <- list()
for (t in 1:1000){
threeprod[[t]] = cbind(rep(t,3),c(1,1,1),c(1,2,3),runif(3),runif(3),ceiling(runif(3,-0.5,0.5)))
purchase[[t]] = which.max(rbind(threeprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(3))), 3, 1),rnorm(1,alfa_mu,alfa_sigma)) )
threeprod[[t]] = cbind(threeprod[[t]],c(purchase[[t]],0,0))
}
for (t in 1001:2000){
twoprod[[t]] = cbind(rep(t,2),c(1,1),c(1,2),runif(2),runif(2),ceiling(runif(2,-0.5,0.5)))
purchase[[t]] = which.max(rbind(twoprod[[t]][,c(4,5,6)]%*%mvrnorm(1,beta_mu,beta_sigma) +
matrix( -log(-log(runif(2))), 2, 1),rnorm(1,alfa_mu,alfa_sigma)) )
if (purchase[[t]] == 3) {purchase[[t]] <- 4}
twoprod[[t]] = cbind(twoprod[[t]],c(purchase[[t]],0))
}
X <- rbind(do.call(rbind,threeprod),do.call(rbind,twoprod))
xcoding <- c(1,1,1)
mcmc = list(R = 5000, use = 2000)
options = list(none=TRUE, save=TRUE, keep=5)
out = choicemodelr(X, xcoding, mcmc = mcmc,options = options)
You have to sort them by ID,Set,Alt .. that solved the error (the same you got)The questions have to sorted by Respondent ID, The set number (Questions) and Alternatives in a given question.

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

R: PDA on 2 manners but with different results?

I'm working on a project where I have to classify data about breast cancer. I want to use PDA. I'm trying to found the optimal value for lambda by 10-fold cross validation.
I've started with:
breast[,1] <- as.numeric(breast[,1]) #Because this code only works when the classes are numeric (so not factors like they were)
I found that this could be done by:
library(penalizedLDA)
PenalizedLDA.cv(breast[,-1], breast[,1], lambdas = seq(0,0.5,by=.01), nfold = 10)
But I can also do this manually:
lam <- seq(0,1,by=.01)
length(lam)
error.lam <- rep(0, length(lammie))
for (la in 1:101){
error.pda <- rep(0,10)
for (fold in 1:k){
currentFold <- folds[fold][[1]]
breast.pda = fda(Diagnosis~., data=breast[-currentFold,], method=gen.ridge, lambda = lam[la])
pred.pda = predict(breast.pda, newdata= breast[currentFold,-1], type="class")
CV.table=table(breast[currentFold,1],pred.pda)
error.pda[fold] <- CV.table[1,2]+CV.table[2,1]
}
error.lam[la] <- sum(error.pda)/nrow
}
The strange thing now is that I become two different results. For the first I've got a CV-error of 4.4% and value for lambda 0.055. While for the second one I become CV-error 6.32% and lamba 0.55. Could someone explain me the differences between those two methods?
Silke

Resources