I have a dataset to classify between won cases (14399) and lost cases (8677). The dataset has 912 predicting variables.
I am trying to oversample the lost cases in order to reach almost the same number as the won cases (so having 14399 cases for each of the won and lost cases).
TARGET is the column with lost (0) and won (1) cases:
table(dat_train$TARGET)
0 1
8677 14399
Now I am trying to balance them using ROSE ovun.sample
dat_train_bal <- ovun.sample(dat_train$TARGET~., data = dat_train, p=0.5, seed = 1, method = "over")
I get this error:
Error in parse(text = x, keep.source = FALSE) :
<text>:1:17538: unexpected symbol
1: PPER_409030143+BP_RESPPER_9639064007+BP_RESPPER_7459058285+BP_RESPPER_9339059882+BP_RESPPER_9339058664+BP_RESPPER_5209073603+BP_RESPPER_5209061378+CRM_CURRPH_Initiation+Quotation+CRM_CURRPH_Ne
Can anyone help?
Thanks :-)
Reproducing your code from a sham example I found an error in your formula dat_train$TARGET~. needs to be corrected as TARGET~.
dframe <- tibble::tibble(val = sample(c("a", "b"), size = 100, replace = TRUE, prob = c(.1, .9))
, xvar = rnorm(100)
)
# Use oversampling
dframe_os <- ROSE::ovun.sample(formula = val ~ ., data = dframe, p=0.5, seed = 1, method = "over")
table(dframe_os$data$val)
Related
I am currently using the R package ParBayesianOptimization to tune parameters for ML methods. While searching for an optimal cost parameter for the svmLinear2 model (contained in caret), the optimization stopped with a sudden error after successfully completing 15 iterations.
Here is the error traceback:
Error in rbindlist(l, use.names, fill, idcol) :
Item 2 has 9 columns, inconsistent with item 1 which has 10 columns. To fill missing columns use fill=TRUE.
7.
rbindlist(l, use.names, fill, idcol)
6.
rbind(deparse.level, ...)
5.
rbind(scoreSummary, data.table(Epoch = rep(Epoch, nrow(NewResults)),
Iteration = 1:nrow(NewResults) + nrow(scoreSummary), inBounds = rep(TRUE,
nrow(NewResults)), NewResults))
4.
addIterations(optObj, otherHalting = otherHalting, iters.n = iters.n,
iters.k = iters.k, parallel = parallel, plotProgress = plotProgress,
errorHandling = errorHandling, saveFile = saveFile, verbose = verbose,
...)
3.
ParBayesianOptimization::bayesOpt(FUN = ...
So somehow the data tables storing the summary information each iteration suddenly differ in the number of columns present. Is this a common bug with the ParBayesianOptimization package? Has anyone else encountered a similar problem? Did you find a fix - other than rewriting the addIterations function to fill the missing columns?
EDIT:I don't have an explanation for why the error may suddenly occur after a number of successful iterations. However, this issue has reoccurred when using svmLinear and svmRadial. I was able to reconstruct a similar case with the same error on the iris dataset:
library(data.table)
library(caret)
library(ParBayesianOptimization)
set.seed(1234)
bayes.opt.bounds = list()
bayes.opt.bounds[["svmRadial"]] = list(C = c(0,1000),
sigma = c(0,500))
svmRadScore = function(...){
grid = data.frame(...)
mod = caret::train(Species~., data=iris, method = "svmRadial",
trControl = trainControl(method = "repeatedcv",
number = 7, repeats = 5),
tuneGrid = grid)
return(list(Score = caret::getTrainPerf(mod)[, "TrainAccuracy"], Pred = 0))
}
bayes.create.grid.par = function(bounds, n = 10){
grid = data.table()
params = names(bounds)
grid[, c(params) := lapply(bounds, FUN = function(minMax){
return(runif(n, minMax[1], minMax[2]))}
)]
return(grid)
}
prior.grid.rad = bayes.create.grid.par(bayes.opt.bounds[["svmRadial"]])
svmRadOpt = ParBayesianOptimization::bayesOpt(FUN = svmRadScore,
bounds = bayes.opt.bounds[["svmRadial"]],
initGrid = prior.grid.rad,
iters.n = 100,
acq = "ucb", kappa = 1, parallel = FALSE,plotProgress = TRUE)
Using this example, the error occurred on the 9th epoch.
Thanks!
It appears that the scoring function returned NAs in place of accuracy measures leading to the error later downstream. This has been described by the library's creator at
https://github.com/AnotherSamWilson/ParBayesianOptimization/issues/33.
It looks like the SVM is trying a cost of 0 during the 9th iteration. Given the problem statement the SVM is solving, the cost parameter should probably be positive.
According to AnotherSamWilson, this error may commonly occur when the scoring function "returns something unexpected".
I am trying to fit Polish local government election results in 2015 following the superb blog by #GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.
library(tidyverse)
library(sf)
library(mgcv)
# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file
boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
stringsAsFactors = FALSE ) %>%
st_transform(crs = 4326)%>%
janitor::clean_names() %>%
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>%
mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>%
select(teryt, nazwa=jpt_nazwa, geometry)
# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
readxl::read_excel("data/2015-gl-lis-gm.xls",
trim_ws = T, col_names = T) %>%
janitor::clean_names() %>%
select(teryt, liczba_wyborcow, glosy_niewazne)
elections <-
boroughs_shp %>% fortify() %>%
left_join(elections_xls, by = "teryt") %>%
arrange(teryt) %>%
mutate(idx = seq.int(nrow(.)) %>% as.factor(),
teryt = as.factor(teryt))
# Neighbors
boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")
# Model
ctrl <- gam.control(nthreads = 4)
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)),
data = elections,
offset = log(liczba_wyborcow), # number of votes
method = 'REML',
control = ctrl,
family = betar())
Here is the error message:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?
EDIT: The condition mentioned in error message is met:
> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.
First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results.
Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.
elections <-
boroughs_shp %>%
left_join(elections_xls, by = "teryt") %>% na.omit() %>%
arrange(teryt) %>%
mutate(idx = row_number() %>% as.factor()) %>%
mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)
Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb()
Now my formula looks like
m1 <-
gam(
void_ratio ~ s(
idx,
bs = 'mrf',
k =500,
xt = list(nb = boroughs_nb),
fx = TRUE),
data = elections_df,
method = 'REML',
control = gam.control(nthreads = 4),
family = nb()
)
It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ # 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.
In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.
Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.
I've gotten an error message when attempting to plot a neural network. I was able to run the code fine at first then it stopped. I do not get an error message when the neuralnet() function is run. Any help would be appreciated. I predicting the loan default.
library(neuralnet)
library(plyr)
CreditCardnn <- read.csv("https://raw.githubusercontent.com/621-Group2/Final-Project/master/UCI_Credit_Card.csv")
#Normalize dataset
maxValue <- apply(CreditCardnn, 2, max)
minValue <- apply(CreditCardnn, 2, min)
CreditCardnn <- as.data.frame(scale(CreditCardnn, center = minValue, scale = maxValue - minValue))
#Rename to target variable
colnames(CreditCardnn)[25] <- "target"
smp <- floor(0.70 * nrow(CreditCardnn))
set.seed(4784)
CreditCardnn$ID <- NULL
train_index <- sample(seq_len(nrow(CreditCardnn)), size = smp, replace = FALSE)
train_nn <- CreditCardnn[train_index, ]
test_nn <- CreditCardnn[-train_index, ]
allVars <- colnames(CreditCardnn)
predictorVars <- allVars[!allVars%in%'target']
predictorVars <- paste(predictorVars, collapse = "+")
f <- as.formula(paste("target~", predictorVars, collapse = "+"))
nueralModel <- neuralnet(formula = f, hidden = c(4,2), linear.output = T, data = train_nn)
plot(nueralModel)
Which gives the following error:
Error in plot.nn(nueralModel) : weights were not calculated
Before the error you report, most probably you also got a warning:
# your data preparation code verbatim here
> nueralModel <- neuralnet(formula = f, hidden = c(4,2), linear.output = T, data = train_nn)
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
This message is important, effectively warning you that your neural network did not converge. Given this message, the error further downstream, when you try to plot the network, is actually expected:
> plot(nueralModel)
Error in plot.nn(nueralModel) : weights were not calculated
Looking more closely into your code & data, it turns out that the problem lies in your choice for linear.output = T in fitting your neural network; from the docs:
linear.output logical. If act.fct should not be applied to the output neurons set linear output to TRUE, otherwise to FALSE.
Keeping a linear output in the final layer of a neural network is normally used in regression settings only; in classification settings, such as yours, the correct choice is to apply the activation function to the output neuron(s) as well. Hence, trying the same code as yours but with linear.output = F, we get:
> nueralModel <- neuralnet(formula = f, hidden = c(4,2), linear.output = F, data = train_nn) # no warning this time
> plot(nueralModel)
And here is the result of the plot:
Try increasing stepmax. Ex. set stepmax = 1e6 or higher. It takes longer time for higher stepmax but you can try:
nueralModel <- neuralnet(formula = f, hidden = c(4,2), linear.output = F, data = train_nn, stepmax = 1e6)
i'm working on building a predictive model for breast cancer data using R. After performing gcrma normalization, i generated the potential predictor variables. Now while i run the RF algorithm i encountered the following error
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error: Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories.
code:
library(randomForest)
library(ROCR)
library(Hmisc)
library(genefilter)
setwd("E:/kavya's project_work/final")
datafile<-"trainset_gcrma.txt"
clindatafile<-read.csv("mod clinical_details.csv")
outfile="trainset_RFoutput.txt"
varimp_pdffile="trainset_varImps.pdf"
MDS_pdffile="trainset_MDS.pdf"
ROC_pdffile="trainset_ROC.pdf"
case_pred_outfile="trainset_CasePredictions.txt"
vote_dist_pdffile="trainset_vote_dist.pdf"
data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=clindatafile
clincaldata_order=order(clin_data_import[,"GEO.asscession.number"])
clindata=clin_data_import[clincaldata_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)
X=rawdata[,4:length(header)]
ffun=filterfun(pOverA(p = 0.2, A = 100), cv(a = 0.7, b = 10))
filt=genefilter(2^X,ffun)
filt_Data=rawdata[filt,]
#Get potential predictor variables
predictor_data=t(filt_Data[,4:length(header)])
predictor_names=c(as.vector(filt_Data[,3])) #gene symbol
colnames(predictor_data)=predictor_names
target= clindata[,"relapse"]
target[target==0]="NoRelapse"
target[target==1]="Relapse"
target=as.factor(target)
tmp = as.vector(table(target))
num_classes = length(tmp)
min_size = tmp[order(tmp,decreasing=FALSE)[1]]
sampsizes = rep(min_size,num_classes)
rf_output=randomForest(x=pred.data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
error:"Error in randomForest.default(x = pred.data, y = target, importance = TRUE, : Can not handle categorical predictors with more than 53 categories."
as i'm new to Machine learning i'm unable to proceed. kindly do the needful.
Thnks in advance.
It is hard to say without knowing the data. Run class or summary on all your predictor variables to ensure that they are not accidentally interpreted as characters or factors. If you really do have more than 53 levels, you will have to convert them to binary variables. Example:
mtcars$automatic <- mtcars$am == 0
mtcars$manual <- mtcars$am == 1
I have a training set that looks like
Name Day Area X Y Month Night
ATTACK Monday LA -122.41 37.78 8 0
VEHICLE Saturday CHICAGO -1.67 3.15 2 0
MOUSE Monday TAIPEI -12.5 3.1 9 1
Name is the outcome/dependent variable. I converted Name, Area and Day into factors, but I wasn't sure if I was supposed to for Month and Night, which only take on integer values 1-12 and 0-1, respectively.
I then convert the data into matrix
ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
I then setup tuning the parameters
nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=ynn, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)
However, I get the error Error: nrow(x) == n is not TRUE for the model<-train
I also get a similar error if I use xgboost instead of nnet
Anyone know whats causing this?
y should be a numeric or factor vector containing the outcome for each sample, not a matrix. Using
train(y = make.names(trainDF$Name), ...)
helps, where make.names modifies values so that they could be valid variable names.
Even though in the help file of train said either maxtrix or data frame would be expected, but you can try to convert the matrix to a data frame:
model <- train(y=ynn, x=as.data.frame(mnn), method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)