I have much more than three elements in every class, but I get this error: "class cannot be less than k=3 in scikit-learn" - runtime-error

This is my target (y):
target = [7,1,2,2,3,5,4,
1,3,1,4,4,6,6,
7,5,7,8,8,8,5,
3,3,6,2,7,7,1,
10,3,7,10,4,10,
2,2,2,7]
I do not know why while I'm executing:
...
# Split the data set in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(C=1), tuned_parameters)#scoring non esiste
# I get an error in the line below
clf.fit(X_train, y_train, cv=5)
...
I get this error:
Traceback (most recent call last):
File "C:\Python27\SVMpredictCROSSeGRID.py", line 232, in <module>
clf.fit(X_train, y_train, cv=5) #The minimum number of labels for any class cannot be less than k=3.
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 354, in fit
return self._fit(X, y)
File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 372, in _fit
cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1148, in check_cv
cv = StratifiedKFold(y, cv, indices=is_sparse)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 358, in __init__
" be less than k=%d." % (min_labels, k))
ValueError: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than k=3.

The algorithm requires that there be at least 3 instances for a label in your training set. Although your target array contains at least 3 instances of each label, but when you split the data between training and testing, not all the training labels have 3 instances.
You either need to merge some class labels or increase your training samples to solve the problem.

If you can't split the test and training set with each class populated enough in each fold, then try updating the Scikit library.
pip install -U scikit-learn
You'll get the same message as a warning, so you can run the code.

Related

XGBoost Custom Objective Function

this will be a long question. I’m trying to define my own custom objective function
I want the XGBClassifier, so I run
from xgboost import XGBClassifier
the documentation of xgboost says:
A custom objective function can be provided for the objective parameter. In this case, it should have the signature
objective(y_true, y_pred) -> grad, hess :
y_true: array_like of shape [n_samples], The target values
y_pred: array_like of shape [n_samples], The predicted values
grad: array_like of shape [n_samples], The value of the gradient for each sample point.
hess: array_like of shape [n_samples], The value of the second derivative for each sample point
Now, I’ve coded this custom:
def guess_averse_loss(y_true, y_pred):
y_true = y_true.astype(int)
y_pred = y_pred.astype(int)
... stuffs ...
return grad, hess
everything is compatible with the previous documentation.
If I run:
classifier=XGBClassifier(eval_metric=custom_weighted_accuracy,objective=guess_averse_loss,**params_common_model)
classifier.train(X_train, y_train)
(where custom_weighted_accuracy is a custom metric defined by me following the documentation of scikitlearn)
I get the error:
-> first_term = np.multiply(cost_matrix[y_true, y_pred], np.exp(y_pred - y_true))
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (4043,) (4043,5)
So, y_pred enters the function as a matrix (n_samples x n_classes) where the element ij is the probability that the sample i belongs to the class j.
Then, I modify the line as
first_term = np.multiply(cost_matrix[y_true, np.argmax(y_pred, axis=1)],np.exp(np.argmax(y_pred, axis=1) - y_true))
so it passes from a matrix to an array,
This leads to the error:
unknown custom metric
so it seems that the problem now is the metric.
I try to remove the custom obj function using the default one and another error comes:
XGBoostError: Check failed: in_gpair->Size() % ngroup == 0U (3 vs. 0) : must have exactly ngroup * nrow gpairs
WHAT CAN I DO???
You read what I've tried, I'm excepting some suggestion to solve this problems

R.keras error during call to fit function

I am new to R keras, so please bear with me. I am trying to build a simple model using variables that are categorical, but I've recast as numeric.
I can get examples working from various tutorials in R/keras with my current installation so I know its not in reticulate or tensorflow or even R. However, when I try to use my own data to create the simple model, I obtain the following errors during the "fit" execution:
I'm pretty sure its my training data format, but I cannot for the life of me figure out what is going wrong. Thank you kindly in advance.
# Fit
model_one <- model %>%
+ fit(training,
+ trainLabels,
+ epochs = 100,
+ batch_size = 32,
+ validation_split = 0.2)
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: in user code:
C:\Users\JRM\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow\python\keras\engine\training.py:571 train_function *
outputs = self.distribute_strategy.run(
C:\Users\JRM\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:951 run **
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
C:\Users\JRM\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2290 call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
C:\Users\JRM\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow\python\distribute\distribute_lib.py:2649 _call_for_each_replica
return fn(*args, **kwargs)
C:\Users\JRM\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\tensorflow\python\keras\engine\training.py:533 train_step **
y, y_pred, sample_weight, regulari
I've upload my script and sample data file to github:
Sample Data and Script to reproduce error
Actually, found the error:
One hot encoding generates a 2 column matrix:
# One hot encoding
trainLabels <- to_categorical(trainingtarget)
testLabels <- to_categorical(testtarget)
print(testLabels[1:10,])
and the model was expecting 3 columns.
I changed the model call to automatically accept the correct number of columns based on the variables instead:
model %>%
layer_dense(units = 8, activation = 'relu', input_shape = ncol(test)) %>%
layer_dense(units = ncol(trainLabels), activation = 'softmax')

How to fix'RecursionError: maximum recursion depth exceeded while getting the str of an object' error in python while

I tried to cluster my data using a hierarchy clustering and dendrogram. My dataset has a size of 400000 rows and 90 columns. I also used data splitting and the test_size= 0.2. In addition, I feature scale my data before draw the dendrogram.
Can someone help me with the error? Thanks.
X = customer.iloc[:, [2,3]].values
y = customer.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X_test, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()
I got an error message:
File "C:\Users\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py", line 3433, in _append_singleton_leaf_node
ivl.append(str(int(i)))
RecursionError: maximum recursion depth exceeded while getting the str of an object.
This message error come from one of multiple CPython implementation limitations (same for multithreading do not exists in cpython due to GIL limitation).
In short to fix it, you can do:
sys.setrecursionlimit(100000)
To get more information about this limitation: https://stackoverflow.com/a/13592002/427887

Keras in R, LSTM with flexible inputlength

I'm using Keras with tensorflow backend in R.
I want to create a model that can handle an arbitrary input sequence length.
When I try to define the following model:
model <- keras_model_sequential()
layer_lstm(model, 128, input_shape = c(NULL, 5))
I get the following error:
ValueError: Input 0 is incompatible with layer lstm_3: expected ndim=3, found ndim=2
I guess that it runs into difficulties since the batchsize is already variable. So I could do the following:
model <- keras_model_sequential()
layer_lstm(model, 128, input_shape = c(20, NULL, 5))
This runs without any error. Does this indeed signify a fixed batchsize of 20 a variable sequence length and an input lenght of 5? Or is this just wishfull thinking?

(R) function: object not found: environment depth fine?

I'm puzzled by a function error & would appreciate some insight.
The function, very briefly, automates the multiple processes involved in Boosted Regression Trees using gbm.step & other gbm's.
"gbm.auto" <- function (grids, samples, 3 parameters) {
starts 2 counters, require(gbm), does various small processing jobs with grids & samples
for parameter 1{
for parameter 2{
for parameter 3{
Runs 2 BRTs per parameter-combination loop, generates & iteratively updates a 'best' BRT for each, adds to counters. Extensive use of samples.
}}}
closes the loops, function continues as the first } is still open.
The next BRT can't find samples, even though it's at the same environment depth (1?) as the pre-loop processing jobs which used it successfully. Furthermore, adding "globalsamples<<-samples" after the }}} loop successfully saves the object, suggesting that samples is still available. Adding env1,2 & 3<<-environment() before the {{{ loop, within it & after it results in Environment for all three. Also suggesting it's all the same function environment & samples should be available.
What am I missing here? Thanks in advance!
Edit: exact message:
Error in eval(expr, envir, enclos) : object 'samples' not found
Function - loads removed & compacted but still gives same error message:
"gbm.auto" <-
function (samples, expvar, resvar, tc, lr, bf)
{ # open function
require(gbm)
require(dismo)
# create binary (0/1) response variable, for bernoulli BRTs
samples$brv <- ifelse(samples[resvar] > 0, 1, 0)
brvcol <- which(colnames(samples)=="brv") # brv column number for BRT
for(j in tc){ # permutations of tree complexity
for(k in lr){ # permutations of learning rate
for(l in bf){ # permutations of bag fraction
Bin_Best_Model<- gbm.step(data=samples,gbm.x = expvar, gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l)
}}} # close loops, producing all BRT/GBM objects & continue through model selection
Bin_Best_Simp_Check <- gbm.simplify(Bin_Best_Model) # simplify model
# if best number of variables to remove isn't 0 (i.e. it's worth simplifying), re-run the best model (Bin_Best_Model, using gbm.call to get its values)
# with just-calculated best number of variables to remove, removed. gbm.x asks which number of drops has the minimum mean (lowest point on the line)
# & that calls up the list of predictor variables with those removed, from $pred.list
if(min(Bin_Best_Simp_Check$deviance.summary$mean) < 0)
assign("Bin_Best_Simp", gbm.step(data = samples,
gbm.x = Bin_Best_Simp_Check$pred.list[[which.min(Bin_Best_Simp_Check$deviance.summary$mean)]],
gbm.y = brvcol, family = "bernoulli", tree.complexity = j, learning.rate = k, bag.fraction = l))
}
Read in data:
mysamples<-data.frame(response=round(sqrt(rnorm(5000, mean= 2.5, sd=1.5)^2)),
depth=sqrt(rnorm(5000, mean= 35, sd=24)^2),
temp=rnorm(5000, mean= 15, sd=1.2),
sal=rnorm(5000, mean= 34, sd=0.34))
Run this: gbm.auto(expvar=c(2,3,4),resvar=1,samples=mysamples,tc=2,lr=0.00000000000000001,bf=0.5)
Problem now: this causes a different error because my fake data are somehow wrong. ARGHG!
Edit: rounded the response data to integers and kept shrinking the learning rate until it runs. If it doesn't work for you, add zeroes until it does.
Edit: so this worked on my computer but reading it back to a clean sheet from online fails on a DIFFERENT count:
Error in var(cv.cor.stats, use = "complete.obs") :
no complete element pairs
In cor(y_i, u_i) : the standard deviation is zero
Is it allowed to attach or link to a csv of a small clip of my data? I'm currently burrowing deeper & deeper into bugfixing problems created by using fake data which I'm only using for this question, & thus getting off topic from the actual problem. Exasperation mode on!
Cheers
Edit2: if this is allowed: 1000row 4column csv link here: https://drive.google.com/file/d/0B6LsdZetdypkaC1WYXpKU3ZScjQ

Resources