In my data, there are about 70 classes and I am using lightGBM to predict the correct class label.
In R, would like to have a customised "metric" function where I can evaluate whether top 3 predictions by lightgbm cover the true label.
The link here is inspiring to see
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
however I don't know the dimensionality of the arguments going to function. seems data are shuffled for some reason.
Scikit-learn implementation
from sklearn.metrics import f1_score
def lgb_f1_score(y_true, y_pred):
preds = y_pred.reshape(len(np.unique(y_true)), -1)
preds = preds.argmax(axis = 0)
print(preds.shape)
print(y_true.shape)
return 'f1', f1_score(y_true, preds,average='weighted'), True
After reading through the docs for lgb.train and lgb.cv, I had to make a separate function get_ith_pred and then call that repeatedly within lgb_f1_score.
The function's docstring explains how it works. I have used the same argument names as in the LightGBM docs. This can work for any number of classes but does not work for binary classification. In the binary case, preds is a 1D array containing the probability of the positive class.
from sklearn.metrics import f1_score
def get_ith_pred(preds, i, num_data, num_class):
"""
preds: 1D NumPY array
A 1D numpy array containing predicted probabilities. Has shape
(num_data * num_class,). So, For binary classification with
100 rows of data in your training set, preds is shape (200,),
i.e. (100 * 2,).
i: int
The row/sample in your training data you wish to calculate
the prediction for.
num_data: int
The number of rows/samples in your training data
num_class: int
The number of classes in your classification task.
Must be greater than 2.
LightGBM docs tell us that to get the probability of class 0 for
the 5th row of the dataset we do preds[0 * num_data + 5].
For class 1 prediction of 7th row, do preds[1 * num_data + 7].
sklearn's f1_score(y_true, y_pred) expects y_pred to be of the form
[0, 1, 1, 1, 1, 0...] and not probabilities.
This function translates preds into the form sklearn's f1_score
understands.
"""
# Only works for multiclass classification
assert num_class > 2
preds_for_ith_row = [preds[class_label * num_data + i]
for class_label in range(num_class)]
# The element with the highest probability is predicted
return np.argmax(preds_for_ith_row)
def lgb_f1_score(preds, train_data):
y_true = train_data.get_label()
num_data = len(y_true)
num_class = 70
y_pred = []
for i in range(num_data):
ith_pred = get_ith_pred(preds, i, num_data, num_class)
y_pred.append(ith_pred)
return 'f1', f1_score(y_true, y_pred, average='weighted'), True
Related
Are those two functions more or less equivalent? For example, if I have an R call like:
loess(formula = myformula, data = mydata, span = myspan, degree = 2, normalize = TRUE, family = "gaussian")
How can I obtain the same or similar result with PyQt-Fit? Should I simply call the smooth.NonParamRegression function (http://pythonhosted.org/PyQt-Fit/NonParam_tut.html) with method=npr_methods.LocalPolynomialKernel(q=2)? What about other parameters, such as span, and family?
UPDATE
I do realize the two implementations are likely not equivalent (https://www.statsdirect.com/help/nonparametric_methods/loess.htm). But any comments regarding "approximating" their outcomes are appreciated.
Statsmodels has a LOWESS implementation
(http://www.statsmodels.org/devel/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html).
Check out this post on the difference between LOESS and LOWESS: https://stats.stackexchange.com/questions/161069/difference-between-loess-and-lowess
Quick example on how to use statsmodels' lowess function in Python
import numpy as np
import statsmodels.api as sm
lowess = sm.nonparametric.lowess
Generate two random arrays, x and y:
x = np.random.rand(100, 1)
y = np.random.rand(100, 1)
Run the lowess function (Frac refers to bandwidth. Note that frac and it are set arbitrarily. Also, not all parameters are specified here, some are set to default. For more, see the official documentation):
results = lowess(y, x, frac=0.05, it=3)
The results are stored in a two-dimensional array. The first column contains the sorted x (exog) values and the second column the associated estimated y (endog) values.
If, for instance, you'd like to construct the residuals, you can proceed as follows:
res = y - results[:,1]
I'm running a nonlinear least squares using the minpack.lm package.
However, for each group in the data I would like optimize (minimize) fitting parameters like similar to Python's minimize function.
The minimize() function is a wrapper around Minimizer for running an
optimization problem. It takes an objective function (the function
that calculates the array to be minimized), a Parameters object, and
several optional arguments.
The reason why I need this is that I want to optimize fitting function based on the obtained fitting parameters to find global fitting parameters that can fit both of the groups in the data.
Here is my current approach for fitting in groups,
df <- data.frame(y=c(replicate(2,c(rnorm(10,0.18,0.01), rnorm(10,0.17,0.01))),
c(replicate(2,c(rnorm(10,0.27,0.01), rnorm(10,0.26,0.01))))),
DVD=c(replicate(4,c(rnorm(10,60,2),rnorm(10,80,2)))),
gr = rep(seq(1,2),each=40),logic=rep(c(1,0),each=40))
the fitting equation of these groups is
fitt <- function(data) {
fit <- nlsLM(y~pi*label2*(DVD/2+U1)^2,
data=data,start=c(label2=1,U1=4),trace=T,control = nls.lm.control(maxiter=130))
}
library(minpack.lm)
library(plyr) # will help to fit in groups
fit <- dlply(df, c('gr'), .fun = fitt) #,"Die" only grouped by Waferr
> fit
$`1`
Nonlinear regression model
model: y ~ pi * label2 * (DVD/2 + U1)^2
data: data
label2 U1
2.005e-05 1.630e+03
$`2`
label2 U1
2.654 -35.104
I need to know are there any function that optimizes the sum-of-squares to get best fitting for both of the groups.
We may say that you already have the best fitting parameters as the residual sum-of-squares but I know that minimizer can do this but I haven't find any similar example we can do this in R.
ps. I made it up the numbers and fitting lines.
Not sure about r, but having least squares with shared parameters is usually simple to implement.
A simple python example looks like:
import matplotlib
matplotlib.use('Qt4Agg')
from matplotlib import pyplot as plt
from random import random
from scipy import optimize
import numpy as np
#just for my normal distributed errord
def boxmuller(x0,sigma):
u1=random()
u2=random()
ll=np.sqrt(-2*np.log(u1))
z0=ll*np.cos(2*np.pi*u2)
z1=ll*np.cos(2*np.pi*u2)
return sigma*z0+x0, sigma*z1+x0
#some non-linear function
def f0(x,a,b,c,s=0.05):
return a*np.sqrt(x**2+b**2)-np.log(c**2+x)+boxmuller(0,s)[0]
# residual function for least squares takes two data sets.
# not necessarily same length
# two of three parameters are common
def residuals(parameters,l1,l2,dataPoints):
a,b,c1,c2 = parameters
set1=dataPoints[:l1]
set2=dataPoints[-l2:]
distance1 = [(a*np.sqrt(x**2+b**2)-np.log(c1**2+x))-y for x,y in set1]
distance2 = [(a*np.sqrt(x**2+b**2)-np.log(c2**2+x))-y for x,y in set2]
res = distance1+distance2
return res
xList0=np.linspace(0,8,50)
#some xy data
xList1=np.linspace(0,7,25)
data1=np.array([f0(x,1.2,2.3,.33) for x in xList1])
#more xy data using different third parameter
xList2=np.linspace(0.1,7.5,28)
data2=np.array([f0(x,1.2,2.3,.77) for x in xList2])
alldata=np.array(zip(xList1,data1)+zip(xList2,data2))
# rough estimates
estimate = [1, 1, 1, .1]
#fitting; providing second length is actually redundant
bestFitValues, ier= optimize.leastsq(residuals, estimate,args=(len(data1),len(data2),alldata))
print bestFitValues
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xList1, data1)
ax.scatter(xList2, data2)
ax.plot(xList0,[f0(x,bestFitValues[0],bestFitValues[1],bestFitValues[2] ,s=0) for x in xList0])
ax.plot(xList0,[f0(x,bestFitValues[0],bestFitValues[1],bestFitValues[3] ,s=0) for x in xList0])
plt.show()
#output
>> [ 1.19841984 2.31591587 0.34936418 0.7998094 ]
If required you can even make your minimization yourself. If your parameter space is sort of well behaved, i.e. approximately parabolic minimum, a simple Nelder Mead method is quite OK.
I am using Conv2D model of Keras 2.0. However, I cannot fully understand what the function is doing mathematically. I try to understand the math using randomly generated data and a very simple network:
import numpy as np
import keras
from keras.layers import Input, Conv2D
from keras.models import Model
from keras import backend as K
# create the model
inputs = Input(shape=(10,10,1)) # 1 channel, 10x10 image
outputs = Conv2D(32, (3, 3), activation='relu', name='block1_conv1')(inputs)
model = Model(outputs=outputs, inputs=inputs)
# input
x = np.random.random(100).reshape((10,10))
# predicted output for x
y_pred = model.predict(x.reshape((1,10,10,1))) # y_pred.shape = (1,8,8,32)
I tried to calculate, for example, the value of the first row, the first column in the first feature map, following the demo in here.
w = model.layers[1].get_weights()[0] # w.shape = (3,3,1,32)
w0 = w[:,:,0,0]
b = model.layers[1].get_weights()[1] # b.shape = (32,)
b0 = b[0] # b0 = 0
y_pred_000 = np.sum(x[0:3,0:3] * w0) + b0
But relu(y_pred_000) is not equal to y_pred[0][0][0][0].
Could anyone point out what's wrong with my understanding? Thank you.
It's easy and it comes from Theano dim ordering. The result of applying filter in stored in a so called channel dimension. In case of TensorFlow this is the last dimension and that's why results are good. In case of Theano it's second dimension (convolution result has shape (cases, channels, width, height) so in order to solve your problem you need to change prediction line to:
y_pred = model.predict(x.reshape((1,1,10,10)))
Also you need to change the way you get the weights as weights in Theano has shape (output_channels, input_channels, width, height) you need to change the weight getter to:
w = model.layers[1].get_weights()[0] # w.shape = (32,1,3,3)
w0 = w[0,0,:,:]
I know that randomForest is supposed to be a black box, and that most people are interested in the ROC curve of the classifier as a whole, but I'm working on a problem in which I need to inspect individual trees of RF. I'm not very experienced with R so what's an easy way to plot ROC curves for the individual trees generated by RF?
I don't think you can generate a ROC curve from a single tree from a random forest generated by the randomForest package. You can access the output of each tree from a prediction, for example over the training set.
# caret for an example data set
library(caret)
library(randomForest)
data(GermanCredit)
# use only 50 rows for demonstration
nrows = 50
# extract the first 9 columns and 50 rows as training data (column 10 is "Class", the target)
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# build the model
rf_model = randomForest(x = x, y = y, ntree = 11)
# Compute the prediction over the training data. Note predict.all = TRUE
rf_pred = predict(rf_model, newdata = x, predict.all = TRUE, type = "prob")
You can access the predictions of each tree with
rf_pred$individual
However, the prediction of a single tree is only the most likely label. For a ROC curve you need class probabilities, so that changing the decision threshold changes the predicted class to vary true and false positive rates.
As far as I can tell, at least in package randomForest there is no way to make the leaves output probabilities instead of labels. If you inspect a tree with getTree(), you will see that the prediction is binary; use getTree(rf_model, k = 1, labelVar = TRUE) and you'll see the labels in plain text.
What you can do, though, is to retrieve individual predictions via predict.all = TRUE and then manually compute class labels on subsets of the whole forest. This you can then input into a function to compute ROC curves like those from the ROCR package.
Edit: Ok, from the link you provided in your comment I got the idea how a ROC curve can be obtained. First, we need to extract one particular tree and then input each data point into the tree, in order to count the occurances of the success class at each node as well as total data points in each node. The ratio gives the node probability for success class. Next, we do something similar, i.e. input each data point into the tree, but now record the probability. This way we can compare the class probs with the true label.
Here is the code:
# libraries we need
library(randomForest)
library(ROCR)
# Set fixed seed for reproducibility
set.seed(54321)
# Define function to read out output node of a tree for a given data point
travelTree = function(tree, data_row) {
node = 1
while (tree[node, "status"] != -1) {
split_value = data_row[, tree[node, "split var"]]
if (tree[node, "split point"] > split_value ) {
node = tree[node, "right daughter"]
} else {
node = tree[node, "left daughter"]
}
}
return(node)
}
# define number of data rows
nrows = 100
ntree = 11
# load example data
data(GermanCredit)
# Easier access of variables
x = GermanCredit[1:nrows, 1:9]
y = GermanCredit$Class[1:nrows]
# Build RF model
rf_model = randomForest(x = x, y = y, ntree = ntree, nodesize = 10)
# Extract single tree and add variables we need to compute class probs
single_tree = getTree(rf_model, k = 2, labelVar = TRUE)
single_tree$"split var" = as.character(single_tree$"split var")
single_tree$sum_good = 0
single_tree$sum = 0
single_tree$pred_prob = 0
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree$sum_good[out_node] = single_tree$sum_good[out_node] + (y[zeile] == "Good")
single_tree$sum[out_node] = single_tree$sum[out_node] + 1
}
# Compute class probabilities from count of "Good" data points in each node.
# Make sure we do not divide by zero
idcs = single_tree$sum != 0
single_tree$pred_prob[idcs] = single_tree$sum_good[idcs] / single_tree$sum[idcs]
# Compute prediction by inserting again data set into tree, but read out
# previously computed probs
single_tree_pred = rep(0, nrow(x))
for (zeile in 1:nrow(x)) {
out_node = travelTree(single_tree, x[zeile, ])
single_tree_pred[zeile] = single_tree$pred_prob[out_node]
}
# Et voila: The ROC curve for single tree!
plot(performance(prediction(single_tree_pred, y), "tpr", "fpr"))
I use Package ‘monmlp’ package in R as follows. (Monotone multi-layer perceptron neural network)
model = monmlp.fit(trainData, trainLabs, hidden1=3, n.ensemble=1, bag=F,silent=T)
pred = monmlp.predict(testData,model)
preds = as.numeric(pred)
labs = as.numeric(testLabs)
pr = prediction(preds,labs)
pf = performance(pr,"auc")
pf#y.values[[1]]
I want to predict some new data using the trained model and take the instances which result higher than a threshold value like 0.9.
In brief, I want to take instances that more likely to be in class 1 using a threshold.
classes are 0 and 1, and
pred = monmlp.predict(testData,model)
head(pred)
returns
[,1]
311694 0.005271582
129347 0.005271582
15637 0.005271582
125458 0.005271582
315130 0.010411831
272375 0.010411831
What are these values? Probabilty values?
If yes what does these values mean?
pred[which(pred>1)]
[1] 1023.839 1023.839 1023.839
Thanks.
Regarding the output: "a matrix with number of rows equal to the number of samples and number of columns equal to the number of predictand variables. If weights is from an ensemble of models, the matrix is the ensemble mean and the attribute ensemble contains a list with predictions for each ensemble member."
Source:
http://cran.r-project.org/web/packages/monmlp/monmlp.pdf
I've never used the package nor the technique, but maybe the quoted answer may mean something to you