How to represent hypothesis function in logistic regression cost function - math

Below is logistic regression cost function with features(x) , training examples(y)
How should the hypotheses function (circled red) be represented ? :
I'm attempting to implement this function but unsure what value (or function) the hypothesis should take ?

The activation function in logistic regression is the sigmoid function (https://en.wikipedia.org/wiki/Sigmoid_function), defined as
which is also probability of y taking on a 1 value for a given x and parameter theta's to be determined (sigmoid is always between 0 and 1)
The cost function you mentioned comes from maximum likelihood estimation (https://en.wikipedia.org/wiki/Maximum_likelihood) of training (X, y) pairs. The log-likelihood of any (X, y) pair is exactly
The final loss function is precisely the sum of all log-likelihood for all (X, y) training pairs.
Thus, the ''hypothesis" you are talking about is simply sigmoid, 1/(1+exp(-theta * x)) (actually I am not familiar with the term hypothesis used in this context, but the expression resembles any standard expression involving sigmoid and MLE)

Related

High (or very high) order polynomial regression in R (or alternatives?)

I would like to fit a (very) high order regression to a set of data in R, however the poly() function has a limit of order 25.
For this application I need an order on the range of 100 to 120.
model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
# OK
Polynomials and orthogonal polynomials
poly(x) has no hard-coded limit for degree. However, there are two numerical constraints in practice.
Basis functions are constructed on unique location of x values. A polynomial of degree k has k + 1 basis and coefficients. poly generates basis without the intercept term, so degree = k implies k basis and k coefficients. If there are n unique x values, it must be satisfied that k <= n, otherwise there is simply insufficient information to construct a polynomial. Inside poly(), the following line checks this condition:
if (degree >= length(unique(x)))
stop("'degree' must be less than number of unique points")
Correlation between x ^ k and x ^ (k+1) is getting closer and closer to 1 as k increases. Such approaching speed is of course dependent on x values. poly first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between x ^ k and x ^ (k+1), poly will also stop and complain:
if (QR$rank < degree)
stop("'degree' must be less than number of unique points")
But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then poly can reset degree to rank to proceed. Maybe R core can improve on this bit??
Your trial-and-error shows that you can't construct a polynomial of more than 25 degree. You can first check length(unique(q)). If you have a degree smaller than this but still triggering error, you know for sure it is due to numerical rank-deficiency.
But what I want to say is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree k forms a vector space, so whatever basis you use for representation, they have the same span!
Splines: piecewise cubic polynomials and its use in regression
Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:
truncated power basis
natural cubic spline basis
B-spline basis
B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix X'X is banded, thus solving normal equations (X'X) b = (X'y) are very stable.
In R, we can use bs function from splines package (one of R base packages) to get B-spline basis. For bs(x), The only numerical constraint on degree of freedom df is that we can't have more basis than length(unique(x)).
I am not sure of what your data look like, but perhaps you can try
library(splines)
model <- lm(noisy.y ~ bs(q, df = 10))
Penalized smoothing / regression splines
Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.
A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.
The smooth.spline function in stats package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read ?smooth.spline for more. For your data, you may try
fit <- smooth.spline(q, noisy.y)
(Note, smooth.spline has no formula interface.)
Additive penalized splines and Generalized Additive Models (GAM)
Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the mgcv package, recommended by R.
You can still fit a univariate penalized regression spline with mgcv:
library(mgcv)
fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))

Negative binomial dispersion parameter in Matlab

The matlab function nbinfit returns the values r and p for the negative binomial. Is there an equivalent MLE function in matlab that instead returns the values for mu (mean) and theta (the dispersion parameter) for the "ecological" or "Polya" parametrization of the negative binomial?
(something like fitdistr in R)

Cost function in cv. glm for a fitted logistic model when cutoff value of the model is not 0.5

I have a logistic model fitted with the following R function:
glmfit<-glm(formula, data, family=binomial)
A reasonable cutoff value in order to get a good data classification (or confusion matrix) with the fitted model is 0.2 instead of the mostly used 0.5.
And I want to use the cv.glm function with the fitted model:
cv.glm(data, glmfit, cost, K)
Since the response in the fitted model is a binary variable an appropriate cost function is (obtained from "Examples" section of ?cv.glm):
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
As I have a cutoff value of 0.2, can I apply this standard cost function or should I define a different one and how?

corr.bias parameter in Random forest regression model in R

I'm using the regression model of random forest in R and I found the parameter corr.bias which according to the manual is "experimental", my data is nonlinear and I just wonder if setting this parameter to true can enhance the results, plus I don't know exactly how it works for nonlinear data, so I really appreciate if someone can explain to me how this correction bias works in the random forest package and if it can enhance my regression model or not.
The short answer is that it performs a simple correction based on a linear regression on the actual and fitted values.
From regrf.c:
/* Do simple linear regression of y on yhat for bias correction. */
if (*biasCorr) simpleLinReg(nsample, yptr, y, coef, &errb, nout);
and the first few lines of that function are simply:
void simpleLinReg(int nsample, double *x, double *y, double *coef,
double *mse, int *hasPred) {
/* Compute simple linear regression of y on x, returning the coefficients,
the average squared residual, and the predicted values (overwriting y). */
So when you fit a regression random forest with corr.bias = TRUE the model object returned will contain a coef element which will simply be the two coefficients from the linear regression.
Then when you call predict.randomForest this happens:
## Apply bias correction if needed.
yhat <- rep(NA, length(rn))
names(yhat) <- rn
if (!is.null(object$coefs)) {
yhat[keep] <- object$coefs[1] + object$coefs[2] * ans$ypred
}
The non-linear nature of your data probably isn't necessarily relevant, but the bias correction may be very poor if the relationship between the fitted and actual values is very far from linear.
You can always fit the model and then plot the fitted vs actual values yourself and see whether a correction based on a linear regression would help or not.

How to fit frequency distributions in R?

Is there a function that could be used to fit a frequency distribution in R? I'm aware of fitdistr but as far as I can tell it only works for data vectors (random samples). Also, I know that converting between the two formats is trivial but frequencies are so large that memory is a concern.
For example, fitdistr may be used the following way:
x<-rpois(100, lambda=10)
fitdistr(x,"poisson")
Is there a function that would do the same fitting on a frequency table? Something along the lines:
freqt <- as.data.frame(table(x))
fitfreqtable(freqt$x, weights=freqt$Freq, "poisson")
Thanks!
There's no built-in function that I know of for fitting a distribution to a frequency table. Note that, in theory, a continuous distribution is inappropriate for a table, since the data is discrete. Of course, for large enough N and a fine enough grid, this can be ignored.
You can build your own model-fitting function using optim or any other optimizer, if you know the density that you're interested in. I did this here for a gamma distribution (which was a bad assumption for that particular dataset, but never mind that).
Code reproduced below.
negll <- function(par, x, y)
{
shape <- par[1]
rate <- par[2]
mu <- dgamma(x, shape, rate) * sum(y)
-2 * sum(dpois(y, mu, log=TRUE))
}
optim(c(1, 1), negll, x=seq_along(g$count), y=g$count, method="L-BFGS-B", lower=c(.001, .001))
$par
[1] 0.73034879 0.00698288
$value
[1] 62983.18
$counts
function gradient
32 32
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
For fitting a Poisson distribution, you only need the mean of your sample. Then the mean equals the lambda, which is the only parameter of the Poisson distribution. Example:
set.seed(1111)
sample<-rpois(n=10000,l=10)
mean(sample)
[1] 10.0191
which is almost equal to the lambda value put for creating the sample (l=10). The small difference (0.0191) is due to the randomness of the Poisson distribution random value generator. As you increase n the difference will get smaller.
Alternatively, you can fit the distribution using an optimization method:
library(fitdistrplus)
fitdist(sample,"pois")
set.seed(1111)
Fitting of the distribution ' pois ' by maximum likelihood
Parameters:
estimate Std. Error
lambda 10.0191 0.03165296
but it's only a waste of time.
For theoritical information on fitting frequency data, you can see my answer here.
The function fixtmixturegrouped from the package ForestFit does the job for other distribution models using frequency-by-group data.
It can fit simple or mixture distribution models based on "gamma", "log-normal", "skew-normal", and "weibull".
For a Poisson distribution, the population mean is the only parameter that is needed. Applying a simple summary function on your data would suffice (as suggested by ntzortzis)

Resources