I read about logistic regression on Wikipedia and it talks of the equation where z, the output depends on the values of beta1, beta2, beta3, and so on which are termed as regression coefficients along with beta0, which is the intercept.
How do you determine the values of these coefficients and intercept given a sample set where we know the output probability and the values of the input dependent factors?
Use R. Here's a video: www.youtube.com/watch?v=Yv05RjKpEKY
Related
Can someone please help me with the equivalent of the mnrval function in R? I have not been able to find one where predicted probabilities are returned based on arguments, coefficient estimates and predictor values. I tried to rewrite the Matlab function in R but was unable to because one of the inner functions that was used was private. I would highly appreciate your help on this.
The documentation page on mnrval() states
MNRVAL Predict values for a nominal or ordinal multinomial regression model.
PHAT = MNRVAL(B,X) computes predicted probabilities for the nominal
multinomial logistic regression model with predictor values X. B is the
intercept and coefficient estimates as returned by the MNRFIT function. X
is an N-by-P design matrix with N observations on P predictor variables.
MNRVAL automatically includes intercept (constant) terms in the model; do
not enter a column of ones directly into X. PHAT is an N-by-K matrix of
predicted probabilities for each multinomial category.
I analyzed non-linear regression using nls package.
power<- nls(formula= agw~a*area^b, data=calibration_6, start=list(a=1, b=1))
summary(power)
I heard in non-linear model, R-squared is not valid and rather than R-squared, we usually show residual standard error which R provides
However, I just want to know what R-squared is. Is that possible to check R-squared in nls package?
Many thanks!!!
OutPut
I found the solution. This method might not be correct in terms of statistics (As R^2 is not valid in non-linear model), but I just want see the overall goodness of fit for my non-linear model.
Step 1> to transform data as log (common logarithm)
When I use non-linear model, I can't check R^2
nls(formula= agw~a*area^b, data=calibration, start=list(a=1, b=1))
Therefore, I transform my data to log
x1<- log10(calibration$area)
y1<- log10(calibration$agw)
cal<- data.frame (x1,y1)
Step 2> to analyze linear regression
logdata<- lm (formula= y1~ x1, data=cal)
summary(logdata)
Call:
lm(formula = y1 ~ x1)
This model provides, y= -0.122 + 1.42x
But, I want to force intercept to zero, therefore,
Step 3> to force intercept to zero
logdata2<- lm (formula= y1~ 0 + x1)
summary(logdata2)
Now the equation is y= 1.322x, which means log (y) = 1.322 log (x),
so it's y= x^1.322.
In power curve model, I force intercept to zero. The R^2 is 0.9994
Consider the code below to fit a generalized additive model including two terms x0 which is linear and x1 which is nonlinear:
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2, method="REML")
b <- gam(y~x1+s(x2, k=5),data=dat)
The model b estimates 3 parameters: an intercept, one parametric coefficient for x1, and one smoothing parameter for x2. How can I extract the estimated covariance matrix of these 3 parameters? I have used vcov(b) which gives the following results:
(Intercept) x0 s(x1).1 s(x1).2 s(x1).3 s(x1).4
(Intercept) 0.104672470 -0.155791753 0.002356237 0.001136459 0.001611635 0.001522158
x0 -0.155791753 0.322528093 -0.004878003 -0.002352757 -0.003336490 -0.003151250
s(x1).1 0.002356237 -0.004878003 0.178914602 0.047701707 0.078393786 0.165195739
s(x1).2 0.001136459 -0.002352757 0.047701707 0.479869768 0.606310668 0.010704075
s(x1).3 0.001611635 -0.003336490 0.078393786 0.606310668 0.933905535 0.025816649
s(x1).4 0.001522158 -0.003151250 0.165195739 0.010704075 0.025816649 0.184471259
It seems vcov(b) gives the covariance related to each knot of the smooth term s(x1), as the results contain s(x1).1, s(x1).2, s(x1).3, s(x1).4 (That's what I guess). I need the covariance between the estimated smoothing parameter and other parametric coefficients, which should be just one for (Intercept) and just one for x0. Is it available at all?
Edit: I set the method of estimation to REML in the code. I agree that I might have used incorrect phrases to explain my idea as said by Gavin Simpson, and I understand all he said. Yet the idea of calculating the covariance between the parametric coefficients (intercept and coefficient of x1) and them smoothing parameter comes from the method of estimation. If we set it to ML or REML, then there could be the covariance I guess. In this case, the estimated covariance matrix for the log smoothing parameter estimates are provided by sp.vcov. So I think such value could exist similarly for the parametric coefficients and the smoothing parameter.
Your statement
The model b estimates 3 parameters: an intercept, one parametric coefficient for x1, and one smoothing parameter for x2.
is incorrect.
The model estimates many more coefficients that these three. Note also that it is confusing to speak of a smoothing parameter for x2 as the model also estimates one of those, but I doubt this is what you mean by that phrase. The smoothing parameter estimated for x2 is the value that controls the wiggliness of the fitted spline. It is also estimated alongside the other coefficients you see, although it isn't typically considered as part of the main model estimated parameters because what you see in the VCOV are actually the variances and covariances of the model coefficients conditional upon this value of the smoothness parameter.
The GAM fitted here is one in which the effect of x2 is represented by a spline basis expansion of x2. For the basis used and the identifiability constraints applied to the basis, this means that the true effect of x2, f(x2), is estimated via a k-1 basis functions. This is a function hat(f(x2)) = \sum \beta_i b_i(x2) estimated by summing up the weighted (by beta_i, the model coefficients for the ith basis function, b) basis functions evaluated at the observed values of x2 (b_i(x2)).
Hence once the basis is chosen and once we have a smoothness parameter (my version, the one controlling the wiggliness), this model is simply a GLM with x1 and the 4 basis functions evaluated at x2. Hence it is parametric and there isn't a single element in the VCOV that relates to the smooth f(x2) - the model just doesn't work that way.
I'm using the regression model of random forest in R and I found the parameter corr.bias which according to the manual is "experimental", my data is nonlinear and I just wonder if setting this parameter to true can enhance the results, plus I don't know exactly how it works for nonlinear data, so I really appreciate if someone can explain to me how this correction bias works in the random forest package and if it can enhance my regression model or not.
The short answer is that it performs a simple correction based on a linear regression on the actual and fitted values.
From regrf.c:
/* Do simple linear regression of y on yhat for bias correction. */
if (*biasCorr) simpleLinReg(nsample, yptr, y, coef, &errb, nout);
and the first few lines of that function are simply:
void simpleLinReg(int nsample, double *x, double *y, double *coef,
double *mse, int *hasPred) {
/* Compute simple linear regression of y on x, returning the coefficients,
the average squared residual, and the predicted values (overwriting y). */
So when you fit a regression random forest with corr.bias = TRUE the model object returned will contain a coef element which will simply be the two coefficients from the linear regression.
Then when you call predict.randomForest this happens:
## Apply bias correction if needed.
yhat <- rep(NA, length(rn))
names(yhat) <- rn
if (!is.null(object$coefs)) {
yhat[keep] <- object$coefs[1] + object$coefs[2] * ans$ypred
}
The non-linear nature of your data probably isn't necessarily relevant, but the bias correction may be very poor if the relationship between the fitted and actual values is very far from linear.
You can always fit the model and then plot the fitted vs actual values yourself and see whether a correction based on a linear regression would help or not.
I need to fit Y_ij ~ NegBin(m_ij,k), hence a negative binomial distribution to a count. However, the data I have observed are censored, I know the value of y_ij, but it could be more than that value. Writting down the loglikelihood going with this problem is:
ll = \sum_{i=1}^n w_i (c_i log(P(Y_ij=y_ij|X_ij)) + (1- c_i) log(1- \sum_{k=1}^32 P(Y_ij = k|X_ij)))
Where X_ij represent the design matrix (with the covariates of interest), w_i is the weight for each observation, y_ij is the response variable and P(Y_ij=y_ij|Xij) is the negative binomial distribution where the m_ij=exp(X_ij \beta) and \alpha is the overdispersion parameter.
Does someone knows if there exist a build-in code in R that could be used to obtain this?
Check this paper out: Regression Models for Count Data in R