What to conclude from parameters of the simple linear regression model about data - r

I had a dataset for which I needed to provide a linear regression model that represents diameter as a function of length.Data which has length in first column and diameter in second looked like:
0.455,0.365
0.44,0.365
I carried out the required operations on the given dataset in R,and plotted the regression line for the data
I am just confused about what to conclude from the parameters(slope=0.8154, y intercept:-0.019413, correlation coefficient:0.98 ). Can I conclude anything other than line is a good fit. I am new to statistics. Any help would be appreciated.

Slope 0.8154 informs you that each unit increase for lenght causes increase of diamater in 0.8154*unit. Intercept -0.019413 is probably statistically insignificant in this case. To verify that you have to look at t-statistics for example.
On this page you can find nice course with visualizations about simple linear regression and other statistical methods answering your questions.

From the parameters slope and intercept, you cannot conclude if the line is a good fit. The correlation coefficient says that they depend highly and that a straight line could fit your model. However, from the p-values for the slope and intercept, you can conclude if your fit is good. If they are small (say below 0.05) you can conclude that the fit is pretty good.

Related

How to resolve heteroskedasticity in Multiple Linear Regression in R

I'm modelling multiple linear regression. I used the bptest function to test for heteroscedasticity. The result was significant at less than 0.05.
How can I resolve the issue of heteroscedasticity?
Try using a different type of linear regression
Ordinary Least Squares (OLS) for homoscedasticity.
Weighted Least Squares (WLS) for heteroscedasticity without correlated errors.
Generalized Least Squares (GLS) for heteroscedasticity with correlated errors.
Welcome to SO, Arun.
Personally, I don't think heteroskedasticity is something you "solve". Rather, it's something you need to allow for in your model.
You haven't given us any of your data, so let's assume that the variance of your residuals increases with the magnitude of your predictor. Typically a simplistic approach to handling it is to transform the data so that the variance is constant. One way of doing this might be to log-transform your data. That might give you a more constant variance. But it also transforms your model. Your errors are no longer IID.
Alternatively, you might have two groups of observarions that you want to compare with a t-test, bit the variance in one group is larger than in the other. That's a different sot of heteroskedasticity. There are variants of the standard "pooled variance" t-test that might handle that.
I realise this isn't an answer to your question in the conventional sense. I would have made it a comment, but I knew before I started that I'd need more words than a comment would let me have.

How to fit a curve to data with sd in R?

I'm completely new to R, so apologies for asking something I'm sure must be basic. I just wonder if I can use the nls() command in R to fit a non-linear curve to a data structure where I have means and sd's, but not the actual replicates. I understand how to fit a curve to single data points or to replicates, but I can't see how to proceed when I have a mean+sd for each data point and I want R to consider variation in my data when fitting.
One possible way to go would be to simulate data using your means and standard deviations and do the regression with the simulated data. Doing this a number of times could give you a good impression on the margin of plausible values for your regression coefficients.

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Goodness of fit functions in R

What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.

Resources