Weighted linear regression in R [duplicate] - r

This question already has an answer here:
R: lm() result differs when using `weights` argument and when using manually reweighted data
(1 answer)
Closed 6 years ago.
I would like to do a linear regression with a weighting factor for an analytical chemistry calibration curve. The x values are concentration and assumed to have no error. The y values are instrument response and the variation is assumed proportional to concentration. So, I would like to use a 1/x weighting factor for the linear regression. The data set is simply ten concentrations with a single measurement for each. Is there an easy way to do this in R? .

The answer can be found on a somewhat older question on Cross Validated. The lm() function (which represents the usual method of applying a linear regression), has an option to specify weights. As shown in the answer on the link, you can use a formula in the weights argument. In your case, the formula will likely take the form of 1/data$concentration.
As suggested by hrbrmstr, I'm adding mpiktas's actual answer from Cross Validated:
I think R help page of lm answers your question pretty well. The only
requirement for weights is that the vector supplied must be the same
length as the data. You can even supply only the name of the variable
in the data set, R will take care of the rest, NA management, etc. You
can also use formulas in the weight argument. Here is the example:
x <-c(rnorm(10),NA) df <-
data.frame(y=1+2*x+rnorm(11)/2,x=x,wght1=1:11)
##Fancy weights as numeric vector
summary(lm(y~x,data=df,weights=(df$wght1)^(3/4)))
#Fancy weights as formula on column of the data set
summary(lm(y~x,data=df,weights=I(wght1^(3/4))))
#Mundane weights as the column of the data set
summary(lm(y~x,data=df,weights=wght1)
Note that weights must be positive, otherwise R will produce an error.

Related

How can I do ARIMA time series prediction for three constrained variables (x+y+z=1)?

How can I PREDICT time-series for three variables that sum to 1? Say, x+y+z=1. I have historical data for x ,y, z , t. Based on historical data, I can create an ARIMA model for each variable individually, and make predictions for the future. How do I add the constraint?
If this were only one variable, applying an ARIMA is simple.
For the single variable x(t), I can get a fit ARIMA_x(p, d, q) and those three numbers parametrize the model.
Here, I could get three sets of fits independently. But that is not proper.
With three variables that always sum to 1, how do I get three sets of constrained fit parameters?
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html
I got an answer on math stack exchange- posting link here.
https://math.stackexchange.com/questions/3322653/time-series-prediction-for-three-constrained-variables-xyz-1/3322665?noredirect=1#comment6836865_3322665

R coxph() warning: Loglik converged before variable - valid results for other variables?

I fit cox regressions and I am interested in the effect of a predictor x, which is the last variable in the model (variable 7). I include some variables like sex and age in the model, because I want to adjust the model for them.
Using R function coxph() gives me the warning "Loglik converged before variable 3". In fact, I am not interested in variable 3, because it is just one of the variables I want to adjust for. But the word "before" makes me wonder whether this mean that results of all variables following variable 3 (which includs my predictor x) are not valid. Or are only the results of variable 3 affected?
This is the output:
More information: Actually, I am running multiple cox regressions and the described warning occurs only in some of the models for variable 3. I do want to adjust for variable 3 and thus to keep it in the code.
There is some discussion about this warning(1), but I have not found an answer to my question so far. Thank you.
(1) For example, here or here
Thanks for doing a search on prior answers. (And thanks for citing one of my answers :-) The warning message only applies to the control3 variable. The key to investigating the validity of the statistical inference about predictorx lies part of Therneau's answer that you cited. You are interested in one of the other variables and fortunately only one of your variables "exploded". That means you can do a LRT comparing the models with and without the variable(s) of interest to get a proper statistical result. The results are essentially saying there were no events in the subset of cases with a positive value of `control3 . I'm guessing that control3 is a 0/1 variable and if you looked at the table:
with( your_data, table(control3, your_status_variable))
.... that you would find zero events in cases with control3 == 1.

How to use the 'weights' in the nls (non-linear least squares) function in R?

My question is on how to correctly interpret (and use) the 'weights' input variable in the nls function of R for non-linear weighted least squares regression.
The solution for solving the unknown parameters in weighted least squares theory is:
From this the variable P is the weight square matrix of size (NxN) where N is the number of data observations.
However, when I look at the nls documentation in R found here, it says the 'weights' to be input is a vector.
This has me puzzled since based on my understanding, the weights should be a square matrix. Some insights with those who have a better understanding is appreciated.
Weight variable in regression, is a measure of how important an observation is to your model due to different reasons (eg. may be in terms of reliability of measurement or inverse of variance estimate). Therefore, some observations may be more important/ weigh higher than others.
Weight vector , in matrix notation converts to a diagonal matrix for i in {1,2,3...n,} both represents the same thing (i.e. weight of ith observation). For nls package in R you need to supply weights in vector form.
Also, it should be noted that, weighted least squares is a special variant of generalized least squares in which we use weights to counter the heteroskedasticity. If the residuals are correlated for observations, perhaps a general model might be suitable.
PS: Cross validated would be the right place to get better detailed answer. Also, It seems to be memory efficient to store a vector rather than a matrix as the number of observations grows

R: glm (multiple linear regression) ignores/removes some predictor variables

I have posted this question before, but I believe that I had not explained the problem well and that it was over-complicated, so I deleted my previous post and I am posting this one instead. I am sorry if this caused any inconvenience.
I also apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know.
So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s).
I am running the glm function (multivariate linear regression) using the response vector and the table of predictors:
fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude)
coeff <- as.vector(coef(summary(fit))[,4])[-1]
When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis.
The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages.
The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.).
About 30 predictors are missing from the model, over 200.
I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector...
Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables.
What could cause glm to ignore/remove some predictor variables?
Any suggestion is welcome!
EDIT: I found out that the predictors that were removed has values identical to another predictor. There should still be a way to keep them, and they would get the same regression coefficient for example
Your edit explains why you are not getting those variables. That was going to be my first question. (This question would be better posed on Cross validated because it is not an R error, it is a problem with your model.)
They would not get the same coefficients: Say you have a 1:1 relationship, Y = X + e, Then fit simple model Y ~ X + X. Each X is going to be assigned ANY value such that the sum is equal to 1. There is no solution. Y = 0.5X + 0.5X may be the most obvious to us, but Y = 100X -99X is just as valid.
You also cannot have any predictors that are linear sums of other predictors for the same reason.
If you really want those values you can generate them from what you have. However I do not recommend it because the assumptions are going to be on very thin ice.

Inputting a whole data frame as independent variables in a logistic regression [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
short formula call for many variables when building a model
I have a biggish data frame (112 variables) that I'd like to do a stepwise logistic regression on using R. I know how to setup the glm model and the stepAIC model, but I'd rather not type in all the headings to input the independent variables. Is there a fast way to give the glm model an entire data frame as independent variables such that it will recognize each column as an x variable to be included in the model? I tried:
ft<-glm(MFDUdep~MFDUind, family=binomial)
But it didn't work (wrong data types). MFDUdep and MFDUind are both data frames, with MFDUind containing 111 'x' variables and MFDUdep containing a single 'y'.
You want the . special symbol in the formula notation. Also, it is probably better to have the response and predictors in the single data frame.
Try:
MFDU <- cbind(MFDUdep, MFDUind)
ft <- glm(y ~ ., data = MFDU, family = binomial)
Now that I have given you the rope, I am obliged to at least warn you about the potential for hanging...
The approach you are taking is usually not the recommended one, unless perhaps prediction is the purpose of the model. Regression coefficient for selected variables may be strongly biased so if you are using this for enlightenment, then rethink your approach.
You will also need a lot of observations to allow 100+ terms in a model.
Better alternative exist; e.g. see the glmnet package for one such approach which allows for ridge, lasso or both (elastic net) constraints on the set of coefficients, which allows one to minimise model error at the expense of a small amount of additional bias.

Resources