Fixing a coefficient on variable in MNL [duplicate] - r

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.

The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))

It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Related

Min timepoints to model longitudinal data with natural quadratic splines?

I'm new to applying splines to longitudinal data, so here comes my question:
I've some longitudinal data on growing mice in 3 timepoints: at x, y and z months. It's known from the existent literature that the trajectories of growth in this type of data are usually better modeled in non-linear terms.
However, since I have only 3 timepoints, I wonder if this allows me to apply natural quadratic spline to age variable in my lmer model?
edit:I mean is
lmer<-mincLmer(File ~ ns(Age,2) * Genotype + Sex + (1|Subj_ID),data, mask=mask)
a legit way to go around?
I'm sorry if this is a stupid question - I'm just a lonely PhD student without supervision, and I would be super-grateful for any advice!!!
Marina
With the nls() function you can fit your data to whatever non-linear function you want. Then, from the biological point of view, probably your data is described by a Gompertz-like function (sigmoidal), but as you have only three time points, probably you can simplify these kind of functions into an exponential one. Try the following:
fit_formula <- independent_variable ~ a * exp(b * dependent_variable)
result <- nls(formula = fit_formula, data = your_Dataset)
It will probably give you an error the first times, something like singular matrix gradient at initial estimates ; if this happens, try adding the additional parameter start, where you provide different starting values for a and b more close to the true values. Remember that in your dataset, the column names must be equal to the names of the variables in the formula.

R: Sliding one-ahead forecasts from equation estimated on a fixed period

The toy model below stands in for one with a bunch more variables, transforms, lags, etc. Assume I got that stuff right.
My data is ordered in time, but is now formatted as an R time series, because I need to exclude certain periods, etc. I'd rather not make it a time series for this reason, because I think it would be easy to muck up, but if I need to, or it greatly simplifies the estimating process, I'd like to just use an integer sequence, such as index. below, to represent time if that is allowed.
My problem is a simple one (I hope). I would like to use the first part of my data to estimate the coefficients of the model. Then I want to use those estimates, and not estimates from a sliding window, to do one-ahead forecasts for each of the remaining values of that data. The idea is that the formula is applied with a sliding window even though it is not estimated with one. Obviously I could retype the model with coefficients included and then get what I want in multiple ways, with base R sapply, with tidyverse dplyr::mutate or purrr::map_dbl, etc. But I am morally certain there is some standard way of pulling the formula out of the lm object and then wielding it as one desires, that I just haven't been able to find. Example:
set.seed(1)
x1 <- 1:20
y1 <- 2 + x1 + lag(x1) + rnorm(20)
index. <- x1
data. <- tibble(index., x1, y1)
mod_eq <- y1 ~ x1 + lag(x1)
lm_obj <- lm(mod_eq, data.[1:15,])
and I want something along the lines of:
my_forecast_values <- apply_eq_to_data(eq = get_estimated_equation(lm_obj), my_data = data.[16:20])
and the lag shouldn't give me an error.
Also, this is not part of my question per se, but I could use a pointer to a nice tutorial on using R formulas and the standard estimation output objects produced by lm, glm, nls and the like. Not the statistics, just the programming.
The common way to use the coefficients is by calling the predict(), coefficients(), or summary() function on the model object for what it is worth. You might try the ?predict.lm() documentation for details on formula.
A simple example:
data.$lagx <- dplyr::lag(data.$x1, 1) #create lag variable
lm_obj1 <- lm(data=data.[2:15,], y1 ~ x1 + lagx) #create model object
data.$pred1 <- predict(lm_obj1, newdata=data.[16,20]) #predict new data; needs to have same column headings

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

Finding model predictor values that maximize the outcome

How do you find the set of values for model predictors (a mixture of linear and non-linear) that yield the highest value for the response.
Example Model:
library(lme4); library(splines)
summary(lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month), data = airquality, REML = F))
Here I am interested in what conditions (predictors) produce the highest solar radation (outcome).
This question seems simple, but I've failed to find a good answer using Google.
If the model was simple, I could take the derivatives to find the maximum or minimum. Someone has suggested that if the model function can be extracted, the stats::optim() function might be used. As a last resort, I could simulate all the reasonable variations of input values and plug it into the predict() function and look for the maximum value.
The last approach mentioned doesn't seem very efficient and I imagine that this is a common enough task (e.g., finding optimal customers for advertising) that someone has built some tools for handling it. Any help is appreciated.
There are some conceptual issues here.
for the simple terms (Wind and Temp), the response is a linear (and hence both monotonic and unbounded) function of the predictors. Thus if these terms have positive parameter estimates, increasing their values to infinity (Inf) will give you an infinite response (Solar.R); values should be as small as possible (negative infinite) if the coefficients are negative. Practically speaking, then, you want to set these predictors to the minimum or maximum reasonable value if the parameter estimates are respectively negative or positive.
for the bs term, I'm not sure what the properties of the B-spline are beyond the boundary knots, but I'm pretty sure that the curves go off to positive or negative infinity, so you've got the same issue. However, for the case of bs, it's also possible that there are one or more interior maxima. For this case I would probably try to extract the basis terms and evaluate the spline over the range of the data ...
Alternatively, your mentioning optim makes me think that this is a possibility:
data(airquality)
library(lme4)
library(splines)
m1 <- lmer(formula = Solar.R ~ 1 + bs(Ozone) + Wind + Temp + (1 | Month),
data = airquality, REML = FALSE)
predval <- function(x) {
newdata <- data.frame(Ozone=x[1],Wind=x[2],Temp=x[3])
## return population-averaged prediction (no Month effect)
return(predict(m1, newdata=newdata, re.form=~0))
}
aq <- na.omit(airquality)
sval <- with(aq,c(mean(Ozone),mean(Wind),mean(Temp)))
predval(sval)
opt1 <- optim(fn=predval,
par=sval,
lower=with(aq,c(min(Ozone),min(Wind),min(Temp))),
upper=with(aq,c(max(Ozone),max(Wind),max(Temp))),
method="L-BFGS-B", ## for constrained opt.
control=list(fnscale=-1)) ## for maximization
## opt1
## $par
## [1] 70.33851 20.70000 97.00000
##
## $value
## [1] 282.9784
As expected, this is intermediate in the range of Ozone(1-168), and min/max for Wind (2.3-20.7) and Temp (57-97).
This brute force solution could be made much more efficient by automatically selecting the min/max values for the simple terms and optimizing only over the complex (polynomial/spline/etc.) terms.

inverse of 'predict' function

Using predict() one can obtain the predicted value of the dependent variable (y) for a certain value of the independent variable (x) for a given model. Is there any function that predicts x for a given y?
For example:
kalythos <- data.frame(x = c(20,35,45,55,70),
n = rep(50,5), y = c(6,17,26,37,44))
kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
model <- glm(Ymat ~ x, family = binomial, data = kalythos)
If we want to know the predicted value of the model for x=50:
predict(model, data.frame(x=50), type = "response")
I want to know which x makes y=30, for example.
Saw the previous answer is deleted. In your case, given n=50 and the model is binomial, you would calculate x given y using:
f <- function (y,m) {
(logit(y/50) - coef(m)[["(Intercept)"]]) / coef(m)[["x"]]
}
> f(30,model)
[1] 48.59833
But when doing so, you better consult a statistician to show you how to calculate the inverse prediction interval. And please, take VitoshKa's considerations into account.
Came across this old thread but thought I would add some other info. Package MASS has function dose.p for logit/probit models. SE is via delta method.
> dose.p(model,p=.6)
Dose SE
p = 0.6: 48.59833 1.944772
Fitting the inverse model (x~y) would not makes sense here because, as #VitoshKa says, we assume x is fixed and y (the 0/1 response) is random. Besides, if the data weren’t grouped you’d have only 2 values of the explanatory variable: 0 and 1. But even though we assume x is fixed it still makes sense to calculate a confidence interval for the dose x for a given p, contrary to what #VitoshKa says. Just as we can reparameterize the model in terms of ED50, we can do so for ED60 or any other quantile. Parameters are fixed, but we still calculate CI's for them.
The chemcal package has an inverse.predict() function, which works for fits of the form y ~ x and y ~ x - 1
You just have to rearrange the regression equation, but as the comments above state this may prove tricky and not necessarily have a meaningful interpretation.
However, for the case you presented you can use:
(1/coef(model)[2])*(model$family$linkfun(30/50)-coef(model)[1])
Note I did the division by the x coefficient first to allow the name attribute to be correct.
For just a quick view (without intervals and considering additional issues) you could use the TkPredict function in the TeachingDemos package. It does not do this directly, but allows you to dynamically change the x value(s) and see what the predicted y-value is, so it would be fairly simple to move x until the desired Y is found (for given values of additional x's), this will also show possibly problems with multiple x's that would work for the same y.

Resources