We're trying to come up with a way for an R function to handle a model which has multiple responses, multiple explanatory variables, and possibly shared parameters between the responses. For example:
Y1 ~ X1 + X2 + X3
Y2 ~ X3 + X4
specifies two responses and four explanatory variables. X3 appears in both, and we want the user to control whether the associated parameter value is the same or different. ie:
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
which is a model with four 'b' parameters, or
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b4 X3 + b5 X4
a model with five parameters.
Two possibilities:
Specify all the explanatory variables in one formula and supply a matrix mapping responses to explanatories. In which case
Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0),c(0,0,1,1)))
would correspond to the first case, and
Foo( Y1+Y2 ~ X1 + X2 + X3 + X4 + X5, map=cbind(c(1,1,1,0,0),c(0,0,0,1,1)))
would be the second. Obviously some parsing of the LHS would be needed, or it could be cbind(Y1,Y2). The advantage of this notation is that there is also other information that might be required for each parameter - starting values, priors etc - and the ordering is given by the ordering in the formula.
Have multiple formulae and a grouping function that just adds an attribute so shared parameters can be identified - the two examples then become:
Foo( Y1 ~ X1+X2+G(X3,1), Y2 ~ G(X3,1)+X4)
where the X3 parameter is shared between the formula, and
Foo( Y1 ~ X1+X2+X3, Y2 ~ X3+X4)
which has independent parameters. The second parameter of G() is a grouping ID which gives the power to share model parameters flexibly.
A further explanation of the G function is shown by the following:
Foo( Y1 + X1+X2+G(X3,1), Y2~G(X3,1)+G(X4,2), Y3~G(X3,3)+G(X4,2), Y4~G(X3,3))
would be a model where:
Y1 = b1 X1 + b2 X2 + b3 X3
Y2 = b3 X3 + b4 X4
Y3 = b5 X3 + b4 X4
Y4 = b5 X3
where there are two independent parameters for X3 (G(X3,1) and G(X3,3)). How to handle a group that refers to a different explanatory variable is an open question - suppose that model had Y4~G(X3,2) - that seems to imply a shared parameter between different explanatory variables, since there's a G(X4,2) in there.
This notation seems easier for the user to comprehend, but if you also have to specify starting values then the mapping between a vector of starting values and the parameters they correspond to is no longer obvious. I suspect that internally we'd have to compute the mapping matrix from the G() notation.
There may be better ways of doing this, so my question is... does anyone know one?
Interesting question (I wish all package authors worried a lot more in advance about how they were going to create extensions to the basic Wilkinson-Rogers formula notation ...)
How about something like
formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3+X4,Y4~X3),
shared=list(Y1+Y2~X3,Y2+Y3~X4,Y3+Y4~X3)
or something like that for your second example above?
The formula component gives the list of equations.
The shared component simply lists which response variables share the same parameter for specified predictor variables. It could obviously be mapped into a logical or binary table, but (for me at least -- this is certainly in the eye of the beholder) it's more straightforward. I think the map solution above is awkward when (as in this case) a variable (such as X3) is shared in two distinct sets of relationships.
I guess some straightforward rule like "starting values in the order in which the parameters appear in the list of formulas" -- in this case
X1, X2, X3(1), X4, X3(2)
would be OK, but it might be nice to provide a helper function that would tell the users the names of the coefficient vector (i.e. the order) given a formula/shared specification ...
From a bit of personal experience, I would say that embedding more fanciness in the formula itself leads to pain ... for example, the original nlme syntax with the random effects specified separately was easier to deal with than the new lme4-style syntax with random effects and fixed effects mixed in the same formula ...
An alternative (which I don't like nearly as well) would be
formula=list(Y1~X1+X2+X3,Y2~X3+X4,Y3~X3[2]+X4,Y4~X3[2])
where new parameters are indicated by some sort of tag (with [1] being implicit).
Also note suggestion from the chat room by #Andrie that interfaces designed for structural equation modeling (sem, lavaan packages) may be useful references.
Of the two methods you propose, the second one with the idea of several formulae looks more natural, but the G notation makes no sense to me.
The first one is much easier to understand, but I have two suggested tweaks to the map argument.
It should really take logical values rather the numbers.
Consider having a default of including all the independent variables for each response variable.
Related
I have a data frame with 10 columns(features) COL0 to COL9 and a column RESP. How do I calculate a LinearRegression Model for each pair COL0 to COL9 ~ RESP?
I am expecting to get 10 graphs showing the Model and also a table with the coefficients of my model for each column.
What I tried do far:
model2 = fit(LinearModel, #formula(RESP ~EXPL_0 + EXPL_1 + EXPL_2 +
EXPL_3 + EXPL_4 + EXPL_5 + EXPL_6 + EXPL_7 + EXPL_8 + EXPL_9), df)
And I get what i want.
I still need to know how to plot all this graphs
and if I had COL0 to COL1000, how do I can avoid to type all the columns from 0 to 1000?
I am new to Julia and I really dont have a clue how to get this done. Any help?
Thanks
As Bogumil says, it's not ideal to ask many questions in one post on StackOverflow - your question should be well defined and targeted, ideally with a minimum working example to make it easiest for people to help you.
Let's therefore answer what from the title of the post I take as your main question: how can I fit a linear regression model with GLM which includes many response columns. That question is almost a duplicate of this question, so a very similar answer applies: broadcast the term function over the names in your DataFrame which you want to include on the right hand side like this:
julia> using DataFrames, GLM
julia> df = hcat(DataFrame(RESP = rand(100)), DataFrame(rand(100, 10), :auto));
julia> mymodel = lm(term(:RESP) ~ sum(term.(names(df[!, Not(:RESP)]))), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
RESP ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
(...)
You also ask about plotting, but that's probably best dealt with in a separate question. You can access the estimated coefficients of your linear model using the coef function, e.g.
julia> coef(mymodel)
11-element Vector{Float64}:
0.504236533528822
0.11712812154185266
-0.0206810430546413
-0.15693089456050294
-0.011916514466331067
-0.1030171434361648
0.10378957999147352
-0.09447743618381275
-0.08860977078650123
0.0816071818033377
0.09939548661830626
and the full output with coeftable.
Finally note that you won't necessarily see from that which columns "impact most" your model as you say in the comment, unless you have standardized your regressors.
I'm using LASSO as a variable selection method for my analysis, but there's one particular variable that I wish to ensure is contained in the final formula. I have automated the entire process to return the variables that LASSO selects and spits them into a character string formula e.g. formula = y~x1+x2+x3+... However there is one variable in particular I would like to keep in the formula even if LASSO does not select it. Now I could easily manually add this variable to the formula after the fact, but in the interest of improving my R skills I'm trying to automate the entire process.
My thoughts to achieve my goal so far was nesting the grep() function inside an ifelse() statement e.g. ifelse(grep("variable I'm concerned with",formula)!=1, formula=formula,formula=paste0(formula,'variable I'm concerned with',collapse="+")) but this has not done the trick.
Am I on the right track or can anyone think of alternative routes to take?
According to documentation
penalty.factor - Separate penalty factors can be applied to each
coefficient. This is a number that multiplies lambda to allow
differential shrinkage. Can be 0 for some variables, which implies no
shrinkage, and that variable is always included in the model. Default
is 1 for all variables (and implicitly infinity for variables listed
in exclude). Note: the penalty factors are internally rescaled to sum
to nvars, and the lambda sequence will reflect this change.
So apply this as an argument to glmnet using a penalty factor of 0 for your "key coefficient" and 1 elsewhere.
Formula is not a character object, but you might want to explore terms.formula if your goal is to edit formulas directly based on character output. terms objects are really powerful ways of doing variable subset and selection. But you really need to explore it because the formula language was not really meant to be automated easily, rather it was meant to be a convenient and readable way to specify model fits (look at how difficult SAS is by comparison).
f <- y ~ x1 +x2
t <- terms(f)
## drop 'x2'
i.x2 <- match('x2', attr(t, 'term.labels'))
t <- t[, -i.x2] ## drop the variable
## t is still a "terms" object but `lm` and related functions have implicit methods for interpreting as a "formula" object.
lm(t)
Currently, you are attempting to adjust character value of formula to a formula object which will not work given the different types. Instead, consider stats::update which will not add any terms not already included as a term:
lasso_formula <- as.formula("y ~ x1 + x2 + x3")
# EXISTING TERM
lasso_formula <- update(lasso_formula, ~ . + x3)
lasso_formula
# y ~ x1 + x2 + x3
# NEEDED VARIABLE
lasso_formula <- update(lasso_formula, ~ . + myTerm)
lasso_formula
# y ~ x1 + x2 + x3 + myTerm
Should formula be a character string, be sure to use grepl (not grep) in ifelse. And do not assign with = inside ifelse as it is a function itself returning a value itself and not to be confused with if...else:
lasso_formula <- "y ~ x1 + x2 + x3"
lasso_formula <- ifelse(grepl("myterm", lasso_formula),
lasso_formula,
paste(lasso_formula, "+ myterm"))
lasso_formula
# [1] "y ~ x1 + x2 + x3 + myterm"
I am trying to generate a formula using dataframe column names of the following format:
d ~ x1 + x2 + x3 + x4
From the following sample dataset:
a = c(1,2,3)
b = c(2,4,6)
c = c(1,3,5)
d = c(9,8,7)
x1 = c(1,2,3)
x2 = c(2,4,6)
x3 = c(1,3,5)
x4 = c(9,8,7)
df = data.frame(a,b,c,d,x1,x2,x3,x4)
As for what I have tried already:
I know that I can subset only the columns I need using the following approach
predictors = names(df[5:8])
response = names(df[4])
Although, my efforts to try and include these into a formula have failed
How can I assemble the predictors and the response variables into the following format:
d ~ x1 + x2 + x3 + x4
I ultimately want to input this formula into a randomForest function.
We can avoid the entire problem by using the default method of randomForest (rather than the formula method):
randomForest(df[5:8], df[[4]])
or in terms of predictors and response defined in the question:
randomForest(df[predictors], df[[response]])
As mentioned in the Note section of the randomForest help file the default method used here has the additional advantage of better performance than the formula method.
How about:
reformulate(predictors,response=response)
I have this sample data table:
df <- data.table(indexer = c(0:12), x1 =
c(0,1000,1500,1000,1000,2000,
1000,1000,0,351.2,1000,1000,1851.2)
)
Now I need to create two additional columns x2 and x3 in this data frame such as x2[i] = x1[i] - x3[i] and x3[i] = x2[i-1] with x3[1]=0.
How can I do this without using a loop in an efficient way?
EDIT1: expected results are
x2 = c(0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,1851.2,0.0)
and
x3 = c(0.0,0.0,1000.0,500.0,500.0,500.0,1500.0,-500.0,1500.0,-1500.0,1851.2,-851.2,1851.2)
EDIT2: First time here posting questions. Hence all these confusions. Forget the example guys, the formulas are:
x3[i] = c - x2[i-1]*(1+r/12); x2[i] = x1[i] - x3[i]; x3[1] = 0; # c is some constant.
The problem is that x2 and x3 depend on each other. Thus, one needs to express x2 in terms of x1:
Once we have the formula, programming is easy:
df$x2 <- (-1)^(df$indexer) * cumsum(df$x1*(-1)^(df$indexer))
And x3 can be obtained from x2:
df$x3 <- c(0,df$x2[-nrow(df)])
[EDIT2] I guess that solution for the modified question, if it exists at all, should be sought along the same lines. I don't think it should be considered as a programming-related problem, because the code is quite straightforward once the mathematical formula is known.
I'm new to optimization and I need to implement it in a simple scenario:
There exists a car manufacturer that can produce 5 models of cars/vans. Associated with each model that can be produced is a number of labor hours required and a number of tons of steel required, as well as a profit that is earned from selling one such car/van. The manufacturer currently has a fixed amount of steel and labor available, which should be used in such a way that it optimizes total profit.
Here's the part I'm hung up on - each car also has a minimum order quantity. The company must manufacture a certain number of each model before it becomes economically viable to produce/sell that model. This would be easily sent to optim() if it were not for that final condition because the `lower = ...' argument can be given a vector with the minimum order quantities, but then it does not consider 0 as an option. Could someone help me solve this, taking into account the minimum order, but still allowing for an order of 0? Here's how I've organized the relevant information/constraints:
Dorian <- data.frame(Model = c('SmCar', 'MdCar', 'LgCar', 'MdVan', 'LgVan'),
SteelReq = c(1.5,3,5,6,8), LabReq=c(30,25,40,45,55),
MinProd = c(1000,1000,1000,200,200),
Profit = c(2000,2500,3000,5500,7000))
Materials <- data.frame(Steel=6500,Labor=65000)
NetProfit<-function(x) {
x[1]->SmCar
x[2]->MdCar
x[3]->LgCar
x[4]->MdVan
x[5]->LgVan
np<-sum(Dorian$Profit*c(SmCar,MdCar,LgCar,MdVan,LgVan))
np
}
LowerVec <- Dorian$MinProd #Or 0, how would I add this option?
UpperVec <- apply(rbind(Materials$Labor/Dorian$LabReq,
Materials$Steel/Dorian$SteelReq),2,min)
# Attempt at using optim()
optim(c(0,0,0,0,0),NetProfit,lower=LowerVec, upper=UpperVec)
Eventually I would like to substitute random variables with known distributions for parameters such as Profit and LabReq (labor required) and wrap this into a function that will take Steel and Labor available as inputs as well as parameters for the random variables. I will want to simulate many times and then find the average solution given specific parameters for the Profit and Labor Required, so ideally this optimization would also be fast so that I could perform the simulations. Thanks in advance for any help!
If you are not familiar with Linear Programming, start here: http://en.wikipedia.org/wiki/Linear_programming
Also have a look at the part about Mixed-Integer Programming http://en.wikipedia.org/wiki/Mixed_integer_programming#Integer_unknowns. That's when the variables you are trying to solve are not all continuous, but also include booleans or integers.
To all aspects, your problem is a mixed-integer programming (to be exact, an integer programming) as you are trying to solve for integers: the number of vehicles to produce for each model.
There are known algorithms for solving these and thankfully, they are already wrapped into R packages for you. Rglpk is one of them, and I'll show you how to formulate your problem so you can use its Rglpk_solve_LP function.
Let x1, x2, x3, x4, x5 be the variables you are solving for: the number of vehicles to produce for each model.
Your objective is:
Profit = 2000 x1 + 2500 x2 + 3000 x3 + 5500 x4 + 7000 x5.
Your steel constraint is:
1.5 x1 + 3 x2 + 5, x3 + 6 x4 + 8 x5 <= 6500
Your labor constraint is:
30 x1 + 25 x2 + 40 x3 + 45 x4 + 55 x5 <= 65000
Now comes the hard part: modeling the minimum production requirements. Let's take the first one as an example: the minimum production requirement on x1 requires that at least 1000 vehicles be produced (x1 >= 1000) or that no vehicle be produced at all (x1 = 0). To model that requirement, we are going to introduce a boolean variables z1. By boolean, I mean z1 can only take two values: 0 or 1. The requirement can be modeled as follows:
1000 z1 <= x1 <= 9999999 z1
Why does this work? Consider the two possible values for z1:
if z1 = 0, then x1 is forced to 0
if z1 = 1 then x1 is forced to be greater than 1000 (the minimum production requirement) and smaller than 9999999 which I picked as an arbitrarily big number.
Repeating this for each model, you will have to introduce similar boolean variables (z2, z3, z4, z5). In the end, the solver will not only be solving for x1, x2, x3, x4, x5 but also for z1, z2, z3, z4, z5.
Putting all this into practice, here is the code for solving your problem. We are going to solve for the vector x = (x1, x2, x3, x4, x5, z1, z2, z3, z4, z5)
library(Rglpk)
num.models <- nrow(Dorian)
# only x1, x2, x3, x4, x5 contribute to the total profit
objective <- c(Dorian$Profit, rep(0, num.models))
constraints.mat <- rbind(
c(Dorian$SteelReq, rep(0, num.models)), # total steel used
c(Dorian$LabReq, rep(0, num.models)), # total labor used
cbind(-diag(num.models), +diag(Dorian$MinProd)), # MinProd_i * z_i
cbind(+diag(num.models), -diag(rep(9999999, num.models)))) # x_i - 9999999 x_i
constraints.dir <- c("<=",
"<=",
rep("<=", num.models),
rep("<=", num.models))
constraints.rhs <- c(Materials$Steel,
Materials$Labor,
rep(0, num.models),
rep(0, num.models))
var.types <- c(rep("I", num.models), # x1, x2, x3, x4, x5 are integers
rep("B", num.models)) # z1, z2, z3, z4, z5 are booleans
Rglpk_solve_LP(obj = objective,
mat = constraints.mat,
dir = constraints.dir,
rhs = constraints.rhs,
types = var.types,
max = TRUE)
# $optimum
# [1] 6408000
#
# $solution
# [1] 1000 0 0 202 471 1 0 0 1 1
#
# $status
# [1] 0
So the optimal solution is to create (1000, 0, 0, 202, 471) vehicles of each respective model, for a total profit of 6,408,000.