how to add an intercept for linear regression when using matrix as an input in GLM Julia - julia

I am trying to use linear regression in GLM from Julia, with a matrix as inputs rather than a DataFrame.
The inputs are:
julia> x
4×2 Matrix{Int64}:
1 1
2 2
3 3
4 4
julia> y
4-element Vector{Int64}:
0
2
4
6
But when I tried to fit it using lm function, I found that the intercept is not default:
julia> lr = lm(x, y)
LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}:
Coefficients:
───────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────
x1 0.666667 1.11848e7 0.00 1.0000 -4.81244e7 4.81244e7
x2 0.666667 1.11848e7 0.00 1.0000 -4.81244e7 4.81244e7
───────────────────────────────────────────────────────────────
I check the official docs of GLM but they only explain the usage of DataFrames as input. Is there a way of adding intercepts to the model when using matrice as inputs without altering the input (such as adding a column of 1s in x)?

If you are using the X, y method, you are responsible for constructing the design matrix yourself. If you do not want to do that, use the formula method. This requires a bit of intermediate setup with your example, as the data needs to be in tabular form, but you can just create a named tuple:
data = #views (;y, x1 = x[:, 1], x2 = x[:, 2])
lm(#formula(y ~ 1 + x1 + x2), data)
If you have a dataframe or similar at hand, you can (probably) directly use it.
(IIRC, you could also just write #formula(y ~ x1 + x2), and it will add the intercept automatically, as in R. But I prefer the explicit specification.)

Related

How to calculate CAPM variables in Julia?

In Python, using stats of scipy package, variables beta, alpha, r, p, std_err of CAPM can be calculated as follows:
beta, alpha, r_value, pvalue, std_err = stats.linregress(stock_rtn_arr, mkt_rtn_arr)
Please guide me in calculating the above variables in Julia.
I'm assuming you are looking to run a simple OLS model, which in Julia can be fit using the GLM package:
julia> using GLM, DataFrame
julia> mkt_rtn_arr = randn(500); stock_rtn_arr = 0.5*mkt_rtn_arr .+ rand();
julia> df = DataFrame(mkt_rtn = mkt_rtn_arr, stock_rtn = stock_rtn_arr);
julia> linear_model = lm(#formula(stock_rtn ~ mkt_rtn), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
stock_rtn ~ 1 + mkt_rtn
Coefficients:
──────────────────────────────────────────────────────────────────────────────
Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) 0.616791 7.80308e-18 7.90446e16 <1e-99 0.616791 0.616791
mkt_rtn 0.5 7.78767e-18 6.42041e16 <1e-99 0.5 0.5
──────────────────────────────────────────────────────────────────────────────
You can then extract the parameters of interest from the linear_model:
julia> β = coef(linear_model)[2]
0.4999999999999999
julia> α = coef(linear_model)[1]
0.6167912017573035
julia> r_value = r2(linear_model)
1.0
julia> pvalues = coeftable(linear_model).cols[4]
2-element Array{Float64,1}:
0.0
0.0
julia> stderror(linear_model)
2-element Array{Float64,1}:
7.803081577574428e-18
7.787667394841443e-18
Note that I have used the #formula API to run the regression, which requires putting your data into a DataFrame and is in my opinion the preferred way of estimating the linear model in GLM, as it allows for much more flexibility in specifying the model. Alternatively you could have called lm(X, y) directly on an array for your X variable and the y variable:
julia> lm([ones(length(mkt_rtn_arr)) mkt_rtn_arr], stock_rtn_arr)
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:
Coefficients:
─────────────────────────────────────────────────────────────────────
Estimate Std. Error t value Pr(>|t|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────
x1 0.616791 7.80308e-18 7.90446e16 <1e-99 0.616791 0.616791
x2 0.5 7.78767e-18 6.42041e16 <1e-99 0.5 0.5
─────────────────────────────────────────────────────────────────────
Note that here I have appended a column of ones to the market return array to estimate the model with an intercept, which the #formula macro will do automatically (similar to the way it's done in R).

Linear optimization of difference subject to constraints

I have a vector of values (x1,x2,x3,x4,x5,x6,x7) and i want to create a vector which minimizes a new unknown vector (y1,y2,y3,y4,y5,y6,y7) such that I can minimize ||x-y||^2. I also want to create this new vector subject to the constraints that x1+x2+x3+x4+x5=x6 and x1+x2+x3+x4=x7. I tried to use constrOptim but I do not think I have the right inputs. Any help would be greatly appreciated!
Would it be best to come up with a set of values and then use a nls model to predict them? How would I do that?
Thank you!!
We assume that what the question actually intended was that y is known and we want to get x with the indicated constraints.
Note that nls does not work for zero residual problems and since no data was provided in the question we don't know whether that is the case here or not so we first present two solutions that can handle that and then finally we show an nls for the non-zero residual case. We use y shown below in (1) as our test input for (1) and (2) and it does have zero residuals. For (3), the nls solution, we use a different y which does not lead to zero residuals.
Here are some alternative solutions:
1) lm We define x5_to_x7 which maps the first 5 components of x to the entire 7-element vector. Because x5_to_x7 is a linear operator it corresponds to a matrix X which we form and then use in lm:
# test data
y <- c(1:5, sum(1:5), sum(1:4))
x5_to_x7 <- function(x5) c(x5, sum(x5), sum(x5[1:4]))
X <- apply(diag(5), 1, x5_to_x7)
fm <- lm(y ~ X + 0)
giving:
coef(fm)
## X1 X2 X3 X4 X5
## 1 2 3 4 5
all.equal(x5_to_x7(coef(fm)), y)
## [1] TRUE
2) optim Alternatively we can use optim by defining a residual sum of squares function and solve it using optim where y and x5_to_x7 are as above:
rss <- function(x) sum((y - x5_to_x7(x))^2)
result <- optim(numeric(5), rss, method = "BFGS")
giving:
> result
$par
[1] 1 2 3 4 5
$value
[1] 5.685557e-20
$counts
function gradient
18 12
$convergence
[1] 0
$message
NULL
> all.equal(x5_to_x7(result$par), y)
[1] TRUE
3) nls If y were such that the residuals are not zero then it would be possible to use nls as suggested in the question.
y <- 1:7
fm1 <- lm(y ~ X + 0)
fm2 <- nls(y ~ x5_to_x7(x), start = list(x = numeric(5)))
all.equal(coef(fm1), coef(fm2), check.attributes = FALSE)
## [1] TRUE

optimal predictor value for multivariate regression in R

Suppose I have 1 response variable Y and 2 predictors X1 and X2, such as the following
Y X1 X2
2.3 1.1 1.2
2.5 1.24 1.17
......
Assuming I have a strong belief the following model works well
fit <- lm(Y ~ poly(X1,2) + X2)
in other words, there is a quadratic relation between Y and X1, a linear relationship between Y and X2.
Now here are my questions:
how to find the optimal value of (x1,x2) such that the fitted model reaches the maximal value at this pair of value?
now assuming X2 has to be fixed at some particular value, how to find the optimal x1 such that the fitted value is maximized?
So here is an empirical way to do this:
# create some random data...
set.seed(1)
X1 <- 1:100
X2 <- sin(2*pi/100*(1:100))
df <- data.frame(Y=3 + 5*X1 -0.2 * X1^2 + 100*X2 + rnorm(100,0,5),X1,X2)
fit <- lm(Y ~ poly(X1,2,raw=T) + X2, data=df)
# X1 and X2 unconstrained
df$pred <- predict(fit)
result <- with(df,df[pred==max(pred),])
result
# Y X1 X2 pred
# 19 122.8838 19 0.9297765 119.2087
# max(Y|X2=0)
newdf <- data.frame(Y=df$Y, X1=df$X1, X2=0)
newdf$pred2 <- predict(fit,newdata=newdf)
result2 <- with(newdf,newdf[pred2==max(pred2),])
result2
# Y X1 X2 pred2
#12 104.6039 12 0 35.09141
So in this example, when X1 and X2 are unconstrained, the maximum value of Y = 119.2 and occurs at (X1,X2) = (122.8,0.930). When X2 is constrained to 0, the maximum value of Y = 35.1 and occurs at (X1,X2) = (104.6,0).
There are a couple of things to consider:
These are global maxima in the space of your data. In other words if your real data has a large number of variables there might be local maxima that you will not find this way.
This method has resolution only as great as your dataset. So if the true maximum occurs at a point between your data points, you will not find it this way.
This technique is restricted to the bounds of your dataset. So if the true maximum is outside those bounds, you will not find it. On the other hand, using a model outside the bounds of your data is, IMHO, the definition of reckless.
Finally, you should be aware the poly(...) produces orthogonal polynomials which will generate a fit, but the coefficients will be very difficult to interpret. If you really want a quadratic fit, e.g. a+ b × x+ c × x2, you are better off doing that explicitly with Y~X1 +I(X1^2)+X2, or using raw=T in the call to poly(...).
credit to #sashkello
Basically, I have to extract coefficients from lm object and multiply with corresponding terms to form the formula to proceed.
I think this is not very efficient. What if this is regression with hundreds of predictors?

Creating separate linear model for every combination(/interaction) of factors

I'm trying to do a simple linear regression on my data frame that looks something like what follows. The actual data set has more factors and more predictors (x's) all trying to predict y.
f1 f2 x y
x a 1 3.3
x a 2 3.2
x a 3 3.04
x b 1 4.5
x b 2 4.9
x b 3 8
y a 1 20.1
y a 2 20.3
y a 3 21.9
y b 1 101.2
y b 2 201.8
y b 3 332.8
Notice, for every combination of f1 & f2 the trends vary. What I want to do is build a lm model for each combination of f1 & f2, store it in some kind of list and then when I call predict, I should be able to use the appropriate model and predict y based on x. I think I should use ldply to create a list of models, as shown below
lm.model.list = ldply(x,.(f1,f2),function(x) {
fit = lm(x$y ~ x$x)
return(fit)
}
This gives an error,
Error: attempt to apply non-function
Also, assume I get it all into a list, how do I work with predict after that?
edit: I realize I could use indicator variables for the factors in the modelling itself, but I want to avoid this.
I think what you want is just:
fit <- lm(y ~ x+ f1*f2, data=dfrm)
This will give a different prediction for each level of the interaction of f1 with f2. It's just one model but could be "queried" for predictions with the predict function using any desired combo of f1 and f2. You should look at ?formula and spend some time understanding how linear models get interpreted.

How is the intercept computed in the GLM fit?

I have been reading the code used by R to fit a generalized linear model (GLM), since the source-code of R is freely available. The algorithm used is called iteratively reweighted least squares (IRLS), which is a fairly documented algorithm. For each iteration, there is a call to a Fortran function to solve the weighted least squares problem.
From the end-user's viewpoint, for a logistic regression for instance, a call in R looks just like this:
y <- rbinom(100, 1, 0.5)
x <- rnorm(100)
glm(y~x, family=binomial)$coefficients
And if you do not want to use a intercept, either of these calls is okay:
glm(y~x-1, family=binomial)$coefficients
glm(y~x+0, family=binomial)$coefficients
However, I cannot manage to understand how the formula, i.e. y~x or y~x-1, is making sense in the code and being understood as for whether to use an intercept or not. I was looking for a part of the code where a column of ones would be bound to x, but it seems there is none.
As far as I have read, the boolean intercept which appears in the function called glm.fit is not the same as the intercept which I am referring to. And it is the same for the offset.
The documentation about glm and glm.fit is here.
You are probably looking in the wrong place. Usually, model.matrix() is called first in the fitting functions:
> D <- data.frame(x1=1:4, x2=4:1)
> model.matrix(~ x1 + x2, D)
(Intercept) x1 x2
1 1 1 4
2 1 2 3
3 1 3 2
4 1 4 1
attr(,"assign")
[1] 0 1 2
> model.matrix(~ x1 + x2 -1 , D)
x1 x2
1 1 4
2 2 3
3 3 2
4 4 1
attr(,"assign")
[1] 1 2
>
and it is the output of model.matrix() which is passed down to Fortran. That is the case for lm() and other model fitters.
For glm(), it is different and only model.frame() is called which does not add an intercept column. Why that is so has to do with the difference between generalized linear models and standard linear models and beyond the scope of this posting.

Resources