R: Linear Regression with N Features - r

I saw quite a few examples of how to do regression (linear, multiple... etc.) but on every example I saw, you had to define every single feature in the formula...
linearMod <- lm(Y ~ x1 + x2 + x3 + ..., data=myData)
Well, we used TSFresh to generate more features. Around 100. So how am I supposed to do this now? I don't really want to type in x1 .. all the way to .. x100.
In Phyton scikit-learn I could just put in all the data:
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
And then repeat this for each 'feature group' to create a multiple linear regression.
Is there a way to do this in R? Or am I doing it wrong? Maybe another approach?
Originally we had 8 features/properties per Row. And with TSFresh we gernerated more of those. (Mean, STD and so on)
And every one of those features has a pretty linear influence on the Y result. So how can I now define something like a multiple linear model that just uses all extended features? Ideally without me having to tell it by hand each time.
So for example (one formulare would probably be feature 1-12 for Y) the next one (13-24 for Y) and so on. Is there a easy way to do this?

If you want to regress on all variables except Y you can do
lm(Y ~ ., data = myData)

Related

R: Sliding one-ahead forecasts from equation estimated on a fixed period

The toy model below stands in for one with a bunch more variables, transforms, lags, etc. Assume I got that stuff right.
My data is ordered in time, but is now formatted as an R time series, because I need to exclude certain periods, etc. I'd rather not make it a time series for this reason, because I think it would be easy to muck up, but if I need to, or it greatly simplifies the estimating process, I'd like to just use an integer sequence, such as index. below, to represent time if that is allowed.
My problem is a simple one (I hope). I would like to use the first part of my data to estimate the coefficients of the model. Then I want to use those estimates, and not estimates from a sliding window, to do one-ahead forecasts for each of the remaining values of that data. The idea is that the formula is applied with a sliding window even though it is not estimated with one. Obviously I could retype the model with coefficients included and then get what I want in multiple ways, with base R sapply, with tidyverse dplyr::mutate or purrr::map_dbl, etc. But I am morally certain there is some standard way of pulling the formula out of the lm object and then wielding it as one desires, that I just haven't been able to find. Example:
set.seed(1)
x1 <- 1:20
y1 <- 2 + x1 + lag(x1) + rnorm(20)
index. <- x1
data. <- tibble(index., x1, y1)
mod_eq <- y1 ~ x1 + lag(x1)
lm_obj <- lm(mod_eq, data.[1:15,])
and I want something along the lines of:
my_forecast_values <- apply_eq_to_data(eq = get_estimated_equation(lm_obj), my_data = data.[16:20])
and the lag shouldn't give me an error.
Also, this is not part of my question per se, but I could use a pointer to a nice tutorial on using R formulas and the standard estimation output objects produced by lm, glm, nls and the like. Not the statistics, just the programming.
The common way to use the coefficients is by calling the predict(), coefficients(), or summary() function on the model object for what it is worth. You might try the ?predict.lm() documentation for details on formula.
A simple example:
data.$lagx <- dplyr::lag(data.$x1, 1) #create lag variable
lm_obj1 <- lm(data=data.[2:15,], y1 ~ x1 + lagx) #create model object
data.$pred1 <- predict(lm_obj1, newdata=data.[16,20]) #predict new data; needs to have same column headings

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.
The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))
It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Deploying logistic regression where variable is 'cut' in R

I have a logistic regression model using glm that looks something like this:
glm(formula = output ~ cut(X1,c(1,2,3,4,5,6,7)) + X2 + X3 + X4 + X5 + X1:term + term:X5 - 1, family="binomial", data=mydata)
When I use summary(glm) I get parameter outputs for each cut of X1. Suppose I wanted to implement / deploy this model. How do I handle each of the 'cut' derived parameters? For example, if the value is between 1 and 2 do I simply use the parameter associated with 2 multiplied by the value and set all others (since the value is not in their range) to 0? Any insight is appreciated.
Categorical variables, such as those you produced with cut, become indicators (AKA dummy variables) in regression. If your value is somewhere between 1 and 2, it's precise value doesn't matter - you have chosen to discard that information for your model. You simply add the parameter associated with the 1-to-2 range (times 1, if you want to think of it that way) and ignore all the others (or times 0, if you want to think of it that way).
This isn't really a programming or R-specific question - it's incidental that you're using R to bin your variable and fit your model. Any tutorial on regression with categorical variables should cover this. This one looks all right, or maybe this one (pdf link).

extract residuals from aov()

I've run an anova using the following code:
aov2 <- aov(amt.eaten ~ salt + Error(bird / salt),data)
If I use view(aov2) I can see the residuals within the structure of aov2, but I would like to extract them in a way that doesn't involve cutting and pasting. Can someone help me out with the syntax?
Various versions of residuals(aov2) I have been using only produce NULL
I just learn that you can use proj:
x1 <- gl(8, 4)
block <- gl(2, 16)
y <- as.numeric(x1) + rnorm(length(x1))
d <- data.frame(block, x1, y)
m <- aov(y ~ x1 + Error(block), d)
m.pr <- proj(m)
m.pr[['Within']][,'Residuals']
The reason that you cannot extract residuals from this model is that you have specified a random effect due to the bird salt ratio (???). Here, each unique combination of bird and salt are treated like a random cluster having a unique intercept value but common additive effect associated with a unit difference in salt and the amount eaten.
I can't conceive of why we would want to specify this value as a random effect in this model. But in order to sensibly analyze residuals, you may want to calculate fitted differences in each stratum according to the fitted model and optimal intercept. I think this is tedious work and not very informative, however.

How to get y_hat using predict() when both response variable and explanatory variables are log transformed?

I have a log-log linear function as:
lom1 = lm(log(y)~log(x1)+log(x2),data=mod_dt)
I want to get y_hat using the same data set and I did
yhat = exp(predict(lom1))
Result seems off a lot (from comparing with the y-hat I calculated manually in R).
Any reason?
The second related question is, I first added three more columns in the original data set mod_dt for the log transformations of y, x1 and x2. Say, they are named as logy, logx1 and logx2, and then I ran lm:
lom2 = lm(logy ~ logx1 + logx2, data=mod_dt)
This gives a different set of coefficients.
Can this give a correct y-hat by doing
exp(predict(lom2))
Many thanks in advance.
When a model such as your formula is estimated, it translates to Y ~ X1 * X2 on the untransformed scale. You will need to provide data for examination if you want to get more specific review of your results.
It's not an answer exactly. Just want to share some of my opinions. A linear regression model assumes E(y) = x * beta. If y is transformed by log, it becomes E(log(y)) = x * beta. However when we try to predict y, usually we don't have exp(E(log(y))) = E(y)

Resources