lm(y~x*g) ignoring one value for g - r

I'm trying to use R for the first time.
In this case, y is oxygen consumption, x is time and g is status indicated by up to three letters (NYF, IR, F, M, or NF). It will run regressions for each status except for F.
[Side note: I've also tried accomplishing this with multiple regressions using the subset function. When I use
lm(O2~time,subset(data,Status=="NYF"))
it does not actually adhere to the subset and gives me a regression for the entire data set regardless of which status I enter.
How do I get multiple simple linear regressions from a single data set based on the codes in the status column?

You question isn't clear. Suppose you have a data frame, dd, with three columns: y, x, g. The variables y and x are numeric and g takes the values NYF, IR, F, M, or NF. To carry out simple linear regression for a particular status, then:
lm(y ~ x, data=dd[dd$g=="NYF",])
#Or
lm(y ~ x, data=dd[dd$g=="IR",])
To perform multiple linear regression, try
lm(y ~ x + g, data=dd)
where the present or absence of a factor is indicated by a binary variable.

lm(y~x,subset(dd,g=='NYF'))
is appropriate syntax to fit the line for a single status (although others are giving you variants that will work). I would check to make sure your data frame is indeed named "data" and your status variable is named "Status".

Related

How to extract a single Y variable value using a X variable without using a plot

me again!
In one of my assignment I have to create a plot with a regression line in and simply read this plot and give data.
Question: "at 80 degrees F what is the wind-speed?"
By simply looking at the plot you can state its ~9m/s at 80F. This would suffice, but knowing what you can do in R i would like to know for ether future reference and now.
How does one using only the Data frame ( in picture ) extract a Y value for a given X value using linear regression
Linear regression because the value itself isn't given, but it can be extracted if you assume its a linear regression.
So in essence instead of reading out the value in the plot ( pic 2 ) I would like to use a function that given a X(temp) value in the DF prints out a Y(wind) value using linear regression.
I tried other stuff i found on stackoverflow, using
lm(data~data, dataframe) but that didnt give me the result i desired.
You might look for the predict function.
First make a linear regression and then calculate the predicted value with predict. Just keep in mind, that you add your X-value in a data.frame.
datasets::airquality
lm_air <- lm(Wind ~ Temp, airquality)
predict(lm_air, data.frame(Temp = 80))

How to plot a comparisson of two fixed categorical values for linear regression of another continuous variable

So I want to plot this:
lmfit = lm (y ~ a + b)
but, "b" only has the values of zero and one. So, I want to plot two separate regression lines, that are paralel to one one another to show the difference that b makes to the y-intercept. So after plotting this:
plot(b,y)
I want to then use abline(lmfit,col="red",lwd=2) twice, once with the x value of b set to zero, and once with it set to one. So once without the term included, and once where b is just 1b.
To restate: b is categorical, 0 or 1. a is continuous with a slight linear trend.
Thank you.
Example:
You might want to consider using predict(...) with b=0 and b=1, as follows. Since you didn't provide any data, I'm using the built-in mtcars dataset.
lmfit <- lm(mpg~wt+cyl,mtcars)
plot(mpg~wt,mtcars,col=mtcars$cyl,pch=20)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=4)),col=4,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=6)),col=6,add=T)
curve(predict(lmfit,newdata=data.frame(wt=x,cyl=8)),col=8,add=T)
Given you have an additive lm model to begin with, drawing the lines is pretty straightforward, even though not completely intuitive. I tested it with the following simulated data:
y <- rnorm(30)
a <- rep(1:10,times=3)
b <- rep(c(1,0),each=15)
LM <- lm(y~a+b)
You have to access the coefficient values in the lm. Its is:
LM$coefficients
Here comes the tricky part, you have to assign the coefficients for each line.
The first one is easy:
abline(LM$coef[1],LM$coef[2])
The other one is a bit more complicated, given R works with additive coefficients, so for the second line you have:
abline(LM$coef[1]+LM$coef[3],LM$coef[2])
I hope this is what you was expecting
Unless I've misunderstood the question, all you have to do is run abline again but on a model without the b term.
abline(lm(y~a),col="red",lwd=2)

Compare model fits of two GAM

I have a matrix Expr with rows representing variables and columns samples.
I have a categorical vector called groups (containing either "A","B", or "C")
I want to test which of variables 'Expr' can be explained by the fact that the sample belong to a group.
My strategy would be modelling the problem with a generalized additive model (with a negative binomial distribution).
And then I want use a likelihood ratio test in a variable wise way to get a p value for each variable.
I do:
require(VGAM)
m <- vgam(Expr ~ group, family=negbinomial)
m_alternative <- vgam(Expr ~ 1, family=negbinomial)
and then:
lr <- lrtest(m, m_alternative)
The last step is wrong because it is testing the overall likelihood ratio of the two model not the variable wise.
Instead of a single p value I would like to get a vector of the p-values for every variable.
How should I do it?
(I am very new to R, so forgive me my stupidity)
It sounds like you want to use Expr as your predictors It think you may have your formula backwards. The response should be on the left, so I guess that's groups in your case.
If Expr is a data.frame, you can do regression on all variables with
m <- vgam(group ~ ., Expr, family=negbinomial)
If class(Expr)=="matrix", then
m <- vgam(group ~ Expr, family=negbinomial)
probably should work, but you may just get slightly odd looking coefficient labels.

How to do statistics and save results in a loop in R

In modeling it is helpful to do univariate regressions of a dependent on an independent in linear, quadratic, cubic and quaternary (?) forms to see which captures the basic shape of the statistical data. I'm a fairly new R programmer and need some help.
Here's pseudocode:
for i in 1:ncol(data)
data[,ncol(data) + i] <- data[, i]^2 # create squared term
data[,ncol(data) + i] <- data[, i]^3 # create cubed term
...and similarly for cubed and fourth power terms
# now do four regressions starting with linear and including one higher order term each time and display for each i the form of regression that has the highest adj R2.
lm(y ~ data[,i], ... )
# retrieve R2 and save indexed for linear case in vector in row i
lm(y tilda data[,i], data[,ncol(data) + i, ...]
# retrieve R2 and save...
Result is a dataframe indexed by i with column name in data of original x variable and results for each of the four regressions (all run with an intercept term).
Ordinarily we do this by looking at plots but where you have 800 variables that is not feasible.
If you really want to help out write code to automatically insert the required number of exponentiated variables into data.
And this doesn't even take care of the kinky variables that come clumped up in a couple of clusters or only relevant at one value, etc.
I'd say the best way to do this is by using the polynomial function in R, poly(). Imagine you have an independent numeric variable, x, and a numeric response variable, y.
models=list()
for (i in 1:4)
models[[i]]=lm(y~poly(x,i),raw=TRUE)
The raw=TRUE part ensures that the model uses the raw polynomials, and not the orthogonal polynomials.
When you want to get one of the models, just type in models[[1]] or models[[2]], etc.

Feeding newdata to R predict function

R's predict function can take a newdata parameter and its document reads:
newdata An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.
But I found that it is not totally true depending on how the model is fit. For instance, following code works as expected:
x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))
But if the model is fit as such:
m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))
The last two line will produce exactly the same result. The predict function works in a way ignoring newdata parameter, i.e. it can't really compute the prediction on new data at all.
The culprit, of course, is lm(y ~ x, data=dataTrain) versus lm(dataTrain$y ~ dataTrain$x). But I didn't find any document that mentioned the difference between these two. Is it a known issue?
I'm using R 2.15.2.
See ?predict.lm and the Note section, which I quote below:
Note:
Variables are first looked for in ‘newdata’ and then searched for
in the usual way (which will include the environment of the
formula used in the fit). A warning will be given if the
variables found are not of the same length as those in ‘newdata’
if it was supplied.
Whilst it doesn't state the behaviour in terms of "same name" etc, as far as the formula is concerned the terms you passed in to it were of the form foo$var and there are no such variables with names like that either in newdata or along the search path that R will traverse to look for them.
In your second case, you are totally misusing the model formula notation; the idea is to succinctly and symbolically describe the model. Succinctness and repeating the data object ad nauseum are not compatible.
The behaviour you note is exactly consistent with the documented behaviour. In simple terms, you fitted the model with terms data$x and data$y then tried to predict for terms x and y. As far as R is concerned those are different names and hence different things and it did right to not match them.

Resources