Consider the two data.frames below. In each case I want to extract the intercept, and slopes for the three variables from the associated models.
set.seed(911)
df1 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=rnorm(10), x3 = rnorm(10))
model1 <- lm(y ~ x + x2 + x3, data = df1)
summary(model1)
summary(model1)$coefficients[1]
summary(model1)$coefficients[2]
summary(model1)$coefficients[3]
summary(model1)$coefficients[4]
set.seed(911)
df2 <- data.frame(y=rnorm(10) + 1:10, x=1:10, x2=1, x3 = rnorm(10))
model2 <- lm(y ~ x + x2 + x3, data = df2)
summary(model2)
summary(model2)$coefficients[1]
summary(model2)$coefficients[2]
summary(model2)$coefficients[3]
summary(model2)$coefficients[4]
However, in the second example there is no variation in x2 and so the coefficient estimate is NA. Importantly, summary(model2) prints the NA but summary(model2)$coefficients[3] does not return the NA but skips and moves to the next parameter.
But instead I would want:
0.9309032
0.8736204
NA
0.5494
If I do not know in adnavce which coefficients will be NA, i.e. it could be x1,x2 or x2 &x3or even something likex1&x2&x3`, how can I return the result I want?
Grab them directly from the model. No need for using summary():
> model2$coefficients
(Intercept) x x2 x3
0.9309032 0.8736204 NA 0.5493671
Related
I have a dataset with about 75 rows and 25 columns, each row shows one student and the columns show a score between 1 and 5.
S1 S2 ..... S24
x1 0 2 ..... 2
x2 1 3 ..... Na
x3 NA 4 ..... 4
x4 4 NA ..... 2
x5 4 3 ..... 2
I want to have the intercept and slope of each line without considering the NA values for each row and add them to the original dataset. I am using the code below, but it still includes NA values. I am using R.
df = read.csv('exc.csv')
Slope = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
lm(x ~ survey, data=TempDF,na.rm=TRUE)$coefficients[2]
}
Intercept = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
lm(x ~ survey, data=TempDF,na.rm=TRUE)$coefficients[1]
}
TData = as.data.frame(t(df))
dataset$Intercept = sapply(TData, Intercept)
dataset$slope = sapply(TData, Slope)
the regression by itself works only with pairs of non-NA values. So anything with NA values will not affect the slope or intercept in your case:
set.seed(100)
y = rnorm(100)
x = rnorm(100)
y[1:10] = NA
x[91:100] = NA
df = data.frame(x,y)
lm(y ~x,data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
0.02871 -0.15929
And we use only pairs in x and y with no NAs:
df = df[!is.na(df$x) & !is.na(df$y),]
lm(y ~x,data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
0.02871 -0.15929
If you also need it for something else, here's how you do it:
#simulate your data
df = data.frame(matrix(sample(1:5,25*5,replace=TRUE),ncol=25))
colnames(df) = paste("S",1:25,sep="")
#make some NAs
df[cbind(c(1,3,5),c(2,3,4))] <- NA
# fit once, take both coefficient and intercept
Coef = function(x) {
TempDF = data.frame(x, survey=1:ncol(df))
TempDF = TempDF[!is.na(x),]
c(lm(x ~ survey, data=TempDF)$coefficients,n=nrow(TempDF))
}
TData = as.data.frame(t(df))
dataset = data.frame(t(sapply(TData, Coef)))
Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)
Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.
I have some data synthetically generated from a function which is shown below.
fn <- function(w1,w2){
f= -(0.1 + 1.3*w1 + 0.4*w2 - 1.8*w1*w1 - 1.8*w2*w2)
return(f)
}
Next I create a data frame with the values as shown below
x = data.frame(
yval = fn(seq(0.1,0.9,by=0.01),seq(1.1,0.3,by=-0.01)),
x1 = seq(0.1,0.9,by=0.01),
x2 = seq(1.1,0.3,by=-0.01)
)
I want to see if I can recreate the coefficients of the polynomial in fn by using a polynomial fit which I attempt as shown below
fit = lm(yval ~ x1 + x2 + I(x1^2) + I(x2^2),data=x)
coef(fit)
However when I run the above code, I get the following
(Intercept) x1 x2 I(x1^2) I(x2^2)
2.012 -5.220 NA 3.600 NA
It appears that the term x2 was never "detected". Would anybody know what I could be doing wrong? I know that if I create synthetic linear data and try to re-create the coefficients using lm, I would get back the coefficients fairly accurately. Thanks in advance.
If you're fitting to a grid of 2 predictors, you want expand.grid.
x <- expand.grid(x1=seq(0.1, 0.9, by=0.01), x2=seq(1.1, 0.3, by=-0.01))
x$yval <- with(x, fn(x1, x2))
fit = lm(yval ~ x1 + x2 + I(x1^2) + I(x2^2),data=x)
coef(fit)
(Intercept) x1 x2 I(x1^2) I(x2^2)
-0.1 -1.3 -0.4 1.8 1.8
Is it possible to replace coefficients in lm object?
I thought the following would work
# sample data
set.seed(2157010)
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x1 + rnorm(length(x1))
fit <- lm( y ~ x1 + x2)
# view origional coefficeints
coef(fit)
# replace coefficent with new values
fit$coef(fit$coef[2:3]) <- c(5, 1)
# view new coefficents
coef(fit)
Any assistance would be greatly appreciated
Your code is not reproducible, as there's few errors in your code. Here's corrected version which shows also your mistake:
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
(Intercept) x1 x2
260.55645444 -0.04276353 2.91272272
# replace coefficients with new values, use whole name which is coefficients:
fit$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit)
(Intercept) x1 x2
260.5565 5.0000 1.0000
So the problem was that you were using fit$coef, although the name of the component in lm output is really coefficients. The abbreviated version works for getting the values, but not for setting, as it made new component named coef, and the coef function extracted the values of fit$coefficient.
I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.