Simulating Data on (Y1, Y2) where Y2 has missing values

Simulating Data on (Y1, Y2) where Y2 has missing values - r

Consider a two variable (Y1, Y2) problem, with each variable defined as follows:
Y1 = 1 + Z1, and Y1 is fully observed
Y2 = 5 + 2*(Z1) + Z2, and Y2 is missing if 2*(Y1 − 1) + Z3 < 0
Z1, Z2, and Z3 follow independent standard normal distributions.
How would we go about simulating a (complete) dataset of size 500 on (Y1, Y2)? This is what I wrote below:
n <- 500
y <- rnorm(n)
How would we simulate the corresponding observed dataset (by imposing missingness
on Y2)? I'm not sure where to go with this question.
n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
y1 <- 1 + z1
y2 <- 5 + 2*z1 + z2
Display the marginal distribution of Y2 for the complete (as originally simulated) and observed (after imposing missingness) data.

Another way to display the distributions, in addition to the great explanation of #jay.sf is building the missing data mechanism in a new variable and compare both y2 and y2_missing:
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(123)
#Data
n <- 500
#Random vars
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
#Design Y1 and Y2
y1 <- 1+z1
y2 = 5 + 2*(z1) + z2
#For missing
y2_missing <- y2
#Set missing
index <- which(((2*(y1-1))+z3)<0)
y2_missing[index]<-NA
#Complete dataset
df <- data.frame(y1,y2,y2_missing)
#Plot distributions
df %>% select(-y1) %>%
pivot_longer(everything()) %>%
ggplot(aes(x=value,fill=name))+
geom_density(alpha=0.5)+
ggtitle('Distribution for y2 and y2_missing')+
labs(fill='Variable')+
theme_bw()
Output:

You probably want to include an error term in your data simulations, so another vector with mean zero should be included in the equation using again rnorm(n).
seed <- sample(1:1e3, 1)
set.seed(635) ## for sake of reproducibility
n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)
To get the missings you may sample a percentage of the vector and set it NA.
y2 <- 5 + 2*z1 + z2 + rnorm(n) ## add error term independent of the `z`s
pct.mis <- .1 ## percentage missings
y2[sample(length(y2), length(y2)*pct.mis)] <- NA
## check 1: resulting missings
prop.table(table(is.na(y2)))
# FALSE TRUE
# 0.9 0.1
summary(y2)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# -2.627 3.372 5.123 4.995 6.643 13.653 50
## check 2: rounded coefficients resemble equation
fit <- lm(y2 ~ z1 + z2)
round(fit$coe)
# (Intercept) z1 z2
# 5 2 1
## check 3: number of fitted values equals number of non-missing obs.
length(fit$fitted.values) / length(y2)
# [1] 0.9

Related

What if I want a single linear regression model rather than an "mlm"?

I have shared the top 9 rows of the data I am working on in the image below (y0 to y6 are outputs, rest are inputs):
My objective is to get fitted output data for y0 to y6.
I tried lm function in R using the commands:
lm1 <- lm(cbind(y0, y1, y2, y3, y4, y5, y6) ~ tt + tcb + s + l + b, data = table3)
summary(lm1)
And it has returned 7 sets of coefficients like "Response y0", "Response y1", etc.
What I really want is just 1 set of coefficients which can predict values for outputs y0 to y6.
Could you please help in this?

By cbind(y0, y1, y2, y3, y4, y5, y6) we fit 7 independent models (which is be a better idea).
For what you are looking for, stack your y* variables, replicate other independent variables and do a single regression.
Y <- c(y0, y1, y2, y3, y4, y5, y6)
tt. <- rep(tt, times = 7)
tcb. <- rep(tcb, times = 7)
s. <- rep(s, times = 7)
l. <- rep(l, times = 7)
b. <- rep(b, times = 7)
fit <- lm(Y ~ tt. + tcb. + s. + l. + b.)
Predicted values for y* are
matrix(fitted(fit), ncol = 7)
For other readers than OP
I hereby prepare a tiny reproducible example (with only one covariate x and two replicates y1, y2) to help you digest the issue.
set.seed(0)
dat_wide <- data.frame(x = round(runif(4), 2),
y1 = round(runif(4), 2),
y2 = round(runif(4), 2))
# x y1 y2
#1 0.90 0.91 0.66
#2 0.27 0.20 0.63
#3 0.37 0.90 0.06
#4 0.57 0.94 0.21
## The original "mlm"
fit_mlm <- lm(cbind(y1, y2) ~ x, data = dat_wide)
Instead of doing c(y1, y2) and rep(x, times = 2), I would use the reshape function from R base package stats, as such operation is essentially a "wide" to "long" dataset reshaping.
dat_long <- stats::reshape(dat_wide, ## wide dataset
varying = 2:3, ## columns 2:3 are replicates
v.names = "y", ## the stacked variable is called "y"
direction = "long" ## reshape to "long" format
)
# x time y id
#1.1 0.90 1 0.91 1
#2.1 0.27 1 0.20 2
#3.1 0.37 1 0.90 3
#4.1 0.57 1 0.94 4
#1.2 0.90 2 0.66 1
#2.2 0.27 2 0.63 2
#3.2 0.37 2 0.06 3
#4.2 0.57 2 0.21 4
Extra variables time and id are created. The former tells which replicate a case comes from; the latter tells which record that case is within a replicate.
To fit the same model for all replicates, we do
fit1 <- lm(y ~ x, data = dat_long)
#(Intercept) x
# 0.2578 0.5801
matrix(fitted(fit1), ncol = 2) ## there are two replicates
# [,1] [,2]
#[1,] 0.7798257 0.7798257
#[2,] 0.4143822 0.4143822
#[3,] 0.4723891 0.4723891
#[4,] 0.5884029 0.5884029
Don't be surprised that two columns are identical; there is only a single set of regression coefficients for both replicates after all.
If you think carefully, we can do the following instead:
dat_wide$ymean <- rowMeans(dat_wide[2:3]) ## average all replicates
fit2 <- lm(ymean ~ x, data = dat_wide)
#(Intercept) x
# 0.2578 0.5801
and we will get the same point estimates. Standard errors and other summary statistics would differ as two models have different sample size.
coef(summary(fit1))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.2998382 0.8596755 0.4229808
#x 0.5800691 0.5171354 1.1216967 0.3048657
coef(summary(fit2))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2577636 0.01385864 18.59949 0.002878193
#x 0.5800691 0.02390220 24.26844 0.001693604

Panel regression with ar1 model

Excuse my naiveté. I'm not sure what this type of model is called -- perhaps panel regression.
Imagine I have the following data:
n <- 100
x1 <- rnorm(n)
y1 <- x1 * 0.5 + rnorm(n)/2
x2 <- rnorm(n)
y2 <- x2 * 0.5 + rnorm(n)/2
x3 <- rnorm(n)
y3 <- x3 * 0.25 + rnorm(n)/2
x4 <- rnorm(n)
y4 <- x4 * 0 + rnorm(n)/2
x5 <- rnorm(n)
y5 <- x5 * -0.25 + rnorm(n)/2
x6 <- rnorm(n)
y6 <- x6 * -0.5 + rnorm(n) + rnorm(n)/2
x7 <- rnorm(n)
y7 <- x7 * -0.75 + rnorm(n)/2
foo <- data.frame(s=rep(1:100,times=7),
y=c(y1,y2,y3,y4,y5,y6,y7),
x=c(x1,x2,x3,x4,x5,x6,x7),
i=rep(1:7,each=n))
Where y and x are individual AR1 time series measured over 100 seconds (I use 's' instead of 't' for the time variable) divided equally into groups (i). I wish to model these as:
y_t= b_0 + b_1(y_{t-1}) + b_2(x_{t}) + e_t
but while taking the group (i) into account:
y_{it)= b_0 + b_1(y_{it-1}) + b_2(x_{it}) + e_{it}
I wish to know if b_2 (the coef on x) is a good predictor of y and how that coef varies with group. I also want to know the R2 and RMSE by group and to predict y_i given x_i and i. The grouping variable can be discrete or continuous.
I gather that this type of problem is called panel regression but it is not a term that is familiar to me. Is using plm in R a good approach to investigate this problem?
Based on the comment below, I guess this is a simple start:
require(dplyr)
require(broom)
fitted_models <- foo %>% group_by(grp) %>% do(model = lm(y ~ x, data = .))
fitted_models %>% tidy(model)
fitted_models %>% glance(model)

Since you don't include fixed or random effects in the model, we are dealing with the pooled OLS (POLS) which can be estimated using lm or plm.
Let's construct example data of 100 groups and 100 observations for each:
df <- data.frame(x = rnorm(100 * 100), y = rnorm(100 * 100),
group = factor(rep(1:100, each = 100)))
df$ly <- unlist(tapply(df$y, df$group, function(x) c(NA, head(x, -1))))
head(df, 2)
# x y group ly
# 1 1.7893855 1.2694873 1 NA
# 2 0.8671304 -0.9538848 1 1.2694873
Then
m1 <- lm(y ~ ly + x:group, data = df)
is a model with a common autoregressive coefficient and a group-dependent effect of x:
head(coef(m1)[-1:-2], 5)
# x:group1 x:group2 x:group3 x:group4 x:group5
# -0.02057244 0.06779381 0.04628942 -0.11384630 0.06377069
This allows you to plot them, etc. I suppose one thing that you will want to do is to test whether those coefficients are equal. That can be done as follows:
m2 <- lm(y ~ ly + x, data = df)
library(lmtest)
lrtest(m1, m2)
# Likelihood ratio test
#
# Model 1: y ~ ly + x:group
# Model 2: y ~ ly + x
# #Df LogLik Df Chisq Pr(>Chisq)
# 1 103 -14093
# 2 4 -14148 -99 110.48 0.2024
Hence, we cannot reject that the effects of x are the same, as expected.

Looping over combinations of regression model terms

I'm running a regression in the form
reg=lm(y ~ x1+x2+x3+z1,data=mydata)
In the place of the last term, z1, I want to loop through a set of different variables, z1 through z10, running a regression for each with it as the last term. E.g. in second run I want to use
reg=lm(y ~ x1+x2+x3+z2,data=mydata)
in 3rd run:
reg=lm(y ~ x1+x2+x3+z3,data=mydata)
How can I automate this by looping through the list of z-variables?

While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.
dat1 <- data.frame(y = rpois(100, 5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100))
lapply(colnames(dat1)[5:6],
function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
d = dat1)
Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.
library(microbenchmark)
microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
# min lq mean median uq max neval
# 4.014237 4.148117 4.323387 4.220189 4.281995 5.898811 100
microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
# min lq mean median uq max neval
# 4.391494 4.505056 5.186972 4.598301 4.698818 51.573 100
The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.

Depending on what your final goal is, it can be much faster to fit a base model, update it with add1, and extract the F-test/AIC you want:
> basemodel <- lm(y~x1+x2+x3, dat1)
>
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions
Model:
y ~ x1 + x2 + x3
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 477.34 164.31
z1 1 0.0768 477.26 166.29 0.0153 0.9019
z2 1 5.1937 472.15 165.21 1.0450 0.3093
See also ?update for refitting the model.

With this dummy data:
dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)
You could get your list of two lm objects this way:
lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))
Which iterates through those two columns and substitutes them as arguments into the lm call.
As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.

Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by() from the dplyr package instead of lapply():
library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
gather(varname, z, z1:z2) %>% # convert data to long form
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
use_series(model)
This converts the data to a long format using gather, where the z-values occupy the same column. use_series() from the magrittr package return the list of lm objects instead of a data.frame. If you load the broom package, you can extract the model coefficients in this pipeline of code:
library(broom)
dat1 %>%
gather(varname, z, z1:z2) %>%
group_by(varname) %>%
do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
glance(model) # or tidy(model)
Source: local data frame [2 x 12]
Groups: varname
varname r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 z1 0.06606736 0.02674388 2.075924 1.680099 0.1609905 5 -212.3698 436.7396 452.3707 409.3987 95
2 z2 0.06518852 0.02582804 2.076900 1.656192 0.1666479 5 -212.4168 436.8337 452.4647 409.7840 95
Data:
dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
x2 = runif(100), x3 = runif(100),
z1 = runif(100), z2 = runif(100))

Replace lm coefficients in [r]

Is it possible to replace coefficients in lm object?
I thought the following would work
# sample data
set.seed(2157010)
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x1 + rnorm(length(x1))
fit <- lm( y ~ x1 + x2)
# view origional coefficeints
coef(fit)
# replace coefficent with new values
fit$coef(fit$coef[2:3]) <- c(5, 1)
# view new coefficents
coef(fit)
Any assistance would be greatly appreciated

Your code is not reproducible, as there's few errors in your code. Here's corrected version which shows also your mistake:
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
(Intercept) x1 x2
260.55645444 -0.04276353 2.91272272
# replace coefficients with new values, use whole name which is coefficients:
fit$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit)
(Intercept) x1 x2
260.5565 5.0000 1.0000
So the problem was that you were using fit$coef, although the name of the component in lm output is really coefficients. The abbreviated version works for getting the values, but not for setting, as it made new component named coef, and the coef function extracted the values of fit$coefficient.

linear model when all occurrences of independent variables are NA

I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?

Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]

You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Simulating Data on (Y1, Y2) where Y2 has missing values - r

Related

What if I want a single linear regression model rather than an "mlm"?

Panel regression with ar1 model

Looping over combinations of regression model terms

Replace lm coefficients in [r]

linear model when all occurrences of independent variables are NA

Categories

Resources