In a regression model is it possible to include an interaction with only one dummy variable of a factor? For example, suppose I have:
x: numerical vector of 3 variables (1,2 and 3)
y: response variable
z: numerical vector
Is it possible to build a model like:
y ~ factor(x) + factor(x) : z
but only include the interaction with one level of X? I realize that I could create a separate dummy variable for each level of x, but I would like to simplify things if possible.
Really appreciate any input!!
One key point you're missing is that when you see a significant effect for something like x2:z, that doesn't mean that x interacts with z when x == 2, it means that the difference between x == 2 and x == 1 (or whatever your reference level is) interacts with z. It's not a level of x that is interacting with z, it's one of the contrasts that has been set for x.
So for a 3 level factor with default treatment contrasts:
df <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
df$x <- factor(df$x)
contrasts(df$x)
2 3
1 0 0
2 1 0
3 0 1
if you really think that only the first contrast is important, you can create a new variable that compares x == 2 to x == 1, and ignores x == 3:
df$x_1vs2 <- NA
df$x_1vs2[df$x == 1] <- 0
df$x_1vs2[df$x == 2] <- 1
df$x_1vs2[df$x == 3] <- NA
And then run your regression using that:
lm(y ~ x_1vs2 + x_1vs2:z)
X <- data.frame(x = sample(1:3, 10, TRUE), y = rnorm(10), z = rnorm(10))
lm(y ~ factor(x) + factor(x):z, data=X)
Is it what you want?
Something like this may be what you need:
y~factor(x)+factor(x=='SomeLevel'):z
If x is already coded as a factor in your data, something like
y ~ x + I(x=='some_level'):z
Or if x is of numeric type in your data frame, then
y ~ as.factor(x) + I(as.factor(x)=='some_level'):z
Or to only model some subset of the data try:
lm(y ~ as.factor(x) + as.factor(x):z, data = subset(df, x=='some_level'))
Related
Based on the answer for this question and its script, how can I print to the console the co-linear columns names?
Script:
library(corrplot)
library(caret)
x <- seq(0, 100, 1)
# colinear with x
y <- x + 2.3
# almost colinear with x / some small gaussian noise
z <- x + rnorm(mean = 0, sd = 5, n = 101)
# uncorrrelated gaussian
w <- rnorm(mean = 0, sd = 1, n = 101)
a <- z+seq(101, 200, 1)/.33 + rnorm(mean = 0, sd = 5, n = 1001)
b <- a -2.3
# this frame is made to exemplify the procedure
df <- data.frame(x = x, y = y, z = z, w = w, a=a, b=b)
corrplot(cor(df))
#drop perfectly multicollinear variables
constant<-rep(1,nrow(df))
tmp<-lm(constant ~ ., data=df)
to_keep<-tmp$coefficients[!is.na(tmp$coefficients)]
to_keep<-names(to_keep[-which(names(to_keep) == "(Intercept)")])
df_result<-df[to_keep]
corrplot(cor(df_result))
You want the variables not included in to_keep. Based off how to_keep is defined, you can write to_drop <- tmp$coefficients[is.na(tmp$coefficients)] to get the coefficients with NA values (meaning there are no estimates for the corresponding variables because they are collinear with others). Then, to print the names of those coefficients, you can simply do print(names(to_drop)).
However, keep in mind that: 1. this will only drop perfectly collinear variables in a hacky way and 2. the way this method decides which variables out of a set of perfectly collinear variables to drop is rather arbitrary (it will depend on the other of variables in your data).
I am working on a meta analysis and a sensitivity analysis for missing data. I want to replace censorsed data either with 0 or 1 according to a predefined probability.
I have a dataset with colum x: timepoints and y: events (1 = event, 0 = censored). For the analysis I replaced some of the 0 with NAs. Z is the indicator for the treatment arm. I want to replace NAs to either 1 or 0 with a predefined probability.
This is my code:
Just an example:
library(mice)
x <- c(1:10)
y <- c(1,1,1,NA,NA,NA,1,1,0,NA)
z <- rep(2,10)
data <- data.frame(x,y,z)
str(data)
md.pattern(data)
mice.impute.myfunct <- function(y, ry, x, ...)
{event <- sample(c(0:1), size = 1, replace=T, prob=c(0.5,0.5)); return(event)}
data.imp <- mice(data, me = c("","myfunct",""), m = 1)
data.comp <- complete(data.imp)
I would expect that NAs in y will be replaced with 0 (20% of cases) and 1 (80% of cases). But NAs are either replaced only with 0 or only with 1.
I have to admit, that I am quite a beginner with R and did not have to write own little functions before.
Thank you very much for your help!
Here is a possible solution just replacing the missing values with the 0 and 1, and a varying probability between 0.1 and 0,9:
for( i in seq(0.1,0.9,0.1)){
data[[paste0("y_imp",i)]] <- data$y
N <- sum(is.na( data$y))
data[[paste0("y_imp",i)]][is.na(data[[paste0("y_imp",i)]])] <- sample(c(0,1), size = N, replace=T, prob=c(i,1-i))
}
data[[paste0("y_imp",i)]] <- data$y create the column where you has the i probability of replacing the missing by 0.
I encounter a problem with the use of the mice function to do multiple imputation. I want to do imputation only on part of the missing data, what looking at the help seems possible and straightworward. But i can't get it to work.
here is the example:
I have some missing data on x and y:
library(mice)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10),c("x","y")] <- NA
I want only to impute missing data on y:
where <- data.frame(ID = rep(FALSE,100),x = rep(FALSE,100),y = is.na(plouf$y))
I do the imputation
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
I look at the imputed values:
test <- complete(plouf.imp)
Here i still have NAs on y:
> sum(is.na(test$y))
[1] 10
if I use where to say to impute on all values, it works:
where <- data.frame(ID = rep(FALSE,100),x = is.na(plouf$x),y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1,method="pmm",maxit=5,where = where)
test <- complete(plouf.imp)
> sum(is.na(test$y))
[1] 0
but it does the imputation on x too, that I don't want in this specific case (speed reason in a statistial simulation study)
Has anyone any idea ?
This is happening because of below code -
plouf[sample(100,10),c("x","y")] <- NA
Let's consider your 1st case wherein you want to impute y only. Check it's PredictorMatrix
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
plouf.imp
#PredictorMatrix:
# ID x y
#ID 0 0 0
#x 0 0 0
#y 1 1 0
It says that y's missing value will be predicted based on ID & x since it's value is 1 in row y.
Now check your sample data where you are populating NA in x & y column. You can notice that wherever y is NA x is also having the same NA value.
So what happens is that when mice refers PredictorMatrix for imputation in y column it encounters NA in x and ignore those rows as all independent variables (i.e. ID & x) are expected to be non-missing in order to predict the outcome i.e. missing values in y.
Try this -
library(mice)
#sample data
set.seed(123)
plouf <- data.frame(ID = rep(LETTERS[1:10],each = 10), x = sample(10,100,replace = T), y = sample(10,100,replace = T))
plouf[sample(100,10), "x"] <- NA
set.seed(999)
plouf[sample(100,10), "y"] <- NA
#missing value imputation
whr <- data.frame(ID = rep(FALSE,100), x = rep(FALSE,100), y = is.na(plouf$y))
plouf.imp <- mice(plouf, m = 1, method="pmm", maxit=5, where = whr)
test <- complete(plouf.imp)
sum(is.na(test$y))
#[1] 1
Here only one value of y is left to be imputed and in this case both x & y are having NA value i.e. row number 39 (similar to your 1st case).
It is always said that we need to create predictor variables for categorical values in order to the regression. I made a test, by creating first a predictor column of 1, 2, 3s for a five-layered categorical variable. Then, I ran the same model, without the predictor column, but with the categorical column itself.
In conclusion, the coefficients are different; however, their relative importance and effect on the y-value are the same. Moreover R-squared and p-value numbers are exactly the same in these two cases. So, do I have to create the predictor variable, or is R smart enough to do it automatically?
for(i in 1:74)
{
if(travel$accommodation[i] == "Hotel")
{
travel$pred_hotel[i] <- 1
}
if(travel$accommodation[i] == "Airbnb")
{
travel$pred_hotel[i] <- 2
}
if(travel$accommodation[i] == "Hostel")
{
travel$pred_hotel[i] <- 3
}
if(travel$accommodation[i] == "With friend/family")
{
travel$pred_hotel[i] <- 4
}
if(travel$accommodation[i] == "Other")
{
travel$pred_hotel[i] <- 5
}
}
travel$pred_hotel <- as.factor(travel$pred_hotel)
Then:
msf <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + accommodation + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
mm <- lm(ticket_events_money ~ museum_fee + nationality +
ticket_events_frequency + pred_hotel + line + activity_1 +
locals + vacation_days, data = travel[-1, ])
summary(msf)
summary(mm)
The problem is, you originally have a character column accommodation. Your new variable pred_hotel is a factor. Function lm automatically converts character covariate into factor. In your test, the only difference will be in factor levels; all the rest is the same. If you want to see difference, remove the as.factor line.
Another common failure is as in the following minimal, reproducible example.
dat <- data.frame(y = rnorm(20), x = rep(letters[1:2], 10), stringsAsFactors = FALSE)
m1 <- lm(y ~ x, dat)
dat$x[dat$x == 'a'] <- 1
dat$x[dat$x == 'b'] <- 2
class(dat$x) # still a character column!!
m2 <- lm(y ~ x, dat)
But you will see difference, if you use real numeric:
dat$x <- as.numeric(dat$x)
m3 <- lm(y ~ x, dat)
I have a formula which I would like to use to create a model matrix, but for my use I need to stop the user from adding an intercept as this will be taken care of at a later stage in the regression. How can I remove the intercept from the formula and is there a better option than update?
You can do this a few ways. The first option specified below is probably the best way of going about this.
# Create dataset and form for example
dta <- data.frame(y = rnorm(3), x = rnorm(3), z = rnorm(3))
form <- y ~ x + z
# No censoring
(X <- model.matrix(form, dta))
# Option 1 (my default option)
tf <- terms(form)
attr(tf, "intercept") <- 0
model.matrix(tf, dta)
# Option 2
X[, !colnames(X) %in% "(Intercept)"]
# Option 3
form2 <- update(form, . ~ . - 1)
model.matrix(form2, dta)