Difference between categorical variables (factors) and dummy variables - r

I was running a regression using categorical variables and came across this question. Here, the user wanted to add a column for each dummy. This left me quite confused because I though having long data with the column including all the dummies stored using as.factor() was equivalent to having dummy variables.
Could someone explain the difference between the following two linear regression models?
Linear Model 1, where Month is a factor:
dt_long
Sales Period Month
1: 0.4898943 1 M1
2: 0.3097716 1 M1
3: 1.0574771 1 M1
4: 0.5121627 1 M1
5: 0.6650744 1 M1
---
8108: 0.5175480 24 M12
8109: 1.2867316 24 M12
8110: 0.6283875 24 M12
8111: 0.6287151 24 M12
8112: 0.4347708 24 M12
M1 <- lm(data = dt_long,
fomrula = Sales ~ Period + factor(Month)
Linear Model 2 where each month is an indicator variable:
dt_wide
Sales Period M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
1: 0.4898943 1 1 0 0 0 0 0 0 0 0 0 0 0
2: 0.3097716 1 1 0 0 0 0 0 0 0 0 0 0 0
3: 1.0574771 1 1 0 0 0 0 0 0 0 0 0 0 0
4: 0.5121627 1 1 0 0 0 0 0 0 0 0 0 0 0
5: 0.6650744 1 1 0 0 0 0 0 0 0 0 0 0 0
---
8108: 0.5175480 24 0 0 0 0 0 0 0 0 0 0 0 1
8109: 1.2867316 24 0 0 0 0 0 0 0 0 0 0 0 1
8110: 0.6283875 24 0 0 0 0 0 0 0 0 0 0 0 1
8111: 0.6287151 24 0 0 0 0 0 0 0 0 0 0 0 1
8112: 0.4347708 24 0 0 0 0 0 0 0 0 0 0 0 1
M2 <- lm(data = data_wide,
formula = Sales ~ Period + M1 + M2 + M3 + ... + M11 + M12
Judging by this previously asked question, both models seem exactly the same. However, after running both models, I noticed that M1 returns 11 dummy estimators (because M1 is used as the reference level), while M2 returns 12 dummies.
Is one model better than the other? Is M1 more efficien? Can I set the reference level in M1 to make both models exactly equivalent?

Defining a model as in M1 is just a shortcut of including dummy variables: if you wanted to compute those regression coefficients by hand, clearly they'd have to be numeric.
Now something that perhaps you didn't notice about M2 is that one of the dummies should have a NA coefficient. That is because you manually included all of them and left the intercept. In this way we have a perfect collinearity issue. By not including one of the dummies or adding -1 to eliminate the constant term everything would be fine.
Some examples. Let
y <- rnorm(100)
x0 <- rep(1:0, each = 50)
x1 <- rep(0:1, each = 50)
x <- factor(x1)
In this way x0 and x1 is a decomposition of x. Then
## Too much
lm(y ~ x0 + x1)
# Call:
# lm(formula = y ~ x0 + x1)
# Coefficients:
# (Intercept) x0 x1
# -0.15044 0.07561 NA
## One way to fix it
lm(y ~ x0 + x1 - 1)
# Call:
# lm(formula = y ~ x0 + x1 - 1)
# Coefficients:
# x0 x1
# -0.07483 -0.15044
## Another one
lm(y ~ x1)
# Call:
# lm(formula = y ~ x1)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
## The same results
lm(y ~ x)
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x1
# -0.07483 -0.07561
Ultimately all the models contain the same amount of information, but in the case of multicollinearity we face the issue of identification.

Improper dummy coding.
When you change a categorical variable into dummy variables, you will have one fewer dummy variable than you had categories. That’s because the last category is already indicated by having a 0 on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity. So always check your dummy coding if it seems you’ve got a multicollinearity problem.

Related

I can't see model results in R2jags - what is wrong?

I'm trying to run this script in R2jags following the instructions provided in "Lahoz-Monfort JJ, Guillera-Arroita G, Tingley R (2015) Statistical approaches to account for false positive errors in environmental DNA
samples. Molecular Ecology Resources, 16, 673–685." and it seems that it worked ok, but I can't figure out the command to see the results... could anyone please help?
cat("model {
# Priors
psi ~ dunif(0,1)
p11 ~ dunif(0,1)
p10 ~ dunif(0,p10_max)
# Likelihood
for (i in 1:S){
z[i] ~ dbern(psi)
p[i] <- z[i]*p11 + (1-z[i])*p10
for (j in 1:K){
Y[i,j] ~ dbern(p[i])
}
}
} ",fill=TRUE)
sink()
Bayesian <- function(psi,p11,p10,S,K,nsims=100,doprint=TRUE,p10_max=0.05,
ni=100000,nt=2,nc=1,nb=50000,myparallel=TRUE) {
psihat<-p11hat<-p10hat<-rep(nsims)
modelSummaries<-list()
for(ii in 1:nsims){
if (doprint) cat("\r", ii, "of", nsims," ")
hh<-genSimData(psi,r11=0,p11,p10,S,K1=0,K2=K)
# fit the model
jags.inits <-function()(list(psi=runif(1,0.05,0.95),p11=runif(1,p10_max,1),p10=runif(1,0,p10_max)))
jags.data <-list(Y=hh,S=S,K=K,p10_max=p10_max)
jags.params<-c("psi","p11","p10")
Thoropa_model<-jags(data = jags.data, inits = jags.inits, parameters.to.save= jags.params,
model.file= "Thoropa.txt", n.thin= nt, n.chains= nc,
n.iter= ni, n.burnin = nb, parallel=myparallel) #, working.directory= getwd()
# extract results (medians of the marginal posteriors)
psihat[ii] <- model$summary["psi","50%"]
p11hat[ii] <- model$summary["p11","50%"]
p10hat[ii] <- model$summary["p10","50%"]
modelSummaries[[ii]]<-model$summary
}
if (doprint){
printsummres(psihat,thename="estimated psi")
printsummres(p11hat,thename="estimated p11")
printsummres(p10hat,thename="estimated p10")
}
return(list(psihat=psihat,p11hat=p11hat,p10hat=p10hat,modelSummaries=modelSummaries))
}
The file "Thoropa.txt" is a presence/absence matrix as follows:
PCR1 PCR2 PCR3 PCR4 PCR5 PCR6 PCR7 PCR8 PCR9 PCR10 PCR11 PCR12
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 1 1 1 1 1 1 1 1
0 0 1 0 0 0 1 0 0 0 0 0
1 0 1 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 1 0 0
1 1 0 1 0 1 0 1 0 0 1 0
1 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 0 1 1 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1
0 0 1 0 1 0 0 0 0 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1
Following the comment from Limey (thank you!) I changed the script to:
sink("Thoropa2.txt")
cat("model {
# Priors
psi ~ dunif(0,1)
p11 ~ dunif(0,1)
p10 ~ dunif(0,p10_max)
# Likelihood
for (i in 1:S){
z[i] ~ dbern(psi)
p[i] <- z[i]*p11 + (1-z[i])*p10
for (j in 1:K){
Y[i,j] ~ dbern(p[i])
}
}
} ",fill=TRUE)
sink()
y=Thoropa# the detection/non detection table
S=nrow(y)
K=ncol(y)
psi ~ dunif(0, 1)
p11 ~ dunif(0, 1)
p10 ~ dunif(0, p10_max)
p10_max=0.05
jags.data<-list(y=y, S=S, K=K, p10_max=p10_max)
jags.inits <-function()(list(psi=runif(0,1),p11=runif(0,1),p10=runif(0,p10_max)))
jags.params<-c("psi","p11","p10")
Thoropa_model<-jags.parallel(data = jags.data, inits = jags.inits, parameters.to.save= jags.params, model.file= "Thoropa2.txt", n.chains= 4, n.thin= 10, n.iter = 100000, n.burnin=50000, jags.seed = 333)
and the data file is as before.
Now I am getting the error message:
"Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: Indexing outside the bounds"
Could anyone help identify the error in my script? I'm no expert and I'm learning by myself, so sorry if it is a stupid question... (maybe there is something wrong with the format of the data...?)
Thank you all!
Your model is not working because some syntax errors in your R script. Note that the R syntax is different of the jags syntax, even if you are running the jags inside de R.
These are the errors:
The symbol "~" is not used for sampling in R. Delete the lines:
psi ~ dunif(0, 1)
p11 ~ dunif(0, 1)
p10 ~ dunif(0, p10_max)
The Y variable in the jags model is capitalized, so you must correct the syntax in jags.data.
jags.data<-list(Y=y, S=S, K=K, p10_max=p10_max)
In jags.inits, a) the body of the function must be inside of curly braces, and not parentheses, and b) the function runif takes 3 arguments: n (number of values you want to sample), min and max. The correct syntax is the following:
jags.inits <-function(){list(psi=runif(1,0,1),p11=runif(1,0,1),p10=runif(1,0,p10_max))}
Fixing those errors, your model should run without errors. After run the model, you can extract the median of the parameters "psi" using one of these two options:
Thoropa_model$BUGSoutput$median$psi
Thoropa_model$BUGSoutput$summary["psi","50%"]

Predictions using neuralnet in R

I am using 'neuralnet' package in R to train a model for 'wine' dataset.
Below is the code that I have come up with so far-
library(neuralnet)
library(rattle)
library(rattle.data)
# load 'wine' dataset-
data(wine)
D <- as.data.frame(wine, stringsAsFactors=FALSE)
# replace 'Type' response variable (with values- 1, 2, 3) by 3 dummy variables-
D$wine1 <- 0
D$wine1[D$Type == 1] <- 1
D$wine2 <- 0
D$wine2[D$Type == 2] <- 1
D$wine3 <- 0
D$wine3[D$Type == 3] <- 1
# create formula to be used-
wine_formula <- as.formula(wine1 + wine2 + wine3 ~ Alcohol + Malic + Ash + Alcalinity + Magnesium + Phenols + Flavanoids + Nonflavanoids + Proanthocyanins + Color + Hue + Dilution + Proline)
# split dataset into training and testing datasets-
train_indices <- sample(1:nrow(wine), floor(0.7 * nrow(wine)), replace = F)
training <- D[train_indices, ]
testing <- D[-train_indices, ]
# train neural network model-
wine_nn <- neuralnet(wine_formula, data = training, hidden = c(5, 3), linear.output = FALSE, stepmax = 1e+07)
# make predictions using 'compute()'-
preds <- compute(wine_nn, testing[, 2:14])
# create a final data frame 'results' containing predicted & actual values-
results <- as.data.frame(preds$net.result)
results <- cbind(results, testing$wine1, testing$wine2, testing$wine3)
# rename the data frame-
names(results) <- c("Pred_Wine1", "Pred_Wine2", "Pred_Wine3", "Actual_Wine1", "Actual_Wine2", "Actual_Wine3")
The task that I have now is to convert the values in attributes "Pred_Wine1", "Pred_Wine2" and "Pred_Wine3" to 1s and 0s so that I can create a confusion matrix and test for model accuracy.
How should I go about it because "Pred_Wine1", "Pred_Wine2", "Pred_Wine3" contain calculated values which are in between 0 and 1.
Any suggestions?
Thanks!
I think you need label encoding here.
Let's say your data frame is called df. This will convert the values in your features into numeric. So, if Pred_Wine1 contains a,b it will convert it to 0,1 or vice-versa.
Try this:
features <- c("Pred_Wine1", "Pred_Wine2","Pred_Wine3")
for(f in features)
{
levels <- unique(df[[f]])
df[[i]] <- as.integer(factor(df[[i]], levels=levels))
}
Something like:
> head(results)
Pred_Wine1
1 1.00000000000000000
14 1.00000000000000000
17 1.00000000000000000
21 0.00000001901851182
26 0.21287781596598065
27 1.00000000000000000
Pred_Wine2
1 0.00000000000000000000000000000000000000000000000000015327712484
14 0.00000000000000000000000000000000000000000000000000009310376079
17 0.00000000000000000000000000000000000000000000000000009435487922
21 0.99999999363562386278658777882810682058334350585937500000000000
26 0.78964805454441211463034733242238871753215789794921875000000000
27 0.00000000000000000000000000000000000000000000000000009310386461
Pred_Wine3 Actual_Wine1 Actual_Wine2 Actual_Wine3
1 5.291055036e-10 1 0 0
14 1.336129635e-09 1 0 0
17 1.303396935e-09 1 0 0
21 8.968513318e-122 1 0 0
26 1.623066411e-111 1 0 0
27 1.336126866e-09 1 0 0
> class <- apply(results[1:3], 1, which.max)
> results[1:3] <- 0
> head(results)
Pred_Wine1 Pred_Wine2 Pred_Wine3 Actual_Wine1 Actual_Wine2 Actual_Wine3
1 0 0 0 1 0 0
14 0 0 0 1 0 0
17 0 0 0 1 0 0
21 0 0 0 1 0 0
26 0 0 0 1 0 0
27 0 0 0 1 0 0
> for (r in names(class)) {results[r,class[r]] <- 1}
> head(results)
Pred_Wine1 Pred_Wine2 Pred_Wine3 Actual_Wine1 Actual_Wine2 Actual_Wine3
1 1 0 0 1 0 0
14 1 0 0 1 0 0
17 1 0 0 1 0 0
21 0 1 0 1 0 0
26 0 1 0 1 0 0
27 1 0 0 1 0 0

How to use predict.stepplr() and confusionmatrix() correctly for step.plr method?

#Here is my code:
library(MASS, caret, stepPlr, janitor)
#stepPlr: L2 penalized logistic regression with a stepwise variable selection
#MASS: Support Functions and Datasets for Venables and Ripley's MASS
#caret: Classification and Regression Training
#janitor: Simple Tools for Examining and Cleaning Dirty Data
#Howells is a main dataframe, we will segregate it.
HNORSE <- Howells[which(Pop=='NORSE'),]
#Let's remove NA cols
#We will use janitor package here to remove NA cols
HNORSE <- remove_empty_cols(HNORSE)
#Assigning 0's and 1's to females and males resp.
HNORSE$PopSex[HNORSE$PopSex=="NORSEF"] <- '0'
HNORSE$PopSex[HNORSE$PopSex=="NORSEM"] <- '1'
HNORSE$PopSex <- as.numeric(HNORSE$PopSex)
HNORSE$PopSex
#Resultant column looks like this
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
[41] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
[81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I want to use Stepplr from caret package
a <- step.plr(HNORSE[,c(6:76)], HNORSE$PopSex, lambda = 1e-4, cp="bic", max.terms = 1, trace = TRUE, type = "forward")
#Where HNORSE[,c(6:76)] --> features
#HNORSE$PopSex ---> Binary response
#lambda ----> Default value
#max.terms ---> I tried more than 1 value for max.terms, but then R goes into infinite loop of 'Convergence Error'.
#That's why using max.terms=1
Then I ran summary command on "a"
summary(a)
Call: plr(x = ix0, y = y, weights = weights, offset.subset = offset.subset,
offset.coefficients = offset.coefficients, lambda = lambda,
cp = cp)
Coefficients:Estimate Std.Error z value Pr(>|z|)
Intercept -71.93470 13.3521 -5.388 0
ZYB 0.55594 0.1033 5.382 0
Null deviance: 152.49 on 109 degrees of freedom
Residual deviance: 57.29 on 108 degrees of freedom
Score: deviance + 4.7 * df = 66.69
I used step.plr so, I should then use predict.stepplr right? and not predict.plr?
By this logic I wish to use predict.stepplr. The default function argument example goes like this:
n <- 100
p <- 5
x0 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x0 <- cbind(rnorm(n),x0)
y <- sample(c(0,1),n,replace=TRUE)
level <- vector("list",length=6)
for (i in 2:6) level[[i]] <- seq(3)
fit <- step.plr(x0,y,level=level)
x1 <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x1 <- cbind(rnorm(n),x1)
pred1 <- predict(fit,x0,x1,type="link")
pred2 <- predict(fit,x0,x1,type="response")
pred3 <- predict(fit,x0,x1,type="class")
object: stepplr object
x: matrix of features used for fitting object.
If newx is provided, x must be provided as well.
newx: matrix of features at which the predictions are made.
If newx=NULL, predictions for the training data are returned.
type: If type=link, the linear predictors are returned;
if type=response, the probability estimates are returned; and
if type=class, the class labels are returned. Default is type=link.
...
other options for prediction..
So First of all, I did not do any sampling like shown in here.
I want to predict HNORSE$PopSex which is binary variable.
My dataset which does not include the binary variable column is HNORSE[,c(6:76)].
I want to know what x0 and x1 function arguments should I put in
predict.stepplr()?
If not, HOW do I correctly implement
predict.stepplr?
I want to use overall accuracy to plot(Density(overall_accuracy))

model.matrix(): why do I lose control of contrast in this case

Suppose we have a toy data frame:
x <- data.frame(x1 = gl(3, 2, labels = letters[1:3]),
x2 = gl(3, 2, labels = LETTERS[1:3]))
I would like to construct a model matrix
# x1b x1c x2B x2C
# 1 0 0 0 0
# 2 0 0 0 0
# 3 1 0 1 0
# 4 1 0 1 0
# 5 0 1 0 1
# 6 0 1 0 1
by:
model.matrix(~ x1 + x2 - 1, data = x,
contrasts.arg = list(x1 = contr.treatment(letters[1:3]),
x2 = contr.treatment(LETTERS[1:3])))
but actually I get:
# x1a x1b x1c x2B x2C
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 1 0
# 4 0 1 0 1 0
# 5 0 0 1 0 1
# 6 0 0 1 0 1
# attr(,"assign")
# [1] 1 1 1 2 2
# attr(,"contrasts")
# attr(,"contrasts")$x1
# b c
# a 0 0
# b 1 0
# c 0 1
# attr(,"contrasts")$x2
# B C
# A 0 0
# B 1 0
# C 0 1
I am sort of confused here:
I have passed in explicit contrast matrix to drop first factor levels;
I have asked for dropping intercept.
Then why am I getting a model matrix with 5 columns? How can I get the model matrix I want?
Whenever we lose control of something at R level, there must be some default, unchangable behaviour at C level. C code for model.matrix.default() can be found in R source package at:
R-<release_number>/src/library/stats/src/model.c
We can find the explanation here:
/* If there is no intercept we look through the factor pattern */
/* matrix and adjust the code for the first factor found so that */
/* it will be coded by dummy variables rather than contrasts. */
Let's make a small test on this, with a data frame
x <- data.frame(x1 = gl(2, 2, labels = letters[1:2]), x2 = sin(1:4))
if we only have x2 on the RHS, we can drop intercept successfully:
model.matrix(~ x2 - 1, data = x)
# x2
#1 0.8414710
#2 0.9092974
#3 0.1411200
#4 -0.7568025
if we have only x1 on the RHS, contrast is not applied:
model.matrix(~ x1 - 1, data = x)
# x1a x1b
#1 1 0
#2 1 0
#3 0 1
#4 0 1
when we have both x1 and x2, contrast is not applied:
model.matrix(~ x1 + x2 - 1, data = x)
# x1a x1b x2
#1 1 0 0.8414710
#2 1 0 0.9092974
#3 0 1 0.1411200
#4 0 1 -0.7568025
This implies that while there is difference between:
lm(y ~ x2, data = x)
lm(y ~ x2 - 1, data = x)
there is no difference between
lm(y ~ x1, data = x)
lm(y ~ x1 - 1, data = x)
or
lm(y ~ x1 + x2, data = x)
lm(y ~ x1 + x2 - 1, data = x)
The reason for such behaviour is not to ensure numerical stability, but to ensure the sensibility of estimation / prediction. If we really drop the intercept while applying contrast to x1, we end up with a model matrix:
# x1b
#1 0
#2 0
#3 1
#4 1
The effect is that we constrain estimation for level a to 0.
In this post: How can I force dropping intercept or equivalent in this linear model?, we have a dataset:
# Y X1 X2
#1 1.8376852 TRUE TRUE
#2 -2.1173739 TRUE FALSE
#3 1.3054450 FALSE TRUE
#4 -0.3476706 TRUE FALSE
#5 1.3219099 FALSE TRUE
#6 0.6781750 FALSE TRUE
There isn't joint existence (X1 = FALSE, X2 = FALSE) in this dataset. But in broad sense, model.matrix() has to do something safe and sensible. It is biased to assume that no joint existence of two factor levels in the training dataset implies that they need not be predicted. If we really drop intercept while applying contrast, such joint existence is constrained at 0. However, the OP of that post deliberately wants such non-standard behaviour (for some reason), in which case, a possible workaround was given in my answer there.

R formula() dummy coding model matrix factor interactions

I'm attempting to use formula to generate a model.matrix object to be used in a custom optimizer function.
It works great for the most part, but when it comes to factor-factor interactions, I'd like to specify the interaction as dummy coded rather than effects coded.
Take for example the following data set:
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
head(myDF)
Both the : and / operators create an effects coded model matrix (the latter being an additive effects structure, I think).
head(model.matrix(formula(Y ~ X1 : X2), data = myDF))
head(model.matrix(formula(Y ~ X1 / X2), data = myDF))
But I am looking to generate a dummy coded model matrix, which would have the first level of X1 omitted for each level of X2. Resulting in these terms (columns):
X1B:X2A
X1C:X2A
X1B:X2B
X1C:X2B
X1B:X2C
X1C:X2C
Is there a way to achieve this?
Is ~X1:X2-1 what you're looking for?
Make test data (as above):
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
Generate model matrix:
mm1 <- model.matrix(formula(Y ~ X1 : X2 - 1), data = myDF)
head(mm1)
## X1A:X2A X1B:X2A X1C:X2A X1A:X2B X1B:X2B X1C:X2B X1A:X2C X1B:X2C X1C:X2C
## 1 0 0 0 0 1 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 1 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 1 0 0 0 0 0
## 6 0 0 0 0 0 0 1 0 0
Or perhaps you really do just want some columns excluded:
mm0 <- model.matrix(formula(Y ~ X1 : X2), data = myDF)
mm0B <- mm0[,!grepl("(Intercept|^X1A:)",colnames(mm0))]
## X1B:X2A X1C:X2A X1B:X2B X1C:X2B X1B:X2C X1C:X2C
## 1 0 0 1 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 1 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
I thought you also might be interested in sum-to-zero contrasts:
mm2 <- model.matrix(formula(Y ~ X1 : X2 - 1), data = myDF,
contrasts.arg=list(X1=contr.sum,X2=contr.sum))
Below is another trial.
set.seed(1987)
myDF <- data.frame(Y = rnorm(100),
X1 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]),
X2 = factor(LETTERS[sample(1:3, 100, replace = TRUE)]))
# row subsetting to exclude A
modelMat <- model.matrix(formula(Y ~ X1 : X2), data = myDF[myDF$X1 != 'A',])
# column subsetting to eliminate all columns including X1A
modelMat <- modelMat[,substring(colnames(modelMat), 1, 3) != "X1A"]
head(modelMat)
(Intercept) X1B:X2A X1C:X2A X1B:X2B X1C:X2B X1B:X2C X1C:X2C
1 1 0 0 1 0 0 0
3 1 0 0 0 0 1 0
4 1 0 0 0 1 0 0
8 1 0 0 0 0 1 0
10 1 0 0 0 0 0 1
11 1 0 0 0 0 0 1

Resources