How to drop some interaction terms from R formula - r

I would like to drop some interaction terms from an R formula. My situation is that I have one factor variable with lots of levels (call this A, and it takes values from 1-50), and another continuous variable that I would like to interact it with (call this B).
A*B
creates terms A1:B, A2:B, A3:B,... I want a simple way to get rid of the first A1:B term.
Note: I saw some previous answers for the lm case that called update and then dropped some terms. This will not work for me as I am trying to estimate a multinomial logit model with the mlogit package, and I cannot make the first estimation without dropping some interactions.
Edit: Although I am not attempting to use lm, if I could the following to occur, then I think it would solve my problem.
dd<-data.frame(A=sample(letters[1:10], 100, replace=T),B = runif(100),z=rexp(100))
#need to drop B term below
reg1 <- lm(z~A*B, dd)
#or need to drop Aa:B term here
reg2 <- lm(z~A*B - B, dd)
#but this doesn't work (I realize why, but this is an
#example of what I would like to have happen)
reg3 <- lm(z~A*B - B - Aa:B, dd)

I think you should be able to work with contrasts her to make this happen. Here we create our own contrast that adjusts the default contrast.treament behavior to skip the first two variables.
contr.skip2 <- function (n, contrasts = TRUE, sparse = FALSE)
{
contr <- contr.treatment(n, 1, contrasts, sparse)
contr[2, ] <- 0
contr[, -1]
}
and then we can fit the model and pass along our special contrast
lm(z~A*B, dd, contrasts=list(A="contr.skip2"))
# Call:
# lm(formula = z ~ A * B, data = dd, contrasts = list(A = "contr.skip2"))
#
# Coefficients:
# (Intercept) Ac Ad Ae Af Ag Ah
# 1.09981 -0.14541 -0.86334 -0.18478 -0.77302 0.19681 0.23845
# Ai Aj B Ac:B Ad:B Ae:B Af:B
# -0.74962 -0.49014 0.09729 0.14705 1.09606 0.14706 0.88919
# Ag:B Ah:B Ai:B Aj:B
# -0.62796 -0.70155 1.60253 -0.20564
and as you can see we no longer have Ab terms in the model.

Related

Model-based partitioning with "two-layer interaction" (segmented models)

I am trying to build a model-based tree with a type of "two-layer interaction" where the models in the nodes of the tree are segmented again.
I am using the mob() function to this aim but I could not manage to make the argument for the fit function work with the lmtree() function.
In the following example a is function of b and the relationship between a and b depends on d and on b | d.
library("partykit")
set.seed(321)
b <- runif(200)
d <- sample(1:2, 200, replace = TRUE)
a <- jitter(ifelse(d == 1, 2 * b - 1, 4 * b - 1.2), amount = .1)
a[b < .5 & d == 1] <- jitter(rep(0, length(a[b < .5 & d == 1])))
a[b < .3 & d == 2] <- jitter(rep(0, length(a[b < .3 & d == 2])))
fit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE)
{
x <- x[, 2]
l <- lmtree(y ~ x | b)
return(l)
}
m <- mob(a ~ b | d, fit = fit) # not working
Of course with this simple example I could use lmtree(a ~ b | d + b) to find every interaction but is there a way to use as fit function of mob() a lmtree()?
No but yes ;-)
No, lmtree() cannot be used easily as a fitter for a mob().
The dimension of the inner tree (lmtree()) is not fixed, i.e., you may get a tree without any partition or with many subgroups, and this would be confusing for the outer tree (mob()).
Even if one worked around the dimension issue or fixed it by always forcing one break, one would need more work to set up the right coefficient vector, matrix of estimating functions, etc. This is also not straightforward because the convergence rate (and hence the inference) is different if breakpoints are given (e.g., for a binary factor) or have to be estimated (such as for your numeric variables b).
The way you set up your fit() function, the inner lmtree() does not know where to find b. All it has is a numeric vector y and a numeric matrix x but not the original data.
But yes, I think that all of these issues can be addressed if changing the view from fitting a "two-layer" tree to fitting a "segmented" model inside a tree. My impression is that you want to fit a model y ~ x (or a ~ b in your example) where a piecewise linear function is used with an additional breakpoint in x. If the piecewise linear function is supposed to be continuous in x, then the segmented package can be easily used. If not, then strucchange could be leveraged. Assuming you want the former (as you have simulated your data like this), I include a worked segmented example below (and also slightly modified your question to reflect this).
Changing the names and code a little bit, your data d has a segmented piecewise linear relationship of y ~ x with coefficients depending on a group variable g.
set.seed(321)
d <- data.frame(
x = runif(200),
g = factor(sample(1:2, 200, replace = TRUE))
)
d$y <- jitter(ifelse(d$g == "1",
pmax(0, 2 * d$x - 1),
pmax(0, 4 * d$x - 1.2)
), amount = 0.1)
Within every node of a tree I can then fit a model segmented(lm(y ~ x)) which comes with suitable extractors for coef(), logLik(), estfun() etc. Thus, the mobster function is simply:
segfit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...)
{
x <- as.numeric(x[, 2])
segmented::segmented(lm(y ~ x))
}
(Note: I haven't tried whether segmented() would also support lm() objects with weights and offset.)
With this we can obtain the full tree which simply splits in g in this basic example:
library("partykit")
segtree <- mob(y ~ x | g, data = d, fit = segfit)
plot(segtree, terminal_panel = node_bivplot, tnex = 2)
A hands-on introduction to segmented is available in: Muggeo VMR (2008). "segmented: An R Package to Fit Regression Models with Broken-Line Relationships." R News, 8(1), 20-25. https://CRAN.R-project.org/doc/Rnews/
For the underlying methodological background see: Muggeo VMR (2003). "Estimating Regression Models with Unknown Break-Points." Statistics in Medicine, 22(19), 3055-3071. doi:10.1002/sim.1545

Predict function returns Fitted value even though I put newdata

I'm making model validation testing function for my own.
In doing so, I let
a=entire set of predictor variables in model-building set
b=set of response variable in model-building set
c=entire set of predictor variables in validation set
d=set of response variable in validation set
e=number of column which I have an interest
This is based on book Applied Linear Regression Models, Kutner , so I used
library(ALSM).
In my case, model-building set is SurgicalUnit, and validation set is SurgicalUnitAdditional.
Both data consists of 10 columns, of which from 1st to 8th columns are entire set of indep. variables, 9th is the response variable, 10th is the log(response variable)
So,
a=SurgicalUnit[,1:8]; b=SurgicalUnit[,10];
c=SurgicalUnitAdditional[,1:8]; d=SurgicalUnitAdditional[,10]; e=c(1,2,3,8)
, since I want to fit with logged response variable, and I want to regress with variable x1,x2,x3 and x8.
(Please note that the reason why I used "entire" set of independent variables with specific number of column instead of putting set of interested independent variables dircetly is, because I need to obtain Mallow's Cp in my function at once.)
So my regression is, asdf=lm(b~as.matrix(a[e])) , the problem is, I want to predict validation set in models built with model-building set. So, I let preds=data.frame(c[e]) and finally predict(asdf, newdata=preds) which is equal with predict(asdf), which means that it's fitted values of asdf.
Why predict doesn't work? Helps will be appreciated.
Below is my function
mod.valid=function(a,b,c,d,e){
asdf=lm(b~as.matrix(a[e])) # model what you want
qwer=lm(b~as.matrix(a[1:max(e)])) # full model in order to get Cp
mat=round(coef(summary(asdf))[,c(-3,-4)],4); mat2=matrix(0,5,2)
mat=rbind(mat,mat2); mat # matrix for coefficients and others(model-building)
n=nrow(anova(asdf)); m=nrow(anova(qwer))
nn=length(b) # To get size of sample size
p=asdf$rank # To get parameters p
cp=anova(asdf)$Sum[n] / (anova(qwer)$Mean[m]) - (nn-2*p); cp=round(cp,4)
mat[p+1,1]=p; mat[p+1,2]=cp # adding p and Cp
rp=summary(asdf)$r.squared; rap=summary(asdf)$adj.r.squared; rp=round(rp,4); rap=round(rap,4)
mat[p+2,1]=rp; mat[p+2,2]=rap # adding Rp2 and Rap2
sse=anova(asdf)$Sum[n]; pre=MPV::PRESS(asdf); sse=round(sse,4); pre=round(pre,4)
mat[p+3,1]=sse; mat[p+3,2]=pre # adding SSE and PRESS
**preds=data.frame(c[e]); predd=predict(asdf,newdata=preds)** **# I got problem here!**
mspr=sum((d-predd)^2) / length(d); mse=anova(asdf)$Mean[n]; mspr=round(mspr,4); mse=round(mse,4)
mat[p+4,1]=mse; mat[p+4,2]=mspr # adding MSE and MSPR
aic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + 2*p; aic=round(aic,4)
bic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + log(nn)*p; bic=round(bic,4)
mat[p+5,1]=aic; mat[p+5,2]=bic # adding AIC and BIC
rownames(mat)[p+1]="p&Cp"; rownames(mat)[p+2]="Rp.sq&Rap.sq"
rownames(mat)[p+3]="SSE&PRESS"; rownames(mat)[p+4]="MSE&MSPR"; rownames(mat)[p+5]="AIC&BIC"
asdf2=lm(d~as.matrix(c[e]))
qwer2=lm(d~as.matrix(c[1:max(e)]))
matt=round(coef(summary(asdf2))[,c(-3,-4)],4); matt2=matrix(0,5,2)
matt=rbind(matt,matt2); matt # matrix for coefficients and others(validation)
n2=nrow(anova(asdf2)); m2=nrow(anova(qwer2))
nn2=length(d) # To get size of sample size
p2=asdf$rank # To get parameters p
cp2=anova(asdf2)$Sum[n2] / (anova(qwer2)$Mean[m2]) - (nn2-2*p2); cp2=round(cp2,4)
matt[p2+1,1]=p2; matt[p2+1,2]=cp2 # adding p and Cp
rp2=summary(asdf2)$r.squared; rap2=summary(asdf2)$adj.r.squared; rp2=round(rp2,4); rap2=round(rap2,4)
matt[p2+2,1]=rp2; matt[p2+2,2]=rap2 # adding Rp2 and Rap2
sse2=anova(asdf2)$Sum[n]; pre2=MPV::PRESS(asdf2); sse2=round(sse2,4); pre2=round(pre2,4)
matt[p2+3,1]=sse2; matt[p2+3,2]=pre2 # adding SSE and PRESS
mse2=anova(asdf2)$Mean[n]; mse2=round(mse2,4)
matt[p2+4,1]=mse2; matt[p2+4,2]=NA # adding MSE and MSPR, in this case MSPR=0
aic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + 2*p2; aic2=round(aic2,4)
bic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + log(nn2)*p2; bic2=round(bic2,4)
matt[p2+5,1]=aic2; matt[p2+5,2]=bic2 # adding AIC and BIC
mat=cbind(mat,matt); colnames(mat)=c("Estimate","Std.Error","Val.Estimate","Val.Std.Error")
print(mat)
}
This function will provide useful statistics for model validation.
It returns a matrix with coefficients, p, Mallow's Cp, R.squared, R.adj.squared, SSE, PRESS, MSE, MSPR, AIC and BIC.
Everythig works fine for general given data, except for MSPR since predict function doesn't work! It only returns the fitted.
Can you try something like this. You have to make sure the both training and test data has same column names.
x <- rnorm(100)
y <- x + rnorm(100)
df <- data.frame(x = x, y=y)
# model fitting
fit <- lm(y ~ x, data=df)
predict(fit)
# creating new data
newx <- rnorm(50)
newdf <- data.frame(x = newx)
# making predictions
predict(fit, newdata = newdf)

mgcv: can I drop a level for factor 'by' variable?

the story:
I'm facing a problem in gam where only 2 input-variable should be considered:
x = relative price (%) the customer paid for the product given the entry price for the club
b = binary, if the customer has to pay the product (VIPs get it for free)
the output-variable is
y = if the customer took the product
and this sims the data:
require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)
relative price
as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%
here comes the problem:
I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:
mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)
smooth effects
my question is:
a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?
b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?
Your question is conceptually as same as How can I force dropping intercept or equivalent in this linear model?. You want to contrast b, rather than using all its levels.
In GAM setting, you want:
dat$B <- as.numeric(dat$b) - 1
y ~ b + s(x, by = B)
For factor by smooth, mgcv does not apply contrast to by, if this factor is unordered. This is generally appealing as often we want a smooth for each factor level. It is thus your responsibility to use some trick to get what you want. What I did in above is to coerce this two-level factor b to a numeric B, with the level you want to omit being numerically 0. Then use numerical 'by' B. This idea can not be extended to factors of more levels.
If your factor by has more than 2 levels and you still want to enforce a contrast, you need to use an ordered factor. For example, you can do
dat$B <- ordered(dat$b)
y ~ b + s(x, by = B)
Read more on 'by' variables from ?gam.models.

R: interaction between continuous and categorical vars in 'isat' regression ('gets' package)

I want to calculate the differential response of y to x (continuous) depending on the categorical variable z.
In the standard lm setup:
lm(y~ x:z)
However, I want to do this while allowing for Impulse Indicator Saturation (IIS) in the 'gets' package. However, the following syntax produces an error:
isat(y, mxreg=x:z, iis=TRUE)
The error message is of the form:
"Error in solve.qr(out, tol = tol, LAPACK = LAPACK) :
singular matrix 'a' in 'solve"
1: In x:z :
numerical expression has 96 elements: only the first used
2: In x:z :
numerical expression has 96 elements: only the first used"
How should I modify the syntax?
Thank you!
At the moment, alas, isat doesn't provide the same functionality as lm on categorical/character variables, nor on using * and :. We hope to address that in a future release.
In the meantime you'll have to create distinct variables in your dataset representing the interaction. I guess something like the following...
library(gets)
N <- 100
x <- rnorm(N)
z <- c(rep("A",N/4),rep("B",N/4),rep("C",N/4),rep("D",N/4))
e <- rnorm(N)
y <- 0.5*x*as.numeric(z=="A") + 1.5*x*as.numeric(z=="B") - 0.75*x*as.numeric(z=="C") + 5*x*as.numeric(z=="D") + e
lm.reg <- lm(y ~ x:z)
arx.reg.0 <- arx(y,mxreg=x:z)
data <- data.frame(y,x,z,stringsAsFactors=F)
for(i in z[duplicated(z)==F]) {
data[[paste("Zx",i,sep=".")]] <- data$x * as.numeric(data$z==i)
}
arx.reg.1 <- arx(data$y,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")])
isat.1 <- isat(data$y,mc=TRUE,mxreg=data[,c("x","Zx.A","Zx.B","Zx.C")],max.block.size=20)
Note that as you'll be creating dummies for each category, there's a chance those dummies will cause singularity of your matrix of explanatory variables (if, as in my example, isat automatically uses 4 blocks). Using the argument max.block.size enables you to avoid this problem.
Let me know if I haven't addressed your particular point.

How to run lm models using all possible combinations of several variables and a factor

this is not my subject so I am sorry if my question is badly asked or if the data is incomplete. I am trying to run 31 lineal models which have a single response variable (VELOC), and as predictor variables have a factor (TRAT, with 2 levels, A and B) and five continuous variables.
I have a loop I used for gls but only with continuous predictor variables, so I thought it could work. But it did not and I believe the problem must be a silly thing.
I don't know how to include the data, but it looks something like:
TRAT VELOC l b h t m
1 A 0.02490 -0.05283 0.06752 0.03435 -0.03343 0.10088
2 A 0.01196 -0.01126 0.10604 -0.01440 -0.08675 0.18547
3 A -0.06381 0.00804 0.06248 -0.04467 -0.04058 -0.04890
4 A 0.07440 0.04800 0.05250 -0.01867 -0.08034 0.08049
5 A 0.07695 0.06373 -0.00365 -0.07319 -0.02579 0.06989
6 B -0.03860 -0.01909 0.04960 0.09184 -0.06948 0.17950
7 B 0.00187 -0.02076 -0.05899 -0.12245 0.12391 -0.25616
8 B -0.07032 -0.02354 -0.05741 0.03189 0.05967 -0.06380
9 B -0.09047 -0.06176 -0.17759 0.15136 0.13997 0.09663
10 B -0.01787 0.01665 -0.08228 -0.02875 0.07486 -0.14252
now, the script I used is:
pred.vars = c("TRAT","l", "b", "h","t","m") #define predictor variables
m.mat = permutations(n = 2, r = 6, v = c(F, T), repeats.allowed = T)# I run all possible combinations of pred.vars
models = apply(cbind(T, m.mat), 1, function(xrow) {paste(c("1", pred.vars)
[xrow], collapse = "+")})# fill the models
models = paste("VELOC", models, sep = "~")#fill the left side
all.aic = rep(NA, length(models))# AIC of models
m.mat = cbind(1, m.mat)# Which predictors are estimated in the models beside
#the intercept
colnames(m.mat) = c("(Intercept)", pred.vars)
n.par = 2 + apply(m.mat,1, sum)# number of parameters estimated in the Models
coefs=m.mat# define an object to store the coefficients
for (k in 1:length(models)) {
res = try(lm(as.formula(models[k]), data = xdata))
if (class(res) != "try-error") {
all.aic[k] = -2 * logLik(res)[1] + 2 * n.par[k]
xx = coefficients(res)
coefs[k, match(names(xx), colnames(m.mat))] = xx
}
}
And I get this error:"Error in coefs[k, match(names(xx), colnames(m.mat))] = xx : NAs are not allowed in subscripted assignments"
Thanks in advance for your help. I'll appreciate any corrections about how to post properly questions.
Lina
I suspect the dredge function in the MuMIn package would help you. You specify a "full" model with all parameters you want to include and then run dredge(fullmodel) to get all combinations nested within the full model.
You should then be able to get the coefficients and AIC values from the results of this.
Something like:
require(MuMIn)
data(iris)
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)
to get the parameter estimates for all models (a bit messy) you can then use
coefTable(combinations)
or to get the coefficients for a particular model you can index that using the row number in the dredge object, e.g.
coefTable(combinations)[1]
to get the coefficients in the model at row 1. This should also print coefficients for factor levels.
See the MuMIn helpfile for more details and ways to extract information.
Hope that helps.
To deal with:
'global.model''s 'na.action' argument is not set and
options('na.action') is "na.omit"
require(MuMIn)
data(iris)
options(na.action = "na.fail") # change the default "na.omit" to prevent models
# from being fitted to different datasets in
# case of missing values.
globalmodel <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
combinations <- dredge(globalmodel)
print(combinations)

Resources