"Vectorizing" this for-loop in R? (suppressing interaction main effects in lm) - r

When interactions are specified in lm, R includes main effects by default, with no option to suppress them. This is usually appropriate and convenient, but there are certain instances (within estimators, ratio LHS variables, among others) where this isn't appropriate.
I've got this code that fits a log-transformed variable to a response variable, independently within subsets of the data.
Here is a silly yet reproducible example:
id = as.factor(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,6,7,7,8,8,8,9,9,9,9,10))
x = rexp(length(id))
y = rnorm(length(id))
logx = log(x)
data = data.frame(id,y,logx)
for (i in data$id){
sub = subset(data, id==i) #This splits the data by id
m = lm(y~logx-1,data=sub) #This gives me the linear (log) fit for one of my id's
sub$x.tilde = log(1+3)*m$coef #This linearizes it and gives me the expected value for x=3
data$x.tilde[data$id==i] = sub$x.tilde #This puts it back into the main dataset
data$tildecoeff[data$id==i] = m$coef #This saves the coefficient (I use it elsewhere for plotting)
}
I want to fit a model like the following:
Y = B(X*id) +e
with no intercept and no main effect of id. As you can see from the loop, I'm interested in the expectation of Y when X=3, constrained the fit through the origin (because Y is a (logged) ratio of Y[X=something]/Y[X=0].
But if I specify
m = lm(Y~X*as.factor(id)-1)
there is no means of suppressing the main effects of id. I need to run this loop several hundred times in an iterative algorithm, and as a loop it is far too slow.
The other upside of de-looping this code is that it'll be much more convenient to get prediction intervals.
(Please, I don't need pious comments about how leaving out main effects and intercepts is improper -- it usually is, but I can promise that it isn't in this instance).
Thanks in advance for any ideas!

I think you want
m <- lm(y ~ 0 + logx : as.factor(id))
see R-intro '11.1 Defining statistical models; formulae'

Related

zfit straight line fitting for 2 dim dataset

I would like to fit 2-dim plot by straight line (a*x+b) using zfit like the following figure.
That is very easy work by a probfit package, but it has been deprecated by scikit-hep. https://nbviewer.jupyter.org/github/scikit-hep/probfit/blob/master/tutorial/tutorial.ipynb
How can I fit such 2dim plots by any function?
I've checked zfit examples, but it seems to be assumed some distribution (histogram) thus zfit requires dataset like 1d array and I couldn't reach how to pass 2d data to zfit.
There is no direct way in zfit currently to implement this out-of-the-box (with one line), since a corresponding loss is simply not added.
However, the SimpleLoss (zfit.loss.SimpleLoss) allows you to construct any loss that you can think of (have a look at the example as well in the docstring). In your case, this would look along this:
x = your_data
y = your_targets # y-value
obs = zfit.Space('x', (lower, upper))
param1 = zfit.Parameter(...)
param2 = zfit.Parameter(...)
...
model = Func(...) # a function is the way to go here
data = zfit.Data.from_numpy(array=x, obs=obs)
def mse():
prediction = model.func(data)
value = tf.reduce_mean((prediction - y) ** 2) # or whatever you want to have
return value
loss = zfit.loss.SimpleLoss(mse, [param1, param2])
# etc.
On another note, it would be a good idea to add such a loss. If you're interested to contribute I recommend to get in contact with the authors and they will gladly help you and guide you to it.
UPDATE
The loss function itself consists presumably of three to four things: x, y, a model and maybe an uncertainty on y. The chi2 loss looks like this:
def chi2():
y_pred = model.func(x)
return tf.reduce_sum((y_pred - y) / y_error) ** 2)
loss = zfit.loss.SimpleLoss(chi2, model.get_params())
That's all, 4 lines of code. x is a zfit.Data object, model is in this case a Func.
Does that work?
That's all.

R stats::step function with forward direction param is not optimizing the LR model(AIC)

I've used AIC and step function for the variable selection before, but for some reason not able to get it to work.
library(ISLR)
d = data("Caravan")
train_data = Caravan[-c(1:500,]
m0 <- glm(Purchase ~ 1, data = train_data, family = "binomial")
stats::step(m0, direction = "forward", trace = 1 )
PN - I tried the stepAIC function and tried passing the scope as scope = Purchase ~., but not those change solve the issue.
The output of the step function is a model that is the same as the base model(m0).
step function uses update within it. On the other hand, the . has a different meaning in the update function as compared to the lm function. The . in update is used to indicate that you would like to MAINTAIN the formula the way it was originally rather than used to INCLUDE ALL THE VARIABLES as in lm. thus if your model is m<-lm(y~x), update(m,log(.)~.) simply means change the left hand side to log, ie log(y) while maintaining the right hand side as it is. ie x. The perios does not include any other variables other than the ones in the model already.
WHAT YOU SHOULD DO:
scopef <- reformulate(grep("Purchase",names(Caravan),value=T,invert = T),"Purchase")
step(m0,scopef,direction = "forward")
This is how I solved the issue. As Onyambu mentioned in his reply, in AIC, the dot doesn't work the way it does in lm. Instead of concatenating the 84 predictors manually, I used the paste function with collapse="+".
glmnet( formula(paste0("Y~", paste(names(Caravan)[1:85], collapse="+"))),
....)

LMER Factor vs numeric Interaction

I am attempting to use lmer to model my data.
My data has 2 independent variables and a dependent variable.
The first is "Morph" and has values "Identical", "Near", "Far".
The second is "Response" which can be "Old" or "New".
The dependent variable is "Fix_Count".
So here is a sample dataframe and what I currently have for running the linear model.
Subject <- c(rep(1, times = 6), rep(2, times = 6))
q <- c("Identical", "Near", "Far")
Morph <- c(rep(q, times = 4))
t <- c(rep("old", times = 3),rep("new", times=3))
Response <- c(rep(t, times = 2))
Fix_Count <- sample(1:9, 12, replace = T)
df.main <- data.frame(Subject,Morph, Response, Fix_Count, stringsAsFactors = T)
df.main$Subject <- as.factor(df.main$Subject)
res = lmer(Fix_Count ~ (Morph * Response) + (1|Subject), data=df.main)
summary(res)
And the output looks like this:
The issue is I do not want it to do combination but an overall interaction of Morph:Response.
I can get it to do this by converting Morph to numeric instead of factor. However I'm not sure conceptually that makes sense as the values don't properly represent 1,2,3 but low-mid-high (ordered but qualitative).
So: 1. Is it possible to run lmer to get interaction effects between 2 factor variables?
2. Or do you think numeric is a fine way to class "Identica", "Near", "Far"?
3. I have tried setting contrasts to see if that can help, but sometimes I get an error and other times it seems like nothing is changed. If contrasts would help, could you explain how I would implement this?
Thank you so much for any help you can offer. I have also posted this question to stack exchange as I am unsure if this is a coding issue or a stats issue. However I can remove it from the less relevant forum once I know.
Best, Kirk
Two problems I see. First, you should be using a factor variable for Subject. It's clearly not a continuous or integer variable. And to (possibly) address part of your question, there is an interaction function designed to work with regression formulas. I'm pretty sure that the formula interface will interpret the "*" operator that you used as a call to interaction, but the labeling of the output may be different and perhaps more to your liking. I get the same number of coefficients with:
res = lmer(Fix_Count ~ interaction(Morph , Response) + (1|Subject), data=df.main)
But that's not an improvement.
However, they differ from the model created with Morph*Response. Probably there is a different set of contrast options.
The way to get an overall statistical test of the interaction is to compare nested models:
res_simple = lmer(Fix_Count ~ Morph + Response + (1|Subject), data=df.main)
And then do an anova for the model comparison:
anova(res,res_simple)
refitting model(s) with ML (instead of REML)
Data: df.main
Models:
res_simple: Fix_Count ~ Morph + Response + (1 | Subject)
res: Fix_Count ~ interaction(Morph, Response) + (1 | factor(Subject))
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
res_simple 6 50.920 53.830 -19.460 38.920
res 8 54.582 58.461 -19.291 38.582 0.3381 2 0.8445
My opinion is that it is sufficiently close to the boundary for stats vs coding that it could have been acceptable on either forum. (You are not supposed to cross post, however.) If you are satisfied with a coding answer then we are done. If you need help with understanding model comparison, then you may need to edit your CV.com's question to request a more theory-based answer than mine. (I checked to make sure the anova results are the same regardless of whether you use the interaction function or the "*" operator.)

error Error: step factor reduced below 0.001 without reducing pwrss when using nlmer

I think this could be more of a stats question rather than R question, but I have an error Error: step factor reduced below 0.001 without reducing pwrss when trying to fit a nlmer function to data. My data is:https://www.dropbox.com/s/cri5n7lewhc8j02/chweight.RData?dl=0
I'm trying to fit the model so that I can predict the weight of chicks based on time, for chicks on diet 1. I did the following:
cw1<-subset(ChickWeight, ChickWeight$Diet==1)
m1 <- nlmer(weight~ SSlogis(Time, Asym, xmid, scal) ~ Asym|Chick, cw1, start=c(Asym = 190, xmid = 730, scal = 350))
Could there be other ways to solve this error? I think the error has to do with Asym values but I'm not understanding well what it is doing, so any brief guidance would help.
I have been asked to improve my answer, so here is my attempt to do so.
This error is usually tripped because your start values aren't adequately close to the "true" values, so the optimizer fails to find any local improvements in fit by moving away from them. You need to try providing better starting guesses--this can sometimes be accomplished by algebraically solving the equation at a few points, as described in many places such as this article. Other times, you can plot the data and try to make educated guesses as to what the parameters might be, if you have knowledge of what the parameters "do" within the non-linear function (that is, maybe parameter a represents an asymptote, b is a scaler, c is the mean rate of change, etc.). That's hard for me personally because I have no math background, but I'm often able to make a reasonable guess most of the time.
To answer the question more directly, though, here is some reproducible code that should illustrate that the error in question comes from bad starting guesses.
#Create independentand dependent variables, X and Y, and a grouping variable Z.
xs = rep(1:10, times = 10)
ys = 3 + 2*exp(-0.5*xs)
zs = rep(1:10, each=10)
#Put random noise in X.
for (i in 1:100) {
xs[i] = rnorm(1, xs[i], 2)
}
df1 = data.frame(xs, ys, zs) #Assemble data into data frame.
require(lme4) #Turn on our package.
#Define our custom function--in this case, a three-parameter exponential model.
funct1 = funct1 = deriv(~beta0 + beta1*exp(beta2*xs), namevec=c('beta0',
'beta1', 'beta2'), function.arg=c('xs','beta0', 'beta1','beta2'))
#This will return the exact same error because our starting guesses are way off.
test1 = nlmer(ys ~ funct1(xs, beta0, beta1, beta2) ~ (beta0|zs), data = df1,
start=c(beta0=-50,beta1=200,beta2=3))
#Our starting guesses are much better now, and so nlmer is able to converge this time.
test1 = nlmer(ys ~ funct1(xs, beta0, beta1, beta2) ~ (beta0|zs), data = df1,
start=c(beta0=3.2,beta1=1.8,beta2=-0.3))

How to obtain Tukey compact letter display from a GLM with interactions

I have set of data that I've analyzed with a generalized linear model that has three categorical factors in 3-way interaction (factorA, factorB, factorC) and a fourth continuous factor (factorD) that is simply added in the model. I am trying to obtain a set of Tukey letter groups (ie, compact letter display) from the model but haven't found a way to include the interaction successfully. I'm not interested in including factorD, just the three in the interaction.
I have gotten the Tukey-adjusted pairwise comparisons with this:
lsmeans(my.glm, factorA*factorB*factorC)
But I was not able to figure out how to produce a compact letters display from that. It can be done with multcomp package but I could only find ways to do it with main effects with that package, not interactions.
So then I tried the agricolae package, as this post (https://stats.stackexchange.com/questions/31547/how-to-obtain-the-results-of-a-tukey-hsd-post-hoc-test-in-a-table-showing-groupe) discusses that that should work. However, following the instructions in that answer led to a non-functional response from HSD.test. Specifically, I could get the main effects tests to work fine, e.g. HSD.test(my.glm,"factorA") but I could not get the interactions to work. I tried this:
intxns<-with(my.data, interaction(factorA,factorB,factorC))
HSD.test(my.glm,"intxns",group=TRUE)
But a get an error that indicates the HSD.test function didn't recognize "intxns" as a valid object, it looks like this (also, I checked the intxns object and it looks good and the number of rows matched the number of residuals of my glm):
Name: inxtns
factorA factorB factorC factorD
I get that same error if I just put nonsense into the factor field in the HSD.test function call. I checked the inxtns object and it looks good and the number of rows matched the number of residu
The agricolae notes don't actually cover the use of interactions in HSD.test, but I assume it can work.
Does anyone know how to get HSD.test to work with interactions? Or is there any other function you've gotten to work to produce compact letter displays for a glm with interactions?
I've been working on this for a number of days now and haven't been able to find a solution, hopefully I'm not missing something obvious.
Thanks!
I don't know how you've specified your glm model, but for HSD.test, it's looking to match the particular treatment name with the same name specified in the glm formula as well as the data frame. This is why your main effect, factorA will work, but not the 3-way interaction. For multiple comparison tests on interactions, I find it easiest to generate the interactions separately and add them to the data frame as additional columns. The glm model can then be specified using the new variables which code for the interaction.
For example,
set.seed(42)
glm.dat <- data.frame(y = rnorm(1000), factorA = sample(letters[1:2],
size = 1000, replace = TRUE),
factorB = sample(letters[1:2], size = 1000, replace = TRUE),
factorC = sample(letters[1:2], size = 1000, replace = TRUE))
# Generate interactions explicitly and add them to the data.frame
glm.dat$factorAB <- with(glm.dat, interaction(factorA, factorB))
glm.dat$factorAC <- with(glm.dat, interaction(factorA, factorC))
glm.dat$factorBC <- with(glm.dat, interaction(factorB, factorC))
glm.dat$factorABC <- with(glm.dat, interaction(factorA, factorB, factorC))
# General linear model
glm.mod <- glm(y ~ factorA + factorB + factorC + factorAB + factorAC +
factorBC + factorABC, family = 'gaussian', data = glm.dat)
# Multiple comparison test
library(agricolae)
comp <- HSD.test(glm.mod, trt = "factorABC", group = TRUE)
giving
comp$groups giving
trt means M
1 a.a.a 0.070052189 a
2 a.b.b 0.035684571 a
3 b.a.a 0.020517535 a
4 b.b.b -0.008153257 a
5 a.b.a -0.036136140 a
6 a.a.b -0.078891136 a
7 b.a.b -0.080845419 a
8 b.b.a -0.115808772 a

Resources