mgcv: can I drop a level for factor 'by' variable? - r

the story:
I'm facing a problem in gam where only 2 input-variable should be considered:
x = relative price (%) the customer paid for the product given the entry price for the club
b = binary, if the customer has to pay the product (VIPs get it for free)
the output-variable is
y = if the customer took the product
and this sims the data:
require(mgcv)
require(data.table)
set.seed(2017)
y <- sample(c(0, 1), 100, replace=T)
x <- rgamma(100, 3, 3)
b <- as.factor(ifelse(x<.5, 0, 1))
dat <- as.data.table(list(y=y, x=x, b=b))
dat[b=="0",x:=0]
plot(dat$x, dat$y, col=dat$b)
relative price
as you can see in the plot, customers who hadn't pay for the product have a relative price for the product at 0%, others have the relative prices between .5% and 3.5%
here comes the problem:
I want to model one dummy effect for b and a smooth effect for x (certainly only for those who has to pay), so I use b also as a by-variable in x:
mod <- bam(y~b+s(x, by=b), data=dat, family=binomial(link="logit"))
summary(mod)
par(mfrow=c(1,2))
plot(mod)
smooth effects
my question is:
a. why can you still see rug by s(x, b=1) at 0%, wouldn't it makes more sense if mgcv only consider those who has to pay? does this problem has s.th to do with the knots?
b. as you can see in the summary, the dummy effect is estimated as NA, this might has to do with the fact that the information of b was totally used in as by-variable in s(x) so the dummy b itself has no more information to give? how can I overcome this problem, in other words: is there a option to model a smooth term only for a subset of the data and make mgcv actually only use this subset to fit?

Your question is conceptually as same as How can I force dropping intercept or equivalent in this linear model?. You want to contrast b, rather than using all its levels.
In GAM setting, you want:
dat$B <- as.numeric(dat$b) - 1
y ~ b + s(x, by = B)
For factor by smooth, mgcv does not apply contrast to by, if this factor is unordered. This is generally appealing as often we want a smooth for each factor level. It is thus your responsibility to use some trick to get what you want. What I did in above is to coerce this two-level factor b to a numeric B, with the level you want to omit being numerically 0. Then use numerical 'by' B. This idea can not be extended to factors of more levels.
If your factor by has more than 2 levels and you still want to enforce a contrast, you need to use an ordered factor. For example, you can do
dat$B <- ordered(dat$b)
y ~ b + s(x, by = B)
Read more on 'by' variables from ?gam.models.

Related

I am working on ordered Logit. Tried to solve the estimates and proportional odds using R. Is it correct

Question: Have a look at data set Two.csv. It contains a potentially dependent binary variable Y , and
two potentially independent variables {X1, X2} for each unit of measurement.
(a) Read data set Two.csv into R and have a look at the format of the dependent variable.
Discuss three models which might be appropriate in this data situation. Discuss which
aspects speak in favor of each model, and which aspects against.
(b) Suppose variable Y measures financial ratings A : y = 1, B : y = 2, and C : y = 3, that
is, the creditworthiness A: high, B: intermediate, C: low for unit of measurement firm
i. Model Y by means of an ordered Logit model as a function of {X1,X2} and estimate
your model by means of a built-in command.
(c) Explain the proportional odds-assumption and test whether the assumption is critical
in the context of the data set at hand.
##a) Read data set Two.csv into R and have a look at the format of the dependent variable.
O <- read.table("C:/Users/DELL/Downloads/ExamQEIII2021/Two.csv",header=TRUE,sep=";")
str(O)
dim(O)
View(O)
##b)
library(oglmx)
ologit<- oglmx(y~x1+x2,data=O, link="logit",constantMEAN = FALSE, constantSD = FALSE,
delta=0,threshparam =NULL)
results.ologis <- ologit.reg(y~x1+x2,data=O)
summary(results.ologis)
## x1 1.46251
## x2 -0.45391
margins.oglmx(results.ologis,ascontinuous = FALSE) #Build in command for AMElogit
##c) Explain the proportional odds-assumption and test whether the assumption is critical
#in the context of the data set at hand.
#ordinal Logit WITH proportional odds(PO)
library(VGAM)
a <- vglm(y~x1+x2,family=cumulative(parallel=TRUE),data=O)
summary(a)
#ordinal Logit WITHOUT proportional odds [a considers PO and c doesn't]
c <- vglm(y~x1+x2,family=cumulative(parallel=FALSE),data=O)
summary(c)
pchisq(deviance(a)-deviance(c),df.residual(a)-df.residual(c),lower.tail=FALSE)
## 0.4936413 ## No significant difference in the variance left unexplained. Cannot
#confirm that PO assumption is critical.
#small model
LLa <- logLik(a)
#large model
LLc <- logLik(c)
#2*LLc-2*
df.residual(c)
df.residual(a) #or similarly, via a Likelihood Ratio test.
# or, if you are unsure about the number of degrees of freedom
LL<- 2*(LLc -LLa)
1-pchisq(LL,df.residual(a)-df.residual(c))
## 0.4936413 [SAME AS ## No sign. line]
##Conclusion: Likelihood do not differ significantly with the assumption of non PO.

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

how to create manual contrasts with emmeans? - R

Suppose I have these data
library(MASS)
m<-lmer(Y~N*V + (1|B),data=oats)
How can I create a manual contrast in emmeans? For example
Victoria_0.2cwt 1
Victoria_0.4cwt -1
Marvellous_0.2cwt -1
Marvellous_0.4cwt 1
emm = emmeans(m, ~ V * N)
emm
contrast(emm, list(con = c(0,0,0,0,-1,1,0,0,-1,0,0,0)))
However, this is actually a linear function, not a contrast, because the coefficients do not sum to zero.
Note: I may have mis-remembered the factor levels, and if so, the coefficients may need to be rearranged. They should correspond to the combinations you see in the results if the 2nd line

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

How to drop some interaction terms from R formula

I would like to drop some interaction terms from an R formula. My situation is that I have one factor variable with lots of levels (call this A, and it takes values from 1-50), and another continuous variable that I would like to interact it with (call this B).
A*B
creates terms A1:B, A2:B, A3:B,... I want a simple way to get rid of the first A1:B term.
Note: I saw some previous answers for the lm case that called update and then dropped some terms. This will not work for me as I am trying to estimate a multinomial logit model with the mlogit package, and I cannot make the first estimation without dropping some interactions.
Edit: Although I am not attempting to use lm, if I could the following to occur, then I think it would solve my problem.
dd<-data.frame(A=sample(letters[1:10], 100, replace=T),B = runif(100),z=rexp(100))
#need to drop B term below
reg1 <- lm(z~A*B, dd)
#or need to drop Aa:B term here
reg2 <- lm(z~A*B - B, dd)
#but this doesn't work (I realize why, but this is an
#example of what I would like to have happen)
reg3 <- lm(z~A*B - B - Aa:B, dd)
I think you should be able to work with contrasts her to make this happen. Here we create our own contrast that adjusts the default contrast.treament behavior to skip the first two variables.
contr.skip2 <- function (n, contrasts = TRUE, sparse = FALSE)
{
contr <- contr.treatment(n, 1, contrasts, sparse)
contr[2, ] <- 0
contr[, -1]
}
and then we can fit the model and pass along our special contrast
lm(z~A*B, dd, contrasts=list(A="contr.skip2"))
# Call:
# lm(formula = z ~ A * B, data = dd, contrasts = list(A = "contr.skip2"))
#
# Coefficients:
# (Intercept) Ac Ad Ae Af Ag Ah
# 1.09981 -0.14541 -0.86334 -0.18478 -0.77302 0.19681 0.23845
# Ai Aj B Ac:B Ad:B Ae:B Af:B
# -0.74962 -0.49014 0.09729 0.14705 1.09606 0.14706 0.88919
# Ag:B Ah:B Ai:B Aj:B
# -0.62796 -0.70155 1.60253 -0.20564
and as you can see we no longer have Ab terms in the model.

Resources