Modelling for zero using glm function in R - r

I am trying to build a logistic regression model using glm function in R. My dependent variable is binomial with 0 and 1 only. Here 0 - Non Return , 1- Return.
I want to model for Non-Return (0's),but glm function of R by default build for 1's. Like in SAS which by default build for lower value and we can use descending attribute in proc logistic to change the order, do we have something similar in glm too ?
I have one option to achieve this by changing 0 to 1 and vice-versa in my raw data but don't want to change my raw data.
Please help me or guide how can I do the similar thing in R.
Thanks in advance.

Just specify 1 - y as the DV:
set.seed(42)
y <- sample(c(0, 1), 10, TRUE)
#[1] 1 1 0 1 1 1 1 0 1 1
fit <- glm(y ~ 1, family = binomial)
coef(fit)
# (Intercept)
# 1.386294
log(mean(y) / (1 - mean(y)))
#[1] 1.386294
1 - y
#[1] 0 0 1 0 0 0 0 1 0 0
fit1 <- glm(1 - y ~ 1, family = binomial)
coef(fit1)
#(Intercept)
#-1.386294
log(mean(1 - y) / (1 - mean(1 - y)))
#[1] -1.386294

Alternatively, you can temporarily transform your data by using...transform:
glm( data = transform( data.frame(y=0), y=y+1 ), ... )

Related

Ranger Predicted Class Probability of each row in a data frame

With regard to this link Predicted probabilities in R ranger package, I have a question.
Imagine I have a mixed data frame, df (comprising of factor and numeric variables) and I want to do classification using ranger. I am splitting this data frame as test and train sets as Train_Set and Test_Set. BiClass is my prediction factor variable and comprises of 0 and 1 (2 levels)
I want to calculate and attach class probabilities to the data frame using ranger using the following commands:
Biclass.ranger <- ranger(BiClass ~ ., ,data=Train_Set, num.trees = 500, importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(Biclass.ranger, data = Test_Set, num.trees = 200, type='response', verbose = TRUE)$predictions)
The data frame probabilities is a data frame consisting of 2 columns (0 and 1) with number of rows equal to the number of rows in Test_Set.
Does it mean, if I append or attach this data frame, namely, probabilities to the Test_Set as the last two columns, it shows the probability of each row being either 0 or 1? Is my understanding correct?
My second question, when I attempt to calcuate confusion matrix through
pred = predict(Biclass.ranger, data=Test_Set, num.trees = 500, type='response', verbose = TRUE)
table(Test_Set$BiClass, pred$predictions)
I get the following error:
Error in table(Test_Set$BiClass, pred$predictions) :
all arguments must have the same length
What am I doing wrong?
For your first question yes, it shows the probability of each row being 0 or 1. Using the example below:
library(ranger)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor",1,0))
Train_Set = data[idx,]
Test_Set = data[-idx,]
mdl <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)
probabilities <- as.data.frame(predict(mdl, data = Test_Set,type='response', verbose = TRUE)$predictions)
We can always check whether they agree:
par(mfrow=c(1,2))
boxplot(probabilities[,"0"] ~ Test_Set$Species,ylab="Prob 0",xlab="Actual label")
boxplot(probabilities[,"1"] ~ Test_Set$Species,ylab="Prob 1",xlab="Actual label")
Not the best plot, but sometimes if the labels are flipped you will see something weird. We need to find the column that has the max probability and assign the label, for this we do:
max.col(probabilities) - 1
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0
This goes through each row of probabilities returns 1 or 2 depending on which column has maximum probability and we simply subtract 1 from it to get 0,1. For the confusion matrix:
caret::confusionMatrix(table(max.col(probabilities) - 1,Test_Set$Species))
Confusion Matrix and Statistics
0 1
0 31 2
1 0 17
Accuracy : 0.96
95% CI : (0.8629, 0.9951)
No Information Rate : 0.62
P-Value [Acc > NIR] : 2.048e-08
In your case, you can just do:
confusionMatrix(table(max.col(probabilities)-1,Test_Set$BiClass))

Custom function to compute contrasts in emmeans

I want to create a custom contrast function in emmeans which could remove a given list of levels from the input vector and apply the built-in contrast method ("trt.vs.ctrl") on the remaining levels. An example dataset is available here. I am using the following R code for computing ANOVA and post hoc comparisons:
options(contrasts=c("contr.sum", "contr.poly"))
my_lm <- lm(D1 ~ C*R, data=df)
Anova(my_lm, type = "III")
#show Interaction effects using emmeans
emmip(my_lm, C ~ R )
emm = emmeans(my_lm, ~ C * R)
emm
contrast(emmeans(my_lm, ~ C * R), "consec", by = "C")
#compare 1st with next 3 groups (how to remove other three levels?)
contrast(emmeans(my_lm, ~ C * R), "trt.vs.ctrl", by = "R")
The built-in contrast option ("trt.vs.ctrl") compares the first level with everything that follows it (there are 7 factor levels in C, and I want to remove last 3 of them and compute the contrasts for the remaining 4). An example is provided in the official documentation to write a custom contrast function.
skip_comp.emmc <- function(levels, skip = 1, reverse = FALSE) {
if((k <- length(levels)) < skip + 1)
stop("Need at least ", skip + 1, " levels")
coef <- data.frame()
coef <- as.data.frame(lapply(seq_len(k - skip - 1), function(i) {
sgn <- ifelse(reverse, -1, 1)
sgn * c(rep(0, i - 1), 1, rep(0, skip), -1, rep(0, k - i - skip - 1))
}))
names(coef) <- sapply(coef, function(x)
paste(which(x == 1), "-", which(x == -1)))
attr(coef, "adjust") = "fdr" # default adjustment method
coef
}
However due to my limited understanding I am not very sure where to apply the modifications that I need to to customise the example. Any ideas?
Is this something you are going to want to do lots of times in the future? My guess is not, that you only want to do this once, or a few times at most; in which case it is way too much trouble to write a custom contrast function. Just get the contrast coefficients you need, and use that as the second argument in contrast.
Now, consider these results:
> con <- emmeans:::trt.vs.ctrl.emmc(1:7)
> con
2 - 1 3 - 1 4 - 1 5 - 1 6 - 1 7 - 1
1 -1 -1 -1 -1 -1 -1
2 1 0 0 0 0 0
3 0 1 0 0 0 0
4 0 0 1 0 0 0
5 0 0 0 1 0 0
6 0 0 0 0 1 0
7 0 0 0 0 0 1
From the description, I think you just want the first 3 sets of contrast coefficients. So use those columns:
contrast(emm, con[, 1:3], by = "R")
Update
StackOverflow can occasionally inspire developers to add software features. In this case, I decided it could be useful to add an exclude argument to most built-in .emmc functions in emmeans (all except poly.emmc()). This was fairly straightforward to do, and those features are now incorporated in the latest push to github -- https://github.com/rvlenth/emmeans. These features will be included in the next CRAN update as well.

Unique intercepts approach for categorical variables in "rstanarm" package in R

Background:
McElearth (2016) in his rethinking book pages 158-159, uses an index variable instead of dummy coding for a 3-category variable called "clade" to predict "kcal.per.g" (linear regression).
Question: I was wondering if we could apply the same approach in "rstanarm"? I have provided data and R code for a possible demonstration below.
library("rethinking") # A github package not on CRAN
data(milk)
d <- milk
d$clade_id <- coerce_index(d$clade) # Index variable maker
#[1] 4 4 4 4 4 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 # index variable
# Model Specification:
fit1 <- map(
alist(
kcal.per.g ~ dnorm( mu , sigma ) ,
mu <- a[clade_id] ,
a[clade_id] ~ dnorm( 0.6 , 10 ) ,
sigma ~ dunif( 0 , 10 )
) ,
data = d )
The most analogous way to do this using the rstanarm package is with
library(rstanarm)
fit1 <- stan_glmer(kcal.per.g ~ 1 + (1 | clade_id), data = milk,
prior_intercept = normal(0.6, 1, autoscale = FALSE),
prior_aux = exponential(rate = 1/5, autoscale = FALSE),
prior_covariance = decov(shape = 10, scale = 1))
However, this is not exactly the same for the following reasons:
Bounded uniform priors on sigma are not implemented because they are not a good idea, so I have used an exponential distribution with an expectation of 5 instead
Fixing the standard deviation on a is not implemented either so I have used a gamma distribution with an expectation of 10
Hierarchical models in rstanarm (and lme4) are parameterized with deviations from common parameters, so rather than using an expectation of 0.6 for a, I have used an expectation of 0.6 for the global intercept and the prior on a is normal with an expectation of zero. This means you need to call coef(fit1) rather than ranef(fit1) to see the "intercepts" as they are parameterized in the original model.

JAGS Runtime Error: Cannot insert node into X[ ]. Dimension Mismatch

I'm trying to add a bit of code to a data-augmentation capture-recapture model and am coming up with some errors I haven't encountered before. In short, I want to estimate a series of survivorship phases that each last more than a single time interval. I want the model to estimate the length of each survivorship phase and use that to improve the capture recapture model. I tried and failed with a few different approaches, and am now trying to accomplish this using a switching state array for the survivorship phases:
for (t in 1:(n.occasions-1)){
phi1switch[t] ~ dunif(0,1)
phi2switch[t] ~ dunif(0,1)
phi3switch[t] ~ dunif(0,1)
phi4switch[t] ~ dunif(0,1)
psphi[1,t,1] <- 1-phi1switch[t]
psphi[1,t,2] <- phi1switch[t]
psphi[1,t,3] <- 0
psphi[1,t,4] <- 0
psphi[1,t,5] <- 0
psphi[2,t,1] <- 0
psphi[2,t,2] <- 1-phi2switch[t]
psphi[2,t,3] <- phi2switch[t]
psphi[2,t,4] <- 0
psphi[2,t,5] <- 0
psphi[3,t,1] <- 0
psphi[3,t,2] <- 0
psphi[3,t,3] <- 1-phi3switch[t]
psphi[3,t,4] <- phi3switch[t]
psphi[3,t,5] <- 0
psphi[4,t,1] <- 0
psphi[4,t,2] <- 0
psphi[4,t,3] <- 0
psphi[4,t,4] <- 1-phi4switch[t]
psphi[4,t,5] <- phi4switch[t]
psphi[5,t,1] <- 0
psphi[5,t,2] <- 0
psphi[5,t,3] <- 0
psphi[5,t,4] <- 0
psphi[5,t,5] <- 1
}
So this creates a [5,t,5] array where the survivorship state can only switch to the subsequent state and not backwards (e.g. 1 to 2, 4 to 5, but not 4 to 3). Now I create a vector where the survivorship state is defined:
PhiState[1] <- 1
for (t in 2:(n.occasions-1)){
# State process: draw PhiState(t) given PhiState(t-1)
PhiState[t] ~ dcat(psphi[PhiState[t-1], t-1,])
}
We start in state 1 always, and then take a categorical draw at each time step 't' for remaining in the current state or moving on to the next one given the probabilities within the array. I want a maximum of 5 states (assuming that the model will be able to functionally produce fewer by estimating the probability of moving from state 3 to 4 and onwards near 0, or making the survivorship value of subsequent states the same or similar if they belong to the same survivorship value in reality). So I create 5 hierarchical survival probabilities:
for (a in 1:5){
mean.phi[a] ~ dunif(0,1)
phi.tau[a] <- pow(phi_sigma[a],-2)
phi.sigma[a] ~ dunif(0,20)
}
Now this next step is where the errors start. Now that I've assigned values 1-5 to my PhiState vector it should look something like this:
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 5
or maybe
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
and I now want to assign a mean.phi[] to my actual phi[] term, which feeds into the model:
for(t in 1:(n.occasions-1)){
phi[t] ~ dnorm(mean.phi[PhiState[t]],phi.tau[PhiState[t]])
}
However, when I try to run this I get the following error:
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Cannot insert node into mean.phi[1:5]. Dimension mismatch
It's worth noting that the model works just fine when I use the following phi[] determinations:
phi[t] ~ dunif(0,1) #estimate independent annual phi's
or
phi[t] ~ dnorm(mean.phi,phi_tau) #estimate hierarchical phi's from a single mean.phi
or
#Set fixed survial periods (this works the best, but I don't want to have to tell it when
#the periods start/end and how many there are, hence the current exercise):
for (a in 1:21){
surv[a] ~ dnorm(mean.phi1,phi1_tau)
}
for (b in 22:30){
surv[b] ~ dnorm(mean.phi2,phi2_tau)
}
for (t in 1:(n.occasions-1)){
phi[t] <- surv[t]
}
I did read this post: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/36c48f25/
but I don't see where I'm redefining variables in this case... Any help fixing this or advice on a better approach would be most welcome!
Many thanks,
Josh
I'm a bit confused as to what are your actual data (the phi[t]?), but the following might give you a starting point:
nt <- 29
nstate <- 5
M <- function() {
phi_state[1] <- 1
for (t in 2:nt) {
up[t-1] ~ dbern(p[t-1])
p[t-1] <- ifelse(phi_state[t-1]==nstate, 0, p_[t-1])
p_[t-1] ~ dunif(0, 1)
phi_state[t] <- phi_state[t-1] + equals(up[t-1], 1)
}
for (k in 1:nstate) {
mean_phi[k] ~ dunif(0, 1)
phi_sigma[k] ~ dunif(0, 20)
}
for(t in 1:(nt-1)){
phi[t] ~ dnorm(mean_phi[phi_state[t]], phi_sigma[phi_state[t]]^-2)
}
}
library(R2jags)
fit <- jags(list(nt=nt, nstate=nstate), NULL,
c('phi_state', 'phi', 'mean_phi', 'phi_sigma', 'p'),
M, DIC=FALSE)
Note that above, p is a vector of probabilities of moving up to the next (adjacent) state.

by group analysis using svyglm in a data.table

I have the following data in a data.table:
h x1 y1 swNx11
1: 1 39.075565717 0 1.03317231703408
2: 1 40.445951251 0 7.14418755725832
3: 1 37.800722944 0 0.435946586361557
4: 1 41.085221504 0 0.381347141150498
5: 1 36.318077491 0 0.497077163135359
---
24996: 25 39.110138193 0 0.942922612158002
24997: 25 39.331940413 0 1.42227399208458
24998: 25 37.479473784 0 0.390657876415799
24999: 25 35.892044242 0 0.599937357458247
25000: 25 40.699588303 0 0.486486760245521
I've created a function to analyse them in svyglm:
msmMC <- function(y, x, sw, name){
msm <- svyglm(y ~ x,family=quasibinomial(link="logit"),design = svydesign(~ 1, weights = ~ sw))
out <- cbind("name",coef(summary(msm))[2,1],coef(summary(msm))[2,2])
return(out)
}
msmswNx1<-dt2[,list(dtmsm=list(msmMC(y1, x1, swNx1, Nx1))),by="h"]
outNx1 <- unlist(dt.lm[,msmswNx1])
When I run this function, I get the following error:
Error in [.data.table(dt2, , list(dtmsm = list(msmMC(y1, x1, swNx1, :
column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Useage: DT[,sum(colC),by=list(colA,month(colB))]
Yet it works fine with a different model, such as glm or polr. So what is going on here? Why is svyglm so picky about by-group processing with a data.table?
I doubt that it has worked for lm glm or polr as the error is an argument matching one.
You will need to wrap the whole thing in list
dt2[,list(dtmsm=list(msmMC(y1, x1, swNx1, Nx1))),by="h"]
Or perhaps, you have just misplaced the list call given that msmMC appears to return an object that might be a data.frame, list or data.table
dt2[,list(dtmsm=msmMC(y1, x1, swNx1, Nx1)),by="h"]

Resources