Building General Linear Mixed Model in R - r

I am trying to build a glmm in R but constantly get error messages (I am a complete beginner).
I have conducted an experiment with Camera traps in which I tested, if they react to a target, that I pulled in front of them, so my response variable is binomial.
I try to build a GLMM, in which all the variables are fixed factors and day (in which the experiment was conducted is a random factor). Could anyone more experienced tell me what I am doing wrong (I first try only with one explanatory variable)?
I tried it with the glmm() and with lmer():
library(glmm)
set.seed(1234)
ptm <- proc.time()
Detections <- glmm(Detection ~ 0 + Camera, random = list(~ 0 + Day),
varcomps.names = c("Day"), data = data1, family.glmm = bernoulli.glmm,
m = 10^4, debug = TRUE)`
This one produces an obscenely large glmm, even with the minimal dataset.
and with
library(lme4)
Detections_glmm <- lmer(Detection ~ Camera + (1|Day), family="binomial")
This one gives the following error message:
Error in lmer(Detection ~ Camera + (1 | Day), family = "binomial") :
unused argument (family = "binomial")
Here is a minimal df:
data.frame(
Detection = c(1, 0, 0, 1, 1, 1, 1, 0, 0, 0),
Temperature = as.factor(c("10","10","10","10","10","20","20",
"0","0","0")),
Distance = as.factor(c("75","75","75","225","225","225",
"75","150","150","150")),
Size = as.factor(c("0","0","0","0","1","1","1","1",
"2","2")),
Light = as.factor(c("1","1","1","1","1","0","0","0",
"0","0")),
Camera = as.factor(c("1","1","2","2","2","3","3","3",
"1","1")),
Day = as.factor(c("1","1","1","2","2","2","3","3",
"3","2"))
And Information about the variables:
Response variable:
Detection (binomial)
Explanatory variables:
Temperature of bottle: (0, 10, 20)
Distance from Camera (75, 150, 225)
Light(0/1)
Size of bottle (0, 1, 3)

Related

Determine what is the break point for the slope change in R [migrated]

I'm trying to implement a "change point" analysis, or a multiphase regression using nls() in R.
Here's some fake data I've made. The formula I want to use to fit the data is:
$y = \beta_0 + \beta_1x + \beta_2\max(0,x-\delta)$
What this is supposed to do is fit the data up to a certain point with a certain intercept and slope ($\beta_0$ and $\beta_1$), then, after a certain x value ($\delta$), augment the slope by $\beta_2$. That's what the whole max thing is about. Before the $\delta$ point, it'll equal 0, and $\beta_2$ will be zeroed out.
So, here's my function to do this:
changePoint <- function(x, b0, slope1, slope2, delta){
b0 + (x*slope1) + (max(0, x-delta) * slope2)
}
And I try to fit the model this way
nls(y ~ changePoint(x, b0, slope1, slope2, delta),
data = data,
start = c(b0 = 50, slope1 = 0, slope2 = 2, delta = 48))
I chose those starting parameters, because I know those are the starting parameters, because I made the data up.
However, I get this error:
Error in nlsModel(formula, mf, start, wts) :
singular gradient matrix at initial parameter estimates
Have I just made unfortunate data? I tried fitting this on real data first, and was getting the same error, and I just figured that my initial starting parameters weren't good enough.
(At first I thought it could be a problem resulting from the fact that max is not vectorized, but that's not true. It does make it a pain to work with changePoint, wherefore the following modification:
changePoint <- function(x, b0, slope1, slope2, delta) {
b0 + (x*slope1) + (sapply(x-delta, function (t) max(0, t)) * slope2)
}
This R-help mailing list post describes one way in which this error may result: the rhs of the formula is overparameterized, such that changing two parameters in tandem gives the same fit to the data. I can't see how that is true of your model, but maybe it is.
In any case, you can write your own objective function and minimize it. The following function gives the squared error for data points (x,y) and a certain value of the parameters (the weird argument structure of the function is to account for how optim works):
sqerror <- function (par, x, y) {
sum((y - changePoint(x, par[1], par[2], par[3], par[4]))^2)
}
Then we say:
optim(par = c(50, 0, 2, 48), fn = sqerror, x = x, y = data)
And see:
$par
[1] 54.53436800 -0.09283594 2.07356459 48.00000006
Note that for my fake data (x <- 40:60; data <- changePoint(x, 50, 0, 2, 48) + rnorm(21, 0, 0.5)) there are lots of local maxima depending on the initial parameter values you give. I suppose if you wanted to take this seriously you'd call the optimizer many times with random initial parameters and examine the distribution of results.
Just wanted to add that you can do this with many other packages. If you want to get an estimate of uncertainty around the change point (something nls cannot do), try the mcp package.
# Simulate the data
df = data.frame(x = 1:100)
df$y = c(rnorm(20, 50, 5), rnorm(80, 50 + 1.5*(df$x[21:100] - 20), 5))
# Fit the model
model = list(
y ~ 1, # Intercept
~ 0 + x # Joined slope
)
library(mcp)
fit = mcp(model, df)
Let's plot it with a prediction interval (green line). The blue density is the posterior distribution for the change point location:
# Plot it
plot(fit, q_predict = T)
You can inspect individual parameters in more detail using plot_pars(fit) and summary(fit).

Simulating ar(2) model generates an error message

I am new to simulation especially when it comes to time series and so I apologize if this question seems too naive. I am trying to understand why simulating this ar(2) model generates an error:
arima.sim(list(order = c(2, 0, 0), ar = c(0.7, 0.3)), n = time_n, sd=0.2)
Error in arima.sim(list(order = c(2, 0, 0), ar = c(0.7, 0.3)), n = time_n, :
'ar' part of model is not stationary
Any pointer will be appreciated!
According to theory (e.g. see here), in order for an autoregressive model to be stationary, if r are the roots of the autoregressive polynomial
1 - phi_1 x - phi_2 x ...
then
The linear AR(p) process is strictly stationary and ergodic
if and only if |rj|>1 for all j, where |rj| is the modulus of the
complex number rj.
In your case
polyroot(c(1, -0.7, -0.3))
gives (1,-3.333)
In fact, this is the actual code within arima.sim:
minroots <- min(Mod(polyroot(c(1, -model$ar))))
if (minroots <= 1)
stop("'ar' part of model is not stationary")
Looking at the patterns and being lazy about the math, I suspect that the criterion for AR2 translates to (ph1 + phi2 < 1).

Logistic Regression in R: glm() vs rxGlm()

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...
(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)
I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly
Two things to try:
Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data
If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.
I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.
Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same sizeā€¦ from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:
A formula typically consists of a response, which in most RevoScaleR
functions can be a single variable or multiple variables combined
using cbind, the "~" operator, and one or more predictors,typically
separated by the "+" operator. The rxSummary function typically
requires a formula with no response.
Does glm_2b or glm_2c in the example below work?
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1), # number of successes
trial=c("first", "second", "third", "fourth")) # trial label
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
trial=c("first","second","third", "third", "fourth", "fourth"))
## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
family = binomial,
data = df_reprex_expanded)
exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct
## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
glm_2a <- rxLogit(y ~ 1,
data = df_reprex_expanded)
exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???
# try cbind() in rxLogit. If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
data=df_reprex)
exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???
# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
family=binomial,
data=df_reprex)
exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

knn train and class have different lengths

NN model to predict new data, but the error says:
"'train' and 'class' have different lengths"
can someone replicate and solve this error?
weather <- c(1, 1, 1, 0, 0, 0)
temperature <- c(1, 0, 0, 1, 0, 0)
golf <- c(1, 0, 1, 0, 1, 0)
df <- data.frame(weather, temperature, golf)
df_new <- data.frame(weather = c(1,1,1,1,1,1,1,1,1), temp = c(0,0,0,0,0,0,0,0,0), sunnday= c(1,1,1,0,1,1,1,0,0))
pred_knn <- knn(train=df[, c(1,2)], test=df_new, cl=df$golf, k=1)
Thank you very much!
The knn function in R requires the training data to only contain independent variables, as the dependent variable is called separately in the "cl" parameter. Changing the line below will fix this particular error.
pred_knn <- knn(train=df[,c(1,2)], test=df_new, cl=df$golf, k=1)
However, note that running the above line will throw another error. Since knn calculates the Euclidean distance between observations, it requires all independent variables to be numeric. These pages have helpful related information. I would recommend using a different classifier for this particular data set.
https://towardsdatascience.com/k-nearest-neighbors-algorithm-with-examples-in-r-simply-explained-knn-1f2c88da405c
https://discuss.analyticsvidhya.com/t/how-to-resolve-error-na-nan-inf-in-foreign-function-call-arg-6-in-knn/7280/4
Hope this helps.
I had a similar issue with data from the ISLR Weekly dataframe:
knn.pred = knn(train$Lag2,test$Lag2,train$Direction,k=1)
Error in knn(train$Lag2, test$Lag2, train$Direction, k = 1) :
dims of 'test' and 'train' differ
where
train = subset(Weekly, Year >= 1990 & Year <= 2008)
test = subset(Weekly, Year < 1990 | Year > 2008)
I finally solved it by putting the test and the train in as.matrix() like this:
knn.pred = knn(as.matrix(train$Lag2),as.matrix(test$Lag2),train$Direction,k=1)

Can't work out what is missing from my R function code - stopping it from running well

I'm trying to do Bayesian Occupancy Analysis with site covariates. My first step is making a function. I keep on getting the + in my R console indicating it thinks my code is incomplete. Having ran lines individually I am pretty certain the issue lies in the first line of code. However, I can't work out where exactly I've missed something out and hence where the problem originally lies.
data.fn <- function(R = 39, T = 14, xmin= 0, xmax= 1, alpha.psi = 0.4567,
beta.psi = 0.0338, alpha.p = 0.4, beta.p = 0.4) {
y <- array(dim = c(R,T)) #This creates an array for counts
#Ecological Process
#Covariate values
X <- sort(runif(n=R, min = xmin, max = xmax))
#Expected occurence-covariate relationship
psi <- plogis(alpha.psi + beta.psi *X) #this applies the inverse logit
#Add Bernoulli Noise - drawing indicator of occurence (z) from bernoulli psi
z<- rbinom(n = R, size = 1, prob = psi)
occ.fs <- sum(z) #"Finite Sample Occupancy"
"Make a census"
p.eff <- z*p
for (i in 1:T) {
y[,i] <- rbinom(n=R, size = 1, prob = p.eff)
}
}
There's more code - i.e. the {} function is complete but the issue started before that was ran and I keep having issues uploading the code into Stack.
The error message is simply + all down the left hand side of the R console
EDIT
Could there be something wrong with how R is sensing stuff? For instance with the following code
naive.pred <- plogis(predict(glm(apply(y, 1, max) ~ X + I (X^2),
family = binomial)))
I got the error message - unexpected symbol (the bracket) in the family = binomial yet each bracket is paired correctly- there's no extra unnecessary brackets?
While I did not see a + issue when I looked over your code, you are not simulating the observed data correctly and there was a p object inside your function that did not have an argument passed to it. You did create a logit linear predictor for psi using alpha.psi and beta.psi, however you are lacking a logit linear predictor for the probability of detecting a species given that they are there using alpha.p and beta.p. Assuming that the covariate X is used for both the latent occupancy state and the observation model the code should become.
data.fn <- function(R = 39, T = 14, xmin= 0, xmax= 1, alpha.psi = 0.4567,
beta.psi = 0.0338, alpha.p = 0.4, beta.p = 0.4) {
y <- array(dim = c(R,T)) #This creates an array for counts
#Ecological Process
#Covariate values
X <- sort(runif(n=R, min = xmin, max = xmax))
#Expected occurence-covariate relationship
psi <- plogis(alpha.psi + beta.psi *X) #this applies the inverse logit
#Add Bernoulli Noise - drawing indicator of occurence (z) from bernoulli psi
z<- rbinom(n = R, size = 1, prob = psi)
occ.fs <- sum(z) #"Finite Sample Occupancy"
# Linear predictor for detection,
# assuming the same covariate is used for detection
p.eff <- plogis(alpha.p + beta.p * X)
for (i in 1:T) {
y[,i] <- rbinom(n=R, size = 1, prob = p.eff * z)
}
return(list(y = y, z = z, X = X, occ.fs = occ.fs))
}
This code assumes that you are passing logit scale parameters to the data, so if you are trying to simulate data such that X has a very marginal and positive influence on occupancy then you are off to the races, so to speak. If you are looking for a more pronounced effect, than you should increase the effect size. Finally, 39 site is very few for an occupancy analysis given that binary deteciton / non-detection data is quite information poor. Don't be surprised if you posterior estimates you get from analyzing the dataset do not actually return the parameters used to simulate the data.

Resources