Specifying truncation point in glmmTMB R package - r

I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they have been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc.
I am attempting to run a negative binomial 2 truncated model in glmmTMB and I am wondering how the package handles lack of 0. I have longitudinal data on gambling behavior, days played for each month (for a total of 70 months). The variable can take values between 1-31 (depending on month), there are no 0. Participants’ months with 0 are absent from dataset. Example of how data are structured with just two participants:
# Example variables and data frame in long form
# Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)
My question: How do I specify where the truncation happens in glmmTMB? Does it default to 0? I want to truncate 0 and have run the following code (I am going to compare models, the first one is a simple unconditional one):
DaysPlayedUnconditional <- glmmTMB(daysPlayed ~ 1 + (1 | id), dfLong, family = truncated_nbinom2)
Will it do the trick?

From Ben Bolker through r-sig-mixed-models#r-project.org:
"I'm not 100% clear on your question, but: glmmTMB only does zero-truncation, not k-truncation with k>0, i.e. you can only specify the model Prob(x==0) = 0 Prob(x>0) = Prob(NBinom(x))/Prob(NBinom(x>0)) (terrible notation, but hopefully you get the idea)"

Related

R: mixed models - how to predict a variable using previous values of this same variable

I struggle with multilevel models and prepared a reproducible example to be clear.
Let's say I would like to predict the height of children after 12 months of follow_up, i.e. their height at month == 12, using the previous values obtained for the height, but also their previous values of weight, with such a dataframe.
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
month = c (1, 3, 6, 12, 1, 6, 12, 1, 6, 8, 12),
weight = c (14, 15, 17, 18, 21, 21, 22, 8, 8, 9, 10),
height = c (100, 102, 103, 104, 122, 123, 125, 82, 86, 88, 90))
ID month weight height
1 1 1 14 100
2 1 3 15 102
3 1 6 17 103
4 1 12 18 104
5 2 1 21 122
6 2 6 21 123
7 2 12 22 125
8 3 1 8 82
9 3 6 8 86
10 3 8 9 88
11 3 12 10 90
My plan was to use the following model (obviously I have much more data than 3 patients, and more lines per patient). Because my height are correlated within each patient, I wanted to add a random intercept (1|ID), but also a random slope and it is the reason why I added (month|ID) (I saw in several examples of predicting scores of students that the "occasion" or "day test" was added as a random slope). So I used the following code.
library(tidymodels)
library(multilevelmod)
library(lme4)
#Specifications
mixed_model_spec <- linear_reg() %>%
set_engine("lmer") %>%
set_args(na.action=na.exclude, control = lmerControl(optimizer ="bobyqa"))
#Fitting the model
mixed_model_fit <-
mixed_model_spec %>%
fit(height ~ weight + month + (month|ID),
data = df)
My first problem is that if I add "weight" (and its multiple values per ID) as a variable, I have the following error "boundary (singular) fit: see help('isSingular')" (even in my large dataset), while if I keep only variables with one value per patient (e.g. sex) I do not have this problem.
Can anyone explain me why ?
My second problem is that by training a similar model, I can predict for new children the values of height at nearly all months (I get a predicted value at month 1, month X, ..., month 12) that I can compare to the real values collected on my test set.
However, what I am interesting in is to predict the value at month 12 and integrate the previous values from each patients in this testing test. In other words, I do not want the model to predict the whole set of values from scratch (more precisely, from the patient data used for training), but also from the previous values of the new patient at month 1, month 4, month 6 etc. already available. How I can write my code to obtain such a prediction?
Thanks a lot for your help!
My first problem is that if I add "weight" (and its multiple values per ID) as a variable, I have the following error "boundary (singular) fit: see help('isSingular')" (even in my large dataset), while if I keep only variables with one value per patient (e.g. sex) I do not have this problem. Can anyone explain me why ?
This happens when the random effects structure is too complex to be supported by the data. Other than this it is usually not possible to identify exactly why this happens in some situations and not others. Basically the model is overfitted. A few things you can try are:
centering the month variable
centering other numeric variables
fitting the model without the correlation between random slopes and intercepts, by using || instead of |
There are also some related questions and answers here:
https://stats.stackexchange.com/questions/378939/dealing-with-singular-fit-in-mixed-models/379068#379068
https://stats.stackexchange.com/questions/509892/why-is-this-linear-mixed-model-singular/509971#509971
As for the 2nd question, it sounds like you want some kind of time series model. An autoregressive model such as AR(1) might be sufficient, but this is not supported by lme4. You could try nmle instead.

Calculating the value to know the trend in a set of numeric values

I have a requirement where I have set of numeric values for example: 2, 4, 2, 5, 0
As we can see in above set of numbers the trend is mixed but since the latest number is 0, I would consider the value is getting DOWN. Is there any way to measure the trend (either it is getting up or down).
Is there any R package available for that?
Thanks
Suppose your vector is c(2, 4, 2, 5, 0) and you want to know last value (increasing, constant or decreasing), then you could use diff function with a lag of 1. Below is an example.
MyVec <- c(2, 4, 2, 5, 0)
Lagged_vec <- diff(MyVec, lag=1)
if(MyVec[length(MyVec)]<0){
print("Decreasing")}
else if(MyVec[length(MyVec)]==0){
print("Constant")}
else {print("Increasing")}
Please let me know if this is what you wanted.

Multivariate Granger's causality

I'm having issues doing a multivariate Granger's causal test. I'll like to check if conditioning a third variable affects the results of a causal test.
Here's one sample for a single dependent and independent variable based on an earlier question I asked and was answered by #Alex
Granger's causality test by column
library(lmtest)
M1<- matrix( c(2,3, 1, 4, 3, 3, 1,1, 5, 7), nrow=5, ncol=2)
M2<- matrix( c(7,3, 6, 9, 1, 2, 1,2, 8, 1), nrow=5, ncol=2)
M3<- matrix( c(1, 3, 1,5, 7,3, 1, 3, 3, 4), nrow=5, ncol=2)
For example, the equation for a conditioned linear regression will be
formula = y ~ w + x * z
How do I carry out this test as a function of a third or fourth variable please?
1. The solution for stationary variables are well-established: See FIAR (v 0.3) package.
This is the paper related with the package that includes concrete example of multivariate Granger causality (in the case of all of the variables are stationary).
Page 12: Theory, Page 15: Practice.
2. In case of mixed (stationary, nonstationary) variables, make all the variables stationary first (via differencing etc.). Do not handle stationary ones (they are already stationary). Now again, you finish by the above procedure (in case I).
3. In case of "non-cointegrated nonstationary" variables, then there is no need for VECM. Run VAR with the stationary variables (by making them stationary first, of course). Apply FIAR::condGranger etc.
4. In case of "cointegrated nonstationary" variables, the answer is really really very long:
Johansen Procedure (detect rank via urca::cajo)
Apply vec2var to convert VECM to VAR (since FIAR is based on VAR).
John Hunter's latest book nicely summarizes what can happen and what can be done in this last case.
You may wanna read this as well.
To my knowledge: Conditional/partial Granger causality supersides the GC via "Block exogeneity Wald test over VAR".

How to specify formula in linear model with 100 dependent variables without having to write them explicitly in R

The problem is to (a) model the intra day demand in ATM Widthrawals and (b) create prediction intervals for future demand. One day has 144 10-minute periods and my dataset is the number of ATM widthrawals in each period. Here is a chart so you can have a glipse of what I'm talking about.
My dataset also has other data (mainly dummies), such as Weekday and Holiday. For the purpose of this post, I be using the following data.frame as a representation of my dataset (which has only 6 time periods, between 00:10 and 01:00 and not the full day)
df <- data.frame(H0010=1, H0020=2, H0030=3, H0050=4, H0050=5, H0100=6,
WeekDay=7, Holiday=8)
The first idea that crossed my mind was to fit a linear regression. More precisely, a multivariate multiple linear regression. But because I have 144 dependent variables (one for each 10-minute period) and not only 6, my code in R would be hugely long:
lm.fit <- lm(cbind(H0010, H0020, H0030, H0050, H0050, H0100,
H0200, H0210, H0220, H0230, H0240, H0250,
(and in goes on and on till midnight)
H2310, H2320, H2330, H2340, H2350, H2359)
~ WeekDay + Holiday, data = df)
Is there a way I could write the model formula without having to specify all the 144 dependent variables?
I would also apreciate any other thoughts on how to address this problem using other methods (although this posts question is the above mencioned).
EDIT:
My dataset is composed by the dependent variables (number of transactions) and dummies which are factors. As so, the solution lm(cbind(-Weekday, -Holiday) ~ Weekday + Holiday, data=df) does not work.
f <- sprintf("cbind(%s) ~ WeekDay + Holiday", paste(names(df)[1:6], collapse = ", "))
lm(f, data = df)
Sure, you can select variables by specifying which you would like to exclude:
lm(cbind(-WeekDay, -Holiday) ~ WeekDay + Holiday, data=df)
EDIT:
How's this? I included a more realistic dataframe too.
df <- data.frame(H0010=rnorm(100, 1, 1), H0020=rnorm(100, 2, 1),
H0030=rnorm(100, 3, 1), H0050=rnorm(100, 4, 1),
H0050=rnorm(100, 5, 1), H0100=rnorm(100, 6, 1),
WeekDay=factor(c(rep(seq(1,7), 14), 1, 2)),
Holiday=factor(rbinom(100, 1, prob = .05)))
y <- as.matrix(df[,1:6])
x <- model.matrix(~df$WeekDay+df$Holiday)
lm(y~0+x) #suppress intercept, as it's in the model.matrix

how to extract timestamps from ts object in r

Consider the treerings dataset.
library("datasets", lib.loc="C:/Program Files/R/R-3.3.1/library")
tr<-treering
length(tr)
[1] 7980
class(tr)
[1] "ts"
From my understanding, it is a time series of length 7980.
How can I find out what the time stamps are for each value?
After plotting the time series, looking at the x axis of the plot, it appears that the time stamps range between -6000 to 2000. But to me the time stamps appear to be "hidden".
plot(tr)
More generally, I'm trying to understand what exactly is a ts object and what are the benefits of using this type of object.
A univariate and multivariate time series can easily be displayed in a data frame with 2 or more columns: Time and variables .
univariatetimeseries <- data.frame(Time = c(0, 1, 2, 3, 4, 5, 6), y = c(1, 2, 3, 4, 5, 6, 7))
multivariatetimeseries <- data.frame(Time = c(0,1,2,3,4,5,6), y = c(1, 2, 3, 4, 5, 6, 7), z = c(7,6,5,4,3,2,1))
This to me seems simple and straighforward and it is consistent with the basic science examples that I learned in high school. Additionally, the time stamps are not "hidden" as is the case of the treering example. So what are the benefits of using ts object?
Object of class comes with many generic functions for convenience. Say for "ts" object class there are ts.plot, plot.ts, etc. If you store your time series as a data frame, you have to do lots of work yourself when plotting them.
Perhaps for seasonal time series, the advantage of using "ts" is more evident. For example, x <- ts(rnorm(36), start = c(2000, 1), frequency = 12) generates monthly time series for 3 years. The print method will nicely arrange it like a matrix when you print x.
A "ts" object has a number of attributes. Modelling fitting routines like arima0 and arima can see such attributes so you don't need to specify them manually.
For your question, there are a number of functions to extract / set attributes of a time series. Have a look at ?start, ?tsp, ?time, ?window.

Resources