GLMM of proportions adjusting for group size - r

I'm trying to investigate if the proportion of muzzle contact(mc) in primates tends to be directed more towards the mother than other group members (Adults or Juveniles). I have data over 5 years in 4 different groups. This is an exemple for 3 different initiators (those initiating the mc):
age1data
initiator
receiver
count
total_init
prop_mc
subgroupsize
group
Aaa
Mother
1
3
0.333
1
1
Aaa
Adult
2
3
0.666
40
1
Aaa
Juvenile
0
3
0
20
1
Hee
Mother
0
2
0
1
1
Hee
Adult
0
2
0
40
1
Hee
Juvenile
2
2
1
20
1
Awa
Mother
2
10
0.2
1
2
Awa
Adult
3
10
0.3
7
2
Awa
Juvenile
5
10
0.5
13
2
count: number of mc directed to an individual belonging to each receiver subgroups
total_init: total number of mc by this individual
subgroupsize: number of individuals within the group that belong to the receiver subgroup (for exemple, each individual has 1 mother but the group1 has 40 adults (other than the mother) and 20 juveniles
This is the model I tried:
glmm_ages <- glmer(((count_init/total_init)/subgroupsize)~receiver + (1|group) + (1|initiator),
data = age1data,
family = binomial)
This gives me this error message:
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, :
Downdated VtV is not positive definite
In addition: Warning message:
In eval(family$initialize, rho) : non-integer #successes in a binomial glm!
The model works when I do a simple GLM without group and initiator as random variables but I really think I need to include them.
From what I understand, the error message means that some categories are all 1 or all 0, which is the case when an individual is only recorded muzzle contacting its mother once, for exemple (dependent variable becomes 1/1/1 = 1).
I'm trying to understand what I should do from this thread I found http://bbolker.github.io/mixedmodels-misc/ecostats_chap.html#digression-complete-separation
In this section, I'm not sure how to find the number I should be putting instead of "10"?
newdat <- subset(age1data,
abs(resid(glmm_ages,"pearson"))<10)
I'm also not sure what all this means and how can I figure out what is my own variance and standard deviation in my dataset:
impose zero-mean Normal priors on the fixed effects (a 4 × 4 diagonal matrix with diagonal elements equal to 9, for variances of 9 or standard deviations of 3)
Can anyone help me figure out if I'm doing the right thing and this is the solution for me?
I apologize for the length of this post, I wanted to make sure everything was there, hope it's clear!

Related

How can I specify the carryover in the first period of a two-treatment three-period crossover study (ABB/BAA)?

I have found a lot of information on how to analyze a 2*2 (AB/BA) crossover trial; however, there are fewer materials on how to disentangle the carryover effect when the study is designed in three periods and two sequences (ABB/BAA). It is worth mentioning that A and B are the treatments and there have been wash-out phases between the three periods.
As sample data, I would like to use the bioequivalence data from "daewr" library.
library("daewr")
data(bioequiv)
head(bioequiv)
Group Subject Period Treat Carry y
1 1 2 1 A none 112.25
2 1 2 2 B A 106.36
3 1 2 3 B B 88.59
4 1 3 1 A none 153.71
5 1 3 2 B A 150.13
6 1 3 3 B B 151.31
The variable Carry contains lagged information from the previous period's Treatment.
The model below should be able to disentangle the effects, but I don't know how to replace the none in the Carry column. I am not sure how to specify this, or how to check if the carryover effect is negligible.
If we don't replace the none in the Carry column, the undermentioned model faces a problem of multicollinearity.
fit <- lmer(y ~ Period+Treat+Carry+(1|Subject), bioequiv)
anova(fit)
summary(fit)

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

How can loading factors from PCA be used to calculate an index that can be applied for each individual in a data frame in R?

I am using principal component analysis (PCA) based on ~30 variables to compose an index that classifies individuals in 3 different categories (top, middle, bottom) in R.
I have a dataframe of ~2000 individuals with 28 binary and 2 continuous variables.
Now, I would like to use the loading factors from PC1 to construct an
index that classifies my 2000 individuals for these 30 variables in 3 different groups.
Problem: Despite extensive research, I could not find out how to extract the loading factors from PCA_loadings, give each individual a score (based on the loadings of the 30 variables), which would subsequently allow me to rank each individual (for further classification). Does it make sense to display the loading factors in a graph?
I've performed the following steps:
a) Ran a PCA using PCA_outcome <- prcomp(na.omit(df1), scale = T)
b) Extracted the loadings using PCA_loadings <- PCA_outcome$rotation
c) Removed all the variables for which the loading factors were close to 0.
I have considered creating 30 new variable, one for each loading factor, which I would sum up for each binary variable == 1 (though, I am not sure how to proceed with the continuous variables). Consequently, I would assign each individual a score. However, I would not know how to assemble the 30 values from the loading factors to a score for each individual.
R code
df1 <- read.table(text="
educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header=T)
## Run PCA
PCA_outcome <- prcomp(na.omit(df1), scale = T)
## Extract loadings
PCA_loadings <- PCA_outcome$rotation
## Explanation: A-E are 5 of the 2000 individuals and the variables (education, call, house, school, members) represent my 30 variables (binary and continuous).
Expected results:
- Get a rank score for each individual
- Subsequently, assign a category 1-3 to each individual.
I'm not 100% sure what you're asking, but here's an answer to the question I think you're asking.
First of all, PC1 of a PCA won't necessarily provide you with an index of socio-economic status. As explained here, PC1 simply "accounts for as much of the variability in the data as possible". PC1 may well work as a good metric for socio-economic status for your data set, but you'll have to critically examine the loadings and see if this makes sense. Depending on the signs of the loadings, it could be that a very negative PC1 corresponds to a very positive socio-economic status. As I say: look at the results with a critical eye. An explanation of how PC scores are calculated can be found here. Anyway, that's a discussion that belongs on Cross Validated, so let's get to the code.
It sounds like you want to perform the PCA, pull out PC1, and associate it with your original data frame (and merge_ids). If that's your goal, here's a solution.
# Create data frame
df <- read.table(text = "educ call house merge_id school members
A 1 0 1 12_3 0 0.9
B 0 0 0 13_3 1 0.8
C 1 1 1 14_3 0 1.1
D 0 0 0 15_3 1 0.8
E 1 1 1 16_3 3 3.2", header = TRUE)
# Perform PCA
PCA <- prcomp(df[, names(df) != "merge_id"], scale = TRUE, center = TRUE)
# Add PC1
df$PC1 <- PCA$x[, 1]
# Look at new data frame
print(df)
#> educ call house merge_id school members PC1
#> A 1 0 1 12_3 0 0.9 0.1000145
#> B 0 0 0 13_3 1 0.8 1.6610864
#> C 1 1 1 14_3 0 1.1 -0.8882381
#> D 0 0 0 15_3 1 0.8 1.6610864
#> E 1 1 1 16_3 3 3.2 -2.5339491
Created on 2019-05-30 by the reprex package (v0.2.1.9000)
As you say you have to use PCA, I'm assuming this is for a homework question, so I'd recommend reading up on PCA so that you get a feel of what it does and what it's useful for.

Using multiple weights for lm() or glm() in R

I want to fit a model in R within which I need to apply two weights at the same time. Let's say my model is glm(y ~ x1 + male + East_Germany) where male identifies a respondent's gender and East Germany is also a binary variable checking whether someone lives in East Germany.
Now let's say both females and East Germans are dramatically missrepresented in my data. Assuming that it's not due to a flawed data collection process, I would have to apply two weights. But can I really specify two weights simultaneously like this...? glm(y ~ x1 + male + East_Germany, weight=c("male_wgt","east_wgt"))
My thinking was that applying one weight to the overall data or the overall model already changes up the whole data structure, but I might be wrong. Let's give you an example:
y x1 male east male_weight east_weight
5 4 0 1 5 2
3 2 1 1 1 2
9 7 1 0 1 1
4 8 1 0 1 1
1 3 1 0 1 1
6 4 1 0 1 1
...where male==1 means "male" and male==0 is "female" and we assume that both genders should be represented equally (50%), and where east==1 means "East Germany", east==0 is "West Germany" and here too, let's assume for reasons of simplicity that both should be represented equally. yand x are just random numbers.
I'm wondering how I would apply both weights simultaneously, because if I say "let's count row#1 five times so that females gain more weight", I'm at the same time giving East Germany too much weight (without having even applied east_weight). The reasons is that if we count row#1 five times, we will end up with a new East/West ratio of 6:4. Or am I mistaken?

Probability of account win/loss using Bayesian Statistics

I am trying to estimate the probability of winning or losing an account, and I'd like to do this using Bayesian Methods. I'm not really that familiar with these methods, but I think I understand the general idea.
I know some information about losses and wins. Wins are usually characterized by some combination of activities; losses are usually characters by a different combination of activities. I'd like to be able to get some posterior probability of whether or not a new observation will be won or lost based on the current number of activities that are associated with that account.
Here is an example of my data: (This is just a sample for simplicity)
Email Call Callback Outcome
14 9 2 1
3 2 4 0
16 14 2 0
15 1 3 1
5 2 2 0
1 1 0 0
10 3 5 0
2 0 1 0
17 8 4 1
3 15 2 0
17 1 3 0
10 7 5 0
10 2 3 0
8 0 0 1
14 10 3 0
1 9 3 1
5 10 3 1
13 5 1 0
9 4 4 0
So from here I know that 30% of the observations have an outcome of 1 (win) and 70% have an outcome of 0 (loss). Let's say that I want to use the other columns to get a probability of win/loss for a new observation which may have a small number of events (emails, calls, and callbacks) associated with it.
Now let's say that I want to use the counts/proportions of the different events as priors for a new observation. This is where I start getting tripped up. My thinking is to create a dirichlet distribution for wins and losses, so two separate distributions, one for wins and one for losses. Using the counts/proportions of events for each outcome as the priors. I guess I'm not sure how to do this in R. I think my course of action would be estimate a dirichlet distribution (since I have 3 variables) for each outcome using maximum likelihood. I've been trying to use the dirichlet.simul and dirichlet.mle functions from the sirt package in R. I'm not sure if I need to simulate one first?
Another issue is once I have this distribution, it's unclear to me how to get a posterior distribution of a new observation. I've read several papers and can't seem to find a straightforward process on how to do this. (Or maybe there's some holes in my understanding). Any pushes in the right direction would be greatly appreciated.
This is the code I've tried so far:
### FOR WON ACCOUNTS
set.seed(789)
N <- 6
probs <- c(0.535714286, 0.330357143, 0.133928571 )
alpha <- probs
alpha <- matrix( alpha , nrow=N , ncol=length(alpha) , byrow=TRUE )
x <- dirichlet.simul( alpha )
dirichlet.mle(x)
$alpha
[1] 0.3385607 0.2617939 0.1972898
$alpha0
[1] 0.7976444
$xsi
[1] 0.4244507 0.3282088 0.2473405
### FOR LOST ACCOUNTS
set.seed(789)
N2 <- 14
probs2 <- c(0.528037383,0.308411215,0.163551402 )
alpha2 <- probs2
alpha2 <- matrix( alpha2 , nrow=N , ncol=length(alpha2) , byrow=TRUE )
x2 <- dirichlet.simul( alpha2 )
dirichlet.mle(x2)
$alpha
[1] 0.3388486 0.2488771 0.2358043
$alpha0
[1] 0.8235301
$xsi
[1] 0.4114587 0.3022077 0.2863336
Not sure if this is a correct approach or how to get posteriors from here. I realize all the outputs look similar across won/lost accounts. I just used some simulated data to represent what I'm working with.

Resources