Generate a crude incidence rate table (stratified by a factor variable) from a Lexis Model - r

I am using the 'Epi' package in R to model follow-up data from a study.
I am having no issues with declaring the Lexis model or running Poisson and (combined with the survival package) Cox regressions.
As part of the initial data review I want to find a simple way to make a table of crude unadjusted incidence/event rates from data in a lexis model in R (pre-fitting any poisson/cox models).
I have found a coded approach which allows me to do this and to stratify by a variable as part of exploratory data analysis:
#Generic Syntax Example
total <-cbind(tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum),tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum))
#Add up the number of events within the stratifying variable
#Add up the amount of follow-up time within the stratifying the variable
rates <- tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum)/tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum)*10^3
#Given rates per 1,000 person years
ratetable <- (cbind(totals,rates))
#Specific Example based on the dataset
totals <-cbind(tapply(lexis_model$lex.Xst,lexis_model$grade,sum),tapply(lexis_model$lex.dur,lexis_model$grade,sum))
rates <- tapply(lexis_model$lex.Xst,lexis_model$grade,sum)/tapply(lexis_model$lex.dur,lexis_model$grade,sum)*10^3
ratetable <- (cbind(totals,rates))
ratetable
rates
1 90 20338.234 4.4251630
2 64 7265.065 8.8092811
#Shows number of events, years follow-up, number of events per 1000 years follow-up, stratified by the stratifying variable
Note this is crude unadjusted/absolute rates - not the output of a Poisson model. Whilst I appreciate that the code above does indeed produce the desired output (and is pretty straightforward) I wanted to see if people were aware of a command which can take a lexis dataset and output this. I've had a look at the available commands in the Epi and epitools package - may have missed somthing but could not see an obvious way to do this.
As this is a quite common thing to want to do I wondered if anyone was aware of a package/function that could do this by specifying the simply the lexis dataset and the stratification variable (or indeed a single function to the steps above in a single go).
Ideally the output would look something like the below (which is taken from STATA which I am trying to move away from in favour of R!):
A copy of the first twenty rows or so of the actual data is here (the data has already been put in to a lexis model using Epi package so all relevant lexis variables are there):
https://www.dropbox.com/s/yjyz1kzusysz941/rate_table_data.xlsx?dl=0

I would do this simply using the tidyverse R package as such:
library(tidyverse)
lexis_model %>%
group_by(grade) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3) -> rateable
rateable
# A tibble: 2 x 4
# grade sum_Xst sum_dur rate
# <dbl> <int> <dbl> <dbl>
# 1 1 2 375.24709 5.329821
# 2 2 0 92.44079 0.000000
And you could wrap this into a function yourself:
rateFunc <- function(data, strat_var)
{
lexis_model %>%
group_by_(strat_var) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3)
}
which you would then call:
rateFunc(lexis_model, "grade")
This is useful because, using the combination of tidyverse summarise and mutate it is very easy to add more summary statistics to the table.
EDIT:
After clarification on the question, this can be done using the popEpi package using the rate command:
popEpi::rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = grade)
# Crude rates and 95% confidence intervals:
# grade lex.Xst lex.dur rate SE.rate rate.lo rate.hi
# 1: 1 2 375.2472 0.00532982 0.003768752 0.001332942 0.0213115
# 2: 2 0 92.4408 0.00000000 0.000000000 0.000000000 NaN

Related

Unmarked colext function: detection probability = 1

I'm building a single species dynamic occupancy model through the r package "unmarked" with UnmarkedMultFrame setup with the "colext()" function for pika den occupancy across 4 years. However, I want to specify that detection probability (p) = 1 (perfect detection). I want to do this because the detection probability for known dens is near-perfect (many other studies make this assumption too). In theory, this is just a simpler model and kind-of defeats the purpose of an occupancy model, but I'm using it because I need to estimate colonization and extinction probabilities.
Another specification is that all dens we are monitoring were initially occupied the first year we have data for them (so occupancy = 1 for all dens the first year and I am monitoring extinction and re-colonization rates after the first year).
Does anyone know how to specify in Unmarked that p = 1 when using the colext function? This function estimates detection probability, but I'm not sure what it is basing it off of, so I don't know how to either eliminate it from the function entirely or force it to be 1. Here is an example of my data and code:
dets1 <- as.matrix(dets1) #detections (179 total dens sampled once per year for 4 years (lots of NAs))
year <- factor(rep(c(2018, 2019,2020, 2021),179)) #the 4 years we surveyed
UMFdets <- unmarkedMultFrame(y=dets1, numPrimary=4)
m4 <- colext(psiformula = ~1, # First-year occupancy
gammaformula = ~ year, # Colonization
epsilonformula = ~ year, # Extinction
pformula = ~1, #Detection
data = UMFdets)
*simply removing "pformula" doesn't work.
Any ideas or knowledge about this would be much appreciated! Thank you!

Error in generalized linear mixed model cross-validation: The value in 'data[[cat_col]]' must be constant within each ID

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:
data <- groupdata2::fold(detect, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
cvms::cross_validate(
data,
"outcome ~ sex + year + season + (1 | bird) + (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)
This is the error I receive:
Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>% :
1 assertions failed:
* The value in 'data[[cat_col]]' must be constant within each ID.
In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?
Reproducible example:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>%
arrange(.folds)
The full documentation says:
cat_col: Name of categorical variable to balance between folds.
E.g. when predicting a binary variable (a or b), we usually
want both classes represented in every fold.
N.B. If also passing an ‘id_col’, ‘cat_col’ should be
constant within each ID.
So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).
In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.
A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) + (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.
You may be able to utilize the new collapse_groups() function in groupdata2 v2.0.0 instead of fold() for this. It allows you to take existing groups (e.g. bird) and collapse them to fewer groups (e.g. folds) with attempted balancing of multiple categorical columns, numeric columns, and factor columns (number of unique levels - though the same levels might be in multiple groups).
It does not have the constraints that fold() does with regards to changing outcomes, but on the other hand does not come with the same "guarantees" in the "non-changing outcome" context. E.g. it does not guarantee at least one of each outcome levels in all folds.
You need more birds than the number of folds though, so I've added a few to the test data:
bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,
4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,7)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0,1,
0,1,1,0,1,1,0,0,1,1,0,0,1,0,0,1,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)
# Combine 'bird' groups to folds
data <- groupdata2::collapse_groups(
data = df,
n = 3,
group_cols="bird",
cat_col="outcome",
col_name = ".folds"
) %>%
arrange(.folds)
# Check the balance of the relevant columns
groupdata2::summarize_balances(
data=data,
group_cols=".folds",
cat_cols="outcome"
)$Groups
> # A tibble: 3 × 6
> .group_col .group `# rows` `# bird` `# outc_0` `# outc_1`
> <fct> <fct> <int> <int> <dbl> <dbl>
> 1 .folds 1 14 3 7 7
> 2 .folds 2 10 2 6 4
> 3 .folds 3 10 2 4 6
summarize_balances() shows us that we created 3 folds with 14 rows in the first fold and 10 in the other folds. There are 3 unique bird levels in the first fold and 2 in the others (normally only unique within the group, but here we know that birds are only in one group, as that is how collapse_groups() works with its group_cols argument).
The outcome variable (here # outc_0 and # outc_1) are somewhat decently balanced.
With larger datasets, you might want to run multiple collapsings and choose the one with the best balance from the summary. That can be done by adding num_new_group_cols = 10 to collapse_groups() (for even better results, enable the auto_tune setting) and then listing all the created group columns when running summarize_balances().
Hope this helps you or others in a similar position. The constraint in fold() is hard to get around with its current internal approach, but collapse_groups hopefully does the trick in those cases.
See more https://rdrr.io/cran/groupdata2/man/collapse_groups.html

different output for PR AUC for different R packages

I find different numeric values for the computation of the Area Under the Precision Recall Curve (PRAUC) with the dataset I am working on when computed via 2 different R packages: yardstick and caret.
I am afraid I was not able to reproduce this mismatch with synthetic data, but only with my dataset (this is strange as well)
In order to make this reproducible, I am sharing the prediction output of my model, you can download it here https://drive.google.com/open?id=1LuCcEw-RNRcdz6cg0X5bIEblatxH4Rdz (don't worry, it's a small csv).
The csv contains a dataframe with 4 columns:
yes probability estimate of being in class yes
no = 1 - yes
obs actual class label
pred predicted class label (with .5 threshold)
here follows the code to produce the 2 values of PRAUC
require(data.table)
require(yardstick)
require(caret)
pr <- fread('pred_sample.csv')
# transform to factors
# put the positive class in the first level
pr[, obs := factor(obs, levels = c('yes', 'no'))]
pr[, pred := factor(pred, levels = c('yes', 'no'))] # this is actually not needed
# compute yardstick PRAUC
pr_auc(pr, obs, yes) # 0.315
# compute caret PRAUC
prSummary(pr, lev = c('yes', 'no')) # 0.2373
I could understand a little difference, due to the approximation when computing the area (interpolating the curve), but this seems way too high.
I even tried a third package, PRROC, and the result is still different, namely around .26.

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Gompertz Aging analysis in R

I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
What I would like to do is perform a Gompertz Analysis (See second paragraph of "the life table" here). The equation is:
μx = α*e β*x
Where μx is probability of death at a given time, α is initial mortality rate, and β is the rate of aging.
I would like to be able to get a dataframe which has α and β estimates for each of my ~640 genotypes for further analysis later.
I need help going from the above dataframes to an output of these values for each of my genotypes in R.
I have looked through the package flexsurv which may house the answer but I have failed in attempts to find and implement it.
This should get you started...
Firstly, for the flexsurvreg function to work, you need to specify your input data as a Surv object (from package:survival). This means one row per observation.
The first thing is to re-create the 'raw' data from the summary tables you provide.
(I know rbind is not efficient, but you can always switch to data.table for large sets).
### get rows with >1 death
df3 <- df2[df2$Deaths>1, 2:3]
### expand to give one row per death per time
df3 <- sapply(df3, FUN=function(x) rep(df3[, 2], df3[, 1]))
### each death is 1 (occurs once)
df3[, 1] <- 1
### add this to the rows with <=1 death
df3 <- rbind(df3, df2[!df2$Deaths>1, 2:3])
### convert to Surv object
library(survival)
s1 <- with(df3, Surv(Day, Deaths))
### get parameters for Gompertz distribution
library(flexsurv)
f1 <- flexsurvreg(s1 ~ 1, dist="gompertz")
giving
> f1$res
est L95% U95%
shape 0.165351912 0.1281016481 0.202602176
rate 0.001767956 0.0006902161 0.004528537
Note that this is an intercept-only model as all your genotypes are A.
You can loop this over multiple survival objects once you have re-created the per-observation data as above.
From the flexsurv docs:
Gompertz distribution with shape parameter a and rate parameter
b has hazard function
H(x: a, b) = b.e^{ax}
So it appears your alpha is b, the rate, and beta is a, the shape.

Resources