Unmarked colext function: detection probability = 1 - r

I'm building a single species dynamic occupancy model through the r package "unmarked" with UnmarkedMultFrame setup with the "colext()" function for pika den occupancy across 4 years. However, I want to specify that detection probability (p) = 1 (perfect detection). I want to do this because the detection probability for known dens is near-perfect (many other studies make this assumption too). In theory, this is just a simpler model and kind-of defeats the purpose of an occupancy model, but I'm using it because I need to estimate colonization and extinction probabilities.
Another specification is that all dens we are monitoring were initially occupied the first year we have data for them (so occupancy = 1 for all dens the first year and I am monitoring extinction and re-colonization rates after the first year).
Does anyone know how to specify in Unmarked that p = 1 when using the colext function? This function estimates detection probability, but I'm not sure what it is basing it off of, so I don't know how to either eliminate it from the function entirely or force it to be 1. Here is an example of my data and code:
dets1 <- as.matrix(dets1) #detections (179 total dens sampled once per year for 4 years (lots of NAs))
year <- factor(rep(c(2018, 2019,2020, 2021),179)) #the 4 years we surveyed
UMFdets <- unmarkedMultFrame(y=dets1, numPrimary=4)
m4 <- colext(psiformula = ~1, # First-year occupancy
gammaformula = ~ year, # Colonization
epsilonformula = ~ year, # Extinction
pformula = ~1, #Detection
data = UMFdets)
*simply removing "pformula" doesn't work.
Any ideas or knowledge about this would be much appreciated! Thank you!

Related

Using predict in metafor when each author has multiple rows in the data

I'm running a meta-analysis where I'm interested in the effect of X on the effect of age on habitat use (raw mean values and variances) using the metafor package.
An example of one of my models is:
mod6 <-
rma.mv(
yi = Used_value,
V = Used_variance,
slab = Citation,
mods = ~ Age + poly(Slope, degrees = 2),
random = ~ 1 | Region,
data = vel.focal,
method = "ML"
)
My justification for not using Citation as a random effect is that using only Region accounts for more of the heterogeneity than when random = list( ~ 1 | Citation/ID, ~ 1 | Region) or when Citation/ID is used by itself.
What I need for output is the prediction for each age by region, but the predict() function for the model and the associated forest plot spits out the prediction for each row, as it assumes each row in the data is a unique study. In my case it is not as I have my input values separated by age and season.
predict(mod6)
pred se ci.lb ci.ub pi.lb pi.ub
Riehle and Griffith 1993.1 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.2 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.3 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Spina 2000.1 8.7706 2.7386 3.4030 14.1382 -0.7364 18.2776
Spina 2000.2 8.5407 2.7339 3.1824 13.8991 -0.9611 18.0426
Spina 2000.3 8.5584 2.7406 3.1868 13.9299 -0.9509 18.0676
Vondracek and Longanecker 1993.1 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.2 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.3 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Vondracek and Longanecker 1993.4 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Does anybody know a way to modify the arguments inside predict() to tell it how you want your predictions output or to tell it that there are multiple rows per slab?
You need to use the newmods argument to specify the values for Age for which you want predicted values. You will have to plug in something for the linear and quadratic terms for the Slope variable as well (e.g., holding Slope constant at its mean and hence the quadratic term will just be the mean squared). Region is not a fixed effect, so it is not relevant if you want to compute predicted values based on the fixed effects. If you want to compute BLUPs for those random effects, you can do so with ranef(). One can then combine the predictions based on the fixed effects with the BLUPs. That would be the general idea, but implementing this will require a bit of programming.

Creating Inverse Probability of Attrition Weights in R

Weuve et al. (2012) wrote a great paper about implementing Inverse Probability of Attrition Weighting (IPAW), a weighting method used to account for bias introduced by attrition during the course of a longitudinal study. Here is a link to said article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3237815/#R30
I am working on a project where I am trying to implement this IPAW method and there isn't much out there on how to implement and code this method, so I'm looking for some help just to make sure I'm doing everything correctly.
The data I am working with involves older individuals who may have dementia, so it makes sense to use IPAW because those with dementia are more likely to leave the study. Each individual has at least a baseline visit and then up to 12 follow up visit (the average number of visits for each person is around 3). My understanding is that I should create weights for each round of follow up visits, so I start by subsetting the data to only a certain visit, creating a variable for whether or not somebody drops out immediately following the visit, and then I proceed to creating the models and weights.
Below is the r code I have been using to generate the weights:
Creating weights for the first follow up visit (visit == 1)
for-loop to create a variable for attrition
(For ease, I am just calling the last observation "x")
data$attrition<-c()
data$attrition[x] <- 1
for (i in 1:x){
if(data$visit[i+1] == 0) {
data$attrition[i] = 1
} else {
data$attrition[i] = 0
}
}
subsetting to only get the data for the first follow up visit
data_visit1 <-subset(data, data$visit == 1)
creating stepwise model for the likelihood of attriting
# specifying null model
null_visit1 <- glm(attrition ~ 1, family = binomial, data = data_visit1)
# specifying full model --
full_visit1 <- glm(attrition~
predictor1 +
predictor2 +
...,
family = binomial, data = data_visit1)
# running combined selection
stepmodel_visit1 <- step(null_visit1, scope=list(lower = null_visit1, upper = full_visit1), direction = "forward", k=2)
Creating weights
# re-naming model for denominator
denom.model <- stepmodel_visit1
# creating the predicted categorizations
pd_visit1 <- predict(denom.model, type = "response")
## estimation of numerator of ip weights using stabilizer instead of just 1
numer.model <- glm(attrition ~ 1, family = binomial(), data = data_visit1)
# predicting the numerator values
pn_visit1 <- predict(numer.model, type = "response")
# Putting together the actual weights
data_visit1$weight <- ifelse(data2$attrition == 1, pn_visit1 / pd_visit1, (1- pn_visit1)/(1 - (pd_visit1)))
Following this, I rejoin the weights back to the full dataset and then repeat the process for each round of follow up visits. So my question is, does this all look good? I would love any and all feedback on my approach. Thanks so much!

R Studio - How to predict row-by-row and use previous prediction in next one - linear model

Sorry if this is unclear, had trouble titling this.
Basically I have a linear model that predicts sales and one of the factors is the previous 10 days of sales. So, when predicting for the next month, I need an estimated number for what the "previous 10 days of sales" is for each day in the month.
I want to use the model to generate these numbers - so, for the first day I'm trying to predict, I have the actual number for the last 10 days of sales. For the day after that, I have 9 days of real data, plus the one predicted number generated. For the day after that, 8 days of real data and two generated, etc.
Not quite sure how to implement this and would appreciate any help. Thanks so much.
The first thing that came to mind would be a moving average using the predicted data. This gets hard to defend though once you're averaging only predicted data but its a place to start.
moving.average = 0
test.dat = rnorm(100, 10,2)
for(i in 1:30){
moving.average[i] = mean(test.dat[i:i+10])
}
Hope this is helpful
Kathy, get your first 10 data points from... where-ever. Seed your prediction with it.
initialization <- c(9.463, 9.704, 10.475, 8.076, 8.221, 8.509,
10.083, 9.572, 8.447, 10.081)
prediction = initialization
Here's a silly prediction function that uses the last 10 values:
predFn <- function(vec10){
stopifnot(length(vec10) == 10)
round(mean(vec10) + 1 , 3)
}
Although I usually like to use the map family, this one seems like it wants to be a loop
for(i in 11:20){
lo = i - 10
hi = i - 1
prediction[i] <- predFn(prediction[lo:hi])
}
What did we get?
prediction
# [1] 9.463 9.704 10.475 8.076 8.221 8.509 10.083 9.572 8.447 10.081 10.263 10.343 10.407 10.400 10.633 10.874 11.110 11.213
# [19] 11.377 11.670

r qqp function - why is the 'perfect fit' a flat line on 0?

This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first.
I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts):
example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,
0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0,
2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,
0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1,
1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2,
1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2,
0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2,
81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0,
0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8,
0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0,
0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6,
0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0)
I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot.
My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded?
Code example:
#so that the families without 0s can recognise data
example.abund.1 <- example.abund + 1
plot(hist(example.abund))
qqp(example.abund, "norm")
qqp(example.abund.1, "lnorm") #lognorm
#have to generate estimates of parameters:
nbinom <- fitdistr(example.abund.1, "Negative Binomial")
qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]])
poisson <- fitdistr(example.abund.1, "Poisson")
qqp(example.abund.1, "pois", poisson$estimate)
gamma <- fitdistr(example.abund.1, "gamma")
qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources