Dealing with factor in detection covariates in unmarked package - r

I am trying to compare (with a statistical test) the probability of detection according to a categorical covariates with 3 levels, let's say A,B and C in a very simple occupancy model. I am interested by the 3 pairwise comparison. Is there any way to get this result from the unmarked package ?
I decide to circumvent this problem by changing the reference level in order to use the test provided by the summary output of the fitted occupancy model. However, even so my covariates are specified as factors, the function unmarkedFrameOccu does not recognize it and I get the following warning Warning message:
obsCovs contains characters. Converting them to factors.
As a consequence, aconversion is done by unmarked I have no way to control the reference level.
Do you have any idea of the possible issues and why unmarkedFrameOccu does not recognize my factor ? I add a reproducible example below. I find a way to deal with my main issue by recoding categorical variable using 1/0 but I am still curious on this warning thing.
Thanks for your help and time and here is the example
Marie
library(tidyverse)
library(unmarked)
n_occas <- 6
n_sites <- 10
p_detect <- c(0.3, 0.8, 0.5)
cov <- matrix(factor( sample(c("A", "B", "C"), size = 60, replace = TRUE)), ncol = n_occas, nrow= n_sites)
sites <- sample(c(0,1), size = n_sites, replace = TRUE)
## generate actual detection according to presence state
y <- cov %>% as_tibble %>%
add_column(truth = sites) %>%
mutate(id = 1 : n()) %>%
pivot_longer(cols = c(-truth, -id), names_to = "occas", values_to = "cov") %>%
mutate(detection = case_when(cov == "A" ~ truth * (runif(n = 1) < p_detect[1]),
cov == "B" ~ truth * (runif(n = 1) < p_detect[2]),
cov == "C" ~ truth * (runif(n = 1) < p_detect[3])
)
) %>%
select(-cov) %>%
pivot_wider(names_from = occas, values_from = detection, values_fill = NA) %>%
select(-truth, -id)
unmarkedFrameOccu(y = y,
obsCovs = list(cov1 = cov))

Related

Fixed within one class intercepts, fixed within another class regression coefficients, and errors; as columns in a tibble

library(tidyverse)
library(lme4)
library(broom.mixed)
Tibble = tibble(
Class1 = rep(c("TITUS","CAIUS"),27),
Class2 = rep(c("A","A","A",
"B","B","B",
"C","C","C"),6
),
Outcome = rnorm(54,5,2),
Predictor = Outcome + rnorm(54,0,2.5),
alpha = NA,
beta = NA)
lmer(data = Tibble,
Outcome ~ (0 + (Class1) + (0 + Predictor|(Class2)))) %>%
tidy(effects = c("fixed","ran_coefs")) -> model
for(i in 1:54) {
Tibble$alpha[i] <- model %>%
filter(effect == "fixed",
term == str_c("Class1",Tibble$Class1[i])) %>%
pull(estimate)
Tibble$beta[i] <- model %>%
filter(effect == "ran_coefs",
level == Tibble$Class2[i],
term == "Predictor") %>%
pull(estimate)
}
Tibble %>% mutate(
Predicted = (alpha + Predictor*beta),
epsilon = Outcome - Predicted) -> Tibble
Tibble %>% summarise(cor(Predicted,Outcome, method = "kendall"),
)
Key concepts: there is a alpha for each class1- There is a beta for each class2.
Epsilon is the residual.
I want to make the code above faster without recurring to a for cycle.
Also, I am very worried about the regression model, because I plan to do it in a tibble with more than 1 million observation, 600k class1, 40k class2.
Notice that that model formula and the regression package is only one possible combination and you can give suggestions. I am most interested in alpha's estimation, not in minimizing epsilons.

How to calculate running slope for rlm using runner?

I have a data frame "customers" build of customer id, month and total purchases that month.
I'm trying to calculate a running slope for a window of 12 months using robust regression.
I have tried the following:
Coef <- function(x) {return(rlm(cbind(x)~cbind(1:length(x)))$coefficients[2])}
customer_slope = customers %>% mutate(slope = runner(x=total_purchases,k=12,f=Coef))
I get the following error:
x 'x' is singular: singular fits are not implemented in 'rlm'
If I run a single example, the function returns what I've expected:
Coef(c(4,11,7,15,5,14,8,9,14,17,14,13))
cbind(1:length(x))
0.6888112
So I ran into similar problems and finally came to the below solution using slider. This provides a 3 days rolling estimate (of course you can change as you see fit). This doesn't quite get to your answer (which you could probably get with loops), but most of the way there.
library(MASS)
library(dplyr)
library(slider)
dat <- tibble::tibble(customers = c(4,11,7,15,5,14,8,9,14,17,14,13)) %>%
mutate(t = 1:n() %>% as.numeric())
dat %>%
mutate(results = slide_dbl(.x = .,
.f = ~rlm(customers ~ t, k = 12, data = .x)$coefficients[2],
.before = 2,
.complete = T))
It look like that's the way to go, thanks!
It seems like what caused the singularity was that I didn't change the default .complete from F to T.
So, combined with your suggestion, this is how I made it work (took about two hours for 3M rows I did have however more complex group_by involved which is not shown below)
slope_rlm <- function(x) {
x=as.numeric(x)
prep = tibble(data=x)%>%mutate(t=1:n()%>%as.numeric())
return(rlm(data~t,data=prep)$coefficients[2])
}
customers_rlm = customers %>%
mutate(cust_rlm_12=slide_dbl(total_purchases,slope_rlm,.before=11,.complete=T))
Consider data with two customers with data from 1000 days span. total_purchases are cumulated by customer, and each purchase size is ~pois(5).
set.seed(1)
customers <- data.frame(
id = factor(rep(1:2, length.out = 100)),
date = seq(Sys.Date(), Sys.Date() + 1000, length.out = 100)
) %>%
group_by(id) %>%
mutate(
total_purchases = cumsum(rpois(n(), lambda = 5))
)
When using calculating regression in rolling window make sure that you handle errors which comming from insufficient degrees of freedom, singularity etc. - that is why I've put tryCatch around rlm call - if there is any error, function returns NA for failing window.
Data below is grouped by id which means that model is calculated per customer. Yearly rolling regression should converge to the slope = 5 (+/- random error).
customers %>%
group_by(id) %>%
mutate(
slope = runner(
x = .,
f = function(x) {
tryCatch(
rlm(x$total_purchases ~ seq_len(nrow(x)))$coefficients[2],
error = function(e) NA
)
},
idx = "date",
k = "year"
)
)
Plotting slope in time for customers
ggplot(customers, aes(x = date, y = slope, color = id, group = id)) +
geom_line() +
geom_hline(yintercept = 5, color = "red")

GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

I am trying to fit Polish local government election results in 2015 following the superb blog by #GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.
library(tidyverse)
library(sf)
library(mgcv)
# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file
boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
stringsAsFactors = FALSE ) %>%
st_transform(crs = 4326)%>%
janitor::clean_names() %>%
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>%
mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>%
select(teryt, nazwa=jpt_nazwa, geometry)
# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
readxl::read_excel("data/2015-gl-lis-gm.xls",
trim_ws = T, col_names = T) %>%
janitor::clean_names() %>%
select(teryt, liczba_wyborcow, glosy_niewazne)
elections <-
boroughs_shp %>% fortify() %>%
left_join(elections_xls, by = "teryt") %>%
arrange(teryt) %>%
mutate(idx = seq.int(nrow(.)) %>% as.factor(),
teryt = as.factor(teryt))
# Neighbors
boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")
# Model
ctrl <- gam.control(nthreads = 4)
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)),
data = elections,
offset = log(liczba_wyborcow), # number of votes
method = 'REML',
control = ctrl,
family = betar())
Here is the error message:
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) :
mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
the condition has length > 1 and only the first element will be used
elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?
EDIT: The condition mentioned in error message is met:
> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE
It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.
First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results.
Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.
elections <-
boroughs_shp %>%
left_join(elections_xls, by = "teryt") %>% na.omit() %>%
arrange(teryt) %>%
mutate(idx = row_number() %>% as.factor()) %>%
mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)
Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb()
Now my formula looks like
m1 <-
gam(
void_ratio ~ s(
idx,
bs = 'mrf',
k =500,
xt = list(nb = boroughs_nb),
fx = TRUE),
data = elections_df,
method = 'REML',
control = gam.control(nthreads = 4),
family = nb()
)
It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ # 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.
In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.
Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

Self-made function works in test but not for my actual data set

I am working with functions. I wrote a function for Basal Area
ba <- function(dbh,na.rm) {
stopifnot(is.numeric(dbh))
answer <- dbh^2*(0.005454)
return(answer)
}
The function works with a test vector. Now I am trying to do some summaries of a dataset I have.
(copy and pasted directly from R)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>% summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE), quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE), n.trees = n())
(Modified spacing to read easier)
plot.summary <- trees %>% group_by(MU, Plot, Inv) %>%
summarize(year = first(Year), arithemtic.mean = my.mean(dbh, na.rm = TRUE),
quadratic.mean = my.q.mean(dbh, na.rm = TRUE), var = my.var(dbh, na.rm = TRUE),
n.trees = n())
When I run it is says
Error in summarise_impl(.data, dots) :
Column `basal.area` must be length 1 (a summary value), not 19
I am not sure why. The data set has only 18 columns.
My command works perfectly fine when I do not include the basal area part.
I am not sure what I might be missing
Thank you for any help!
The variables you refer to in the group_by function are not in the dataset trees, so I've taken some liberties to create a reproducible example that hopefully fits your needs.
Assuming you wanted to group by a variable like Height, here is a working example:
plot.summary <- trees %>%
group_by(Height) %>%
summarise(mean.basal.area = mean(ba(Girth)),
n.trees = n())
In the above, your function ba is wrapped in mean. This results in a mean basal area for the set of values of Girth that share the same Height.
Is that the kind of thing you want?

Apply grouped model group-wise

My question is very similar to this one, but the problem I am facing has a twist that those answers do not address. Specifically, I am estimating a spatial model, y=rho * lw * y + X *beta. Because the observations are related by the matrix lw, I must apply the model to the entire X matrix simultaneously. Because those answers operate row-wise, they do not apply.
Here is MWE data, consisting of twenty points across three groups and a spatial weights matrix:
library(spdep)
#Coordinates
pointcoords <- data.frame(x = runif(n=20, min =10, max = 100), y = runif(n=20, min = 10, max = 100), ID = as.character(1:20))
pointsSP <- SpatialPoints(pointcoords[,1:2])
# Weights matrix
lw <- nb2listw(knn2nb(knearneigh(pointsSP, k = 4, RANN = FALSE),
row.names = pointcoords$ID))
# Data
MyData <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
DV = rnorm(60),IV = rnorm(60))
I can estimate the models by Group with dplyr
library(dplyr)
models <- MyData %>% group_by(Group) %>%
do(lm = lm(DV ~ IV, data = .),
sar = lagsarlm(DV ~ IV, data = ., listw = lw))
Predicting to new data with this answer operates on a row-wise basis, working fine for the lm objects,
MyData2 <- data.frame(ID = rep(1:20, each = 3),
Group = rep(1:3, times = 20),
IV = rnorm(60))
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(lmPred = predict(lm, newdata = list("IV" = IV))) %>% head()
#Joining by: "Group"
#Source: local data frame [6 x 6]
#Groups:
# ID Group IV lm sar lmPred
#1 1 1 -0.8930794 <S3:lm> <S3:sarlm> -0.21378814
#2 1 2 -1.6637963 <S3:lm> <S3:sarlm> 0.42547796
#3 1 3 0.5243841 <S3:lm> <S3:sarlm> -0.23372996
#4 2 1 -0.1956969 <S3:lm> <S3:sarlm> -0.20860280
#5 2 2 0.8149920 <S3:lm> <S3:sarlm> 0.14771431
#6 2 3 -0.3000439 <S3:lm> <S3:sarlm> 0.05082524
But not for the sar models:
MyData2 %>% left_join(models) %>% rowwise %>%
mutate(sarPred = predict(sar, newdata = list("IV" = IV), listw=lw)) %>% head()
#Joining by: "Group"
#Error in if (nrow(newdata) != length(listw$neighbours)) stop("mismatch between newdata and spatial weights") :
argument is of length zero
I think there should be a better way of doing this, without joining the model to every row. Also, creating a list object for newdata won't work if you have several or changing predictor variables. It seems that the dplyr way should be something like this:
MyData2 %>% group_by(Group) %>%
mutate(sarPred = predict(models$sar[[Group]], newdata = ., listw=lw))
But the [[Group]] index isn't quite right.
I ended up doing this with do in dplyr, going through the models data.frame rowwise. I believe it does what you want, although the output doesn't contain the new data used for predictions. I did add in Group to the output, though, as it seemed necessary to keep groups separated.
models %>%
do(data.frame(Group = .$Group,
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
EDIT
Playing around with adding the explanatory variable into the output data.frame. The following works, although there is likely a better way to do this.
models %>%
do(data.frame(Group = .$Group, IV = select(filter(MyData2, Group == .$Group), IV),
predlm = predict(.$lm, newdata = filter(MyData2, Group == .$Group)),
predsar = predict(.$sar, newdata = filter(MyData2, Group == .$Group) , listw = lw)))
I'm putting this out there because it does do what I want it to, even if it needs to use a for loop (gasp)
predictobj <- list()
for(i in models$Group){
predictobj[[i]] <- predict.sarlm(models$sar[[i]],
newdata = filter(MyData2, Group == i),
listw = lw)
}
Anybody have a dplyr solution?

Resources