How to fix error when using studpermu.test with large dataset - r

I am using studpermu.test on a large hyperframe of ppp objects and it returns an error.
I am using studpermu.test on a large hyperframe of 250 ppp objects, each with thousands of points (~ 2000-5000), and a grouping factor with 5 levels, with equal group sizes of 50. The function runs successfully on a smaller subset of point patterns but returns an error when I try to run the function on the whole hyperframe.
Here is a smaller reproducible example which gives the same error.
X <- runifpoint(20, nsim = 250)
h <- hyperframe(ppp = X, group = rep(1:5, each=50))
studpermu.test(h, ppp ~ group)
Error in if (npossible < max(100, nperm)) warning("Don't expect exact results - group sizes are too small") : missing value where TRUE/FALSE needed
In addition: Warning message: In factorial(sum(m)) : value out of range in 'gammafn'

This is a bug in the current CRAN version of spatstat. Here is a small reproducible example (which it would have been great if you provided in your question):
library(spatstat)
#> Loading required package: spatstat.data
#> Loading required package: nlme
#> Loading required package: rpart
#>
#> spatstat 1.61-0 (nickname: 'Puppy zoomies')
#> For an introduction to spatstat, type 'beginner'
X <- runifpoint(20, nsim = 250)
h <- hyperframe(ppp = X, group = rep(1:5, each=50))
studpermu.test(h, ppp ~ group)
#> Error in if (npossible < max(100, nperm)) warning("Don't expect exact results - group sizes are too small"): missing value where TRUE/FALSE needed
This is now fixed in the development version of spatstat. For now you can install the development version from GitHub which should fix the issue:
library(remotes)
install_github("spatstat/spatstat")
With this version I get (didn't set a seed so results may vary, but no error should occur):
library(spatstat)
#> Loading required package: spatstat.data
#> Loading required package: nlme
#> Loading required package: rpart
#>
#> spatstat 1.61-0.021 (nickname: 'New improved formula')
#> For an introduction to spatstat, type 'beginner'
X <- runifpoint(20, nsim = 250)
h <- hyperframe(ppp = X, group = rep(1:5, each=50))
studpermu.test(h, ppp ~ group)
#>
#> Studentized permutation test for grouped point patterns
#> ppp ~ group
#> 5 groups: 1, 2, 3, 4, 5
#> summary function: Kest, evaluated on r in [0, 0.25]
#> test statistic: T, 999 random permutations
#>
#> data: h
#> T = 3.6133, p-value = 0.125
#> alternative hypothesis: not the same K-function

Related

Why is the quantile function not working for this dplyr function?

I'm working through Faraway's 2016 book Extending the Linear Model with R and have encountered an issue with the code that I don't know how to fix. Here is the relevant syntax leading up to the error:
#### Load Data & Libraries ####
library(faraway)
library(tidyverse)
data(wcgs)
#### Add Variables ####
wcgs$y <- ifelse(wcgs$chd == "no",0,1) # create binary response from chd
wcgs$bmi <- with(wcgs,
703*wcgs$weight/(wcgs$height^2)) # create BMI variable
#### Create GLM Model ####
lmod <- glm(chd ~ height + cigs,
family = binomial,
wcgs)
#### Mutate Data ####
wcgs <- mutate(wcgs,
residuals=residuals(lmod),
linpred=predict(lmod)) # create residuals/pred values
And this is the part where the error arises (the third line which includes a mutate function:
#### Error Code (Last Line) ####
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob=predict(lmod,
type="response")) # make pred data
gdf <- group_by(wcgsm,
cut(linpred,
breaks=unique(quantile(linpred,
(1:100)/101)))) # bin NA
Which gives me this error:
Error in `group_by()`:
! Problem adding computed columns.
Caused by error in `mutate()`:
! Problem while computing `..1 = cut(linpred, breaks = unique(quantile(linpred,
(1:100)/101)))`.
✖ `..1` must be size 3140 or 1, not 3154.
I dont understand what this error means. When I run dim(wcgs), I get there are 3154 rows, and when I run dim(na.omit(wcgs)) I get 3140 rows. The only thing I can think of is that the predicted model values dont line up with the new na.omit data, but I'm not sure now how to work around that given the rest of this chapter uses this data manipulation.
predict methods for R's modeling functions always predict from the original data set the models were fitted to. To have a new data set, in this case a subset of the data wcgs, argument newdata must be explicitly set.
The error in the predict line at the bottom is therefore expected behavior.
#### Load Data & Libraries ####
suppressPackageStartupMessages({
#library(faraway)
library(dplyr)
})
data(wcgs, package = "faraway")
#### Add Variables ####
wcgs$y <- as.integer(wcgs$chd == "yes") # create binary response from chd
wcgs$bmi <- with(wcgs, 703*weight/(height^2)) # create BMI variable
#### Create GLM Model ####
lmod <- glm(chd ~ height + cigs, family = binomial, data = wcgs)
#### Mutate Data ####
# create residuals/pred values
wcgs <- mutate(wcgs,
residuals = residuals(lmod),
linpred = predict(lmod))
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob = predict(lmod, type="response")) # make pred data
#> Error in `mutate()`:
#> ! Problem while computing `predprob = predict(lmod, type = "response")`.
#> ✖ `predprob` must be size 3140 or 1, not 3154.
Created on 2022-07-16 by the reprex package (v2.0.1)
See where the error comes from.
predprob_all <- predict(lmod, type = "response")
predprob_na.omit <- predict(lmod, newdata = wcgsm, type = "response")
length(predprob_all)
#> [1] 3154
length(predprob_na.omit)
#> [1] 3140
Created on 2022-07-16 by the reprex package (v2.0.1)
These lengths are the values in the error message, once again, as expected.
There is also the problem of the quantiles in cut(., breaks) not spanning the entire range of linpred. Values outside the quantiles' range will become NA. This is solved with the two endpoints of the breaks vector.
And I have given a name to the binned vector.
The following code works and, I believe, does what is needed.
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob = predict(lmod, newdata = wcgsm, type="response")) # make pred data
breaks <- c(-Inf,
unique(quantile(wcgsm$linpred, (1:100)/101)),
Inf)
gdf <- group_by(wcgsm,
bins = cut(linpred, breaks = breaks)) # bin NA
anyNA(gdf$bins)
#> [1] FALSE
Created on 2022-07-16 by the reprex package (v2.0.1)

Loading {logistf} breaks MCMCglmm()

Loading the package logistf breaks MCMCglmm(). Unloading logistf before running the command doesn't remove the error.
Why is that? Is there a way to solve this?
Works
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#>
#> MCMC iteration = 0
#>
#> MCMC iteration = 1000
#>
#> MCMC iteration = 2000
#>
#> MCMC iteration = 3000
#>
[...]
#> attr(,"class")
#> [1] "MCMCglmm"
Created on 2022-06-07 by the reprex package (v2.0.1)
Doesn't work
library(logistf)
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data): invalid term in model formula
unloadNamespace("logistf")
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data): invalid term in model formula
Created on 2022-06-07 by the reprex package (v2.0.1)
After some research i found that the problem not from logistf but it comes from the imported package formula.tools to reproduce the error try :
library(formula.tools)
#>formula.tools-1.7.1 - Copyright © 2022 Decision Patterns
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data) :
invalid term in model formula
and this issue known for formula.tools see Weird package dependency introduces error
The solution detailed in this issue is:
fork fomula.tools repo
(remove this line)[https://github.com/decisionpatterns/formula.tools/blob/45b6654e4d8570cbaf1e2fd527652471202d97ad/NAMESPACE#L3]
install_github from your repo
OR
run as.character.formula = function(x) as.character.default(x) right after loading formula.tools. That might break code using as.character.formula though (but not sure).
Thanks for this question

ETS from fable package in R (can I do it with out tsibble)

I am trying to use ETS function from fable package (following this tutorial link). Ideally I would like to do it without using tsibble functionality. In particular I am trying to generate forecast:
library(tsibble)
library(fable)
library(tidyverse)
fit <- ETS(1:63)
forecast(fit, h =2)
returns error:
Error in UseMethod("forecast") :
no applicable method for 'forecast' applied to an object of class "c('mdl_defn', 'R6')"
another try
summary(fit)
also returns error
Error in object[[i]] : wrong arguments for subsetting an environment
So can I used it without full tsibble functionality? It was so simple with ARIMA from forecast package.
If it is not possible without tsibble what would be the quickest way to cast it as tsibble data?
You need to use tsibbles, but it is very easy to do so.
library(tsibble)
library(fable)
library(tidyverse)
ts(1:63) %>%
as_tsibble() %>%
model(ETS(value)) %>%
forecast(h=2)
#> # A fable: 2 x 4 [1]
#> # Key: .model [1]
#> .model index value .distribution
#> <chr> <dbl> <dbl> <dist>
#> 1 ETS(value) 64 64 N(64, 0)
#> 2 ETS(value) 65 65 N(65, 0)
Created on 2020-02-19 by the reprex package (v0.3.0)

Handling alternative-specific NA values in mlogit

It is common in mode choice models to have variables that vary with alternatives ("generic variables") but that are undefined for certain modes. For example, transit fare is present for bus and light rail, but undefined for automobiles and biking. Note that the fare is not zero.
I'm trying to make this work with the mlogit package for R. In this MWE I've asserted that price is undefined for fishing from the beach. This results in a singularity error.
library(mlogit)
#> Warning: package 'mlogit' was built under R version 3.5.2
#> Loading required package: Formula
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Loading required package: lmtest
data("Fishing", package = "mlogit")
Fishing$price.beach <- NA
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
head(Fish)
#> mode income alt price catch chid
#> 1.beach FALSE 7083.332 beach NA 0.0678 1
#> 1.boat FALSE 7083.332 boat 157.930 0.2601 1
#> 1.charter TRUE 7083.332 charter 182.930 0.5391 1
#> 1.pier FALSE 7083.332 pier 157.930 0.0503 1
#> 2.beach FALSE 1250.000 beach NA 0.1049 2
#> 2.boat FALSE 1250.000 boat 10.534 0.1574 2
mlogit(mode ~ catch + price | income, data = Fish, na.action = na.omit)
#> Error in solve.default(H, g[!fixed]): system is computationally singular: reciprocal condition number = 3.92205e-24
Created on 2019-07-08 by the reprex package (v0.2.1)
This happens when price is moved to the alternative-specific variable position as well. I think the issue may lie in the na.action function argument, but I can't find any documentation on this argument beyond the basic documentation tag:
na.action: a function which indicates what should happen when the data contains NAs
There appear to be no examples showing how this term is used differently and what the results are. There's a related unanswered question here.
There appears to be a few things going on.
I am not quite sure how na.action = na.omit works under the hood, but it sounds to me like it will drop the entire row. I always find it better to do this explicitly.
When you drop the entire row, you will have choice occasions where no choice was made. This is not going to work. Remember, we are working with logit type probabilities. Furthermore, if no choice is made, no information is gained, so we need to drop these choice observations entirely. Doing these two steps in combination, I am able to run the model you propose.
Here is a commented working example:
library(mlogit)
# Read in the data
data("Fishing", package = "mlogit")
# Set price for the beach option to NA
Fishing$price.beach <- NA
# Scale income
Fishing$income <- Fishing$income / 10000
# Turn into 'mlogit' data
fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
# Explicitly drop the alts with NA in price
fish <- fish[fish$alt != "beach", ]
# Dropping all NA also means that we now have choice occasions where no choice
# was made and we need to get rid of these as well
fish$choice_made <- rep(colSums(matrix(fish$mode, nrow = 3)), each = 3)
fish <- fish[fish$choice_made == 1, ]
fish <- mlogit.data(fish, shape = "long", alt.var = "alt", choice = "mode")
# Run an MNL model
mnl <- mlogit(mode ~ catch + price | income, data = fish)
summary(mnl)
In general, when working with these models, I find it very useful to always make all data transformations before running a model rather than rely on functions such as na.action.

"Error in if (!all(pars < 1e-06)) pars[pars < 1e-06] <- 0" error in a model with depmixS4

Problem
I want to run a latent class analysis with depmixS4 package in r. The problem appears while trying to fit a model with only one class (or state in depmixS4 package). When I try to adjust the model with a dataset of 6000 cases I get the following error. However, when the cases are 5000 there is no problem.
Error in if (!all(pars < 1e-06)) pars[pars < 1e-06] <- 0 :
missing value where TRUE/FALSE needed
Where is the problem? Could someone help me understand why this error occurs?
A reproducible example
CASE A (n = 6000)
The same case also occurs when it comes to aleatory variables. To have a reproducible example, first I generate a dataset (n = 6000) with two random variables (a and b) with two possible values (0 and 1).
library(depmixS4)
#> Loading required package: nnet
#> Loading required package: MASS
#> Loading required package: Rsolnp
a <- sample(0:1, size = 6000, replace = T)
b <- sample(0:1, size = 6000, replace = T)
foo_large <- data.frame(a,b)
set.seed(123)
mod1 <- mix(response = list(a~1, b~1),
data=foo_large, # the dataset to use
nstates=1, # the number of latent classes
family=list(multinomial("identity"),multinomial("identity")))
fmod1 <- fit(mod1, verbose=TRUE)
#> Error in if (!all(pars < 1e-06)) pars[pars < 1e-06] <- 0: missing value where TRUE/FALSE needed
CASE B (n = 5000) However, with a dataset (n = 5000) with two random variables with the same characteristics as the previous ones, there is no error.
library(depmixS4)
#> Loading required package: nnet
#> Loading required package: MASS
#> Loading required package: Rsolnp
c <- sample(0:1, size = 5000, replace = T)
d <- sample(0:1, size = 5000, replace = T)
foo_short <- data.frame(c,d)
set.seed(123)
mod1 <- mix(response = list(c~1, d~1),
data=foo_short, # the dataset to use
nstates=1, # the number of latent classes
family=list(multinomial("identity"),multinomial("identity")))
fmod1 <- depmixS4::fit(mod1, verbose=TRUE)
#> iteration 0 logLik: -6928.943
#> converged at iteration 1 with logLik: -6928.943
I did a bit of digging and the error seems due to the way depmixS4 provides random starting values to initialize the EM algorithm (it generates random probabilities for class membership with a Dirichlet distribution and the code we use to draw from this distribution doesn't work well for a 1-dimensional Dirichlet). We'll fix this in an upcoming release. For now, you can run the EM without random starting values by using:
fmod1 <- fit(mod1, emcontrol=em.control(random.start=FALSE), verbose=TRUE)
This works in both your examples.
Note that the issue is not due to the difference in the number of observations (n=5000) or (n=6000). That the code converged for (n=6000) was a lucky coincidence of using set.seed(123). After deleting this line you will most likely get the same error as for (n=6000). The latter you can coincidently get working if you set set.seed(1234).

Resources