Error in generating a poisson point pattern with calculated lambda - r

I have a two-dimensional point pattern (no marks) and I am trying to test for clustering in the presence of spatial inhomogeneity using envelopes and the inhomogenous pair correlation function. I am estimating an inhomogenous intensity function for the data using the density.ppp function. Here is some sample data:
x y
1 533.03 411.58
2 468.39 622.92
3 402.86 530.94
4 427.13 616.81
5 495.20 680.62
6 566.61 598.99
7 799.03 585.16
8 1060.09 544.23
9 144.66 747.40
10 138.14 752.92
11 449.49 839.15
12 756.45 713.72
13 741.01 728.41
14 760.22 740.28
15 802.34 756.21
16 799.04 764.89
17 773.81 771.97
18 768.41 720.07
19 746.14 754.11
20 815.40 765.14
There are ~1700 data points overall
Here is my code:
library("spatstat")
WT <- read.csv("Test.csv")
colnames(WT) <- c("x","y")
#determine bounding window
win <- ripras(WT)
unitname(win) <- c("micrometer")
#convert to ppp data class
WT.ppp <- as.ppp(WT, win)
plot(WT.ppp)
#estimate intensity function using cross validation
I <- density.ppp(WT.ppp,sigma=bw.diggle(WT.ppp),adjust=0.3,kernal="epanechnikov")
plot(I)
#predetermined r values for PCF
radius <- seq(from = 0, to = 50, by = 0.5)
#use envelopes to test the null hypothesis (ie. inhomogenous poisson process)
PCF_envelopes <- envelope(WT.ppp,divisor="d", pcfinhom,r = radius,nsim=10,simulate=expression(rpoispp(I)) )
When I run rpoisspp(I), I get the following error:
Error in sample.int(npix, size = ni, replace = TRUE, prob = lpix) :
negative probability
I can't seem to figure out what the issue is....any suggestions?
Thanks for your help!

This is happening because the image I contains some negative values, probably very small values but negative. You can check that by computing range(I) or min(I) or any(I < 0).
The help for density.ppp says that the result may contain negative values (very small ones) due to numerical error. To remove these, you need to set positive=TRUE in the call to density.ppp.
By the way, the argument kernel has been mis-spelt in the code above. Also the vector r is too coarsely spaced - you would be better to leave this argument un-specified. Also you don't need to type density.ppp, just density.

Related

Is there a correction I can apply to negative values within a probability matrix produced by matexpo {ape}?

I'm simulating discrete character data using the function rTraitDisc {ape} in R using a variety of model matrices. I've not encountered any issues with scaling when all state changes are possible. However when I supply an ordered model with 8 or more possible states, the function breaks down and returns the following error:
## library
library(ape)
## read in tree
data("bird.orders")
## build model
model.matrix <- matrix(c(0,0.1,0,0,0,0,0,0,
0.1,0,0.1,0,0,0,0,0,
0,0.1,0,0.1,0,0,0,0,
0,0,0.1,0,0.1,0,0,0,
0,0,0,0.1,0,0.1,0,0,
0,0,0,0,0.1,0,0.1,0,
0,0,0,0,0,0.1,0,0.1,
0,0,0,0,0,0,0.1,0), 8)
## run function
rTraitDisc(phy = bird.orders, model = model.matrix)
Error message:
Error in sample.int(k, size = 1, FALSE, prob = p) : negative probability
Having dug a little deeper, it seems that when there are 8 or more states but only one possible transition (e.g. if the ancestral state is 0, only a transition to state 1 should be possible in an ordered matrix), the function matexpo produces a probability matrix with negative values for the shortest branch of the tree (0.5). As these probabilities are used by sample.int as the "prob" argument, the negative probabilities cause the function to break down.
## get number of states
k <- ncol(model.matrix)
## get equilibrium relative frequencies
freq = rep(1/k, k)
## match number of elements in model
freq <- rep(freq, each = k)
## get Q matrix
Q <- model.matrix * freq
diag(Q) <- 0
diag(Q) <- -rowSums(Q)
## get minimum edge length
min.el <- min(bird.orders$edge.length)
## run matexpo
matexpo(Q*min.el)
How do I deal with these negative values in this context? Is there a correction I can/should apply?

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Error in NbClust: not enough objects to cluster

I am trying to use the NbClust method in R to determine the best number of clusters in a cluster analysis following the approach in the book from Manning.
However, I get an error message saying:
Error in hclust(md, method = "average"): must have n >= 2 objects to
cluster.
Even though the hclust method appears to work. Therefore, I assume that the problem is (which is also stated by the error message), that NbClust tries to create groups with only one object inside.
Here is my code:
mydata = read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[-1] # Without first line (int)
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
data.dist <- dist(data.scale) # Calculates distances between points
fit.average <- hclust(data.dist, method = "average")
plot(fit.average, hang = -1, cex = .8, main = "Average Linkage Clustering")
library(NbClust)
nc <- NbClust(data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
I found a similar problem here, but I was not able to adapt the code.
There are some problems in your dataset.
The last 4 rows do not contain data and must be deleted.
mydata <- read.table("PLR_2016_WM_55_5_Familienstand_aufbereitet.csv", skip = 0, sep = ";", header = TRUE)
mydata <- mydata[1:(nrow(mydata)-4),]
mydata[,1] <- as.numeric(mydata[,1])
Now rescale the dataset:
data.transformed <- t(mydata) # Transformation of matrix
data.scale <- scale(data.transformed) # Scaling of table
For some reason data.scale is not a full rank matrix:
dim(data.scale)
# [1] 72 447
qr(data.scale)$rank
# [1] 71
Hence, we delete a row from data.scale and transpose it:
data.scale <- t(data.scale[-72,])
Now the dataset is ready for NbClust.
library(NbClust)
nc <- NbClust(data=data.scale, distance="euclidean",
min.nc=2, max.nc=15, method="average")
The output is
[1] "Frey index : No clustering structure in this data set"
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 8 proposed 2 as the best number of clusters
* 4 proposed 3 as the best number of clusters
* 8 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 1 proposed 11 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Resources