Obtain values from simulated mppm in spatstat - r

I have obtained an mppm object by fitting a model on several independent datasets using the mppm function from the R package spatstat. How can I generate simulated realisations of this model and obtain the x, y, and marks attributes of the simulations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realisations of the same experiment.
sim <- simulate(model) allows me to generate simulated realisations of this model, and plot(sim,axes = TRUE) to plot them. sim itself is an hyperframe object:
> sim
Hyperframe:
Sim1
1 (ppp)
2 (ppp)
3 (ppp)
How can I access the values (x, y, and marks) in this hyperframe ? My goal is to generate a large number of independent realisations from my model, and to use the simulated values for another task. Is there a practical way to obtain, retrieve and save these values ?

Since you say you want to simulate this many times I have here shown the code
with two simulations (rather than one as you have in the question):
library(spatstat)
data <- list(amacrine, amacrine, amacrine)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
sim <- simulate(model, nsim = 2)
#> Generating simulated realisations of 3 models..
#> 1, 2, 3.
Now sim is a hyperframe with 2 columns (one for each simulation). Each
column is a list of 3 point patterns. To get the three sets of coordinates
and marks for the first simulation use as.data.frame on each point pattern:
co1 <- lapply(sim$Sim1, as.data.frame)
Then co1 is a list of length 3, and we can print out the first few
coordinates with the head() command, e.g. the coordinates of the third
point pattern:
head(co1[[3]])
#> x y marks
#> 1 0.4942587 0.7889985 off
#> 2 0.6987270 0.7637359 on
#> 3 0.3926415 0.6819965 on
#> 4 0.7982686 0.9060733 off
#> 5 1.3507722 0.9731363 on
#> 6 0.6450985 0.6924126 on
We can extract the coordinates and marks for each simulation by another lapply that
runs over every simulation (in this case 2):
co <- lapply(sim, function(x) lapply(x, as.data.frame))
Now co is a list with 2 elements, and each element is a list of 3 sets of
coordinates:
length(co)
#> [1] 2
length(co[[2]])
#> [1] 3
head(co[[2]][[3]])
#> x y marks
#> 1 0.1660580 0.04180501 on
#> 2 0.7840025 0.71727782 on
#> 3 1.2011733 0.17109112 on
#> 4 1.0429867 0.49284639 on
#> 5 1.1411869 0.86711072 off
#> 6 1.0375942 0.06427601 on

Related

R - polynomial regression issue - model limited to finite number of output values

I'm trying to calculate point slopes from a series of x,y data. Because some of the x data repeats (...8, 12, 12, 16...) there will be a division by zero issue when using slope = (y2-y1/x2-x1).
My solution is to create a polynomial regression equation of the data, then plug a new set of x values (xx) into the equation that monotonically increase between the limits of x. This eliminates the problem of equal x data points. As a result, (x) and (xx) have the same limits, but (xx) is always longer in length.
The problem I am having is that the fitted values for xx are limited to the length of x. When I try to use the polynomial equation with (xx) that is 20 in length, the fitted yy results provide data for the first 10 points then gives NA for the next 10 points. What is wrong here?
x <- c(1,2,2,5,8,12,12,16,17,20)
y <- c(2,4,5,6,8,11,12,15,16,20)
df <- data.frame(x,y)
my_mod <- lm(y ~ poly(x,2,raw=T), data=df) # This creates the polynomial equation
xx <- x[1]:x[length(x)] # Creates montonically increasing x using boundaries of original x
yy <- fitted(my_mod)[order(xx)]
plot(x,y)
lines(xx,yy)
tag-name
If you look at
fitted(my_mod)
It outputs:
# 1 2 3 4 5 6 7 8 9 10
#3.241032 3.846112 3.846112 5.831986 8.073808 11.461047 11.461047 15.303305 16.334967 19.600584
Meaning the name of the output matches the position of x, not the value of x, so fitted(my_mod)[order(xx)] doesn't quite make sense.
You want to use predict here:
yy <- predict(my_mod, newdata = data.frame(x = xx))
plot(xx, yy)
# 1 2 3 4 5 6 7 8 9 10
# 3.241032 3.846112 4.479631 5.141589 5.831986 6.550821 7.298095 8.073808 8.877959 9.710550
# 11 12 13 14 15 16 17 18 19 20
# 10.571579 11.461047 12.378953 13.325299 14.300083 15.303305 16.334967 17.395067 18.483606 19.600584

Creating a for-loop to store LDA misclassification rates

I have a dataset of 104 samples (2 classes) and 182 variables. I am to carry out LDA on the dataset. My strategy involves first carrying out PCA in order to reduce dimensionality; this leaves me with 104 PCs. Now, what I want to do is carry out LDA on the PCs. I want to carry it out first where the number of PCs equal to 1, and store the misclassification rates into a data frame object. I will then do the same for 2, 3 and so on until ~50 PCs; the number is not important. I have created a for-loop to try solve this but I end up with a data frame where the only row is the final value I choose for my PCs. Here is the code I have so far:
# required packages
library(MASS)
library(class)
library(tidyverse)
# reading in and cleaning data
og_data <- read.csv("data.csv")
og_data <- og_data[, -1]
og_data$tumour <- unclass(as.factor(og_data$tumour))
# standardizing
st_data <- as.data.frame(cbind(og_data[, 1], scale(og_data[, -1])))
colnames(st_data)[1] <- "tumour"
# PCA for dimension reduction
k=10 # this is for the for-loop
grouping <- c(rep(1, 62), rep(2, 42)) # a vector denoting the true class of the samples
pca <- prcomp(st_data[, -1])
df_misclassification <- tibble(i=as.numeric(),
misclassification_rate_1=as.numeric(),
misclassification_rate_2=as.numeric())
for (i in k){
a <- as.data.frame(pca$x[, 1:i])
b <- lda(a, grouping=grouping, CV=TRUE)
c <- table(list(predicted=b$class, observed=grouping)) # confusion matrix
d <- t(as.data.frame(diag(c) / rowSums(c))) # misclassification rate for each class
df_misclassification <- df_misclassification %>%
add_row(i=i,
misclassification_rate_1=d[, 1],
misclassification_rate_2=d[, 2])
}
Running the above for k=10 leaves me with the following data frame:
# A tibble: 1 x 3
i misclassification_rate_1 misclassification_rate_2
<dbl> <dbl> <dbl>
1 10 0.952 0.951
I would like the table to have 10 rows, one for each number of PCs used. There is some overwriting in the for-loop but I have no idea how to fix this. Any help would be much appreciated. Thank you.
My for-loop was wrong. It should have been for (i in 1:k).

Run DBSCAN against grouped coordinates

I'm attempting to run DBSCAN against some grouped coordinates in order to get sub-clusters. I've clustered some spacial data and I'd now like to further divide these regions according to the density of points within them. I think DBSCAN is probably the best way to do this.
My issue is that I can't figure out how to run DBSCAN against each cluster seperately and then output the cluster assignment as a new column. Here's some sample data:
library(dplyr)
library(dbscan)
# Create sample data
df <- data.frame(
"ID"=1:200,
"X"=c(1.0083,1.3166,1.3072,1.1311,1.2984,1.2842,1.1856,1.3451,1.1932,1.0926,1.2464,1.3197,1.2331,1.2996,1.3482,
1.1944,1.2800,1.3051,1.4471,0.9068,1.3150,1.1846,1.0232,1.0005,1.0640,1.3177,1.1015,0.9598,1.0354,1.2203,
0.8388,0.8655,1.3387,1.0133,1.0106,1.1753,1.3200,1.0139,1.1511,1.3508,1.2747,1.3681,1.1074,1.2735,1.2245,
0.9695,1.3250,1.0537,1.2020,1.3093,0.9268,1.3244,1.2626,1.3123,1.2819,1.1063,0.8759,1.0063,1.0173,1.0187,
1.2396,1.0241,1.2619,1.2682,1.0008,1.0827,1.3639,1.3099,1.0004,0.8886,1.2359,1.1370,1.2783,1.0803,1.1918,
1.1156,1.3313,1.1205,1.0776,1.3895,1.3559,0.8518,1.1315,1.3521,1.2281,1.2589,0.9974,1.1487,1.4204,0.9998,
1.0154,1.0098,0.8851,1.0252,0.9331,1.2197,1.0084,1.2303,1.2808,1.3125,0.5500,0.6694,0.3301,0.3787,0.6492,
0.6568,0.6773,0.3769,0.6237,0.7265,0.5509,0.3579,0.7201,0.2631,0.3881,0.7596,0.3343,0.7049,0.3430,0.2951,
0.5483,0.7699,0.3806,0.6555,0.2524,0.4030,0.6329,0.5006,0.2701,0.0822,0.5442,0.5233,0.7105,0.5660,0.3962,
0.3187,0.3143,0.5673,0.3731,0.7310,0.6376,0.4864,0.8865,0.3352,0.7540,0.0690,0.7983,0.6990,0.4090,0.5658,
0.5636,0.5420,0.7223,0.6146,0.5648,0.2711,0.3422,0.7214,0.2196,0.2848,0.6496,0.7907,0.7418,0.7825,0.4550,
0.4361,0.7417,0.2661,0.8978,0.7875,0.2343,0.3853,0.6874,0.7761,0.2905,0.6092,0.5329,0.6189,0.0684,0.5726,
0.5740,0.7060,0.4609,0.3568,0.7037,0.2874,0.6200,0.7149,0.5100,0.7059,0.2520,0.3105,0.6870,0.7888,0.3674,
0.6514,0.7271,0.6679,0.3752,0.7067),
"Y"=c(-1.2547,-1.1499,-1.1803,-1.0626,-1.2877,-1.1151,-1.0958,-1.1339,-1.0808,-1.5461,-1.0775,-1.1431,-1.0499,
-1.1521,-1.1675,-1.0963,-1.1407,-1.1916,-1.1229,-1.2297,-1.1308,-1.0341,-1.3071,-1.2370,-1.5043,-1.1154,
-1.5452,-1.0349,-1.5412,-1.0348,-1.3620,-1.3776,-1.1830,-1.2552,-1.2354,-1.0838,-1.1214,-1.2396,-1.4937,
-1.0793,-1.1857,-1.0679,-1.5425,-1.1633,-1.1620,-1.0838,-1.0750,-1.3493,-1.4155,-1.1354,-1.0615,-1.1494,
-1.1620,-1.1582,-1.1800,-1.5230,-1.3019,-1.2484,-1.5490,-1.2435,-1.0487,-1.2330,-1.1234,-1.0924,-1.0702,
-1.0446,-1.1077,-1.1144,-1.2170,-1.2715,-1.1537,-1.5077,-1.1305,-1.3396,-1.2107,-1.5458,-1.1482,-1.1224,
-1.3690,-1.2058,-1.1685,-1.3400,-1.5033,-1.2152,-1.3805,-1.1439,-1.5183,-1.4288,-1.1252,-1.2330,-1.2511,
-1.5429,-1.3333,-1.1851,-1.1367,-1.3952,-1.1240,-1.2113,-1.1632,-1.1965,-0.9917,-0.7416,-0.7729,-1.1279,
-0.9323,-0.9372,-0.7013,-1.1746,-0.9191,-0.9356,-0.7873,-1.1957,-0.9838,-0.5825,-1.0738,-0.9302,-0.7713,
-0.9407,-0.7774,-0.8160,-0.9861,-1.0440,-0.9896,-0.6478,-0.8865,-1.0601,-1.0640,-0.9898,-0.5989,-0.7375,
-0.7689,-0.9799,-0.9147,-1.1048,-0.9735,-0.8591,-0.7913,-1.0085,-0.7231,-0.9688,-0.9272,-0.9395,-0.9494,
-0.7859,-1.0817,-0.7262,-0.9915,-0.9329,-1.0953,-1.0425,-1.0806,-1.0132,-0.8514,-1.0785,-1.1109,-0.8542,
-1.0849,-0.9665,-0.5940,-0.6145,-0.7830,-0.9601,-0.8996,-0.7717,-0.7447,-1.0406,-1.0067,-0.5710,-0.9839,
-1.0594,-0.7069,-1.1202,-0.9705,-1.0100,-0.6377,-1.0632,-0.9450,-0.9163,-0.7865,-1.0090,-1.1005,-1.0049,
-0.8042,-1.0781,-0.6829,-0.5962,-1.0759,-0.7918,-0.9732,-0.7353,-0.5615,-1.2002,-0.9295,-0.9944,-1.1570,
-0.9524,-0.9257,-0.9360,-1.1328,-0.7661),
"cluster"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2))
# How do you run DBSCAN against the points within each cluster?
I first thought I'd try to use the group_by function in dplyr but DBSCAN requires a data matrix input and group_by doesn't work for matrices.
matrix <- as.matrix(df[, -1])
set.seed(1234)
db = matrix %>%
group_by(cluster) %>%
dbscan(matrix, 0.4, 4)
#Error in UseMethod("group_by_") :
# no applicable method for 'group_by_' applied to an object of class "c('matrix', 'double', 'numeric')"
I've also tried using by() but get duplicate results for each cluster grouping, which isn't right:
by(data = df, INDICES = df$cluster, FUN = function(x) {
out <- dbscan(as.matrix(df[, c(2:3)]),eps=.0215,minPts=4)
})
#df$cluster: 1
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
#--------------------------------------------------------------------------
#df$cluster: 2
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
Can anyone point me in the right direction?
To be clear, dbscan::dbscan works fine on data.frame objects. You do not need to convert to matrix. It returns an object that includes a vector with the same dimension as the number of records in your input. The issue is that dplyr exposes variables to other functions as individual vectors, rather than as data.frame or matrix objects. You are free to do something like:
df %>%
group_by(cluster) %>%
mutate(
dbscan_cluster = dbscan::dbscan(
data.frame(X, Y),
eps = 0.0215,
minPts = 4
)[["cluster"]]
)
dplyr is not necessary, by also works, you just need to supply a generic function rather than one that directly references the source object directly. Your data must already be ordered by cluster.
df$dbscan_cluster <- unlist(
by(
df,
INDICES = df$cluster,
function(x) dbscan::dbscan(x[,c(2,3)], eps = 0.0215, minPts = 4)[["cluster"]]
)
)
However, you can still get garbage results if you don't have a good way to pick your epsilon. You might consider using dbscan::optics instead.

Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10)

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

R - Using nested dataframe to run function with different sets of parameters

I would like to create a wrapper for the Levenberg-Marquardt Nonlinear Least-Squares function nls.lm (minpack.lm library) similar to nls2 (nls2 library) to give a brute force method for evaluating the fit of a model to observed data.
The idea is to create a range of starting value combinations and either:
pass these to a function, then compare the function output to the observed data to create an R^2 value for each of the starting value combinations and run the nls.lm fitting with the best one of them.
or
run nls.lm on all combinations and select the best returned fit.
I wanted to do this without looping and after inspiration from here am trying to use nested dataframes, with one column for the parameter input list, one for the values returned by my function, one for the R^2 values, and one for the best fit models,something like:
df
# start_val fun_out R^2
# 1 {a=2,b=2} {22,24,26...} 0.8
# 2 {a=3,b=5} {35,38,41...} 0.6
This is the code I have so far:
require(dplyr);require(tidyr)
foo <- function(x,a,b) a*x^2+b # function I am fitting
x <- 1:10 # independent variable
y_obs <- foo(x,1.5,2.5) + rnorm(length(x),0,10) # observed data (dependent variable)
start_range <- data.frame(a=c(1,2),b=c(2,3)) # range of allowed starting points for fitting
reps <- 2 # number of starting points to generate
# Create a data frame of starting points
df<-as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range)) %>%
mutate(data=as.list(data)) %>%
as.data.frame()
df
# id data
# 1 1 1.316356, 2.662923
# 2 2 1.059356, 2.723081
I get stuck now trying to pass the parameters in data into the function foo(). I've tried using do.call(), and even with using constant parameters the following error appears:
mutate(df,y=do.call(foo,list(x,1,2)))
# Error: wrong result size (5), expected 2 or 1
Is there a way to create columns of a dataframe which contain lists directly without using nest()?
Also when trying to create the list to pass to do.call() using the dataframe columns, how do you create a list where the first element is the vector x, the second is the parameter a and the third is the parameter b? The follwing splits the list down the column:
mutate(df,my_list=list(x,data))
# id data my_list
# 1 1 1.316356, 2.662923 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 2 1.059356, 2.723081 1.316356, 2.662923, 1.059356, 2.723081
Running nls2 using algorithm = "random-search" and all = TRUE and the specified maxiter will evaluate foo at maxiter random points and return starting_fits which are the fits at those points. It consists of a set of "nls" class objects evaluated at each of the randomly chosen starting values. It does not do an optimization from each of these starting values but just returns the "nls" object at each. That is, nls is not run. Now for each starting fit run nlsLM giving fits, a list of nlsLM fits and from that summarize them in data (a data frame with one row per run) and show the least.
If we only want to pick the best starting value and just run nlsLM once from that then use the alternate code near the end.
library(nls2)
fo <- y_obs ~ foo(x, a, b)
starting_fits <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps), all = TRUE)
fits <- lapply(starting_fits, function(fit) nlsLM(fo, start = coef(fit)))
data <- data.frame(RSS = sapply(fits, deviance), t(sapply(fits, coef)),
start = t(sapply(starting_fits, coef)))
# data$fits <- fits # optional to store each row's fitted object in that row
subset(data, RSS == min(RSS)) # minimum(s)
giving:
RSS a b start.a start.b
2 706.3956 1.396616 7.226525 1.681819 2.768374
R squared is used for linear regression. It is not valid for nonlinear regression. Residual sum of squares (RSS) is shown above instead.
Alternately if you just want to pick out the best starting value and run nlsLM on that then just omit the all=TRUE argument from the nls2 call and do this. If you need the coefficients and RSS for later code then try coef(fit) and deviance(fit) .
starting_fit <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps))
fit <- nlsLM(fo, start = coef(starting_fit))
Note 1: If you are getting errors from nlsLM try replacing nlsLM(...) with try(nlsLM(...)). This will issue error messages (use try(..., silent = TRUE) if you don't want them) but will not stop processing.
Note 2: I assume that the foo shown in the question is just an example and the real function is more complex. The foo shown is linear in the coefficients so one could use lm for it. No nonlinear optimization is needed.
An approach like this perhaps?
library(dplyr)
library(purrr)
foo2 <- function(x,data) data$a*x^2+data$b
r2 <- function(e, o) 1 - sum((e - 0)^2) / sum((e - mean(e)^2))
df <- as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range))
df %>%
mutate(fun_out = map(data, foo2, x = x),
R2 = map(fun_out, o = y_obs, r2))
Result:
# A tibble: 3 x 4
id data fun_out R2
<int> <list> <list> <list>
1 1 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
2 2 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
3 3 <tibble [1 x 2]> <dbl [10]> <dbl [1]>

Resources