Point pattern classification with spatstat: what am I doing wrong? - r

I’am trying to classify bivariate point patterns into groups using spatstat. The patterns are derived from the whole slide images of lymph nodes with cancer. I’ve trained a neural network to recognize cells of three types (cancer “LP”, immune cells “bcell” and all other cells). I do not wish to analyse all other cells but use them to construct a polygonal window in the shape of the lymph node. Thus, the patterns to be analysed are immune cells and cancer cells in polygonal windows. Each pattern can have several 10k cancer cells and up to 2mio immune cells. The patterns are of the type “Small World Model” as there is no possibility of points laying outside the window.
My classification should be based on the position of the cancer cells in relation to the immune cells. E.g. most cancer cells are laying on the “islands” of immune cells but in some cases cancer cells are (seemingly) uniformly dispersed and there are only a few immune cells. In addition, the patterns are not always uniform across the node. As I’m rather new to spatial statistics I developed a simple and crude method to classify the patterns. Here in short:
I calculated a kernel density of the immune cells with sigma=80 because this looked “nice” for me. Den<-density(split(cells)$"bcell",sigma=80,window= cells$window) (Should I have used e.g. sigma=bw.scott instead?)
Then I created a tessellation image by dividing density range in 3 parts (here again, I experimented with the breaks to get some “good looking results”).
rangesDenMax<-2*range(Den)[2]/3
rangesDenMin<-range(Den)[2]/3
map.breaks<-c(-Inf,rangesDenMin,rangesDenMax,Inf)
map.cuts <- cut(Den, breaks = map.breaks, labels = c("Low B-cell density","Medium B-cell density", "High B-cell density"))
map.quartile <- tess(image = map.cuts,window=cells$window)
tessImage<-map.quartile
Here are some examples of the plots of the tessellations with the cancer cell overlay (white dots). The lymph node on the left has a typical uniformly distributed “islands” of immune cells while the node on the right has only a few dense spots of immune cells and cancer cells not restricted to those spots:
heat map: immune cell kernel density, white dots: cancer cells
Then I measured a silly number of variables, which should give me a clue of how the cancer cells are distributed across the tessellation tiles (the calculation code is trivial so I post only the description of my variables):
LPlwB<-c() # proportion of cancer cells in low-b-cell-area
LPmdB<-c() # proportion of cancer cells in medium-b-cell-area
LPhiB<-c() # proportion of cancer cells in high-b-cell-area
AlwB<-c() # proportion of the low-b-cell area
AmdB<-c() # proportion of the medium-b-cell area
AhiB<-c() # proportion of the high-b-cell area
LPm1<-c() # mean distance to the 1st neighbour
LPm2<-c() # mean distance to the 2nd neighbour
LPm3<-c() # mean distance to the 3d neighbour
LPsd1<-c() # standard deviation of the mean distance to the 1st neighbour
LPsd2<-c() # standard deviation of the mean distance to the 2nd neighbour
LPsd3<-c() # standard deviation of the mean distance to the 3d neighbour
meanQ<-c() # mean quadratcount (I visually chose the quadrat size to be not too large and not too small)
sdevQ<-c() # standard deviation of the mean quadratcount
hiSAT<-c() # realised cancer cells saturation in high b-cell-area (number of cells observed divided by a number of cells, which could be fitted into the area considering the observed min distance between the cells)
mdSAT<-c() # realised cancer cells saturation in medium b-cell-area
lwSAT<-c() # realised cancer cells saturation in low b-cell-area
ll<-c() # Proportion LP neighbours of LP (contingency table count divided by total points)
lb<-c() # Proportion b-cell neighbours of LP
bl<-c() # Proportion b-cell neighbours of b-cells
bb<-c() # Proportion LP neighbours of b-cells
I z-scaled the variables, inspected them on a PCA-plot (the vectors pointed in different directions like needles of a sea urchin) and performed a hierarchical cluster analysis. I choose k by calculating fviz_nbclust(scaled_variables, hcut, method = "silhouette"). After dividing the dendrogram into k clusters and checking the cluster stability, I ended up with my groups, which seemed to make sense as cases with “islands” were separated from the "more dispersed" ones.
However, given the possibilities of the spatstat package I strongly feel like hitting nails into the wall with a smartphone.

It seems you are trying to quantify the way in which the cancer cells are positioned relative to the immune cells. You could do this by something like
Cancer <- split(cells)[["LP"]]
Immune <- split(cells)[["bcell"]]
Dimmune <- density(Immune, sigma=80)
f <- rhohat(Cancer, Dimmune)
plot(f)
Then f is a function that indicates the intensity (number per unit area) of cancer cells as a function of the density of immune cells. The plot shows the density of cancer cells on the vertical axis, against the density of immune cells on the horizontal axis.
If the graph of this function is flat, it means that the cancer cells are not paying attention to the density of immune cells. If the graph is steeply declining it means that cancer cells tend to avoid immune cells.
I suggest you first look at the plot of f for some example datasets to decide whether f has any ability to discriminate between spatial arrangements that you think should be classified as different. If so then you can use as.data.frame to extract the values of f and then use classical discriminant analysis (etc) to classify the slide images into groups.
Instead of density(Immune) you could use any other summary of the immune cells.
For example D <- distfun(Immune) would give you the distance to the nearest immune cell, and then f would compute the density of cancer cells as a function of the distance to nearest immune cell. And so on.

Related

Obtain the differential spectrum of each marine floating target and its background/neighborhood water in Google Earth Engine

How to obtain the differential spectrum of each floating target (algal pattern here), that is, the band value of each algal pattern subtract the band value of the adjacent water around it (such as the median water spectrum)
I first extract floating algae from the sea. I can use NDVI, NDWI, etc. to extract the algae and its edges first (See the Fig.1, algae is in viridis palette). My goal is to get the difference between the spectra of the algae and the surrounding water. Therefore, I carried out buffer operation on the edge of algae patterns (See the Fig.2, yellow buffer). The buffer represents the water around algae. My goal was to calculate the difference between algae pattern and the surrounding background water body. I have considered the object-based approach, but this is very memory intensive and has limitations on spot size. Now I want to do it based on pixels and morphology. How to achieve this?
An alternative idea maybe:Fill nodata values (masked algae) using neighborhood water in an image, then using subtraction between the original image and the new one to obtain the difference between the spectra of the algae and the surrounding water.

K-means clustering interpretation

I have 3 cluster pair plot with "Av. Mon. Hrs","Sat. Lvl","Last Eval", and found a matrix graph by below code.
library("ggplot2") # Expanded plotting functionality over "lattice" package
x<-cbind(HR_left$average_montly_hours,HR_left$satisfaction_level,HR_left$last_evaluation)
kmfit<-kmeans(x,3,nstart=25)
# Find the best 3 clusters using 25 random sets of (distinct) rows in x as initial centres.
pairs(x,col=(kmfit$cluster), labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
It says
Cluster 1: The pairs plot characterised this cluster as working low
average monthly hours of employees, middle satisfaction range and a
low last evaluation.
Cluster 2: From the pairs plot, this cluster is
characterised by high monthly hours, very low satisfaction and high
evaluation.
Cluster 3: From the pairs plot, this cluster is
characterised by high monthly hours, high satisfaction and high
evaluation.
But I don't understand the pairplot graphs about how they interpretative of these three findings.
library(readr)
HR_comma_sep <- read_csv("https://stluc.manta.uqcloud.net/mdatascience/public/datasets/HumanResourceAnalytics/HR_comma_sep.csv")
HR_left<-HR_comma_sep[HR_comma_sep$left==1,]
library("ggplot2") # Expanded plotting functionality over "lattice" package
x<-cbind(HR_left$average_montly_hours,HR_left$satisfaction_level,HR_left$last_evaluation)
kmfit<-kmeans(x,3,nstart=25)
# Find the best 3 clusters using 25 random sets of (distinct) rows in x as initial centres.
pairs(x,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
The number of "monthly hours" is at a very different scale than the other two variables, thus is skewing the clustering. The difference in "hours worked" is dominating the differences in the other two variables.
Normalize each column by dividing by the mean, the range or finding the z-score.
Original Code:
library(readr)
HR_comma_sep <- read_csv("https://stluc.manta.uqcloud.net/mdatascience/public/datasets/HumanResourceAnalytics/HR_comma_sep.csv")
HR_left<-HR_comma_sep[HR_comma_sep$left==1,]
library("ggplot2")
x_org<-cbind(HR_left$average_montly_hours,
HR_left$satisfaction_level,
HR_left$last_evaluation)
kmfit<-kmeans(x_org, 3, nstart = 25)
pairs(x_org,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
Repeating the calculation using scaled values:
x_scaled<-cbind(scale(HR_left$average_montly_hours),
scale(HR_left$satisfaction_level),
scale(HR_left$last_evaluation))
kmfit<-kmeans(x_scaled, 3)
pairs(x_org,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
Using just the raw values, the clustering based on difference in the "monthly hours", The top plot shows 2 clusters (black and green) merged together and not clearly distinct.
After scaling the values and repeating the clustering, 3 clearly differentiated clusters are now clearly shown (bottom image).

Correlating rasters with divisible resolution

I am using a multibeam echosounder to create a raster stack in R with layers all in the same resolution, which I then convert to a data frame so I can create additive models to describe the distribution of fish around bathymetry variables (depth, aspect, slope, roughness etc.).
The issue I have is that I would like to keep my resonse variable (fish school volume) fine and my predictive variables (bathymetry) coarse, such that I have say 1 x 1m cells representing the distribution of fish schools and 10 x 10m cells representing bathymetry (so the coarse cell is divisible by the fine cell with no remainder).
I can easily create these rasters individually but relating them is the problem. As each coarser cell would contain 10 x 10 = 100 finer cells, I am not sure how to program this into R so that the values are in the right location relative to an x and a y column (for cell addresses). But I realise in this case, I would need each coarse cell value to be repeated 100 times in the data frame.
Any advice would be greatly appreciated! Thanks!

R understanding raster's corLocal neighborhood size parameter

I am calculating the Pearson correlation between two rasters (identical in dimensions and cell size) in a moving window with the corLocal from the raster package. It is not clear (to me) from the manual what the neighborhood size parameter (ngb) actually means. E.g., does a ngb = 5 mean that the correlation is calculated for the focal cell plus the top-bottom-right-left cells?
I looked at the code and corLocal calls getValuesFocal():
getValuesFocal(x, 1, nrow(x), ngb=ngb)
but I couldn't understand what getValuesFocal actually does.
Thanks,
Ilik
The ngb parameter defines the neighborhood size. For example, I believe ngb=5 defines a 5 x 5 neighborhood. This should be equivalent to ngb=c(5,5) which is a vector of two integers defining the numbers of rows and cols in the neighborhood or focal window. In this example, an individual cell in the output raster would represent the correlation calculated from a 5 x 5 cell neighborhood in the two input rasters.
The raster library documentation on p. 118 might help too.

How is adaptative.density() (spatstat) managing duplicated points and default f value

I can not find this information in the reference literature [1]
1)how adaptative.density() (package spatstat) manage duplicated spatial points. I have duplicated points exactly in the same position because I am combining measurements from different years, and I am expecting that the density curve is higher in those areas but I am not sure about it.
2) is the default value of f in adaptative.density() f=0 or f=1?
My guess is that it is f=0, so it is doing an adaptive estimate by calculating the intensity estimate at every location equal to the average intensity (number of points divided by window area)
Thank you for your time and input!
The default value of f is 0.1 as you can see from the "Usage" section in the help file.
The function subsamples the point pattern with this selection probability and uses the resulting pattern to generate a Dirichlet tessellation (if there are duplicated points here they are ignored). The other fraction of points (1-f) is used to estimate the intensity by the number of points in each tile of the tessellation divided by the corresponding area (here duplicated points count equally to the total count in the tile).

Resources