Spatstat Point Pattern Analysis for colocalization - r

I am trying to do some cololcaization analysis, i.e. I want to show if one cell type tends to show up closer to another different celltype significantly in a microsopy image.
I tried to do this with R spatstat package I was able to visualize my dataset:
mypattern is one kind of cell and mypattern2 ist another kind of cell. When u look at the L-plots you can see that there is some kind of clustering as the curve is deviating from poission.
I thought about using nearest neighbor apporoach which is the nncross function in spatstat.
But how can I show now if this distance is random (two random point pattern) or significantly relevant? Does anyone has an idea? I saw a lot about simulations like Monte-Carlo but I have no idea how to begin coding...
I would be glad for any help!
Kind regards,
Hashirama

The L function should not be used here because the data are highly inhomogeneous.
I suggest you combine the two point patterns into a single "marked" point pattern,
X <- superimpose(A=mypattern1, B=mypattern2)
Then estimate the spatially-varying densities of points
D <- density(split(X))
plot(D)
or the spatially varying proportions of each type of cell
R <- relrisk(X)
plot(R)
You can also use segregation.test or a contingency table of nearest neighbours (dixon).
See Chapter 14 of the spatstat book and the help files for relrisk, density.splitppp and segregation.test.

Related

R: clustering with a similarity or dissimilarity matrix? And visualizing the results

I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:
Normalized compression distance (NCD)
Damerau-Levenshtein distance
Jaro-Winkler distance
Levenshtein distance
Optimal string alignment distance (OSA)
("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")
At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.
But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.
I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.
To get the dendrograms using the similarity function I do:
plot(hclust(as.dist(""similarityMATRIX""), "average"))
With the dissimilarity matrix I tried:
plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))
and
plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))
From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)
I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.
Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?
You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package).
You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).
Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Finding the intersection of two curves in a scatterplot (here: pvalues vs test-statistics)

i do
library(Hmisc)
df <- as.matrix(replicate(20, rnorm(20)))
cor.df <- rcorr(df)
plot(cor.df$r,cor.df$P)
abline(h=0.05)
and i would like to know if R can compute the meeting point of the horizontal line and the bell-curve. Since i have a scatterplot, do i need to model the x,y-curve first, and then balance the two functions? Or can R do that graphically?
I actually want to know what the treshold for (uncorrected) pvalues indicating a significant test statistics for a given dataset would be. I am not a trained statistician, so excuse me if that is a basic question.
Thank you very much!
There is no function to graphically calculate an intersection. There are functions like uniroot that you can use in R to find intersections, but you need to have proper functions and have a good idea of the interval where the intersection occurs.
It would be best to properly model the curve in question, but a simply way to approximate a function when you have a bunch of points on the curve is just to use linear interpolation between the observed points. You can create a function for your points with approxfun
f1 <- approxfun(cor.df$r,cor.df$P, rule=2)
(again, a proper model would be better, but just for the sake of example, i'll continue with this function).
Now we can find the place where this curve cross 0.05 with
uniroot(function(x) f1(x)-.05, c(-1,-.001))$root
# [1] -0.4437796
uniroot(function(x) f1(x)-.05, c(.001, 1))$root
# [1] 0.4440005

Interpreting the phom R package - persistent homology - topological analysis of data - Cluster analysis

I am learning to analyze the topology of data with the pHom package of R.
I would like to understand (characterize) a set of data (A Matrix(3500 rows,10 colums). In order to achieve such aim the R-package phom runs a persistent homology test that describes the data.
(Reference: The following video describes what we are seeking to do with homology in topology - reference video 4 min: http://www.youtube.com/embed/XfWibrh6stw?rel=0&autoplay=1).
Using the R-package "phom" (link: http://cran.r-project.org/web/packages/phom/phom.pdf) the following example can be run.
I need help in order to properly understand how the phom function works and how to interpret the data (plot).
Using the Example # 1 of the reference manual of the phom package in r, running it on R
Load Packages
library(phom)
library(Rccp)
Example 1
x <- runif(100)
y <- runif(100)
points <- t(as.matrix(rbind(x, y)))
max_dim <- 2
max_f <- 0.2
intervals <- pHom(points, max_dim, max_f, metric="manhattan")
plotPersistenceDiagram(intervals, max_dim, max_f,
title="Random Points in Cube with l_1 Norm")
I would kindly appreciate if someone would be able to help me with:
Question:
a.) what does the value max_f means and where does it come from? from my data? I set them?
b.) the plot : plotPersistenceDiagram (if you run the example in R you will see the plot), how do I interpret it?
Thank you.
Note: in order to run the "phom" package you need the "Rccp" package and you need the latest version of R 3.03.
The previous example was done in R after loading the "phom" and the "Rccp" packages respectively.
This is totally the wrong venue for this question, but just in case you're still struggling with it a year later I happen to know the answer.
Computing persistent homology has two steps:
Turn the point cloud into a filtration of simplicial complexes
Compute the homology of the simplicial complex
The "filtration" part of step 1 means you have to compute a simplicial complex for a whole range of parameters. The parameter in this case is epsilon, the distance threshold within which points are connected. The max_f variable caps the range of epsilon sweep from zero to max_f.
plotPersistenceDiagram displays the homological "persistence barcodes" as points instead of lines. The x-coordinate of the point is the birth time of that topological feature (the value of epsilon for which it first appears), and the y-coordinate is the death time (the value of epsilon for which it disappears).

Fitting multiple peaks to a dataset and extracting individual peak information in R

I have a number of datasets composed of counts of features at various elevations. Currently there is data for each 1m interval from 1-30m. When plotted, many of my datasets exhibit 3-4 peaks, which are indicative of height layers.
Here is a sample dataset:
Height <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
Counts <-c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
I would like to fit some manner of curve function to these datasets in order to determine the total number of ‘peaks’, the peak center location (i.e. height) and peak width.
I could perform this kind of analysis by fitting multiple Gaussian functions manually using the fityk software some time ago, I would however like to know if it is possible to perform such a process automatically through R?
I’ve explored a number of other posts concerning fitting peaks to histograms, such as through the mixtools package, however I do not know if you could extract individual peak information.
Any help you can supply would be greatly appreciated.
"How do I fit a curve to my data" is way too broad of a question, because there are countless ways to do this. It's also probably more suited for https://stats.stackexchange.com/ than here. However, ksmooth from base R is a pretty good starting point for a basic smoother:
plot(Height,Counts)
smoothCounts<-ksmooth(Height,Counts,kernel="normal",bandwidth=2)
dsmooth<-diff(smoothCounts$y)
locmax<-sign(c(0,dsmooth))>0 & sign(c(dsmooth,0))<0
lines(smoothCounts)
points(smoothCounts$x[locmax],smoothCounts$y[locmax],cex=3,c=2)
A simple peak identification could be along the following lines. Looks reasonable?
library(data.table)
dt <- data.table(
Height = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
Counts = c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
)
# crude dHeights/dCounts
dt[,d1 := c(NA,diff(Counts))]
# previous crude dHeights/dCounts (d2Heights/dCounts2 will be even more crude so comparing change in dHeight/dCounts instead)
dt[,d2 := c(tail(d1,-1),NA)]
# local maxima
dtpeaks <- dt[d1 >=0 & d2 <=0]
I'm not very sure how you would calculate FWHM for the peaks, if you can explain the process then I should be able to help.

Resources