I'm still trying to find the best way to classify bivariate point patterns:
Point pattern classification with spatstat: what am I doing wrong?
I now analysed 110 samples of my dataset using #Adrian's suggestion with sigma=bw.diggle (as I wanted an automatic bandwidth selection). f is a "resource selection function" (RSF) which describes the relationship between the intensity of the Cancer point process and the covariate (here kernel density of Immune):
Cancer <- split(cells)[["tumor"]]
Immune <- split(cells)[["bcell"]]
Dimmune <- density(Immune,sigma=bw.diggle)
f <- rhohat(Cancer, Dimmune)
I am in doubt about some results I've got. A dozen of rho-functions looked weird (disrupted, single peak). After changing to default sigma=NULL or sigma=bw.scott (which are smoother) the functions became "better" (see examples below). I also experimented with the following manipulations:
cells # bivariate point pattern with marks "tumor" and "bcell"
o.marks<-cells$marks # original marks
#A) randomly re-assign original marks
a.marks <- sample(cells$marks)
#B) replace marks randomly with a 50/50 proportion
b.marks<-as.factor(sample(c("tumor","bcell"), replace=TRUE, size=length(o.marks)))
#C) random (homogenious?) pattern with the original number of points
randt<-runifpoint(npoints(subset(cells,marks=="tumor")),win=cells$window)
randb<-runifpoint(npoints(subset(cells,marks=="bcell")),win=cells$window)
cells<-superimpose(tumor=randt,bcell=randb)
#D) tumor points are associated with bcell points (is "clustered" a right term?)
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),Dimmune,win=cells$window)
#E) tumor points are segregated from bcell points
reversedD<-Dimmune
density.scale.v<-sort(unique((as.vector(Dimmune$v)[!is.na(as.vector(Dimmune$v))]))) # density scale
density.scale.v.rev<-rev(density.scale.v)# reversed density scale
new.image.v<-Dimmune$v
# Loop over matrix
for(row in 1:nrow(Dimmune$v)) {
for(col in 1:ncol(Dimmune$v)) {
if (is.na(Dimmune$v[row, col])==TRUE){next}
number<-which(density.scale.v==Dimmune$v[row, col])
new.image.v[row, col]<-density.scale.v.rev[number]}
}
reversedD$v<-new.image.v # reversed density
Cancer<-rpoint(npoints(subset(cells,marks=="tumor")),reversedD,win=cells$window)
A better way to generate inverse density heatmaps is given by #Adrian in his post below.
I could not generate rpoint patterns for the bw.diggle density as it produced negative numbers.Thus I replaced the negatives Dimmune$v[which(Dimmune$v<0)]<-0 and could run rpoint then. As #Adrian explained in the post below, this is normal and can be solved easier by using a density.ppp option positive=TRUE.
I first used bw.diggle, because hopskel.test indicarted "clustering" for all my patterns. Now I'm going to use bw.scott for my analysis but can this decision be somehow justified? Is there a better method besides "RSF-function is looking weird"?
some examples:
sample10:
sample20:
sample110:
That is a lot of questions!
Please try to ask only one question per post.
But here are some answers to your technical questions about spatstat.
Negative values:
The help for density.ppp explains that small negative values can occur because of numerical effects. To force the density values to be non-negative, use the argument positive=TRUE in the call to density.ppp. For example density(Immune, bw.diggle, positive=TRUE).
Reversed image: to reverse the ordering of values in an image Z you can use the following code:
V <- Z
A <- order(Z[])
V[][A] <- Z[][rev(A)]
Then V is the order-reversed image.
Tips for your code:
to generate a random point pattern with the same number of points and in the same window as an existing point pattern X, use Y <- runifpoint(ex=X).
To extract the marks of a point pattern X, use a <- marks(X). To assign new marks to a point pattern X, use marks(X) <- b.
to randomly permute the marks attached to the points in a point pattern X, use Y <- rlabel(X).
to assign new marks to a point pattern X where the new marks are drawn randomly-with-replacement from a given vector of values m, use Y <- rlabel(X, m, permute=FALSE).
Using R, I want to estimate two curves using points from two vectors, and then find the x and y coordinates where those estimated curves intersect.
In a strategic setting with players "t" and "p", I am simulating best responses for both players in response to what the other would pick in a strategic setting (game theory). The problem is that I don't have functions or lines, I have two sets of points originating from simulation, with one set of points corresponding to the player's best response to given actions by the other player. The actual math was too difficult for me (or matlab) to solve, which is why I'm using this simulated visual approach. I want to estimate best response functions (i.e. create non-linear curves) using the points, and then take the two estimated curves and find where they intersect in order to identify nash equilibrium (where the best response curves intersect).
As an example, here are two such vectors I am working with:
t=c(10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0)
p=c(12.3,12.3,12.3,12.3,12.3,12.3,12.4,12.4,12.4,12.5,12.5,12.5,12.6,12.6,12.7,12.7,12.8,12.8,12.9,12.9,13.0,13.1,13.1,13.2,13.3,13.4,13.5,13.4,13.5,13.6,13.6,13.7,13.8,13.8,13.9,13.9,13.9,14.0,14.0,14.0,14.0)
For the first line, the sample is made up of (t,a), and for the second line, the sample is made up of (a,p) where a is a third vector given by
a = seq(10, 14, by = 0.1)
For example, the first point for the sample corresponding to the first vector would be (10.0,10.0) and the second point would be (10.0,10.1). The first point for the sample corresponding to the second vector would be (10.0,12.3) and the second point would be (10.1,12.3).
What I originally tried to do is estimate the lines using polynomials produced by lm models, but those don't seem to always work:
plot(a,t, xlim=c(10,14), ylim=c(10,14), col="purple")
points(p,a, col="red")
fit4p <- lm(a~poly(p,3,raw=TRUE))
fit4t <- lm(t~poly(a,3,raw=TRUE))
lines(a, predict(fit4t, data.frame(x=a)), col="purple", xlim=c(10,14), ylim=c(10,14),type="l",xlab="p",ylab="t")
lines(p, predict(fit4p, data.frame(x=a)), col="green")
fit4pCurve <- function(x) coef(fit4p)[1] +x*coef(fit4p)[2]+x^2*coef(fit4p)[3]+x^3*coef(fit4p)[4]
fit4tCurve <- function(x) coef(fit4t)[1] +x*coef(fit4t)[2]+x^2*coef(fit4t)[3]+x^3*coef(fit4t)[4]
a_opt1 = optimise(f=function(x) abs(fit4pCurve(x)-fit4tCurve(x)), c(10,14))$minimum
b_opt1 = as.numeric(fit4pCurve(a_opt1))
EDIT:
After fixing the type, I get the correct answer, but it doesn't always work if the samples don't come back as cleanly.
So my question can be broken down a few ways. First, is there a better way to accomplish what I'm trying to do. I know what I'm doing isn't perfectly accurate by any means, but it seems like a decent approximation for my purposes. Second, if there isn't a better way, is there a way I could improve on the methodology I have listed above.
Restart your R session, make sure all variables are cleared and copy/paste this code. I found a few mistakes in referenced variables. Also note that R is case sensitive. My suspicion is that you've been overwriting variables.
plot(a,t, xlim=c(10,14), ylim=c(10,14), col="purple")
points(p,a, col="red")
fit4p <- lm(a~poly(p,3,raw=TRUE))
fit4t <- lm(t~poly(a,3,raw=TRUE))
lines(a, predict(fit4t, data.frame(x=a)), col="purple", xlim=c(T,P), ylim=c(10,14),type="l",xlab="p",ylab="t")
lines(p, predict(fit4p, data.frame(x=a)), col="green")
fit4pCurve <- function(x) coef(fit4p)[1] +x*coef(fit4p)[2]+x^2*coef(fit4p)[3]+x^3*coef(fit4p)[4]
fit4tCurve <- function(x) coef(fit4t)[1] +x*coef(fit4t)[2]+x^2*coef(fit4t)[3]+x^3*coef(fit4t)[4]
a_opt = optimise(f=function(x) abs(fit4pCurve(x)-fit4tCurve(x)), c(T,P))$minimum
b_opt = as.numeric(fit4pCurve(a_opt))
As you will see:
> a_opt
[1] 12.24213
> b_opt
[1] 10.03581
I found some posts and discussions about the above, but I'm not sure... could someone please check if I am doing anything wrong?
I have a set of N points of the form (x,y,z). The x and y coordinates are independent variables that I choose, and z is the output of a rather complicated (and of course non-analytical) function that uses x and y as input.
My aim is to find a set of values of (x,y) where z=z0.
I looked up this kind of problem in R-related forums, and it appears that I need to interpolate the points first, perhaps using a package like akima or fields.
However, it is less clear to me: 1) if that is necessary, or the basic R functions that do the same are sufficiently good; 2) how I should use the interpolated surface to generate a correct matrix of the desired (x,y,z=z0) points.
E.g. this post seems somewhat related to the problem I am describing, but it looks extremely complicated to me, so I am wondering whether my simpler approach is correct.
Please see below some example code (not the original one, as I said the generating function for z is very complicated).
I would appreciate if you could please comment / let me know if this approach is correct / suggest a better one if applicable.
df <- merge(data.frame(x=seq(0,50,by=5)),data.frame(y=seq(0,12,by=1)),all=TRUE)
df["z"] <- (df$y)*(df$x)^2
ta <- xtabs(z~x+y,df)
contour(ta,nlevels=20)
contour(ta,levels=c(1000))
#why are the x and y axes [0,1] instead of showing the original values?
#and how accurate is the algorithm that draws the contour?
li2 <- as.data.frame(contourLines(ta,levels=c(1000)))
#this extracts the contour data, but all (x,y) values are wrong
require(akima)
s <- interp(df$x,df$y,df$z)
contour(s,levels=c(1000))
li <- as.data.frame(contourLines(s,levels=c(1000)))
#at least now the axis values are in the right range; but are they correct?
require(fields)
image.plot(s)
fancier, but same problem - are the values correct? better than the akima ones?
Is there a simple way to plot the difference between two probability density functions?
I can plot the pdfs of my data sets (both are one-dimensional vectors with roughly 11000 values) on the same plot together to get an idea of the overlap/difference but it would be more useful to me if I could see a plot of the difference.
something along the lines of the following (though this obviously doesn't work):
> plot(density(data1)-density(data2))
I'm relatively new to R and have been unable to find what I'm looking for on any of the forums.
Thanks in advance
This should work:
plot(x =density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$x,
y= density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y-
density(data2, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y )
The trick is to make sure the densities have the same limits. Then you can plot their differences at the same locations.My understanding of the need for the identical limits comes from having made the error of not taking that step in answering a similar question on Rhelp several years ago. Too bad I couldn't remember the right arguments.
It looks like you need to spend a little time learning how to use R (or any other language, for that matter). Help files are your friend.
From the output of ?density :
Value [i.e. the data returned by the function]
If give.Rkern is true, the number R(K), otherwise an object with class
"density" whose underlying structure is a list containing the
following components.
x the n coordinates of the points where the density is estimated.
y the estimated density values. These will be non-negative, but can
be zero [remainder of "value" deleted for brevity]
So, do:
foo<- density(data1)
bar<- density(data2)
plot(foo$y-bar$y)
Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.