Difference between two density plots - r

Is there a simple way to plot the difference between two probability density functions?
I can plot the pdfs of my data sets (both are one-dimensional vectors with roughly 11000 values) on the same plot together to get an idea of the overlap/difference but it would be more useful to me if I could see a plot of the difference.
something along the lines of the following (though this obviously doesn't work):
> plot(density(data1)-density(data2))
I'm relatively new to R and have been unable to find what I'm looking for on any of the forums.
Thanks in advance

This should work:
plot(x =density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$x,
y= density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y-
density(data2, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y )
The trick is to make sure the densities have the same limits. Then you can plot their differences at the same locations.My understanding of the need for the identical limits comes from having made the error of not taking that step in answering a similar question on Rhelp several years ago. Too bad I couldn't remember the right arguments.

It looks like you need to spend a little time learning how to use R (or any other language, for that matter). Help files are your friend.
From the output of ?density :
Value [i.e. the data returned by the function]
If give.Rkern is true, the number R(K), otherwise an object with class
"density" whose underlying structure is a list containing the
following components.
x the n coordinates of the points where the density is estimated.
y the estimated density values. These will be non-negative, but can
be zero [remainder of "value" deleted for brevity]
So, do:
foo<- density(data1)
bar<- density(data2)
plot(foo$y-bar$y)

Related

How do I produce a probability histogram?

I've just started learning R, and was wondering, say I have the dataset quake, and I want to generate the probability histogram of quakes near Fiji, would the code simply be hist(quakes$lat,freq=F)?
A histogram shows the frequency or proportion of a given value out of all the values in a data set. You need a numeric vector as the x argument for hist(). There is no flat variable in quakes, but there is a lat variable. hist(quakes$lat, freq = F) would show the following:
This shows the north/south geographical distribution of earthquakes, centering around -20, and, since it is approximately normal (with a left skew) suggests that there is a mechanism for earthquake generation that centers around a specific latitude.
The best way to learn is to try. If you wonder if that would be the way to do it, try it.
You might also want to look at this tutorial on creating kernel density plots with ggplot.

Converting "ppp" to multitype

I have been running two unmarked planar point pattern data sets through a series of spatstat functions. Now I would like to use the Kcross.inhom function to describe interaction between the two, but Kcross only works with marked data, so I have combined all x-y data into one csv file and added a column that distinguishes the two. I have established the following point pattern object, but do not understand how to edit the subsequent example of Kcross for my purposes. Or, perhaps there is a better way? Thanks for your help!
# read in data & create ppp
collisionspotholes<-read.csv("cpmulti.csv")
cp<-ppp(collisionspotholes[,3],collisionspotholes[,4],c(40.50390735,40.91115166),c(-74.25262139,-73.7078596))
# synthetic example
pp <- runifpoispp(50)
pp <- pp %mark% factor(sample(0:1, npoints(pp), replace=TRUE))
K <- Kcross(pp, "0", "1")
K <- Kcross(pp, 0, 1) # equivalent
I am not really clear as to what the problem is that you are having. You seem to me to "be there" essentially. However let me, for completeness, spell out the procedure that you should follow:
Let X and Y be your two point patterns (observed, presumably, in the same window).
Put these together into a single pattern:
XY <- superimpose(X=X,Y=Y)
Note that there is no need to dick around with your csv files; it is much more efficient to use the facilities provided by spatstat.
The foregoing syntax produces a multitype point pattern with marks being a factor with levels "X" and "Y". (If you want the levels to be denoted by other symbols you can easily arrange this.)
Then just calculate the inhomogeneous Kcross function:
Ki <- Kcross.inhom(XY,"X","Y")
That is all that there is to it.
Note that the foregoing uses the default method of estimating the intensities of the two patterns, explicitly leave-one-out kernel smoothing with bandwidth chosen by bw.diggle(). There may be better ways of estimating the intensities, perhaps by fitting a parametric model. This depends on the nature of the information available to you.
Interpreting the output of Kcross.inhom() is, IMHO, subtle and difficult.
Be cautious in any conclusions that you draw.
Rolf Turner's answer is correct. However, you say that
I have combined all x-y data into one csv file and added a column that distinguishes the two.
OK, suppose the data frame is called df and it has columns named x and y giving the spatial coordinates and h which is a character vector identifying whether the corresponding point is a pothole (h="p") or a collision (h="c"). Then you could do
X <- ppp(df$x, df$y, xlim, ylim, marks=factor(df$h))
where xlim, ylim are the limits for the spatial coordinates. Or more elegantly
X <- with(df, ppp(x, y, xlim, ylim, marks=factor(h))
Note the use of factor to ensure that the marks are categorical values. Then type
X
to check that you've got a 'multitype point pattern'.
Then you can do, e.g.
K <- Kcross(X)
Ki <- Kcross.inhom(X)
Please read the help files for Kcross, Kcross.inhom for advice about how to use these functions and how to interpret the results.
Incidentally, please do not send the same question to multiple forums at the same time. That is difficult for those who have to answer.

How to cleanly use interpolation between points to generate a mean in R

I am having issues trying to generate a code that will cleanly produce a mean (specifically a weighted average) based on a simple plot of points using interpolation.
For Example;
ex=c(1,2,3,4,5)
why=c(2,5,9,15,24)
This shows the kind of information I am working with.
plot(ex, why, type="o")
At this point, I want to actually have each point "binned" so the lines between them are straight. To do this, I have been adding points to the x values manually in excel as (x+0.01).
This is the new output:
why=c(2,2,5,5,9,9,15,15,24,24)
ex=c(1,2,2.01,3,3.01,4,4.01,5,5.01,6)
plot(ex, why, type="o")
So this is where my question comes in to play. I have to do this many times and do not want to generate a ton of new vectors and objects. To get a weighted average, I have been interpolating y values for increments of x at 0.01 using interpolation into a new object. I am then able to go into this new object and get a mean when a point falls between the actual ex values, i.e.
mean(newy[1:245])
Because I made new y values for 100 increments of x that (basically) follow a straight line, I am getting a weighted average here for x= 1 to 2.45.
Is there an easier and more elegant way to embed the interpolate code into the mean code so I could just say "average of interpolated y for nonreal x to nonreal x?"
It doesn't do exactly what you want, but you should consider the stepfun function -- this creates a step function out of two series.
plot(stepfun(ex[-1], why))
stepfun is handy because it gives you a function defined over that interval, so you can easily interpolate just by evaluating anywhere. The downside to it is that it is not strictly defined on the range given (which is why we have to cut off the first value in ex).
Based on your second plotting example, I think you are probably looking for this:
library(ggplot2)
qplot(ex, why, geom="step")
this gives:
Or if you want the line to go vertical first, you can use:
qplot(ex, why, geom="step", direction = "vh")
which gives:

How to generate medoid plots

Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning

Graphing results of dbscan in R

Your comments, suggestions, or solutions are/will be greatly appreciated, thank you.
I'm using the fpc package in R to do a dbscan analysis of some very dense data (3 sets of 40,000 points between the range -3, 6).
I've found some clusters, and I need to graph just the significant ones. The problem is that I have a single cluster (the first) with about 39,000 points in it. I need to graph all other clusters but this one.
The dbscan() creates a special data type to store all of this cluster data in. It's not indexed like a data frame would be (but maybe there is a way to represent it as such?).
I can graph the dbscan type using a basic plot() call. But, like I said, this will graph the irrelevant 39,000 points.
tl;dr:
how do I graph only specific clusters of a dbscan data type?
If you look at the help page (?dbscan) it is organized like all others into sections labeled Description, Usage, Arguments, Details and Value. The Value section describes what the function dbscan returns. In this case it is simply a list (a standard R data type) with a few components.
The cluster component is simply an integer vector whose length it equal to the number of rows in your data that indicates which cluster each observation is a member of. So you can use this vector to subset your data to extract only those clusters you'd like and then plot just those data points.
For example, if we use the first example from the help page:
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
sd=0.2))
ds <- dbscan(x, 0.2)
we can then use the result, ds to plot only the points in clusters 1-3:
#Plot only clusters 1, 2 and 3
plot(x[ds$cluster %in% 1:3,])
Without knowing the specifics of dbscan, I can recommend that you look at the function smoothScatter. It it very useful for examining the main patterns in a scatterplot when you otherwise would have too many points to make sense of the data.
The probably most sensible way of plotting DBSCAN results is using alpha shapes, with the radius set to the epsilon value. Alpha shapes are closely related to convex hulls, but they are not necessarily convex. The alpha radius controls the amount of non-convexity allowed.
This is quite closely related to the DBSCAN cluster model of density connected objects, and as such will give you a useful interpretation of the set.
As I'm not using R, I don't know about the alpha shape capabilities of R. There supposedly is a package called alphahull, from a quick check on Google.

Resources