I came across PCA analysis, and noticed the different values returned by different functions in R. The intention of this question is to disambiguate the output of each. I didn't find a satisfactory answer as to why these functions return different values. The functions compared are: stats::princomp(), stats::prcomp(), psych::principal(), and FactoMineR::PCA(). Data set was scaled and centered for sake of comparison and all set to return 4 components, however only the first two PCs are shown here for brevity.
Below is a code of a MWE to set up the case. Please feel free to report any other function in R that you might see it helpful to compare its output here in one place, I hope.
princompPCA <- princomp(USArrests, cor = TRUE)
prcompPCA <- prcomp(USArrests,scale.=TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
library(FactoMineR)
fmrPCA <- PCA(USArrests, ncp=4, graph=FALSE) # vars scaled data
# now the first two PCs from each package into one data frame
dfComp <- cbind.data.frame(princompPCA$scores[,1:2],prcompPCA$x[,1:2],principalPCA$scores[,1:2],fmrPCA$ind$coord[,1:2])
names(dfComp) <- c("princompDim1","princompDim2","prcompDim1","prcompDim2","principalDim1","principalDim2","fmrDim1","fmrDim2")
head(dfComp)
Output:
princompDim1 princompDim2 prcompDim1 prcompDim2 principalDim1 principalDim2 fmrDim1 fmrDim2
Alabama -0.9855659 1.1333924 -0.9756604 1.1220012 0.61951483 -1.1277874 0.9855659 -1.1333924
Alaska -1.9501378 1.0732133 -1.9305379 1.0624269 1.22583308 -1.0679059 1.9501378 -1.0732133
Arizona -1.7631635 -0.7459568 -1.7454429 -0.7384595 1.10830334 0.7422678 1.7631635 0.7459568
Arkansas 0.1414203 1.1197968 0.1399989 1.1085423 -0.08889509 -1.1142591 -0.1414203 -1.1197968
California -2.5239801 -1.5429340 -2.4986128 -1.5274267 1.58654347 1.5353037 2.5239801 1.5429340
Colorado -1.5145629 -0.9875551 -1.4993407 -0.9776297 0.95203595 0.9826713 1.5145629 0.9875551
I noticed that output of stats::princomp() is exactly the same as FactoMineR::PCA() except for the inverted signs. Any idea why the signs are mirrored? Both outputs of these two functions are drawing near to the stats::prcomp() but that may be due to floating point issues, a minor issue. But psych::principal() is relatively different than others. Could it be due to rotation differences between the mentioned functions? So any explanation for these differences would be much appreciated.
The outcome of the PCA are vectors along an axis. The numbers with the sign inverted are simply the vectors pointing in the other direction along the same axis. So, the results you get are the same.
Other differences could be due a different way of calculating the principal components, i.e. using eigenvectors of a correlation matrix or using singular vector decomposition. But I'm just speculating here.
I was looking for the same info and found this link helpful:
https://groups.google.com/forum/#!topic/factominer-users/BRN8jRm-_EM
FactoMiner outputs PCA coordinates not loadings which confused me for a while....
Related
I wonder if it’s possible to use overlay function of raster package with a function that calls a list of vectors to perform some calculations based on two rasters. So far I just saw examples of functions performing some raster algebra without calling external data.
Hereafter I provide some toy code to illustrate what I’m trying to do, but can also provide some context on my real problem. Specifically, I need to classify each pixel as either zero (absence) or one (presence) of housings. The likelihood of housing presence is related to the percentage of built-up area covering the pixel (raster ‘r1’ below), and the land cover type (raster ‘r2’ below). This likelihood is known based on reference data, which is stored in a list like ‘probs’ below.
library(raster)
# continuous and categorical maps
r1<-r2<-raster()
r1[]<-round(runif(ncell(r1))*100)
r2[]<-1
r2[1:30000]<-2
# probability of housing presence in each stratum
prob1<-1:100/100
prob2<-log(1:100)/max(log(1:100))
# list of probabilities to be used in overlay
probs<-list(prob1,prob2)
# overlay - not working
o<-overlay(r1,r2,fun=function(x,y,...){return(rbinom(n=1, size=1, prob=probs[[y]][x]))})
the error is
cannot use this formula, probably because it is not vectorized
Alternatively to the toy code above, I thought to process each categorical class separately and use function calc rather than function overlay (see below). However, this is extremely slow (if not impossible) for large rasters, so I though overlay would be better.
# alternative: loop across categorical classes (extremely slow for large rasters)
r<-list()
for(i in 1:2){
stratum<-r2
stratum[Which(stratum !=i)]<-NA
r[[i]]<-calc(r1, fun=function(x,...){return(rbinom(n=1, size=1, prob=probs[[i]][x]))})
r[[i]]<-mask(r[[i]],stratum)
}
r<-stack(r)
r<-sum(r,na.rm=T)
par(mfrow=c(1,3))
plot(r1)
plot(r2)
plot(r)
I ran into this same error recently. My solution was to "Vectorize" the function that is passed to overlay. This creates a wrapper of your function with mapply so that overlay can use it. I was able to get your code to run by using Vectorize (see below).
library(raster)
# continuous and categorical maps
r1<-r2<-raster()
r1[]<-round(runif(ncell(r1))*100)
r2[]<-1
r2[1:30000]<-2
# probability of housing presence in each stratum
prob1<-1:100/100
prob2<-log(1:100)/max(log(1:100))
# list of probabilities to be used in overlay
probs<-list(prob1,prob2)
## edits below
# define function
f <- function(x,y,...){return(rbinom(n=1, size=1, prob=probs[[y]][x]))}
# run function using overlay with Vectorize
o <- overlay(r1,r2,fun=Vectorize(f))
This produced a raster layer of probabilities. It also produced the following error:
In rbinom(n = 1, size = 1, prob = probs[[y]][x]) : NAs produced
I am not sure if this error would be problematic with your real data.
You can also refer to the second answer here for another worked example.
I have been running two unmarked planar point pattern data sets through a series of spatstat functions. Now I would like to use the Kcross.inhom function to describe interaction between the two, but Kcross only works with marked data, so I have combined all x-y data into one csv file and added a column that distinguishes the two. I have established the following point pattern object, but do not understand how to edit the subsequent example of Kcross for my purposes. Or, perhaps there is a better way? Thanks for your help!
# read in data & create ppp
collisionspotholes<-read.csv("cpmulti.csv")
cp<-ppp(collisionspotholes[,3],collisionspotholes[,4],c(40.50390735,40.91115166),c(-74.25262139,-73.7078596))
# synthetic example
pp <- runifpoispp(50)
pp <- pp %mark% factor(sample(0:1, npoints(pp), replace=TRUE))
K <- Kcross(pp, "0", "1")
K <- Kcross(pp, 0, 1) # equivalent
I am not really clear as to what the problem is that you are having. You seem to me to "be there" essentially. However let me, for completeness, spell out the procedure that you should follow:
Let X and Y be your two point patterns (observed, presumably, in the same window).
Put these together into a single pattern:
XY <- superimpose(X=X,Y=Y)
Note that there is no need to dick around with your csv files; it is much more efficient to use the facilities provided by spatstat.
The foregoing syntax produces a multitype point pattern with marks being a factor with levels "X" and "Y". (If you want the levels to be denoted by other symbols you can easily arrange this.)
Then just calculate the inhomogeneous Kcross function:
Ki <- Kcross.inhom(XY,"X","Y")
That is all that there is to it.
Note that the foregoing uses the default method of estimating the intensities of the two patterns, explicitly leave-one-out kernel smoothing with bandwidth chosen by bw.diggle(). There may be better ways of estimating the intensities, perhaps by fitting a parametric model. This depends on the nature of the information available to you.
Interpreting the output of Kcross.inhom() is, IMHO, subtle and difficult.
Be cautious in any conclusions that you draw.
Rolf Turner's answer is correct. However, you say that
I have combined all x-y data into one csv file and added a column that distinguishes the two.
OK, suppose the data frame is called df and it has columns named x and y giving the spatial coordinates and h which is a character vector identifying whether the corresponding point is a pothole (h="p") or a collision (h="c"). Then you could do
X <- ppp(df$x, df$y, xlim, ylim, marks=factor(df$h))
where xlim, ylim are the limits for the spatial coordinates. Or more elegantly
X <- with(df, ppp(x, y, xlim, ylim, marks=factor(h))
Note the use of factor to ensure that the marks are categorical values. Then type
X
to check that you've got a 'multitype point pattern'.
Then you can do, e.g.
K <- Kcross(X)
Ki <- Kcross.inhom(X)
Please read the help files for Kcross, Kcross.inhom for advice about how to use these functions and how to interpret the results.
Incidentally, please do not send the same question to multiple forums at the same time. That is difficult for those who have to answer.
what are some good kriging/interpolation idea/options that will allow heavily-weighted points to bleed over lightly-weighted points on a plotted R map?
the state of connecticut has eight counties. i found the centroid and want to plot poverty rates of each of these eight counties. three of the counties are very populated (about 1 million people) and the other five counties are sparsely populated (about 100,000 people). since the three densely-populated counties have more than 90% of the total state population, i would like those the three densely-populated counties to completely "overwhelm" the map and impact other points across the county borders.
the Krig function in the R fields package has a lot of parameters and also covariance functions that can be called, but i'm not sure where to start?
here is reproducible code to quickly produce a hard-bordered map and then three differently-weighted maps. hopefully i can just make changes to this code, but perhaps it requires something more complex like the geoRglm package? two of the three weighted maps look almost identical, despite one being 10x as weighted as the other..
https://raw.githubusercontent.com/davidbrae/swmap/master/20141001%20how%20to%20modify%20the%20Krig%20function%20so%20a%20huge%20weight%20overwhelms%20nearby%20points.R
thanks!!
edit: here's a picture example of the behavior i want-
disclaimer - I am not an expert on Krigging. Krigging is complex and takes a good understanding of the underlying data, the method and the purpose to achieve the correct result. You may wish to try to get input from #whuber [on the GIS Stack Exchange or contact him through his website (http://www.quantdec.com/quals/quals.htm)] or another expert you know.
That said, if you just want to achieve the visual effect you requested and are not using this for some sort of statistical analysis, I think there are some relatively simple solutions.
EDIT:
As you commented, though the suggestions below to use theta and smoothness arguments do even out the prediction surface, they apply equally to all measurements and thus do not extend the "sphere of influence" of more densely populated counties relative to less-densely populated. After further consideration, I think there are two ways to achieve this: by altering the covariance function to depend on population density or by using weights, as you have. Your weighting approach, as I wrote below, alters the error term of the krigging function. That is, it inversely scales the nugget variance.
As you can see in the semivariogram image, the nugget is essentially the y-intercept, or the error between measurements at the same location. Weights affect the nugget variance (sigma2) as sigma2/weight. Thus, greater weights mean less error at small-scale distances. This does not, however, change the shape of the semivariance function or have much effect on the range or sill.
I think that the best solution would be to have your covariance function depend on population. however, I'm not sure how to accomplish that and I don't see any arguments to Krig to do so. I tried playing with defining my own covariance function as in the Krig example, but only got errors.
Sorry I couldn't help more!
Another great resource to help understand Krigging is: http://www.epa.gov/airtrends/specialstudies/dsisurfaces.pdf
As I said in my comment, the sill and nugget values as well as the range of the semivariogram are things you can alter to affect the smoothing. By specifying weights in the call to Krig, you are altering the variance of the measurement errors. That is, in a normal use, weights are expected to be proportional to the accuracy of the measurement value so that higher weights represent more accurate measurements, essentially. This isn't actually true with your data, but it may be giving you the effect you desire.
To alter the way your data is interpolated, you can adjust two (and many more) parameters in the simple Krig call you are using: theta and smoothness. theta adjusts the semivariance range, meaning that measured points farther away contribute more to the estimates as you increase theta. Your data range is
range <- data.frame(lon=range(ct.data$lon),lat=range(ct.data$lat))
range[2,]-range[1,]
lon lat
2 1.383717 0.6300484
so, your measurement points vary by ~1.4 degrees lon and ~0.6 degrees lat. Thus, you can play with specifying your theta value in that range to see how that affects your result. In general, a larger theta leads to more smoothing since you are drawing from more values for each prediction.
Krig.output.wt <- Krig( cbind(ct.data$lon,ct.data$lat) , ct.data$county.poverty.rate ,
weights=c( size , 1 , 1 , 1 , 1 , size , size , 1 ),Covariance="Matern", theta=.8)
r <- interpolate(ras, Krig.output.wt)
r <- mask(r, ct.map)
plot(r, col=colRamp(100) ,axes=FALSE,legend=FALSE)
title(main="Theta = 0.8", outer = FALSE)
points(cbind(ct.data$lon,ct.data$lat))
text(ct.data$lon, ct.data$lat-0.05, ct.data$NAME, cex=0.5)
Gives:
Krig.output.wt <- Krig( cbind(ct.data$lon,ct.data$lat) , ct.data$county.poverty.rate ,
weights=c( size , 1 , 1 , 1 , 1 , size , size , 1 ),Covariance="Matern", theta=1.6)
r <- interpolate(ras, Krig.output.wt)
r <- mask(r, ct.map)
plot(r, col=colRamp(100) ,axes=FALSE,legend=FALSE)
title(main="Theta = 1.6", outer = FALSE)
points(cbind(ct.data$lon,ct.data$lat))
text(ct.data$lon, ct.data$lat-0.05, ct.data$NAME, cex=0.5)
Gives:
Adding the smoothness argument, will change the order of the function used to smooth your predictions. The default is 0.5 leading to a second-order polynomial.
Krig.output.wt <- Krig( cbind(ct.data$lon,ct.data$lat) , ct.data$county.poverty.rate ,
weights=c( size , 1 , 1 , 1 , 1 , size , size , 1 ),
Covariance="Matern", smoothness = 0.6)
r <- interpolate(ras, Krig.output.wt)
r <- mask(r, ct.map)
plot(r, col=colRamp(100) ,axes=FALSE,legend=FALSE)
title(main="Theta unspecified; Smoothness = 0.6", outer = FALSE)
points(cbind(ct.data$lon,ct.data$lat))
text(ct.data$lon, ct.data$lat-0.05, ct.data$NAME, cex=0.5)
Gives:
This should give you a start and some options, but you should look at the manual for fields. It is pretty well-written and explains the arguments well.
Also, if this is in any way quantitative, I would highly recommend talking to someone with significant spatial statistics know how!
Kriging is not what you want. (It is a statistical method for accurate--not distorted!--interpolation of data. It requires preliminary analysis of the data--of which you do not have anywhere near enough for this purpose--and cannot accomplish the desired map distortion.)
The example and the references to "bleed over" suggest considering an anamorph or area cartogram. This is a map which will expand and shrink the areas of the county polygons so that they reflect their relative population while retaining their shapes. The link (to the SE GIS site) explains and illustrates this idea. Although its answers are less than satisfying, a search of that site will reveal some effective solutions.
lot's of interesting comments and leads above.
I took a look at the Harvard dialect survey to get a sense for what you are trying to do first. I must say really cool maps. And before I start in on what I came up with...I've looked at your work on survey analysis before and have learned quite a few tricks. Thanks.
So my first take pretty quickly was that if you wanted to do spatial smoothing by way of kernel density estimation then you need to be thinking in terms of point process models. I'm sure there are other ways, but that's where I went.
So what I do below is grab a very generic US map and convert it into something I can use as a sampling window. Then I create random samples of points within that region, just pretend those are your centroids. After I attach random values to those points and plot it up.
I just wanted to test this conceptually, which is why I didn't go through the extra steps to grab cbsa's and also sorry for not projecting, but I think these are the fundamentals. Oh and the smoothing in the dialect study is being done over the whole country. I think. That is the author is not stratifying his smoothing procedure within polygons....so I just added states at the end.
code:
library(sp)
library(spatstat)
library(RColorBrewer)
library(maps)
library(maptools)
# grab us map from R maps package
usMap <- map("usa")
usIds <- usMap$names
# convert to spatial polygons so this can be used as a windo below
usMapPoly <- map2SpatialPolygons(usMap,IDs=usIds)
# just select us with no islands
usMapPoly <- usMapPoly[names(usMapPoly)=="main",]
# create a random sample of points on which to smooth over within the map
pts <- spsample(usMapPoly, n=250, type='random')
# just for a quick check of the map and sampling locations
plot(usMapPoly)
points(pts)
# create values associated with points, be sure to play aroud with
# these after you get the map it's fun
vals <-rnorm(250,100,25)
valWeights <- vals/sum(vals)
ptsCords <- data.frame(pts#coords)
# create window for the point pattern object (ppp) created below
usWindow <- as.owin(usMapPoly)
# create spatial point pattern object
usPPP <- ppp(ptsCords$x,ptsCords$y,marks=vals,window=usWindow)
# create colour ramp
col <- colorRampPalette(brewer.pal(9,"Reds"))(20)
# the plots, here is where the gausian kernal density estimation magic happens
# if you want a continuous legend on one of the sides get rid of ribbon=FALSE
# and be sure to play around with sigma
plot(Smooth(usPPP,sigma=3,weights=valWeights),col=col,main=NA,ribbon=FALSE)
map("state",add=TRUE,fill=FALSE)
example no weights:
example with my trivial weights
There is obviously a lot of work in between this and your goal of making this type of map reproducible at various levels of spatial aggregation and sample data, but good luck it seems like a cool project.
p.s. initially I did not use any weighting, but I suppose you could provide weights directly to the Smooth function. Two example maps above.
Hi I am using partitioning around medoids algorithm for clustering using the pam function in clustering package. I have 4 attributes in the dataset that I clustered and they seem to give me around 6 clusters and I want to generate a a plot of these clusters across those 4 attributes like this 1: http://www.flickr.com/photos/52099123#N06/7036003411/in/photostream/lightbox/ "Centroid plot"
But the only way I can draw the clustering result is either using a dendrogram or using
plot (data, col = result$clustering) command which seems to generate a plot similar to this
[2] : http://www.flickr.com/photos/52099123#N06/7036003777/in/photostream "pam results".
Although the first image is a centroid plot I am wondering if there are any tools available in R to do the same with a medoid plot Note that it also prints the size of each cluster in the plot. It would be great to know if there are any packages/solutions available in R that facilitate to do this or if not what should be a good starting point in order to achieve plots similar to that in Image 1.
Thanks
Hi All,I was trying to work out the problem the way Joran told but I think I did not understand it correctly and have not done it the right way as it is supposed to be done. Anyway this is what I have done so far. Following is how the file looks like that I tried to cluster
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155
following is the Pam clustering output
GRMZM2G181227
1
GRMZM2G146885
2
GRMZM2G139463
2
GRMZM2G015295
2
GRMZM2G111909
2
GRMZM2G078097
3
GRMZM2G450498
3
GRMZM2G413652
2
GRMZM2G090087
2
AC217811.3_FG003
2
Using the above two files I generated a third file that somewhat looks like this and has cluster information in the form of cluster type K1,K2,etc
geneID RPKM-base RPKM-1cm RPKM+4cm RPKMtip Cluster_type
GRMZM2G181227 3.412444267 3.16437442 1.287909035 0.037320722 K1
GRMZM2G146885 14.17287135 11.3577013 2.778514642 2.226818648 K2
GRMZM2G139463 6.866752401 5.373925806 1.388843962 1.062745344 K2
GRMZM2G015295 1349.446347 447.4635291 29.43627879 29.2643755 K2
GRMZM2G111909 47.95903081 27.5256729 1.656555758 0.949824883 K2
GRMZM2G078097 4.433627458 0.928492841 0.063329249 0.034255945 K3
GRMZM2G450498 36.15941083 9.45235616 0.700105077 0.194759794 K3
GRMZM2G413652 25.06985426 15.91342458 5.372151214 3.618914949 K2
GRMZM2G090087 21.00891969 18.02318412 17.49531186 10.74302155 K2
I certainly don't think that this is the file that joran would have wanted me to create but I could not think of anything else thus I ran lattice on the above file using the following code.
clusres<- read.table("clusinput.txt",header=TRUE,sep="\t");
jpeg(filename = "clusplot.jpeg", width = 800, height = 1078,
pointsize = 12, quality = 100, bg = "white",res=100);
parallel(~clusres[2:5]|Cluster_type,clusres,horizontal.axis=FALSE);
dev.off();
and I get a picture like this
Since I want one single line as the representative of the whole cluster at four different points this output is wrong moreover I tried playing with lattice but I can not figure out how to make it accept the Rpkm values as the X coordinate It always seems to plot so many lines against a maximum or minimum value at the Y coordinate which I don't understand what it is.
It will be great if anybody can help me out. Sorry If my question still seems absurd to you.
I do not know of any pre-built functions that generate the plot you indicate, which looks to me like a sort of parallel coordinates plot.
But generating such a plot would be a fairly trivial exercise.
Add a column of cluster labels (K1,K2, etc.) to your original data set, based on your clustering algorithm's output.
Use one of the many, many tools in R for aggregating data (plyr, aggregate, etc.) to calculate the relevant summary statistics by cluster on each of the four variables. (You haven't said what the first graph is actually plotting. Mean and sd? Median and MAD?)
Since you want the plots split into six separate panels, or facets, you will probably want to plot the data using either ggplot or lattice, both of which provide excellent support for creating the same plot, split across a single grouping vector (i.e. the clusters in your case).
But that's about as specific as anyone can get, given that you've provided so little information (i.e. no minimal runnable example, as recommended here).
How about using clusplot from package cluster with partitioning around medoids? Here is a simple example (from the example section):
require(cluster)
#generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
clusplot(pam(x, 2)) #`pam` does you partitioning
Is there a simple way to plot the difference between two probability density functions?
I can plot the pdfs of my data sets (both are one-dimensional vectors with roughly 11000 values) on the same plot together to get an idea of the overlap/difference but it would be more useful to me if I could see a plot of the difference.
something along the lines of the following (though this obviously doesn't work):
> plot(density(data1)-density(data2))
I'm relatively new to R and have been unable to find what I'm looking for on any of the forums.
Thanks in advance
This should work:
plot(x =density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$x,
y= density(data1, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y-
density(data2, from= range(c(data1, data2))[1],
to=range(c(data1, data2))[2] )$y )
The trick is to make sure the densities have the same limits. Then you can plot their differences at the same locations.My understanding of the need for the identical limits comes from having made the error of not taking that step in answering a similar question on Rhelp several years ago. Too bad I couldn't remember the right arguments.
It looks like you need to spend a little time learning how to use R (or any other language, for that matter). Help files are your friend.
From the output of ?density :
Value [i.e. the data returned by the function]
If give.Rkern is true, the number R(K), otherwise an object with class
"density" whose underlying structure is a list containing the
following components.
x the n coordinates of the points where the density is estimated.
y the estimated density values. These will be non-negative, but can
be zero [remainder of "value" deleted for brevity]
So, do:
foo<- density(data1)
bar<- density(data2)
plot(foo$y-bar$y)