Inner Products in Principal Component Analysis in R - r

For this, I am using the banknote data in R given by data(banknote), which shows measurements of 200 Swiss banknotes. My data matrix is called X, and I have performed PCA by pca.banknote<-prcomp(X).
I am trying to show that the inner product between each observation X[i,] and Principal Component Loading 3 given by pca.banknote$rot[,3] is the same as the 3rd PC scores given by pca.banknote$x[,3].
I have attempted:
all.equal(as.matrix(X[,])%*%banknote.pca$rot[,3], as.matrix(banknote.pca$x[,3]), check.attributes=FALSE)
but this simply gives a mean difference of 1, i.e. they are not equal.
Do I need to change the format of one of these to a vector/data frame etc for this to work? Or any ideas at all as to where the issue is?
Any feedback would be much appreciated. Thanks.

Related

Extracting data from lower layers in a Rasterbrick

So I'm extracting data from a rasterbrick I made using the method from this question: How to extract data from a RasterBrick?
In addition to obtaining the data from the layer given by the date, I want to extract the data from months prior. In my best guess I do this by doing something like this:
sapply(1:nrow(pts), function(i){extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i-1], nl=1)})
So it the extracting should look at layerindex i-1, this should then give the data for one month earlier. So a point with layerindex = 5, should look at layer 5-1 = 4.
However it doesn't do this and seems to give either some random number or a duplicate from months prior. What would be the correct way to go about this?
Your code is taking the value from the layer of the previous point, not the previous layer.
To see that imagine we are looking at the point in row 2 (i=2). your code that indicates the layer is pts$layerindex[i-1], which is pts$layerindex[1]. In other words, the layer of the point in row 1.
The fix is easy enough. For clarity I will write the function separetely:
foo = function(i) extract(b, cbind(pts$x[i],pts$y[i]), layer=pts$layerindex[i]-1, nl=1)
sapply(1:nrow(pts), foo)
I have not tested it, but this should be all.

prcomp( .. ,retx=TRUE), do I get the new data to train over?

I am having some issues in interpreting the results from prcomp().
Say I have a centered and scaled data.table called dat, with N columns and M rows. Indeed every column represents a feature and every row a record. I also got a M-dimensional vector of outcomes Y.
I wanted to know what the PCA of this system says. So I just executed:
dat.pca=prcomp(dat,retx=TRUE)
By the elbow method I decided to retain 5 PCA modes, accounting for 90% of the variance. Then, I got the following data.table:
dat.pcadata=as.data.table(dat.pca$x)
dat.pcadata has M rows and N columns, and each column corresponds to a PCA mode.
My question is: do I understand correctly if I say that now my system should be trained to forecast the outcomes Y using the first 5 columns of dat.pcadata as features?

Time Series Clustering in R

I have two time series- a baseline (x) and one with an event (y). I'd like to cluster based on dissimilarity of these two time series. Specifically, I'm hoping to create new features to predict the event. I'm much more familiar with clustering, but fairly new to time series.
I've tried a few different things with a limited understanding...
Simulating data...
x<-rnorm(100000,mean=1,sd=10)
y<-rnorm(100000,mean=1,sd=10)
This package seems awesome but there is limited information available on SO or Google.
library(TSclust)
d<-diss.ACF(x, y)
the value of d is
[,1]
[1,] 0.07173596
I then move on to clustering...
hc <- hclust(d)
but I get the following error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
My assumption is this error is because I only have one value in d.
Alternatively, I've tried the following on a single timeseries (the event).
library(dtw)
distMatrix <- dist(y, method="DTW")
hc <- hclust(y, method="complete")
but it takes FOREVER to run the distance Matrix.
I have a couple of guesses at what is going wrong, but could use some guidance.
My questions...
Do I need a set of baseline and a set of event time series? Or is one pairing ok to start?
My time series are quite large (100000 rows). I'm guessing this is causing the SLOW distMatrix calculation. Thoughts on this?
Any resources on applied clustering on large time series are welcome. I've done a pretty thorough search, but I'm sure there are things I haven't found.
Is this the code you would use to accomplish these goals?
Thanks!

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

image comparison in R

I am looking for the best way to compare 2 or more images.
The images I have are now in matrix format, so basically I am comparing matrices.
They aren't square (but this isn't a problem).
This is an example of what I have with only two matrices:
#Original data
M1<-cbind(c(0,0,20,40,50,35),c(0,0,5,20,90,80),c(0,0,10,25,85,0),c(58,70,20,50,0,5))
#Data to be compared with M1
M2<-cbind(c(0,5,25,25,60,15),c(0,30,15,10,116,67),c(0,2,9,20,90,1),c(69,50,22,30,0,2))
I can check for the differences and the correlation, but I also want to be able to say for example, if:
high values in M2 occur in the same positions that M1
high values in M2 occur close to the positions in M1
high values in M2 occur far away
Same thing for low values.
By high values I mean maximum values, for example if the max value in M1 is in position (M1_maxvalue(x,y)), than I M2 max value should be a similar value observed in M1 as well as in the same or close position M1_maxvalue(x,y).
I can extract the positions, the variation of the positions of the maximum values, however I am looking for existent methods where I can base my comparisons.
What type of calculations can I use to do such type of analysis?
I can use both image processing packages as well as matrices algorithms.
Sounds like a job better handled with ImageJ or SAODS9 at http://hea-www.harvard.edu/RD/ds9/ .
IIRC those apps have built-in tools for spot and blob-finding, which may save you a lot of time and pain.

Resources