interpreting princomp results - r

I am currently trying to do PCA in R. This is my first project in Data mining.
I have around 200 features and around 3000 rows of data.
Data is not in normalized form and i need to do dimensionality reduction
So i am using PCA for the same. This is what i did till now
x <- princomp(data,scores=TRUE,cor=TRUE)
I suppose to do dimension reduction, i am supposed to look at score values. So i did to get top few values
head(x$scores)
This was the output
Comp.1 Comp.2 Comp.3 Comp.4 ...
[1,] 6.831452 -4.4316218 -1.9226226 -0.8344245
[2,] -1.808007 -4.2743390 1.0173944 0.4527465
[3,] -7.750329 -4.9523056 -1.6750438 1.6247354
.
.
.
Now I am not sure how to interpret these matrix and get the best attributes (and do dimension reduction). It would be great if someone could help me out with this.
P.S - I searched a lot but did not get an answer for the same.

scores is just one piece of the puzzle. The general formula is:
original_data =~ approximation = (scores * loadings) * scale + center
where:
1. `scores` are the coordinates in your new orthogonal base
1. `loadings` are the directions of the new axis in the old base
1. `scale` are the scaling applied to the dimensions
1. `center` are the coordinates of the new base origin in the old base
Using the R objects, the formula above is
data =~ t(t(x$scores %*% t(x$loadings)) * x$scale + x$center)
You'll want to reduce dimensions by only taking the first i loadings:
data =~ t(t(x$scores[, 1:i] %*% t(x$loadings[, 1:i ])) * x$scale + x$center)

Related

Sampling custom probability density function in R

Through the dwp package, I got the probability density function which describes my original data. The function itself is given within the list, so I can see the fitted parameters by checking that particular list element:
> Kbatmod$xep02
Distribution: xep02
Formula: ncarc ~ log(r) + I(r^2) + offset(log(exposure))
Parameters:
b0 b2
-0.8396640654 -0.0004653923
Coefficients:
(Intercept) log(r) I(r^2)
-2.1154557406 -0.8396640654 -0.0004653923
Variance:
(Intercept) log(r) I(r^2)
(Intercept) 3.034712e-01 -1.111308e-01 4.550014e-05
log(r) -1.111308e-01 4.608998e-02 -2.451855e-05
I(r^2) 4.550014e-05 -2.451855e-05 2.531856e-08
Now I want to sample that function to plot the point cloud (locations around the 0,0). Direction or bearing is going to be sampled as random numbers between 0 and 360. But distances need to fit to this fitted xep02 function. I can't find the function which would do that in the dwp package, even though its manual uses this kind of density point clouds to explain how it works.
I tried the RVCompare package, and the function sampleFromDensity.
But I keep getting errors, and I believe it is because of how I'm giving it the xep02 function.
> dwpPDF <- Kbatmod$xep02
>
> PDFsamples <- sampleFromDensity(dwpPDF, 100, c(0,100))
Error in density.default(X[[i]], ...) :
need at least 2 points to select a bandwidth automatically
Can someone help me how to "translate" what dwp package gave me into input for RVCompare?
Aim is to obtain a vector with 100 distances which fit the xep02 PDF, add a vector of 100 random selected bearings, and then plot these in ArcGIS to overlay with predefined polygons and see how many points fall within these polygons.
sampleFromDensity is expecting a function as its first argument. You can accomplish this with the ddd function:
sampleFromDensity(function(x) ddd(x, dwpPDF), 100, c(0,100))

Merging covariance from two sets to create new covariance

Is there any way to combine co-variance from two data sets instead of calculating the new co-variance by merging the data. Suppose I have already calculated co-variance from 1 million data and then if I get another 2 million data that has already calculated co-variance, can i combine the already calculated co-variance to produce the new co-variance. I am mostly interested in reducing the computation that is required when i calculate the co-variance from the combined 3 million data.
This can be easily done for mean.
new mean = (data_size_1* mean_1 + data_size_2*mean_2)/((data_size_1 + data_size_2)
Is there any similar way to calculate co-variance so that i can take advantage of the pre-computed data. I can also store some information while calculating co-variance of data_size_1 and data data_size_2 if that can help me to find the new merged co-variance easily.
The complete derivation is given in this pdf http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf
I found formula for combining variances of two sets here:
https://www.emathzone.com/tutorials/basic-statistics/combined-variance.html
Replacing (X1–Xc)2 with
(X1–Xc)(Y1–Yc),
and (X2–Xc)2 with
(X2–Xc)(Y2–Yc)
gives the correct results for covariances.
Unlike the formula from the first answer, which is only approximately correct.
Here is a code fragment that combines covariances a and b
into resulting covariance r.
r.n = a.n + b.n;
r.mean_x = (a.n * a.mean_x + b.n * b.mean_x) / r.n;
r.mean_y = (a.n * a.mean_y + b.n * b.mean_y) / r.n;
r.sum = a.sum + a.n * (a.mean_x - r.mean_x) * (a.mean_y - r.mean_y)
+ b.sum + b.n * (b.mean_x - r.mean_x) * (b.mean_y - r.mean_y);
a, b and r are structs that contain:
n – number of elements,
mean_x – mean of X,
mean_y – mean of Y,
sum – the covariance multiplied by n.

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Extracting the Model Object in R from str()

I have a logit model object fit using glm2. The predictors are continuous and time varying so I am using basis splines. When I predict(FHlogit, foo..,) the model object it provides a prediction. All is well.
Now, what I would like to do is extract the part of FHLogit and the basis matrix the provides the prediction. I do not want to extract information about the model from str(FHLogit) I am trying to extract the part that says Beta * Predictor = 2. So, I can manipulate the basis matrix for each predictor
I don't think using basis splines will affect this. If so, please provide a reproducible example.
Here's a simple case:
df1 <- data.frame(y=c(0,1,0,1),
x1=seq(4),
x2=c(1,3,2,6))
library(glm2)
g1 <- glm2(y ~ x1 + x2, data=df1)
### default for type is "link"
> stats::predict.glm(g1, type="link")
1 2 3 4
0.23809524 0.66666667 -0.04761905 1.14285714
Now, being unsure how these no.s were arrived at we can look at the source for the above, with predict.glm. We can see that type="link" is the simplest case, returning
pred <- object$fitted.values # object is g1 in this case
These values are the predictions resulting from the original data * the coefficients, which we can verify with e.g.
all.equal(unname(predict.glm(g1, type="link")[1]),
unname(coef(g1)[1] + coef(g1)[2]*df1[1, 2] + coef(g1)[3]*df1[1, 3]))

Root mean square deviation on binned GAM results using R

Background
A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:
cor( rank(x), rank(y) )
Also in R, a naïve calculation of a fitted generalized additive model (GAM):
data.frame( x, fitted( gam( y ~ s(x) ) ) )
Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.
Problem
The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
Possible Solution
One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.
Questions
Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?
Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?
Attempted Solution 1
Call the PL/R function using the observed amounts and the model (GAM) amounts: correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
Define plr_corr_rmse as follows (where o and m represent the observed and modelled data): CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
o double precision[], m double precision[])
RETURNS double precision AS
$BODY$
sqrt( mean( o - m ) ^ 2 )
$BODY$
LANGUAGE 'plr' VOLATILE STRICT
COST 100;
The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:
omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )
Then correct the RMSE calculation as:
sqrt( mean( omean - mmean ) ^ 2 )
How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?
I don't think hist is suitable here, is it?
Attempted Solution 2
The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.
while( length(o) %% 5 != 0 ) {
o <- o[-length(o)]
}
omean <- apply( matrix(o, 5), 2, mean )
What other options are available?
Thanks in advance.
You say that:
The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.
You could calculate the correlation between the fitted values and the measured values:
cor(y,fitted(gam(y ~ s(x))))
I don't see why you want to bin your data, but you could do it as follows:
mean.binned <- function(y,n = 5){
apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
2,
function(x)mean(x,na.rm = TRUE))
}
It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).
You also say that:
One way to improve the accuracy of the
correlation is to use a root mean
square error (RMSE) calculation on
binned data.
I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.

Resources