Weighted correlation in R - r

I am trying to output a correlation matrix for various locations. The row names 'PC1', PC2' etc. represent principal components. Since the percentage variance explained (and thus the weights) of principal components decreases from PC1 to PC4, I need to run Pearson correlation such that it takes the weights of PC's into account.
In other words, row 1 is more important in determining the correlation among locations than row 2, and row 2 is more important than row 3, and so on...
A simple weight vector for the 4 rows can be as follows:
w = [1.00, 0.75, 0.50, 0.25]
I did go through this, but I am not fully clear with the solution, and unlike this question, I need to find the correlation within the columns of a SINGLE matrix, while weighing its rows.

Ok, this is very easy to do in R using cov.wt (available in stats)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!

Related

Generate random data based on correlation matrix for multiple timesteps in R

I would like to simulate data for some cases (e.g. nPerson=1000 obversations) at
some consecutive timesteps (e.g. ts = 3) for N intercorrelated variables (e.g. N=5).
The simulation should be based on a correlation matrix (corrMat, nrows=nPerson,.ncols = N).
corrMat should be identical for all timesteps.
I already found out that the MASS package has a function to create
random data fitting the constraints given by corrMat.
t1 <- mvrnorm(nPerson,mu=rep(0, N),Sigma=corrMat,empirical=T)
Now I would like to simulate t2 as a function of t1 and corrMat.
The data of t2 therefore should correlate according to corrMat
and they should also have same variance as the variables of t1.
One important constrained: for the intial values corrMat[i,i] = 1,
for consequtive timesteps it should be posible, that corrMat[i,i] < 1,
because each variable is depending on itsself a timestep before,
but a perfect correlation is notintended.
Maybe there is a variance decomposition of the correlation matrix,
that calculates an error variance for each of the n variables at the
next time step, so that one could calculate the
values at timestep t+1 as sum of the weighted correlations of the
variables at timestep t and then adding a random error,distributed
according to the error variance (with mean of error = 0) that replicates
the correlation matrix again at t+1.
Assuming normal errors:
getRand <- function (range) {
return (rnorm(1,mean=0, sd=range) )
}
That the (very simplified) code for the i-th variable x_i:
x_i[t+1] = 0
for (j:1..N) {
x_i[t+1] = x_i[t+1] + corrMat[i,j] * x_j[t]
}
x_i[t+1] = x_i[t+1] + getRand(sdErr)
So the question would be more specific: how to calculate sdErr?
For simplification I try to assume, that the variance for all variables
should be 1.
Thank you for any hint, how to get one step further!
I will do a mathematical formulation of the problem to stats.stackexchange.com,
as mikeck suggested to discuss details of the correlation problems more
in depth.
I still am interested in finding a geneal formula to calculate sdErr
to use it in the calculation of x_i[t+1].
But meanwhile I found a useful practical solution to the specific question "how to calculate sdErr?" without a formula for sdErr:
(1) simply calculate all variables WITHOUT errors (according to the equation above).
(2) calculate variances of the new variables
(3) calculate (for each i) differences var(x_i[t]) - var(x_i[t+1]) = sdErr ^ 2
So this sdErr can be added to each variable for each new observation.
This should lead to observations at t+1 which at least have the same variances as the observations in t.
Details concercing the question, if the model definition is adequate,
will be part of another post.

How to remove outliers from distance matrix or Hierarchical clustering in R?

I have some questions
First, I don't know how to find and remove outliers in distance matrix or symmetry matrix.
Second, also I used Hierarachical clustering with Average linkage.
My data is engmale161 (already made symmetry matrix with DTW )
engmale161 <- na.omit(engmale161)
engmale161 <- scale(engmale161)
d <- dist(engmale161, method = "euclidean")
hc1_engmale161 <- hclust(d, method="average")
and I find optimize index 4 with silhouette, wss & gap.
>sub_grp <- cutree(hc1_engmale161,h=60, k = 4)
>table(sub_grp)
>table(sub_grp)
sub_grp
1 2 3 4
741 16 7 1
> subset(sub_grp,sub_grp==4)
4165634865
4
>fviz_cluster(list(data = engmale161, cluster = sub_grp), geom = "point")
So, I think the right upper point(4165634865) is outlier and it has only one point.
How to delete the outlier in H-C algorithm.
just some ideas.
in a nutshell,
don't do "na.omit" on engmale161. find the outlier(s) using
quantiles and box-and-whiskers put outliers to NA in the dist matrix
proceed with your processing
long version:
"dist" behaves nicely with NAs (from the R documentation, "Missing
values are allowed, and are excluded from all computations involving
the rows within which they occur. Further, when Inf values are
involved, all pairs of values are excluded when their contribution to
the distance gave NaN or NA)"
to find an outlier I would use concepts from exploratory statistics.
use "quantile" with default probs and na.rm = true (because your dist
matrix still contains NAs) --> you'll get values for the quartiles
(dataset split in 4: 0-25%, 25-50%m and so on). 25-75 is the "box".
To find the "whiskers" is a debated topic. the standard approach is
to find the InterQuartileRange (IQR), which is third-first quartile,
then first quartile - 1.5*IQR is the "lower" whiskers, and third
quartile + 1.5*IQR is the "upper" whisker. Any value outside the
whiskers is to be considered an outlier. Mark them as NA, and proceed.
Best of luck, and my compliments for being someone who actually looks at the data!

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Is it possible to specify a range for numbers randomly generated by mvrnorm( ) in R?

I am trying to generate a random set of numbers that exactly mirror a data set that I have (to test it). The dataset consists of 5 variables that are all correlated with different means and standard deviations as well as ranges (they are likert scales added together to form 1 variable). I have been able to get mvrnorm from the MASS package to create a dataset that replicated the correlation matrix with the observed number of observations (after 500,000+ iterations), and I can easily reassign means and std. dev. through z-score transformation, but I still have specific values within each variable vector that are far above or below the possible range of the scale whose score I wish to replicate.
Any suggestions how to fix the range appropriately?
Thank you for sharing your knowledge!
To generate a sample that does "exactly mirror" the original dataset, you need to make sure that the marginal distributions and the dependence structure of the sample matches those of the original dataset.
A simple way to achieve this is with resampling
my.data <- matrix(runif(1000, -1, 2), nrow = 200, ncol = 5) # Some dummy data
my.ind <- sample(1:nrow(my.data), nrow(my.data), replace = TRUE)
my.sample <- my.data[my.ind, ]
This will ensure that the margins and the dependence structure of the sample (closely) matches those of the original data.
An alternative is to use a parametric model for the margins and/or the dependence structure (copula). But as staded by #dickoa, this will require serious modeling effort.
Note that by using a multivariate normal distribution, you are (implicity) assuming that the dependence structure of the original data is the Gaussian copula. This is a strong assumption, and it would need to be validated beforehand.

R: Statistics of distribution

I have the number of samples per unit and need to calculate statistics with R.
The table is like this (all rows and columns are actually filled with values, I only write a few here for easier visibility, and there are many more columns):
Hour 1 2 3 4
H1 72 11 98 65
H2 19 27
H3
H4
H5
:
H200000
I.e. the first hour (H1) there were 72 samples of value 1, 11 samples of value 2, etc. The second hour(H2) there were 19 samples of value 1, 27 samples of value 2, etc.
I need to calculate the mean and standard deviation per hour (i.e. per row). As there are many thousands of rows I need a fast method.
Example: The manual mean-calculation for hour 1 (H1) would be:
(72x1 + 11x2 + 98x3 + 65x4)/(72+11+98+65) = 2.6
I suppose there are R-methods or packages that can do this, but I fail to find where. Your support is highly appreciated.
Thanks,
Chris
You want to calculate a weighted mean, so you need weighted.mean. For the first row:
values <- c(1, 2, 3, 4)
weights <- c(72, 11, 98, 65)
weighted.mean(values, weights)
The weighted standard deviation is not well-defined. You could use a hand-rolled weighted RMS as an estimator (but this assumes that your input sample is really from a single Gaussian, i.e. there are no outliers -- not sure if that's the case for your example).
# same values and weights as above
sqrt(sum(values^2*weights^2))/sum(weights)
You should read your data into a table and iterate over every row. Also, "many thousands of rows" is not necessarily a large number for such a simple calculation. This is very basic stuff, maybe checking out a tutorial would also be beneficial.
You are much better off (i.e. faster calculations) using matrix operations instead of applying something by row. For example, assuming X is the matrix containing your data, you can get the weighted means the following way:
w <- 1:ncol(X)
w <- w/sum(w) #scale to have a sum of 1
wmeans <- X %*% w
Assuming your table is a matrix called dataset of n * 20000 and you have the weigths in a weights array you just need to do:
# The 1 as 2nd parameter indicates to apply the function on the rows
w.means <- apply(dataset, 1, weighted.mean, w=weights)

Resources