I have several vectors:
a <- c(1.1, 2.9, 3.9, 5.2)
b <- c(1.0, 1.9, 4.0, 5.1)
c <- c(0.9, 2.1, 3.1, 4.1, 5.0, 11.13)
They can have different length.
I want to combine them in such a way to obtain a general vector with averaged values if there are similar meanings in all of the vectors or in any pairs of them; and with an initial meanings if this meaning is only in one vector. For averaging I would like to use a threshold = 0.2.
My explanation could be a bit confusing, but here is the general vector I want to obtain:
d <- c(1, 2, 3, 4, 5.1, 11.13)
I have around 12 vectors and about 2000 values in each vector.
I will be glad for any help
Seems like a clustering problem (clustered by distance). You can try the code below
library(igraph)
v <- sort(c(a, b, c))
tapply(
v,
membership(components(graph_from_adjacency_matrix(as.matrix(dist(v)) <= 0.2 + sqrt(.Machine$double.eps)))),
mean
)
which gives
1 2 3 4 5 6
1.00 2.00 3.00 4.00 5.10 11.13
Related
I have a set of columns with numerical values that describe a given object (row) in some 5 dimensional space. I want to compute the distance for each object from a fixed object at various times. I can group_by time and perform the desired computation. The issue is that I'm not sure how to do the computation. I want to use the Euclidean distance (squared) to measure the distance between these objects in 5 dimensional space. So clearly at each time, the reference object should be 0 distance from itself.
The metric should look like distance from object x to object Reference is
(x1 - Reference1)^2 + (x2 - Reference2)^2 + ....
I'm VERY new to working in R (and programming in general), so I was hoping this exercise would help me learn; I apologize if my question is not appropriate, I'm new.
My data looks like
Distances from rows to other rows can be done in base R with this:
mtx <- structure(c(2.8, 6.4, 1.7, 3.2, 24.2, 25.5, 5.4, 16.2, 15.6, 25.1, 8.6, 15.4, 0.7, 0.8, 0.1, 0.5, 0.1, 0.4, 0.04, 0.2), .Dim = 4:5)
outer(seq_len(nrow(mtx)), seq_len(nrow(mtx)),
function(a, b) rowSums((mtx[a,] - mtx[b,])^2))
# [,1] [,2] [,3] [,4]
# [1,] 0.0000 105.0000 404.0136 64.2500
# [2,] 105.0000 0.0000 698.9696 190.9500
# [3,] 404.0136 698.9696 0.0000 165.3156
# [4,] 64.2500 190.9500 165.3156 0.0000
Granted, you only need to calculate (less-than) half of that matrix, since the diagonal is always zero and the upper/lower triangles of it are mirrors, but this gives you what you need. For instance, the distances from the third row to all other rows are in the third row (and third column).
If all you need is one row compared to all others, then
rowSums((mtx[rep(3,nrow(mtx)),] - mtx)^2)
# [1] 404.0136 698.9696 0.0000 165.3156
The mtx[rep(3,nrow(mtx)),] creates a same-size matrix so that subtraction works seamlessly.
Am I missing something obvious or Matlab's kstest2 is giving very poor p-values?
Under very poor I mean that I have the suspicion that it is even wrongly implemented.
Help page of kstest2 states that the function calculates asymptotic p-value, though I did not find any reference about which method is used exactly. Anyway, the description further states:
asymptotic p-value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2, such that (n1*n2)/(n1 + n2) ≥ 4
Example 1
Let's take Example 6 from Lehman and D'Abrera (1975):
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
[h,p,ks2stat] = kstest2(sampleA, sampleB, 'Tail', 'unequal');
(n1*n2)/(n1 + n2) = 4 in this case so the p-value should be reasonably accurate.
Matlab yields p = 0.0497, while the solution given in the book is 0.0870.
To validate the solution I used R, which I trust more than Matlab, especially in statistics.
Using ks.test from stats package and ks.boot from Matching package:
ks.test(sampleA, sampleB, alternative = "two.sided")
ks.boot(sampleA, sampleB, alternative = "two.sided")
Both give p = 0.0870.
Example 2
Lets use kstest2's own example to compare Matlab and R results for larger sample size:
rng(1); % For reproducibility
x1 = wblrnd(1,1,1,50);
x2 = wblrnd(1.2,2,1,50);
[h,p,ks2stat] = kstest2(x1,x2);
This yields p = 0.0317. Now, using the same x1 and x2 vectors R gives p = 0.03968.
About 20% difference when very accurate result is expected (n1*n2)/(n1 + n2) = 25.
Am I missing, messing up something?
Is it possible that Matlab's kstest2 performs so poorly as the examples indicate? What approximation, algorithm kstest2 is using? (I can see the implemented code for kstest2, however a reference to book or paper would be much better to understand what is going on.)
I am using Matlab 2016a.
Lehman and D'Abrera (1975). Nonparametrics: Statistical Methods Based on Ranks. 1st edition. Springer.
I think that the correct test to compare with R's ks.test in MATLAB or Octave would be kolmogorov_smirnov_test_2:
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
kolmogorov_smirnov_test_2(sampleA, sampleB)
pval: 0.0878664
The difference appears to be the use of ks versus lambda, i.e.
ks = sqrt (n) * d;
pval = 1 - kolmogorov_smirnov_cdf (ks);
versus
lambda = max((sqrt(n) + 0.12 + 0.11/sqrt(n)) * d , 0);
pval = 1 - kolmogorov_smirnov_cdf (lambda);
I presume the different test statistics arise from the differences in the research papers cited by these 2 functions. If you want a deeper dive into the statistical theory you may want to reach out to CrossValidated.
Suppose, I have the following ordered data sets:
X <- c(12, 15, 23, 4, 9, 36, 10, 16, 67, 45, 58, 32, 40, 58, 33)
# and
Y <- c(1.5, 3.3, 10, 2.1, 8.3, 6.3, 4, 5.1, 1.4, 1.6, 1.8, 3.1, 2.2, 4, 3)
What does it mean by "the correlation of their ordered pairs after standardization"?
How to find (code for) it in R?
In order to standardize the given sets for X and Y, We will first
calculate average, variance, standard deviation of population.
In next step, we need to subtract each individual value in each
set from its mean and then in final step, we need to divide the
values obtained, from 2nd step, by its standard deviation, which
is nothing but a Z-scores of the set (and individual value, say Xi).
Doing so, we will get mean of 0 and standard deviation of 1 for both
X and Y sets.
This is standardized condition because we will always get mean as zero and standard deviation as one for all of the sets (in your case X and Y).
We will also look into the relationships between ordered pairs.
If we look at certain standard relationship such as co-variance
correlation, the slope is best fit line that plots Y against X,
then the Y intercept will they be the same for the original
values and the standardized values or will they be different?
And if they be different, how different will they be and why?
This was the context of question.
What I tried in R is as follows:
Your Data Set is:
X <- c(12, 15, 23, 4, 9, 36, 10, 16, 67, 45, 58, 32, 40, 58, 33)
# and
Y <- c(1.5, 3.3, 10, 2.1, 8.3, 6.3, 4, 5.1, 1.4, 1.6, 1.8, 3.1, 2.2, 4, 3)
Statistics for Original Data, where n = 15 observations for X and Y each
# Variance
VarX <- sum((X - mean(X))^2)/15 ## Which gives us Variance of X set as 374.5156
VarY <- sum((Y - mean(Y))^2)/15 ## Which gives us Variance of Y set as 6.226489
# Standard Deviation
sdX <- sqrt(VarX) ## Which gives us Std. Dev. of X set as 19.3524
sdY <- sqrt(VarY) ## Which gives us Std. Dev. of Y set as 2.495293
# Z-scores
Z_Score_X <- (X - mean(X))/sdX
Z_Score_Y <- (Y - mean(Y))/sdY
# A Check, mean of ZScores should be close or equal to 0
# and Std. Dev. must be close or equal to 1
round(mean(Z_Score_X), 0) # Yes, it is 0
round(sd(Z_Score_X), 0) # Yes, it is 1
round(mean(Z_Score_Y), 0) # Yes, it is 0
round(sd(Z_Score_Y), 0) # Yes, it is 1
This is the standardized condition where we have same mean
and standard deviation for both X and Y (as in above cases of Z Score data set).
Now we will look into the relationships between ordered pairs
If we look at certain standard relationship such as coveriance
correlation, the slope is best fit line that plots Y against X,
then the Y intercept will they be the same for the original
values and the standardized values or will they be different?
And if they be different, how different wil they be and why?
Let's calculate the rest...
First we look at the co-variance of X and Y...
Covariance (X, Y) = (1/n) * summation(i = 1 to n) of products
of (Xi - mean(X)) and (Yi - mean(Y))
and together, Xi and Yi are in ordered pair (remember step 3 above,
the Z-Scores)
# Covariance for older sets (X, Y)
covXY <- (1/15) * sum((X - mean(X))*(Y - mean(Y)))
# Covariance for New sets (Z_Score_X, Z_Score_Y)
covXYZ <- (1/15) * sum((Z_Score_X - mean(Z_Score_X))*(Z_Score_Y - mean(Z_Score_Y)))
Next we will look at slope (Beta) of best fit line of (X and Y)
Recall, Beta = slope = delta_Y / delta_X
# Slope for old set (X, Y)
Beta_X_Y <- round(lm(Y ~ X)$coeff[[2]], 2)
# Slope for standardized values in new set (Z_Score_z, Z_Score_z1)
Beta_ZScoreXY <- round(lm(Z_Score_X ~ Z_Score_Y)$coeff[[2]], 2)
Please note that intercept for the standardized values will always be ZERO
The reason for that is because the means for standardized values are always
on the best fit line and are zero (as in our case of Z_Score_X, Z_Score_Y,
the means are 0, 0).
In other words, the best fit line, for standardized data, must go through origin.
Although, not always necessary, but it is expected so.
# Intercept for old set
Intercept_X_Y <- round(lm(Y ~ X)$coeff[[1]], 2)
# 5.17
# Intercept for standardized set, should be zero
Intercept_ZScore_X_Y <- round(lm(Z_Score_Y ~ Z_Score_X)$coeff[[1]], 2)
# Yes, it is 0
Finally, we will look at Correlation, which is equal to
Covariate of X and Y divided by standard deviation of X times standard deviation of Y
# Correlation of old set
CorrelationXY <- round(covXY / (sdX * sdY), 2)
# Variance for new set
VarZScoreX <- sum((Z_Score_X - mean(Z_Score_X))^2)/15
VarZScoreY <- sum((Z_Score_Y - mean(Z_Score_Y))^2)/15
sdZScoreX <- sqrt(VarZScoreX)
sdZScoreY <- sqrt(VarZScoreY)
# Correlation of new set
correlation_ZScore_X_Y <- round(covXYZ / (sdZScoreX * sdZScoreY), 2)
Therefore, what we see here is, that overall thing that remains constant
for old set of data or new set of standardized (z score) data, is the
correlation (in our case it is -0.34). The correlation is UNCHANGED.
One another point to note, for every standardized set, the slope, the
covariance are EQUAL to correlation (all -0.34 in our case) and the intercept
of standardized set is equal to zero.
As we know, quantile function is the inverse cumulative distribution function.
Then for an existed distribute(a vector), how to exactly match the result of cumulative distribution function and quantile function?
Here is an example given in MATLAB.
a = [150 154 151 153 124]
[x_count, x_val] = hist(a, unique(a));
% compute the probability cumulative distribution
p = cumsum(n)/sum(n);
x_out = quantile(a, p)
In the cumulative distribution function, the corresponding relation between cumulative probability and x value should be:
x = 124 150 151 153 154
p = 0.2000 0.4000 0.6000 0.8000 1.0000
But use p and quantile to compute x_out, the result is different with x:
x_out =
137.0000 150.5000 152.0000 153.5000 154.0000
Reference
quantile function
matlab quantile function
From the docs:
For a data vector of five elements such as {6, 3, 2, 10, 1}, the sorted elements {1, 2, 3, 6, 10} respectively correspond to the 0.1, 0.3, 0.5, 0.7, 0.9 quantiles.
So if you wanted to get the exact numbers out that you put in for x, and your x has 5 elements then your p needs to be p = [0.1, 0.3, 0.5, 0.7, 0.9]. The complete algorithm is explicitly defined in the documentation.
You have assumed that to get x back, p should have been [0.2, 0.4, 0.6, 0.8, 1]. But then why not p = [0, 0.2, 0.4, 0.6, 0.8]? Matlab's algorithm seems to just take a linear average of the two methods.
Note that R defines nine different algorithms for quantiles, so your assumptions need to be stated clearly.
I have a square matrix M in R whose all entries all real numbers lying between 0.5 to 1.9. I want to make adjacency matrix by imposing a condition that whenever each element is less than 0.6 then that element should replace by zero other wise element should replace by 1. This i want to do for all 141 threshold in
seq(0.5, 1.9, 0.01) so that i can get 141 adjacency matrices. How can I get this? and how can I save or print all those matrices in R? Any help will be appreciated. Kindly bear with my poor knowledge in R :-)
You could use lapply to loop over the values of "Seq1", create the binary matrix ((M>=x)+0L) and store it in a list ("lst")
lst <- lapply(Seq1, function(x) (M >=x)+0L)
length(lst)
#[1] 141
data
Seq1 <- seq(0.5, 1.9, 0.01)
set.seed(24)
M <- matrix(sample(Seq1, 10*10, replace=TRUE), ncol=10)