Perform pairwise comparison of matrix - r

I have a matrix of n variables and I want to make an new matrix that is a pairwise difference of each vector, but not of itself. Here is an example of the data.
Transportation.services Recreational.goods.and.vehicles Recreation.services Other.services
2.958003 -0.25983789 5.526694 2.8912009
2.857370 -0.03425164 5.312857 2.9698044
2.352275 0.30536569 4.596742 2.9190123
2.093233 0.65920773 4.192716 3.2567390
1.991406 0.92246531 3.963058 3.6298314
2.065791 1.06120930 3.692287 3.4422340
I tried running a for loop below, but I'm aware that R is very slow with loops.
Difference.Matrix<- function(data){
n<-2
new.cols="New Columns"
list = list()
for (i in 1:ncol(data)){
for (j in n:ncol(data)){
name <- paste("diff",i,j,data[,i],data[,j],sep=".")
new<- data[,i]-data[,j]
list[[new.cols]]<-c(name)
data<-merge(data,new)
}
n= n+1
}
results<-list(data=data)
return(results)
}
As I said before the code is running very slow and has not even finished a single run through yet. Also I apologize for the beginner level coding. Also I am aware this code leaves the original data on the matrix, but I can delete it later.
Is it possible for me to use an apply function or foreach on this data?

You can find the pairs with combn and use apply to create the result:
apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 3.217841 -2.568691 0.0668021 -5.786532 -3.151039 2.6354931
## [2,] 2.891622 -2.455487 -0.1124344 -5.347109 -3.004056 2.3430526
## [3,] 2.046909 -2.244467 -0.5667373 -4.291376 -2.613647 1.6777297
## [4,] 1.434025 -2.099483 -1.1635060 -3.533508 -2.597531 0.9359770
## [5,] 1.068941 -1.971652 -1.6384254 -3.040593 -2.707366 0.3332266
## [6,] 1.004582 -1.626496 -1.3764430 -2.631078 -2.381025 0.2500530
You can add appropriate names with another apply. Here the column names are very long, which impairs the formatting, but the labels tell what differences are in each column:
x <- apply(combn(ncol(d), 2), 2, function(x) d[,x[1]] - d[,x[2]])
colnames(x) <- apply(combn(ncol(d), 2), 2, function(x) paste(names(d)[x], collapse=' - '))
> x
Transportation.services - Recreational.goods.and.vehicles Transportation.services - Recreation.services
[1,] 3.217841 -2.568691
[2,] 2.891622 -2.455487
[3,] 2.046909 -2.244467
[4,] 1.434025 -2.099483
[5,] 1.068941 -1.971652
[6,] 1.004582 -1.626496
Transportation.services - Other.services Recreational.goods.and.vehicles - Recreation.services
[1,] 0.0668021 -5.786532
[2,] -0.1124344 -5.347109
[3,] -0.5667373 -4.291376
[4,] -1.1635060 -3.533508
[5,] -1.6384254 -3.040593
[6,] -1.3764430 -2.631078
Recreational.goods.and.vehicles - Other.services Recreation.services - Other.services
[1,] -3.151039 2.6354931
[2,] -3.004056 2.3430526
[3,] -2.613647 1.6777297
[4,] -2.597531 0.9359770
[5,] -2.707366 0.3332266
[6,] -2.381025 0.2500530

Related

How to fix? 'It is recommended to have a covariance matrix with a determinant bigger than 1/ ((2*PI)^k) .' when using RcppHMM package in R

I am using a RcppHMM package to make a GHMM(Multivariate gaussian mixture HMM model) with continuous observation.
I want to learn an EM algorithm using continuous observations with different sequence lengths.
To be specific, each observation has a different sequence length from 3 to 6.
I tried to fit the model using the whole observation dataset at once (I made the dataset with ncol=6(maximum sequence length) and filled the empty part with all zero), but it didn't work
so I separated observations as groups with the same lengths [O3, O4, O5, O6]
and updated the model by each group.
Each observation group looks like this
O3
[,1] [,2] [,3]
[1,] 0.8550940 0.3231340 0.8639223
[2,] 0.4453262 0.5840305 0.4356958
[3,] 0.4344789 -1.2234760 0.4344789
[4,] -0.5003085 3.0322560 -0.5003085
[5,] -0.1459598 -0.4661041 -0.1459598
[6,] -0.1977263 -0.6352724 -0.1977263
O4
[,1] [,2] [,3] [,4]
[1,] 0.8965332 0.3338220 0.7270241 0.8824540
[2,] 0.4033438 0.4131293 0.1593136 0.4187023
[3,] -0.7329015 -1.6828296 -0.1550487 -0.1550487
[4,] -0.3213490 7.3449076 -0.2787857 -0.2787857
[5,] -0.2868067 -0.3743332 -0.1340566 -0.1340566
[6,] 2.6832742 -0.5844305 0.2320774 0.2320774
O5
[,1] [,2] [,3] [,4] [,5]
[1,] 0.83401341 0.2492370 0.47493190 0.6440035 0.84985396
[2,] 0.37988234 0.2335883 0.17043570 0.2116066 0.36260248
[3,] -0.05240445 -0.3034002 -0.05240445 -0.3034002 -0.05240445
[4,] -0.37240867 1.1500528 -0.37240867 1.1500528 -0.37240867
[5,] -0.02056839 0.9343497 -0.02056839 0.9343497 -0.02056839
[6,] -0.27586584 -0.4406833 -0.27586584 -0.4406833 -0.27586584
O6
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.9287066 0.35065802 0.4493442 0.6142040 0.7423286 0.9217381
[2,] 0.3852644 0.09612516 0.1623447 0.1320334 0.1875127 0.3928661
[3,] 0.1436024 -0.08326038 0.7800491 0.1436024 0.1926751 0.1436024
[4,] -0.4284304 -0.27916609 -0.5224586 -0.4284304 0.1267840 -0.4284304
[5,] -0.8846364 -0.81131525 -0.1781479 -0.8846364 -0.1266250 -0.8846364
[6,] -0.2141231 -0.78377461 -0.4440142 -0.2141231 -0.7888260 -0.2141231
nrow is the number of dimension of observation, and ncol is lengths of sequences.
When I updated the model with the first group that has sequence length 3, it operated.
But when I tried to re-update model with second group that has sequence length 4, the warning message came out as below,
In learnEM(newModel, O4[, 1:4, ], iter = 20, delta = 1e-05, print = TRUE) :
It is recommended to have a covariance matrix with a determinant bigger than 1/ ((2*PI)^k) .
Does anyone know how to fix this warning message?
And is there any proper way to learn a EM algorithm with observations that have different sequence lengths using this package?

Nested apply function

I have two matrices and want to apply a linear regression. Briefly I want to get the p value for the regression between each colum of a with the factors of b, repsectively.
So I want to calculate the association between (the number displays the column):
a1 ~ b1
a1 ~ b2
...
a2 ~ b1
a2 ~ b2
...
a3 ~ b1
...
The both datasets:
set.seed(1232)
a <- matrix(runif(100,min=6,max=14),10)
b <- matrix(sample(c(0,1),100,replace = T),10)
I tried to use a loop, this works but it is too slow:
res <- NULL
for( i in 1:ncol(a)){
tmp <- apply(b,2,function(y,x) summary(lm(x~y))$coefficients[2,4],a[,i])
res <- cbind(res,tmp)
print(i)
}
So I tried to use a nested apply function like this but it does not work. Do you have an idea?
apply(b, 2, function(y,x) apply(x,2,function(x,y) summary(lm(x~y))$coefficients[2,4]), a)
Try this:
apply(b, 2, function(x) apply(a, 2, function(y) summary(lm(x~y))$coefficients[2,4]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.8625602 0.2411563 0.7612476 0.509436082 0.3577235 0.45874974 0.360808572 0.05617883
[2,] 0.4136582 0.1186081 0.5161454 0.819513642 0.7813467 0.75912163 0.908950370 0.80584144
[3,] 0.5388209 0.8055687 0.7549796 0.935750996 0.0646300 0.79523596 0.973429634 0.55095667
[4,] 0.9930913 0.9622925 0.3753466 0.552830253 0.9636374 0.75312925 0.997724288 0.39246169
[5,] 0.2238149 0.4628420 0.5969530 0.007438859 0.4620672 0.42912431 0.007249279 0.50000827
[6,] 0.8022590 0.2352531 0.8990588 0.760952083 0.5794459 0.06170874 0.459247551 0.94625005
[7,] 0.4928149 0.1462937 0.5620584 0.554991195 0.6128055 0.55182670 0.874316882 0.55169689
[8,] 0.6631461 0.6260493 0.7679088 0.714076186 0.1254954 0.14316276 0.961166356 0.55342849
[9,] 0.9449110 0.2703502 0.5328246 0.533630873 0.2036671 0.87532137 0.402796595 0.24040106
[10,] 0.4151634 0.3584605 0.6923008 0.599701142 0.4649529 0.98238156 0.628130071 0.17310254
[,9] [,10]
[1,] 0.38552290 0.03078476
[2,] 0.03466566 0.64135540
[3,] 0.44603945 0.57578621
[4,] 0.47220820 0.45735156
[5,] 0.33202974 0.63330763
[6,] 0.09964719 0.19571414
[7,] 0.72649867 0.61591287
[8,] 0.22911914 0.84239810
[9,] 0.30766378 0.12782897
[10,] 0.01139275 0.46489123
You don't need regression:
res <- outer(seq_len(ncol(a)),
seq_len(ncol(b)),
FUN = Vectorize(function(k, l) cor.test(a[, k], b[, l])$p.value))
res[4, 3]
#[1] 0.3753466
summary(lm(a[,4] ~ b[, 3]))$coefficients[2,4]
#[1] 0.3753466
You can of course use lm instead of cor.test in the same way.
Alternatively, you could use package psych:
library(psych)
corr.test(as.data.frame(a), as.data.frame(b), adjust = "none")$p
Of course, it's default of adjusting for multiple testing reminds us that you really should do that.

Replicate each time with different standard deviation

I have a vector of standard deviations:
sd_vec<-runif(10,0,20) with 10 values between 0 and 20.
[1] 11.658106 9.693493 12.695608 4.091922 5.761061 18.410951 14.710990 12.095944 18.023123
[10] 13.294963
I would like to replicate the following process:
a<-rnorm(10,0,30)
[1] -21.265083 85.557147 23.958170 -32.843328 6.629831 -23.745339 46.094324 51.020059
[9] 1.041724 13.757235
n_columns=50
replicate(n_columns, a+rnorm(length(a), mean=0,sd=sd_vec))
The result should be 10 columns each of which are:
column 1: a + rnorm(length(a),0,11.658106)
column 2: a + rnorm(length(a),0,9.693493)
column 3: a + rnorm(length(a),0,12.695608)
.
.
.
column 10:a + rnorm(length(a),0,13.294963)
Will this use different values of sd_vec for each replication or will it use it for each random number generation?
According to your edit, then you may want to try
a+sapply(sd_vec, rnorm, n=100, mean=0)
# example
> set.seed(1)
> sd_vec <-runif(10,0,20)
> set.seed(1)
> a<-rnorm(100,0,30)
> n_columns=10
> head(a+sapply(sd_vec, rnorm, n=100, mean=0))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] -22.087869 -15.746650 -8.554735 0.7226986 -18.481801 -24.921835 -32.16206 -33.158153 -38.187974
[2,] 5.732942 18.078702 -6.489666 39.9422684 4.311839 32.504554 42.75921 -18.624133 7.954302
[3,] -29.906010 -13.260709 -2.483113 -36.0217953 -29.841630 -15.576334 -26.76925 -11.915258 -21.741820
[4,] 48.697584 45.395650 43.463125 40.7586401 47.903975 57.600406 47.59359 47.701659 33.782184
[5,] 6.409275 -7.122582 28.836887 2.3249113 13.884993 7.429514 -11.34081 1.960571 18.075706
[6,] -15.229450 -6.025260 -7.288529 -31.4375515 -18.184563 -45.038651 -50.00938 -26.965804 -37.610292
[,10]
[1,] -17.391109
[2,] 6.883342
[3,] -26.144900
[4,] 48.118830
[5,] 9.970987
[6,] -26.668629
Your current solution will replicate sd_vec for each replication, not using each sd for each replication.
If you want to have columns for each sd then you may work on matrices. Create matrix of rnorm with desire sd by:
X <- rnorm(length(a)*n_columns, mean=0, sd=sd_vec)
X <- matrix(X, nrow=length(a), ncol=n_columns, byrow=TRUE)
Then add it to a converted to matrix:
matrix(a, nrow=length(a), ncol=n_columns) + X

R hdf5 dataset written incorrectly?

When I execute the following my "predictors" dataset is populated correctly:
library(rhdf5)
library(forecast)
library(sltl)
library(tseries)
fid <- H5Fcreate(output_file)
## TODO: compute the order p
p <- 4
# write predictors
h5createDataset(output_file, dataset="predictors", c(p, length(tsstl.remainder) - (p - 1)), storage.mode='double')
predictors <- as.matrix(tsstl.remainder)
for (i in 1:(p - 1)) {
predictors <- as.matrix(cbind(predictors, Lag(as.matrix(tsstl.remainder), i)))
}
predictors <- as.matrix(predictors[-1:-(p-1),])
head(predictors)
h5write(predictors, output_file, name="predictors")
H5Fclose(fid)
The generated (correct) output for head(predictors) is:
[,1] [,2] [,3] [,4]
[1,] 0.3089645 6.7722063 5.1895389 5.2323261
[2,] 8.7607228 0.3089645 6.7722063 5.1895389
[3,] -0.9411553 8.7607228 0.3089645 6.7722063
[4,] -14.1390243 -0.9411553 8.7607228 0.3089645
[5,] -26.6605296 -14.1390243 -0.9411553 8.7607228
[6,] -8.1293076 -26.6605296 -14.1390243 -0.9411553
However, when I read it the results are not correct:
tsmatrix <- t(as.matrix(h5read(output_file, "predictors")))
head(tsmatrix)
Incorrectly outputs:
[,1] [,2] [,3] [,4]
[1,] 0.3089645 8.760723 -0.9411553 -14.13902
[2,] -26.6605296 -8.129308 -9.8687675 31.52086
[3,] 54.2703126 43.902489 31.8164836 43.87957
[4,] 22.1260636 36.733055 54.7064107 56.35158
[5,] 36.3919851 25.193068 48.2244464 57.12196
[6,] 48.0585673 72.402673 68.3265518 80.18960
How come what I write does not correspond to what I get back? I double-checked and hdfview HDF5 viewer also shows this incorrect values for the "predictors" dataset.
What is wrong here?
From the rhdf5 docs:
Please note, that arrays appear as transposed matrices when opening it
with a C-program (h5dump or HDFView). This is due to the fact the
fastest changing dimension on C is the last one, but on R it is the
first one (as in Fortran).

Weighted variance-covariance matrices and lapply

I have a list prob with 50 elements. Each element is a 601x3 matrix of probabilities, each row of which represents a complete sample space (i.e., each row of each matrix sums to 1). For instance, here are the first five rows of the first element of prob:
> prob[[1]][1:5,]
[,1] [,2] [,3]
[1,] 0.6027004 0.3655563 0.03174335
[2,] 0.6013667 0.3665756 0.03205767
[3,] 0.6000306 0.3675946 0.03237481
[4,] 0.5986921 0.3686131 0.03269480
[5,] 0.5973513 0.3696311 0.03301765
Now, what I want to do is to create the following matrix for each row of each matrix/element in the list prob. Taking the first row, let a = .603, b = .366, and c = .032 (rounding to three decimal places). Then,
> w
[,1] [,2] [,3]
[1,] a*(1-a) -a*b -a*c
[2,] -b*a b*(1-b) -b*c
[3,] -c*a -c*b c*(1-c)
Such that:
> w
[,1] [,2] [,3]
[1,] 0.239391 -0.220698 -0.019296
[2,] -0.220698 0.232044 -0.011712
[3,] -0.019296 -0.011712 0.030976
I want to obtain a similar 3x3 matrix 600 more times (for the rest of the rows of this matrix) and then to repeat this entire process 49 more times for the rest of the elements of prob. The only thing I can think of is to call apply within lapply so that I am accessing each row of each matrix one-at-a-time. I'm sure that is not an elegant way to do this (not to mention I can't get it to work), but I can't think of anything else. Can anyone help me out with this? I'd also love to hear suggestions for using a different structure (e.g., is it bad to use matrices within lists?).
Running this process with lapply on a list of similarly dimensioned matrices should be very simple. If it represents a challenge, then you should post the dput(.) output for a two element list with similar matrices. The challenge is really to do the processing row by row which is illustrated here with the output being a 3x3xN array:
w <- apply(M, 1, function(rw) diag( rw*(1-rw) ) +
rbind( rw*c(0, -rw[1], -rw[1] ),
rw*c(-rw[2],0, -rw[2] ),
rw*c(-rw[3], -rw[3], 0)
)
)
w
[,1] [,2] [,3] [,4] [,5]
[1,] 0.23945263 0.23972479 0.23999388 0.24025987 0.24052272
[2,] -0.22032093 -0.22044636 -0.22056801 -0.22068575 -0.22079962
[3,] -0.01913173 -0.01927842 -0.01942588 -0.01957412 -0.01972314
[4,] -0.22032093 -0.22044636 -0.22056801 -0.22068575 -0.22079962
[5,] 0.23192489 0.23219793 0.23246881 0.23273748 0.23300395
[6,] -0.01160398 -0.01175156 -0.01190081 -0.01205173 -0.01220435
[7,] -0.01913173 -0.01927842 -0.01942588 -0.01957412 -0.01972314
[8,] -0.01160398 -0.01175156 -0.01190081 -0.01205173 -0.01220435
[9,] 0.03073571 0.03102998 0.03132668 0.03162585 0.03192748
w <- array(w, c(3,3,5) )
w
, , 1
[,1] [,2] [,3]
[1,] 0.23945263 -0.22032093 -0.01913173
[2,] -0.22032093 0.23192489 -0.01160398
[3,] -0.01913173 -0.01160398 0.03073571
, , 2
[,1] [,2] [,3]
[1,] 0.23972479 -0.22044636 -0.01927842
[2,] -0.22044636 0.23219793 -0.01175156
[3,] -0.01927842 -0.01175156 0.03102998
.... snipped remaining output

Resources