Obtaining the observations within each cluster - r

Is it possible to obtain the actual observations within each cluster after performing k-means in R?
Like for example, after my analysis, I have 2 clusters, and I want to find the exact observations within each cluster, is it possible?

# random samples
x <- matrix(c(rnorm(30,10,2), rnorm(30,0,1)), nrow=12, byrow=T)
# clustering
clusters <- kmeans(x, 2)
# accessing cluster membership
clusters$cluster
[1] 1 1 1 1 1 1 2 2 2 2 2 2
# samples within cluster 1
c1 <- x[which(clusters$cluster == 1),]
# samples within cluster 2
c2 <- x[which(clusters$cluster == 2),]
# printing variables
x
[,1] [,2] [,3] [,4] [,5]
[1,] 10.8415151 9.3075438 9.443433171 13.5402818 7.0574904
[2,] 6.0721775 7.4570368 9.999411972 12.8186182 6.1697638
[3,] 11.3170525 10.9458832 7.576416396 12.7177707 6.7104535
[4,] 8.1377999 8.0558304 9.925363089 11.6547736 9.4911071
[5,] 11.6078294 8.7782984 8.619840508 12.2816048 9.4460169
[6,] 10.2972477 9.1498916 11.769122361 7.6224395 12.0658246
[7,] -0.9373027 -0.5051318 -0.530429758 -0.8200562 -0.0623147
[8,] -0.7257655 -1.1469400 -0.297539831 -0.0477345 -1.0278240
[9,] 0.7285393 -0.6621878 2.914976054 0.6390049 -0.5032553
[10,] 0.2672737 -0.6393167 -0.198287317 0.1430110 -2.2213365
[11,] -0.8679649 0.3354149 -0.003510304 0.6665495 0.6664689
[12,] 0.1731384 -1.8827645 0.270357961 0.3944154 1.3564678
c1
[,1] [,2] [,3] [,4] [,5]
[1,] 10.841515 9.307544 9.443433 13.540282 7.057490
[2,] 6.072177 7.457037 9.999412 12.818618 6.169764
[3,] 11.317053 10.945883 7.576416 12.717771 6.710454
[4,] 8.137800 8.055830 9.925363 11.654774 9.491107
[5,] 11.607829 8.778298 8.619841 12.281605 9.446017
[6,] 10.297248 9.149892 11.769122 7.622439 12.065825
c2
[,1] [,2] [,3] [,4] [,5]
[1,] -0.9373027 -0.5051318 -0.530429758 -0.8200562 -0.0623147
[2,] -0.7257655 -1.1469400 -0.297539831 -0.0477345 -1.0278240
[3,] 0.7285393 -0.6621878 2.914976054 0.6390049 -0.5032553
[4,] 0.2672737 -0.6393167 -0.198287317 0.1430110 -2.2213365
[5,] -0.8679649 0.3354149 -0.003510304 0.6665495 0.6664689
[6,] 0.1731384 -1.8827645 0.270357961 0.3944154 1.3564678

Related

multiple matrix generation based on vectors in R

I have an (5x4) matrix in R, namely data defined as follows:
data <- matrix(rnorm(5*4,mean=0,sd=1), 5, 4)
and I want to create 4 different matrices that follows this formula: Assume that data[,1] = [A1,A2,A3,A4,A5]. I want to create the following matrix:
A1*A1 A1*A2 A1*A3 A1*A4 A1*A5
A2*A1 A2*A2 A2*A3 A2*A4 A2*A5
G1 = A3*A1 A3*A2 A3*A3 A3*A4 A3*A5
A4*A1 A4*A2 A4*A3 A4*A4 A4*A5
A5*A1 A5*A2 A5*A3 A5*A4 A5*A5
Similarly for the other columns i want to calculate at once all the G matrices (G1,G2,G3,G4). How can i achieve that?
results <- lapply(1:ncol(data), function(i) outer(data[, i], data[, i]))
results
[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.37164 0.37582 0.33424 -0.105387 0.120936
[2,] 0.37582 0.38006 0.33800 -0.106574 0.122298
[3,] 0.33424 0.33800 0.30060 -0.094780 0.108765
[4,] -0.10539 -0.10657 -0.09478 0.029885 -0.034294
[5,] 0.12094 0.12230 0.10876 -0.034294 0.039354
[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.94684 0.117862 -1.01368 2.01456 0.719629
[2,] 0.11786 0.014671 -0.12618 0.25077 0.089579
[3,] -1.01368 -0.126183 1.08525 -2.15679 -0.770432
[4,] 2.01456 0.250772 -2.15679 4.28633 1.531132
[5,] 0.71963 0.089579 -0.77043 1.53113 0.546941
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1.61048 0.344159 -0.453466 2.68019 -0.57121
[2,] 0.34416 0.073547 -0.096906 0.57276 -0.12207
[3,] -0.45347 -0.096906 0.127684 -0.75467 0.16084
[4,] 2.68019 0.572758 -0.754669 4.46044 -0.95062
[5,] -0.57121 -0.122068 0.160837 -0.95062 0.20260
[[4]]
[,1] [,2] [,3] [,4] [,5]
[1,] 0.559341 0.859297 0.451096 -0.0522063 -1.027929
[2,] 0.859297 1.320109 0.693004 -0.0802028 -1.579172
[3,] 0.451096 0.693004 0.363799 -0.0421032 -0.829002
[4,] -0.052206 -0.080203 -0.042103 0.0048727 0.095942
[5,] -1.027929 -1.579172 -0.829002 0.0959421 1.889075

Matrix into another matrix with specified dimensions

I have a matrix with 2 columns, and I'd like to turn it into a matrix with specified dimensions.
> t <- matrix(rnorm(20), ncol=2, nrow=10)
[,1] [,2]
[1,] 1.4938530 1.2493088
[2,] -0.8079445 1.8715868
[3,] 0.5775695 -0.9277420
[4,] 0.4415969 2.6357908
[5,] 0.3209226 -1.1306049
[6,] 0.5109251 -0.8661100
[7,] 1.9495571 0.2092941
[8,] 0.7816373 1.1517466
[9,] 0.0300595 -0.1351532
[10,] 0.7550894 0.7778869
What I'd like to do is something like:
> tt <- matrix(t, ncol=4, nrow=5)
[,1] [,2] [3,] [4,]
[1,] 1.4938530 1.2493088 -0.8079445 1.8715868
[2,] 0.5775695 -0.9277420 0.4415969 2.6357908
[3,] etc.
I tried to do things with modulo but my head hurts too much for me to try even one more minute.
You can transpose your first matrix, so that data is stored in the order you want, and then fill the second matrix by row:
tt <- matrix(t(t), ncol=4, nrow=5, byrow = T)
t
# [,1] [,2]
# [1,] -1.4162465950 0.01532476
# [2,] -0.2366332875 -0.04024386
# [3,] 0.5146631983 -0.34720239
# [4,] 1.9243922633 -0.24016160
# [5,] 1.6161165230 0.63187438
# [6,] -0.3558181508 -0.73199138
# [7,] 0.7459405376 0.01934826
# [8,] -1.0428581093 -2.04422042
# [9,] 0.0003166344 0.98973993
#[10,] 0.6390745275 -0.65584930
tt
# [,1] [,2] [,3] [,4]
# [1,] -1.4162465950 0.01532476 -0.2366333 -0.04024386
# [2,] 0.5146631983 -0.34720239 1.9243923 -0.24016160
# [3,] 1.6161165230 0.63187438 -0.3558182 -0.73199138
# [4,] 0.7459405376 0.01934826 -1.0428581 -2.04422042
# [5,] 0.0003166344 0.98973993 0.6390745 -0.65584930
When you work with matrix in R, you can think of it as a vector with data stored column by column. So extracting data by row from a matrix is not as straight forward as extracting by column which is essentially how data is stored. After transposing the first matrix, the data will be stored in an order you want to extract and then fill the second matrix by row would be straight forward.

R Correlation significance matrix

I have a large correlation matrix (something like 50*50).
I calculated the matrix using cor(mydata) function.
Now I would like to have equal significance matrix.
Using cor.test() I can have one significance level but is there a easy way to get all 1200?
The function cor_pmat from the ggcorrplot package gives you the p-values of correlations.
library(ggcorrplot)
set.seed(123)
xmat <- matrix(rnorm(50), ncol = 5)
cor_pmat(xmat)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000000 0.08034470 0.24441138 0.03293644 0.3234899
[2,] 0.08034470 0.00000000 0.08716815 0.44828479 0.4824117
[3,] 0.24441138 0.08716815 0.00000000 0.20634394 0.9504582
[4,] 0.03293644 0.44828479 0.20634394 0.00000000 0.8378530
[5,] 0.32348990 0.48241166 0.95045815 0.83785303 0.0000000
I think this should do what you want, we use expand.grid in conjunction with the apply function:
Since you didn't provide your data, I created my own set.
set.seed(123)
xmat <- matrix(rnorm(50), ncol = 5)
matrix(apply(expand.grid(1:ncol(xmat), 1:ncol(xmat)),
1,
function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`),
ncol = ncol(xmat), byrow = T)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000000 0.08034470 0.24441138 3.293644e-02 0.3234899
[2,] 0.08034470 0.00000000 0.08716815 4.482848e-01 0.4824117
[3,] 0.24441138 0.08716815 0.00000000 2.063439e-01 0.9504582
[4,] 0.03293644 0.44828479 0.20634394 1.063504e-62 0.8378530
[5,] 0.32348990 0.48241166 0.95045815 8.378530e-01 0.0000000
Note that if you didn't want a matrix, and instead were comfortable with a data.frame, we could use combn which would involve much less iteration and be more efficient.
cbind(t(combn(1:ncol(xmat), 2)),
combn(1:ncol(xmat), 2, function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`)
)
[,1] [,2] [,3]
[1,] 1 2 0.08034470
[2,] 1 3 0.24441138
[3,] 1 4 0.03293644
[4,] 1 5 0.32348990
[5,] 2 3 0.08716815
[6,] 2 4 0.44828479
[7,] 2 5 0.48241166
[8,] 3 4 0.20634394
[9,] 3 5 0.95045815
[10,] 4 5 0.83785303
Alternatively, we can perform the same operation, but use the pipe operator %>% to make it a bit more concise:
library(magrittr)
combn(1:ncol(xmat), 2) %>%
apply(., 2, function(x) cor.test(xmat[,x[1]], xmat[,x[2]])$`p.value`) %>%
cbind(t(combn(1:ncol(xmat), 2)), .)
Here is one solution:
data <- swiss
#cor(data)
n <- ncol(data)
p.value.vec <- apply(combn(1:ncol(data), 2), 2, function(x)cor.test(data[,x[1]], data[,x[2]])$p.value)
p.value.matrix = matrix(0, n, n)
p.value.matrix[upper.tri(p.value.matrix, diag=FALSE)] = p.value.vec
p.value.matrix[lower.tri(p.value.matrix, diag=FALSE)] = p.value.vec
p.value.matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.000000e+00 1.491720e-02 9.450437e-07 1.028523e-03 1.304590e-06 2.588308e-05
[2,] 1.491720e-02 0.000000e+00 3.658617e-07 3.585238e-03 5.204434e-03 4.453814e-01
[3,] 9.450437e-07 9.951515e-08 0.000000e+00 9.951515e-08 6.844724e-01 3.018078e-01
[4,] 3.658617e-07 1.304590e-06 4.811397e-08 0.000000e+00 4.811397e-08 5.065456e-01
[5,] 1.028523e-03 5.204434e-03 2.588308e-05 3.018078e-01 0.000000e+00 2.380297e-01
[6,] 3.585238e-03 6.844724e-01 4.453814e-01 5.065456e-01 2.380297e-01 0.000000e+00

Replacing Zero and Infinite Values for the Value of the First column

n the matrix example below (Stocks Return) :
IBOV PETR4 VALE5 ITUB4 BBDC4 PETR3
[1,] -0.03981646 -0.027412907 -0.051282051 -0.05208333 -0.047300526 -0.059805285
[2,] -0.03000415 -0.030534351 -0.046332046 -0.03943116 -0.030090271 -0.010355030
[3,] -0.02241318 -0.026650515 0.000000000 -0.04912517 -0.077559462 0.005231689
[4,] -0.05584830 -0.072184194 -0.066126856 -0.04317056 -0.066704036 0.000000000
[5,] 0.01196833 -0.004694836 0.036127168 -0.00591716 -0.006006006 Inf
[6,] 0.02039587 0.039083558 0.009762901 0.01488095 0.024169184 0.011783189
I would like to replace the 0 (Zeros) and Inf values for the values of the same row in the first column.
Here's a sample matrix
set.seed(15)
stocks<-matrix(rnorm(3*5), nrow=3)
stocks[cbind(c(2,3,1),c(4,4,2))] <- 0
stocks[2,2] <- Inf
stocks
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.2588229 0.000000 0.0227882 -1.075001 0.1655543
# [2,] 1.8311207 Inf 1.0907732 0.000000 -1.2427850
# [3,] -0.3396186 -1.255386 -0.1321224 0.000000 1.45928777
Now we can find the bad values, and then replace them with the values in the first column of the same row by using matrix indexing and the row() function to find the correct row.
bad <- stocks==0 | is.infinite(stocks)
stocks[bad] <- stocks[row(bad)[bad], 1]
stocks
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.2588229 0.2588229 0.0227882 -1.0750013 0.1655543
# [2,] 1.8311207 1.8311207 1.0907732 1.8311207 -1.2427850
# [3,] -0.3396186 -1.2553858 -0.1321224 -0.3396186 1.4592877

Method to copy down rows R

Suppose we have a dataframe or matrix with one column specifying an integer value N as below (col 5).
Is there a vector approach to repopulate the object such that each row gets copied N times?
> y
[,1] [,2] [,3] [,4] [,5]
[1,] -0.02738267 0.5170621 -0.01644855 0.48830663 1
[2,] -0.30076544 1.8136359 0.02319640 -1.59649330 2
[3,] 1.73447245 0.4043638 -0.29112385 -0.25102988 3
[4,] 0.01025271 -0.4908636 0.80857300 0.08137033 4
The result would be as follows.
[1,] -0.02738267 0.5170621 -0.01644855 0.48830663 1
[2,] -0.30076544 1.8136359 0.02319640 -1.59649330 2
[2,] -0.30076544 1.8136359 0.02319640 -1.59649330 2
[3,] 1.73447245 0.4043638 -0.29112385 -0.25102988 3
[3,] 1.73447245 0.4043638 -0.29112385 -0.25102988 3
[3,] 1.73447245 0.4043638 -0.29112385 -0.25102988 3
[4,] 0.01025271 -0.4908636 0.80857300 0.08137033 4
[4,] 0.01025271 -0.4908636 0.80857300 0.08137033 4
[4,] 0.01025271 -0.4908636 0.80857300 0.08137033 4
[4,] 0.01025271 -0.4908636 0.80857300 0.08137033 4
Another question would be how to jitter the newly populated rows, such that there is not compute overlap of the newly copied data.
Some made-up data:
y <- cbind(matrix(runif(16), 4, 4), 1:4)
Just do:
z <- y[rep(seq_len(nrow(y)), y[,5]), ]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.5256007 0.07467979 0.95189484 0.2887943 1
# [2,] 0.3083967 0.03518523 0.08380005 0.9168161 2
# [3,] 0.3083967 0.03518523 0.08380005 0.9168161 2
# [4,] 0.8549639 0.79452728 0.22483537 0.4452553 3
# [5,] 0.8549639 0.79452728 0.22483537 0.4452553 3
# [6,] 0.8549639 0.79452728 0.22483537 0.4452553 3
# [7,] 0.5453508 0.47633523 0.51522514 0.3936340 4
# [8,] 0.5453508 0.47633523 0.51522514 0.3936340 4
# [9,] 0.5453508 0.47633523 0.51522514 0.3936340 4
# [10,] 0.5453508 0.47633523 0.51522514 0.3936340 4
And I am not sure what you mean by "jitter", but maybe
z <- z + runif(z) / 1000
?

Resources