I know this has been answered before, but,
given a correlation matrix which looks like this:
V A B C D
A 1 0.3 0.1 0.4
B 0.2 1 0.4 0.3
C 0.1 0 1 0.9
D 0.3 0.3 0.1 1
which can be loaded in R as follows:
corr.matrix <- read.table("path/to/file", sep = '\t', header = T)
rownames(corr.matrix) <- corr.matrix$V
corr.matrix <- corr.matrix[, 2:ncol(corr.matrix)]
Based on 2 other files that dictate which of the rows and columns to be plotted (Because some are of no interest to me), I want to rearrange the rows and columns in to how the 2 separate files dictate.
For example:
cols_order.txt
C
D
E
B
A
...
rows.txt
D
E
Z
B
T
A
...
I read those other 2 files like this:
rows.order <- ("rows_order.txt", sep = '\n', header=F)
colnames(rows.order) <- "Variant"
cols.order <- ("cols_order.txt", sep = '\n', header=F)
colnames(cols.order) <- "Variant"
And after this step I do this:
corr.matrix <- corr.matrix[rows.order$Variant, cols.order$Variant]
The values that I don't want to be plotted are successfully removed, but the order gets scrambled. How can I fix this?
The .order datasets are read correctly (I checked 3 times).
Here is a potential solution to your question. I tried to re-create a small-sized data.frame based on your question. The key here is the match function as well as some basic subsetting/filtering techniques in R:
## Re-create your example:
V <- data.frame(
A = c(1 , 0.3, 0.1 , 0.4),
B = c(0.2, 1 , 0.4 , 0.3),
C = c(0.1, 0 , 1 , 0.9),
D = c(0.3, 0.3, 0.1 , 1)
) #matrix() also ok
rownames(V) <- LETTERS[1:4]
## Reorder using `match` function
## Needs to be in data.frame form
## So use as.data.frame() if needed
## Here, I don't have the text file
## So if you want to load in txt files specifying rows columns
## Use `read.csv` or `read.table to load
## And then store the relevant info into a vector as you did
col_order <- c("C","D","E","B","A")
col_order_filtered <- col_order[which(col_order %in% colnames(V))]
rows <- c("D","E","Z","B","T","A")
## Filter rows IDs, since not all are present in your data
row_filtered <- rows[rows %in% rownames(V)]
V1 <- V[match(rownames(V), row_filtered), match(colnames(V), col_order_filtered)]
V1 <- V1[-which(rownames(V1)=="NA"), ]
V1
## D C A B
## C 0.1 1.0 0.1 0.4
## B 0.3 0.0 0.3 1.0
## A 0.3 0.1 1.0 0.2
Alternatively, if you are comfortable with dplyr package and the syntax, you can use it and often it is handy:
## Continued from previous code
library(dplyr)
V2 <- V %>%
select(C, D, B, A, everything()) %>%
slice(match(rownames(V), row_filtered))
rownames(V2) <- row_filtered
V2
## C D B A
## D 1.0 0.1 0.4 0.1
## B 0.0 0.3 1.0 0.3
## A 0.1 0.3 0.2 1.0
Hope that helps.
Related
Here is my data table:
A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1
I am trying to get the unique pairs of row names and column names based on the entries. For example, if I am looking at > 0.5, my output would be:
A B 0.8
If I am looking at < 0.5, my output would be:
B C 0.3
A C 0.2
This is a classical melt situation (though it needs some seasoning with upper or lower.tri)
dat <- read.table(text=
" A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1
", header=TRUE )
dat[ !upper.tri(dat) ] <- NA
dat <- as.data.frame( dat )
dat <- tibble::rownames_to_column( dat, "V1" )
setDT(dat)
use.this <- melt( dat, id.vars="V1", variable.name="V2" )[ !is.na(value) ]
use.this[ value < .5 ]
use.this[ value > 0.5 ]
It looks like this:
> use.this[ value < .5 ]
V1 V2 value
1: A C 0.2
2: B C 0.3
> use.this[ value > .5 ]
V1 V2 value
1: A B 0.8
In base R, using which with arr.ind = TRUE to get the row and column numbers that meet the condition.
df[lower.tri(df, diag = TRUE)] <- NA
mat <- which(df < 0.5, arr.ind = TRUE)
data.frame(rowname = rownames(df)[mat[, 1]],
colname = colnames(df)[mat[, 2]],
value = df[mat])
# rowname colname value
#1 A C 0.2
#2 B C 0.3
data
df <- structure(list(A = c(1, 0.8, 0.2), B = c(0.8, 1, 0.3), C = c(0.2,
0.3, 1)), class = "data.frame", row.names = c("A", "B", "C"))
Here is another option:
M <- matrix(c(1,0.8,0.2,0.8,1,0.3,0.2,0.3,1), nrow=3L,
dimnames=list(LETTERS[1:3], LETTERS[1:3]))
allDT <- data.table(rn=rep(rownames(M), nrow(M)),
cn=rep(colnames(M), each=ncol(M)),
val=as.vector(M))
DT <- unique(allDT[, .(val=val), .(rn=pmin(rn, cn), cn=pmax(rn, cn))])
DT[val<0.5]
The question has been tagged data.table. So, here is a simple solution which uses only data.table syntax. In addition, it suggests lookup functions (EDIT: in 3 different flavours) which require less keystrokes.
library(data.table)
wide <- fread(" A B C
A 1 0.8 0.2
B 0.8 1 0.3
C 0.2 0.3 1")
long <- melt(wide, id.var = "V1", variable.name = "V2",
variable.factor = FALSE)[V1 < V2]
long
V1 V2 value
1: A B 0.8
2: A C 0.2
3: B C 0.3
Note that the upper triangular part of wide is picked after reshaping to long format by subsetting [V1 < V2] which ensures that only unique pairs are considered.
long can be queried by subsetting, e.g.,
long[value < 0.5]
V1 V2 value
1: A C 0.2
2: B C 0.3
long[value > 0.5]
V1 V2 value
1: A B 0.8
lookup function
long can be queried by defining a lookup function:
l <- function(cond) eval(parse(text = paste0("long[value", cond, "]")))
which can be called, e.g.,
l("< .5")
V1 V2 value
1: A C 0.2
2: B C 0.3
l("> .5")
V1 V2 value
1: A B 0.8
l("== .3")
V1 V2 value
1: B C 0.3
EDIT: lookup function with 2 arguments
Alternatively, the lookup function can be defined to allow for 2 arguments, one for the comparision operator, one for the numerical values:
l2 <- function(op, v) long[do.call(op, list(value, v))]
l2("%between%", c(0.25, 0.95))
V1 V2 value
1: A B 0.8
2: B C 0.3
Or, with the new interface for programming on data.table (available with data.table development version 1.14.1):
l3 <- function(op, v) long[op(value, v), env = list(op = as.name(op), v = v)]
l3("%in%", c(0.2, 0.3))
V1 V2 value
1: A C 0.2
2: B C 0.3
I would like to convert this dataframe
tmp <- data.frame(V1=c("A","A","B"),V2=c("B","C","C"),V3=c(0.2,0.4,0.1))
tmp
V1 V2 V3
1 A B 0.2
2 A C 0.4
3 B C 0.1
into a square matrix like this (which should ultimately be a dist object
A B C
A 0
B 0.2 0
C 0.4 0.1 0
I tried different approaches based on functions reshape, spread or xtabs but I cannot get the right dimension. Thanks for your help.
Maybe you can try the code below
d <- sort(unique(unlist(tmp[1:2])))
m <- `dimnames<-`(matrix(0,length(d),length(d)),list(d,d))
m[as.matrix(tmp[1:2])] <- tmp$V3
res <- t(m) + m
such that
> res
A B C
A 0.0 0.2 0.4
B 0.2 0.0 0.1
C 0.4 0.1 0.0
You can also create your own dist object this way using structure:
tmp_lab <- unique(c(as.character(tmp$V1), as.character(tmp$V2)))
structure(tmp$V3,
Size = length(tmp_lab),
Labels = tmp_lab,
Diag = TRUE,
Upper = FALSE,
method = "user",
class = "dist")
Output
A B C
A 0.0
B 0.2 0.0
C 0.4 0.1 0.0
Here is an option with xtabs after converting the columns 'V1' , 'V2' to factor with levels specified as the same
tmp[1:2] <- lapply(tmp[1:2], factor, levels = c('A', 'B', 'C'))
as.dist(xtabs(V3 ~ V2 + V1, tmp), diag = TRUE)
# A B C
#A 0.0
#B 0.2 0.0
#C 0.4 0.1 0.0
I have a data frame of correlations which looks something like this (although there are ~15,000 rows in my real data)
phen1<-c("A","B","C")
phen2<-c("B","C","A")
cors<-c(0.3,0.7,0.8)
data<-as.data.frame(cbind(phen1, phen2, cors))
phen1 phen2 cors
1 A B 0.3
2 B C 0.7
3 C A 0.8
This was created externally and read into R and I want to convert this data frame into a correlation matrix with phen1 and 2 as the labels for rows and columns of this matrix. I have only calculated this for either the lower or upper triangle and I don't have the 1's for the Diagnonal. So I would like the end results to be a full correlation matrix but a first step would probably be to create the lower/upper triangle and then convert to a full matrix I think. I'm unsure how to do either step of this.
Also, the results may not be in an intuitive order, but I'm not sure if this matters, but ideally I would like a way to do this which uses the labels in phen1 and phen 2 to make sure the matrix has the correct values in the correct place if that makes sense?
Essentially for this, I would want something like this as an end result:
A B C
A 1 0.3 0.8
B 0.3 1 0.7
C 0.8 0.7 1
Here is another one in base R where we create a symmetrical dataframe same as data but with columns inverted for phen1 and phen2. Then we use xtabs to get a correlation matrix and set diagonal to 1.
data1 <- data.frame(phen1 = data$phen2, phen2 = data$phen1, cors = data$cors)
df <- rbind(data, data1)
df1 <- as.data.frame.matrix(xtabs(cors ~ ., df))
diag(df1) <- 1
df1
# A B C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0
data
phen1<-c("A","B","C")
phen2<-c("B","C","A")
cors<-c(0.3,0.7,0.8)
data<- data.frame(phen1, phen2, cors)
I think there must be an elegant way to do it, however, here is a dplyr and tidyr possibility:
data %>%
spread(phen1, cors) %>%
rename(phen = "phen2") %>%
bind_rows(data %>%
spread(phen2, cors) %>%
rename(phen = "phen1")) %>%
group_by(phen) %>%
summarise_all(~ ifelse(all(is.na(.)), 1, first(na.omit(.))))
phen A B C
<chr> <dbl> <dbl> <dbl>
1 A 1 0.3 0.8
2 B 0.3 1 0.7
3 C 0.8 0.7 1
You can use the Matrix package for this. What you have is a sparse representation of the data and you want to turn this into a dense (redundant) matrix.
data <- data.frame(phen1, phen2, cors)
inds <- cbind(as.integer(data$phen1), as.integer(data$phen2))
inds <- t(apply(inds, 1, sort))
library(Matrix)
res <- sparseMatrix(i = inds[,1],
j = inds[,2],
x = data$cors,
symmetric = TRUE)
#3 x 3 sparse Matrix of class "dsCMatrix"
#
#[1,] . 0.3 0.8
#[2,] 0.3 . 0.7
#[3,] 0.8 0.7 .
res <- as.matrix(res)
diag(res) <- 1
dimnames(res) <- list(sort(data$phen1), sort(data$phen2))
res
# A B C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0
Here's another option.
First reshape data from long to wide and convert to a matrix. You have different options to do that (reshape2, tidyr, etc.); here I use tidyr::spread.
library(tidyverse)
mat <- data %>% spread(phen2, cors) %>% column_to_rownames("phen1") %>% as.matrix()
We then fill the missing NA values from the upper and lower triangular matrix respectively, and fill the diagonal with 1.
mat[lower.tri(mat)] <- mapply(sum, mat[lower.tri(mat)], mat[upper.tri(mat)], na.rm = T)
mat[upper.tri(mat)] <- mat[lower.tri(mat)]
diag(mat) <- 1
mat
# A B C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0
You can use reshape library.
library(reshape)
data <- melt(data)
your_mat <- cast(data, phen1 ~ phen2 )
Output:
phen1 A B C
1 A <NA> 0.3 <NA>
2 B <NA> <NA> 0.7
3 C 0.8 <NA> <NA>
The reason you will NAs because you have many missing combination from your input table. For avoiding this you need an input table like this:
phen1 phen2 cors
1 A B 0.3
2 B C 0.7
3 C A 0.8
4 A C 0.8
5 B A 0.3
6 C B 0.7
7 A A 1.0
8 B B 1.0
9 C C 1.0
Plenty of solutions already, but I'll throw in another way. Note: I'm setting up the data so that cors is numeric rather than a factor in your original data frame.
data <- data.frame(phen1, phen2, cors)
Then we can expand the data frame with missing combinations and then uses reshape2::acast() to convert the data to wide format.
library(tidyverse)
library(reshape2)
data %>%
select(phen1 = phen2, phen2 = phen1, cors) %>%
bind_rows(data) %>%
acast(phen1 ~ phen2, fill = 1)
acast handily lets you fill in the missing values with some other specified value, in this case 1.
Also, check out the corrr package, which may be able to do this more neatly.
Here is a function that I wrote:
long2cormat <- function(xlong, x = "x", y = "y", r = "r") {
# Takes some inspiration from https://stackoverflow.com/a/57904948/180892
xlong <- xlong[,c(x, y, r)]
names(xlong) <- c("x", "y", "r")
data1 <- data.frame(x = xlong$x, y = xlong$y, r = xlong$r)
data2 <- data.frame(x = xlong$y, y = xlong$x, r = xlong$r)
df <- rbind(data1, data2)
uv <- unique(c(df$x, df$y))
df1 <- matrix(NA, nrow = length(uv), ncol = length(uv), dimnames = list(uv, uv))
for (i in seq(nrow(df))) df1[df$x[i], df$y[i]] <- df$r[i]
diag(df1) <- 1
df1
}
To run it do the following:
xlong <- data.frame(phen1 = c("A","B","C"),
phen2 = c("B","C","A"),
cors = c(0.3,0.7,0.8))
long2cormat(xlong, "phen1", "phen2", "cors")
Importantly, for my own use cases, it leaves missing correlations as NA.
I have a vector with values which distribution is unknown and i want to create another vector with the probabilities of the values i have.
eg.
I have
v <- c(e1, e2, ... , ei)
and i want to create
p <- c(P(e1), P(e2), ... , P(ei))
How can i do this in R?
As you want to create a vector the same length as the vector of values, you could do something like:
p <- sapply(v, function(x) length(which(x == v))/length(v))
Example using letters as values
set.seed(123)
v = sample(letters[1:4], 10, replace = TRUE)
p <- sapply(v, function(x) length(which(x == v))/length(v))
p
#> b d b d d a c d c b
#> 0.3 0.4 0.3 0.4 0.4 0.1 0.2 0.4 0.2 0.3
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.
The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))