Related
I have a data frame in R (df) which looks like this:
colA colB
A,B 0.5
A,C 8
B,A 0.5
B,C 9
C,A 8
C,B 9
It represents correlation values obtained by running a certain software.
Now, I would like to convert this data frame to a correlation matrix to be plotted with the Corr() function:
DESIRED OUTPUT:
A B C
A 1 0.5 8
B 0.5 1 9
C 8 9 1
Please, any suggestion about the code I can utilise?
Data:
input <- structure(list(colA = c("A,B", "A,C", "B,A", "B,C", "C,A", "C,B"
), colB = c(0.5, 8, 0.5, 9, 8, 9)), class = "data.frame", row.names = c(NA, -6L))
Solution:
## separate that column "colA" into 2
rc <- read.csv(text = input$colA, header = FALSE)
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
tapply(input$colB, unname(rc), FUN = identity, default = 1)
# A B C
#A 1.0 0.5 8
#B 0.5 1.0 9
#C 8.0 9.0 1
Note 1: OP has carelessly made-up data. Correlation is never bigger than 1.
Note 2: Thanks thelatemail for suggesting simply using read.csv instead of scan + matrix + asplit, as was in my initial answer.
Remark 1: If using xtabs, we have to modify diagonal elements to 1 later.
Remark 2: Matrix indexing is also a good approach, but takes more lines of code.
Remark 3: "reshaping" solution is also a good idea.
rc$value <- input$colB
reshape2::acast(rc, V1 ~ V2, fill = 1)
# A B C
#A 1.0 0.5 8
#B 0.5 1.0 9
#C 8.0 9.0 1
Something like that?
# create your input df:
df<-data.frame(colA=c("A,B","A,C","B,A","B,C","C,A","C,B"),value=c(0.5,8,0.5,9,8,9))
# split ID column
df[,c("col.A","col.B")]<- matrix(ncol=2,unlist(strsplit(df$colA,",")),byrow = T)
# reshape
library(reshape2)
dcast( df , col.A~col.B ,fill=1)
Some example data:
library(data.table)
mydat <- data.table(id1=rep(c("A","B","C"),each=3),
id2=c("D","E","G", "D","E","F","G","E","D"),
val=c(1,2,4,1,2,3, 4,2,1))
Which gives
id1 id2 val
1: A D 1
2: A E 2
3: A G 4
4: B D 1
5: B E 2
6: B F 3
7: C G 4
8: C E 2
9: C D 1
My goal is to get unique values of id2,val and then generate a variable that depends upon the unique values (e.g. the sum across unique observations as below). This variable should then be put into a column in the original data.table. I often find myself writing code like the following:
## This is the most obvious way
tmp <- unique(mydat[,.(id2,val)])
tmp[,weight:=val/sum(val)]
tmp[,val:=NULL]
mydat <- merge(mydat,tmp,by="id2",all.x=TRUE)
## A second option which doesn't require merging
mydat[,first:=FALSE]
mydat[mydat[,.I[1],by=.(id2)]$V1,first:=TRUE]
mydat[first==TRUE,weight2:=val/sum(val)]
mydat[,weight2:=max(weight,na.rm = TRUE),by=.(id2)]
mydat[,first:=NULL]
This gives
id2 id1 val weight weight2
1: D A 1 0.1 0.1
2: D B 1 0.1 0.1
3: D C 1 0.1 0.1
4: E A 2 0.2 0.2
5: E B 2 0.2 0.2
6: E C 2 0.2 0.2
7: F B 3 0.3 0.3
8: G A 4 0.4 0.4
9: G C 4 0.4 0.4
Entirely out of curiosity, is there a cleaner (more data.table) way to do this? Perhaps with self joins? Performance is important because the actual data I'm working with tends to be quite large.
I agree with #thelatemail that the approaches in the OP are already pretty clean.
Performance is important because the actual data I'm working with tends to be quite large.
If you must use this structure, there's:
setorder(mydat, id2)
mydat[unique(id2), on=.(id2), mult="first", w2 := val/sum(val)]
mydat[, w2 := nafill(w2, type="locf")]
I'm just sorting because that's shown in the desired output. To keep the original sorting, drop setorder and change the last line to mydat[order(id2), w2 := nafill(w2, type="locf")].
The nafill function is available in 1.12.3+ (so not yet on CRAN).
I would suggest instead using a set of normalized/"tidy" tables: val is an attribute of id2, so you could have an id2 table containing such things.
# same as OP's tmp
id2DT = unique(mydat[, .(id2, val)])
setkey(id2DT, id2)
id2DT[, w := val/sum(val)]
# drop redundant repeating val unless you really need it there
# to save on space and improve readability
mydat[, val := NULL]
# merge w in if/when needed
mydat[, w := id2DT[.SD, on=.(id2), x.w]]
Here is a merge-free option:
total_val <- mydat[!duplicated(id2, val), sum(val)] # Just the scalar we are after
mydat[, `:=`(val = val[1], weight = val[1] / total_val), by = id2]
# id1 id2 val weight
# 1: A D 1 0.1
# 2: B D 1 0.1
# 3: C D 1 0.1
# 4: A E 2 0.2
# 5: B E 2 0.2
# 6: C E 2 0.2
# 7: B F 3 0.3
# 8: A G 4 0.4
# 9: C G 4 0.4
Assuming I have a matrix looks like below, the values up or down the diagonal are the same. In other words, [,1] x [2,] and [,2] x [1,] both are 2 in the matrix.
> m = cbind(c(1,2,3),c(2,4,5),c(3,5,6))
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 5
[3,] 3 5 6
Then I have real name for 1, 2, and 3 as well.
>Real_name
A B C # A represents 1, B represents 2, and C represents 3.
If I would like to convert the matrix to 3 columns containing corresponding real name for each pair, and the pair must be unique, A x B is the same as B x A, so we keep A x B only. How can I achieve it using R?
A A 1
A B 2
A C 3
B B 4
B C 5
C C 6
The following is straightforward:
m <- cbind(c(1,2,3), c(2,4,5), c(3,5,6))
## read `?lower.tri` and try `v <- lower.tri(m, diag = TRUE)` to see what `v` is
## read `?which` and try `which(v, arr.ind = TRUE)` to see what it gives
ij <- which(lower.tri(m, diag = TRUE), arr.ind = TRUE)
Real_name <- LETTERS[1:3]
data.frame(row = Real_name[ij[, 1]], col = Real_name[ij[, 2]], val = c(m[ij]))
# row col val
#1 A A 1
#2 B A 2
#3 C A 3
#4 B B 4
#5 C B 5
#6 C C 6
colnames(m) <- c("A", "B", "C")
rownames(m) <- c("A", "B", "C")
m[lower.tri(m)] = NA # replace lower triangular elements with NA
data.table::melt(m, na.rm = TRUE) # melt and remove NA
# Var1 Var2 value
#1 A A 1
#4 A B 2
#5 B B 4
#7 A C 3
#8 B C 5
#9 C C 6
Or you can do it in a single line: melt(replace(m, lower.tri(m), NA), na.rm = TRUE)
This will also work:
g <- expand.grid(1:ncol(m), 1:ncol(m))
g <- g[g[,2]>=g[,1],]
cbind.data.frame(sapply(g, function(x) Real_name[x]), Val=m[as.matrix(g)])
Var1 Var2 Val
1 A A 1
2 A B 2
3 B B 4
4 A C 3
5 B C 5
6 C C 6
library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.
The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))
Recently, I have found that I am using the following pattern over and over again. The process is:
cross-tabulate numeric variable by factor using table
create data frame from created table
add original numeric values to data frame (from row names (!))
remove row names
reorder columns of aggregated data frame
In R, it looks like this:
# Sample data
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
# Get frequencies
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
# Add column of original values from rownames
dfSummary$x <- as.numeric(rownames(dfSummary))
# Remove rownames
rownames(dfSummary) <- NULL
# Reorder columns
dfSummary <- dfSummary[, c(3, 1, 2)]
Is there anything more elegant in R, preferably using base functions? I know I can use sql to do this in single command - I think that it has to be possible to achieve similar behavior in R.
sqldf solution:
library(sqldf)
dfSummary <- sqldf("select
x,
sum(y = 'failure') as failure,
sum(y = 'success') as success
from df group by x")
An alternative with base R could be:
aggregate(. ~ x, transform(df, success = y == "sucess",
failure = y == "failure", y = NULL), sum)
# x success failure
#1 0.0 2 4
#2 0.1 6 8
#3 0.2 1 7
#4 0.3 5 4
#5 0.4 6 6
#6 0.5 3 3
#7 0.6 4 6
#8 0.7 6 6
#9 0.8 4 5
#10 0.9 6 7
#11 1.0 1 0
Your code modified as a function would be efficient compared to the other solutions in base R (so far). If you wanted the code in one-line, a "reshape/table" combo from base R could be used.
reshape(as.data.frame(table(df)), idvar='x', timevar='y',
direction='wide')
# x Freq.failure Freq.success
#1 0 3 2
#2 0.1 3 9
#3 0.2 5 5
#4 0.3 8 7
#5 0.4 5 3
#6 0.5 9 4
#7 0.6 3 6
#8 0.7 7 6
#9 0.8 3 1
#10 0.9 4 3
#11 1 0 4
In case you want to try data.table
library(data.table)
dcast.data.table(setDT(df), x~y)
# x failure success
# 1: 0.0 3 2
# 2: 0.1 3 9
# 3: 0.2 5 5
# 4: 0.3 8 7
# 5: 0.4 5 3
# 6: 0.5 9 4
# 7: 0.6 3 6
# 8: 0.7 7 6
# 9: 0.8 3 1
#10: 0.9 4 3
#11: 1.0 0 4
Update
I didn't notice the as.data.frame(table( converts to "factor" columns (thanks to #Hadley's comment). A workaround is:
res <- transform(reshape(as.data.frame(table(df), stringsAsFactors=FALSE),
idvar='x', timevar='y', direction='wide'), x= as.numeric(x))
data
set.seed(24)
df <- data.frame(x = round(runif(100), 1),
y = factor(ifelse(runif(100) > .5, 1, 0),
labels = c('failure', 'success'))
)
Benchmarks
set.seed(24)
df <- data.frame(x = round(runif(1e6), 1),
y = factor(ifelse(runif(1e6) > .5, 1, 0),
labels = c('failure', 'success'))
)
tomas <- function(){
dfSummary <- as.data.frame.matrix(table(df$x, df$y))
dfSummary$x <- as.numeric(rownames(dfSummary))
dfSummary <- dfSummary[, c(3, 1, 2)]}
doc <- function(){aggregate(. ~ x, transform(df,
success = y == "success", failure = y == "failure",
y = NULL), sum)}
akrun <- function(){reshape(as.data.frame(table(df)),
idvar='x', timevar='y', direction='wide')}
library(microbenchmark)
microbenchmark(tomas(), doc(), akrun(), unit='relative', times=20L)
Unit: relative
#expr min lq mean median uq max neval cld
#tomas() 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000 20 a
#doc() 13.451037 11.5050997 13.082074 13.043584 12.8048306 19.715535 20 b
#akrun() 1.019977 0.9522809 1.012332 1.007569 0.9993835 1.533191 20 a
Updated with dcast.data.table
df1 <- copy(df)
akrun2 <- function() {dcast.data.table(setDT(df1), x~y)}
microbenchmark(tomas(), akrun2(), unit='relative', times=20L)
# Unit: relative
# expr min lq mean median uq max neval cld
# tomas() 6.493231 6.345752 6.410853 6.51594 6.502044 5.591753 20 b
# akrun2() 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 20 a
This should be relatively efficient. You cannot really suppress rownames in a dataframe, since they are a requirement of a valid dataframe
X <- table(df$x,df$y)
cbind( data.frame(x=rownames(X)), unclass(X) )
x failure success
0 0 5 3
0.1 0.1 6 1
0.2 0.2 7 8
0.3 0.3 7 3
0.4 0.4 6 6
0.5 0.5 6 4
0.6 0.6 2 5
0.7 0.7 2 7
0.8 0.8 3 7
0.9 0.9 4 6
1 1 2 0