Flatten matrix in R to four columns (indexes and upper/lower triangles) - r

I'm using the cor.prob() function that's been posted several times around the mailing list to get a matrix of correlations (lower diagonal) and p-values (upper diagonals):
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
d <- data.frame(x=1:5, y=c(10,16,8,60,80), z=c(10,9,12,2,1))
cor.prob(d)
> cor.prob(d)
x y z
x NA 0.04856042 0.107654038
y 0.8807155 NA 0.003523594
z -0.7953560 -0.97945703 NA
How would I collapse the above correlation matrix (with the correlations in the lower half, p-values in the upper half) into a four-column matrix: two indexes, the correlation, and the p-value? E.g.:
i j cor pval
x y .88 .048
x z -.79 .107
y z -.97 0.0035
I've seen the answer to the previous question like this, but will only give me a 3-column matrix, not a four column matrix with separate columns for the p-value and correlation.
Any help is appreciated!

well it's not a matrix, because you can't mix characters and numerics. But:
this is my first attempt (before your label swap):
m <- cor.prob(d)
ut <- upper.tri(m)
lt <- lower.tri(m)
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=m[lt])
now apply the correction I suggested below and you get
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=t(m)[ut])
finally your label swap, use row()/col(), and write it as a function:
f1 <- function(m) {
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=tm[ut])
}
then
m<-matrix(1:25,5,dimnames=list(letters[1:5],letters[1:5])
> m
a b c d e
a 1 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
> f1(m)
i j cor p
1 a b 6 2
2 a c 11 3
3 b c 12 8
4 a d 16 4
5 b d 17 9
6 c d 18 14
7 a e 21 5
8 b e 22 10
9 c e 23 15
10 d e 24 20
Can you explain what you expected if it wasn't this?

cd <- cor.prob(d)
dcd <- as.data.frame( which( row(cd) < col(cd), arr.ind=TRUE) )
dcd$pval <- cd[row(cd) < col(cd)]
dcd$cor <- cd[row(cd) > col(cd)]
dcd[[2]] <-dimnames(cd)[[2]][dcd$col]
dcd[[1]] <-dimnames(cd)[[2]][dcd$row]
dcd
#--------------------
row col pval cor
1 x y 0.048560420 0.8807155
2 x z 0.107654038 -0.7953560
3 y z 0.003523594 -0.9794570

Related

How to remove near zero variance without employing the caret package?

I am new to programming. But here is the piece of code that I have tried to remove the nearZeroVar function of the caret package from:
N <- 200 # number of points per class
D <- 2 # dimensionality
K <- 4 # number of classes
X <- data.frame() # data matrix (each row = single example)
y <- data.frame() # class labels
...(some lines are omitted)...
X <- as.matrix(X)
Y <- matrix(0, N * K, K)
for (i in 1:(N * K)) { Y[i, y[i,]] <- 1}
...(some lines are omitted)...
nzv <- nearZeroVar(train)
nzv.nolabel <- nzv-1
inTrain <- createDataPartition(y=train$label, p=0.7, list=F)
training <- train[inTrain, ]
CV <- train[-inTrain, ]
X <- as.matrix(training[, -1])
N <- nrow(X)
y <- training[, 1]
K <- length(unique(y))
X.proc <- X[, -nzv.nolabel]/max(X)
D <- ncol(X.proc)
Xcv <- as.matrix(CV[, -1])
ycv <- CV[, 1]
Xcv.proc <- Xcv[, -nzv.nolabel]/max(X)
Y <- matrix(0, N, K)
So, to get rid of the nearZeroVar function, I have tried to use the Filter function and the following foo function:
foo <- function(data) {
out <- lapply(data, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
nzv <- foo(trainingSet)
nzv.nolabel <- nzv - 1
But I get error messages: "Error in X[, training.nolabel]: incorrect number of dimensions. Execution halted" or something like "Non-conformable arrays". Any ideas on how to work around the `nearZeroVar" are strongly appreciated. Please, let me know if I should share some more details.
It's not evident from the code posted that how Filter() was being used.
Try the following;
# create sample data
R> df <- data.frame(a=1:10, b=sample(10:19), c=rep(5,10))
R> df
a b c
1 1 16 5
2 2 17 5
3 3 18 5
4 4 13 5
5 5 15 5
6 6 14 5
7 7 11 5
8 8 12 5
9 9 19 5
10 10 10 5
creating a custom function like;
R> zeroVarianceCol<- function(df){
new_df<-Filter(var,df)
}
passing the dataframe to this custom function like, x<- zeroVarianceCol(df) will remove the near zero variance column, in this case column c.
R> x
a b
1 1 16
2 2 17
3 3 18
4 4 13
5 5 15
6 6 14
7 7 11
8 8 12
9 9 19
10 10 10

Calculate and place output in lower triangle matrix

I have dataset that looks like this:
A:B A:C A:D B:C B:D C:D
2 5 12 21 12 2
4 6 25 2 1 5
10 21 89 3 3 8
I would like to calculate the mean of each column and place the output in a lower triangular matrix such as:
A B C D
A NA
B 5.33 NA
C 11 8.6 NA
D 42 4.3 15 NA
I am trying the following code, but I'm not there yet.
result.matrix <- matrix(nrow = 4, ncol = 4)
for (i in 1:length(test)) {
for(j in 1:length(test)) {
result.matrix[i,j] <- cor(mean(as.numeric(test[,i])), mean(as.numeric(test[,j])), use = "na.or.complete") }}
Any help would be appreciated!
answer by #user20650
m[lower.tri(m)] <- colMeans(d), where m is matrix of zeros of correct dimensions

Split dataframe into bins based on another vector

suppose I have the following dataframe
x <- c(12,30,45,100,150,305,2,46,10,221)
x2 <- letters[1:10]
df <- data.frame(x,x2)
df <- df[with(df, order(x)), ]
x x2
7 2 g
9 10 i
1 12 a
2 30 b
3 45 c
8 46 h
4 100 d
5 150 e
10 221 j
6 305 f
And I would like to split these into groups based on another vector,
v <- seq(0, 500, 50)
Basically, I would like to partition out each row based on column x and how it matches with to v ( so for example x <= an element in v) - the location/index of that element in v is then used to assign a group for that row. The resulting table should look something like the following:
x x2 group
7 2 g g1
9 10 i g1
1 12 a g1
2 30 b g1
3 45 c g1
8 46 h g2
4 100 d g3
5 150 e g4
10 221 j g4
6 305 f g6
I could try to loop through each row and try and match it to v but I'm still confuse as to how I could easily detect where the match x<=element v occurs so that I can assign a group id to it. thanks.
You can use cut to break up df$x by the values of v:
df$group <- as.numeric(cut(df$x, breaks = v))
df$group <- paste0('g', df$group)
cut returns a factor so you can use as.numeric to just pull out which numeric bucket the value of df$x falls into based on v.

Getting the maximum common words in R

I have data of the form:
ID A1 A2 A3 ... A100
1 john max karl ... kevin
2 kevin bosy lary ... rosy
3 karl lary bosy ... hale
.
.
.
10000 isha john lewis ... dave
I want to get one ID for each ID such that both of them have maximum number of common attributes(A1,A2,..A100)
How can I do this in R ?
Edit: Let's call the output a MatchId:
ID MatchId
1 70
2 4000
.
.
10000 3000
I think this gets what you're looking for:
library(dplyr)
# make up some data
set.seed(1492)
rbind_all(lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
})) -> dat
print(dat)
## Source: local data frame [15 x 11]
##
## ID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1 1 H F E C B A R J Z N
## 2 2 Q P E M L Z C G V Y
## 3 3 Q J D N B T L K G Z
## 4 4 D Y U F V O I C A W
## 5 5 T Z D I J F R C B S
## 6 6 Q D H U P V O E R N
## 7 7 C L I M E K N S X Z
## 8 8 M J S E N O F Y X I
## 9 9 R H V N M T Q X L S
## 10 10 Q H L Y B W S M P X
## 11 11 M N J K B G S X V R
## 12 12 W X A H Y D N T Q I
## 13 13 K H V J D X Q W A U
## 14 14 M U F H S T W Z O N
## 15 15 G B U Y E L A Q W O
# get commons
rbind_all(lapply(1:15, function(i) {
rbind_all(lapply(setdiff(1:15, i), function(j) {
data.frame(id1=i,
id2=j,
common=length(intersect(c(t(dat[i, 2:11])),
c(t(dat[j, 2:11])))))
}))
})) -> commons
commons %>%
group_by(id1) %>%
top_n(1, common) %>%
filter(row_number()==1) %>%
select(ID=id1, MatchId=id2)
## Source: local data frame [15 x 2]
## Groups: ID
##
## ID MatchId
## 1 1 5
## 2 2 7
## 3 3 5
## 4 4 12
## 5 5 1
## 6 6 9
## 7 7 8
## 8 8 7
## 9 9 10
## 10 10 9
## 11 11 9
## 12 12 13
## 13 13 12
## 14 14 8
## 15 15 2
Using similar data as provided by #hrbrmstr
set.seed(1492)
dat <- do.call(rbind, lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
}))
You could achieve the same using base R only
Res <- sapply(seq_len(nrow(dat)),
function(x) apply(dat[-1], 1,
function(y) length(intersect(dat[x, -1], y))))
diag(Res) <- -1
cbind(dat[1], MatchId = max.col(Res, ties.method = "first"))
# ID MatchId
# 1 1 5
# 2 2 7
# 3 3 5
# 4 4 12
# 5 5 1
# 6 6 9
# 7 7 8
# 8 8 7
# 9 9 10
# 10 10 9
# 11 11 9
# 12 12 13
# 13 13 12
# 14 14 8
# 15 15 2
If I understand correctly, the requirement is to obtain the maximum number of common attributes for each ID.
Frequency tables can be obtained using table() and recursively in lapply(), assuming that ID column is unique - slight modification is necessary if not (unique(df$ID) rather than df$ID in lapply()). The maximum frequencies can be taken and, if there is a tie, only the first one is chosen. Finally they are combined by do.call().
df <- read.table(header = T, text = "
ID A1 A2 A3 A100
1 john max karl kevin
2 kevin bosy lary rosy
3 karl lary bosy hale
10000 isha john lewis dave")
do.call(rbind, lapply(df$ID, function(x) {
tbl <- table(unlist(df[df$ID == x, 2:ncol(df)]))
data.frame(ID = x, MatchId = tbl[tbl == max(tbl)][1])
}))
# ID MatchId
#john 1 1
#kevin 2 1
#karl 3 1
#isha 10000 1

r "slot" two columns into one (like a zip)

Given two columns (perhaps from a data frame) of equal length N, how can I produce a column of length 2N with the odd entries from the first column and the even entries from the second column?
Suppose I have the following data frame
df.1 <- data.frame(X = LETTERS[1:10], Y = 2*(1:10)-1, Z = 2*(1:10))
How can I produce this data frame df.2?
i <- 1
j <- 0
XX <- NA
while (i <= 10){
XX[i+j] <- LETTERS[i]
XX[i+j+1]<- LETTERS[i]
i <- i+1
j <- i-1
}
df.2 <- data.frame(X.X = XX, Y.Z = c(1:20))
ggplot2 has an unexported function interleave which does this.
Whilst unexported it does have a help page (?ggplot2:::interleave)
with(df.1, ggplot2:::interleave(Y,Z))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
If I understand you right, you want to create a new vector twice the length of the vectors X, Y and Z in your data frame and then want all the elements of X to occupy the odd indices of this new vector and all the elements of Y the even indices. If so, then the code below should do the trick:
foo<-vector(length=2*nrow(df.1), mode='character')
foo[seq(from = 1, to = 2*length(df.1$X), by=2)]<-as.character(df.1$X)
foo[seq(from = 2, to = 2*length(df.1$X), by=2)]<-df.1$Y
Note, I first create an empty vector foo of length 20, then fill it in with elements of df.1$X and df.1$Y.
Cheers,
Danny
You can use melt from reshape2:
library(reshape2)
foo <- melt(df.1, id.vars='X')
> foo
X variable value
1 A Y 1
2 B Y 3
3 C Y 5
4 D Y 7
5 E Y 9
6 F Y 11
7 G Y 13
8 H Y 15
9 I Y 17
10 J Y 19
11 A Z 2
12 B Z 4
13 C Z 6
14 D Z 8
15 E Z 10
16 F Z 12
17 G Z 14
18 H Z 16
19 I Z 18
20 J Z 20
Then you can sort and pick the columns you want:
foo[order(foo$X), c('X', 'value')]
Another solution using base R.
First index the character vector of the data.frame using the vector [1,1,2,2 ... 10,10] and store as X.X. Next, rbind the data.frame vectors Y & Z effectively "zipping" them and store in Y.X.
> res <- data.frame(
+ X.X = df.1$X[c(rbind(1:10, 1:10))],
+ Y.Z = c(rbind(df.1$Y, df.1$Z))
+ )
> head(res)
X.X Y.Z
1 A 1
2 A 2
3 B 3
4 B 4
5 C 5
6 C 6
A one two liner in base R:
test <- data.frame(X.X=df.1$X,Y.Z=unlist(df.1[c("Y","Z")]))
test[order(test$X.X),]
Assuming that you want what you asked for in the first paragraph, and the rest of what you posted is your attempt at solving it.
a=df.1[df.1$Y%%2>0,1:2]
b=df.1[df.1$Z%%2==0,c(1,3)]
names(a)=c("X.X","Y.Z")
names(b)=names(a)
df.2=rbind(a, b)
If you want to group them by X.X as shown in your example, you can do:
library(plyr)
arrange(df.2, X.X)

Resources