accessing matrix in R - r

i have a matrix in R as follows:
YITEMREVENUE XCARTADD XCARTUNIQADD XCARTADDTOTALRS
YITEMREVENUE 1.0000000000 -0.02630016 -0.01811156 0.0008988723
XCARTADD -0.0263001551 1.00000000 0.02955307 -0.0438881639
XCARTUNIQADD -0.0181115638 0.02955307 1.00000000 0.0917359285
XCARTADDTOTALRS 0.0008988723 -0.04388816 0.09173593 1.0000000000
i want to list out the names of the columns with negative values only.. my output should look like:
YITEMREVENUE - XCARTADD XCARTUNIQADD
XCARTADD - YITEMREVENUE XCARTADDTOTALRS
XCARTUNIQADD - YITEMREVENUE
XCARTADDTOTALRS - XCARTADD
is it possible in R?

I would first cast the matrix to a data.frame, in code this would be:
# Some example data
dat = matrix(runif(9) - 0.5, 3, 3)
dimnames(dat) = list(LETTERS[1:3], LETTERS[1:3])
> dat
A B C
A 0.1216529 0.3501861 0.47473598
B -0.4720577 0.4887181 -0.41118597
C 0.4406510 -0.2516563 0.02344829
# Cast to data.frame
library(reshape)
df = melt(dat)
df
X1 X2 value
1 A A 0.12165293
2 B A -0.47205771
3 C A 0.44065104
4 A B 0.35018605
5 B B 0.48871810
6 C B -0.25165634
7 A C 0.47473598
8 B C -0.41118597
9 C C 0.02344829
# And find the combinations of row-columns which have < 0
df[df$value < 0, c("X1","X2")]
X1 X2
2 B A
6 C B
8 B C

If your data are in a data frame called m, you can use the following :
lapply(m, function(v) {rownames(m)[v<0]})
If your data are in a matrix called m, you can use :
apply(m, 2,function(v) {rownames(m)[v<0]})
In both cases, you will get a list like this :
$YITEMREVENUE
[1] "XCARTADD" "XCARTUNIQADD"
$XCARTADD
[1] "YITEMREVENUE" "XCARTADDTOTALRS"
$XCARTUNIQADD
[1] "YITEMREVENUE"
$XCARTADDTOTALRS
[1] "XCARTADD"

Related

how to generate grouping variable based on correlation?

library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.
The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))

Return Max Correlation and Row Name From Corr Matrix

I am trying to find the maximum correlation in each column of a data.frame object by using the cor function. Let's say this object looks like
A <- rnorm(100,5,1)
B <- rnorm(100,6,1)
C <- rnorm(100,7,4)
D <- rnorm(100,4,2)
E <- rnorm(100,4,3)
M <- data.frame(A,B,C,D,E)
N <- cor(M)
And the correlation matrix looks like
>N
A B C D E
A 1.000000000 0.02676645 0.000462529 0.026875495 -0.054506842
B 0.026766455 1.00000000 -0.150622473 0.037911600 -0.071794930
C 0.000462529 -0.15062247 1.000000000 0.015170017 0.026090225
D 0.026875495 0.03791160 0.015170017 1.000000000 -0.001968634
E -0.054506842 -0.07179493 0.026090225 -0.001968634 1.000000000
In the case of the first column (A) I'd like R to return to me the value "D" since it's the maximum non-negative, non-"1" value in column A, along with it's associated correlation.
Any ideas?
Another option:
library(data.table)
setDT(melt(N))[Var1 != Var2, .SD[which.max(value)], keyby=Var1]
Result with #cory's data (using set.seed(9)):
Var1 Var2 value
1: A D 0.28933634
2: B C 0.13483843
3: C B 0.13483843
4: D A 0.28933634
5: E C 0.02588474
To understand how it works, first try running melt(N), which puts the data in long format.
The column numbers are
(n <- max.col(`diag<-`(N,0)))
# [1] 4 4 5 2 3
The names are
colnames(N)[n]
# [1] "D" "D" "E" "B" "C"
The values are
N[cbind(seq_len(nrow(N)),n)]
# [1] 0.02687549 0.03791160 0.02609023 0.03791160 0.02609023
Use apply on rows to get the max of the row for values less than one. Then use which to get the column index and then use the colNames to get the actual letters...
set.seed(9)
A <- rnorm(100,5,1)
B <- rnorm(100,6,1)
C <- rnorm(100,7,4)
D <- rnorm(100,4,2)
E <- rnorm(100,4,3)
M <- data.frame(A,B,C,D,E)
N <- cor(M)
N
A B C D E
A 1.000000000 0.005865532 0.03595202 0.28933634 0.00795076
B 0.005865532 1.000000000 0.13483843 0.04252079 -0.09567275
C 0.035952017 0.134838434 1.00000000 -0.01160411 0.02588474
D 0.289336335 0.042520787 -0.01160411 1.00000000 -0.12054680
E 0.007950760 -0.095672747 0.02588474 -0.12054680 1.00000000
colnames(N)[apply(N, 1, function (x) which(x==max(x[x<1])))]
[1] "D" "C" "B" "A" "C"
The corrr package gives a simple way to do it.
library(corrr)
library(dplyr)
set.seed(9)
A <- rnorm(100, 5, 1)
B <- rnorm(100, 6, 1)
C <- rnorm(100, 7, 4)
D <- rnorm(100, 4, 2)
E <- rnorm(100, 4, 3)
M <- data.frame(A, B, C, D, E)
N <- corrr::correlate(M)
print(N)
# # A tibble: 5 x 6
# term A B C D E
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A NA 0.00587 0.0360 0.289 0.00795
# 2 B 0.00587 NA 0.135 0.0425 -0.0957
# 3 C 0.0360 0.135 NA -0.0116 0.0259
# 4 D 0.289 0.0425 -0.0116 NA -0.121
# 5 E 0.00795 -0.0957 0.0259 -0.121 NA
head(dplyr::arrange(corrr::stretch(N, remove.dups = TRUE), desc(r)), 3)
# # A tibble: 3 x 3
# x y r
# <chr> <chr> <dbl>
# 1 A D 0.289
# 2 B C 0.135
# 3 B D 0.0425

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

usage of subscript within For loop in R

I am having very basic doubt in R.
I am having a table like this:
A B C D E
7 1 6 8 7
9 3 9 5 9
4 6 2 1 10
10 5 3 4 1
1 3 5 9 3
6 4 8 7 6
I am in the process of finding correlation of each variable with every other variable in the table. The final report should be something like this:
Var_1 Var_2 Correlation
A A 1
A B -0.022991544
A C 0.231553
A D -0.28037
A E -0.00523
B A -0.022999
B B 1
…
…
E D -0.39223
E E 1
The below is the R code i am using to achieve this:
rm(list=ls())
test <- read.csv("D:/AB/test.csv")
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations)
for (k in 1:iterations) {
for (l in 1:iterations){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[l,] <- rbind(corr_string_A, corr_string_B, corr)
}
}
But i am ending up getting only the output of E variables:
> correlation
[,1] [,2] [,3]
[1,] "E" "A" "-0.0523026032815805"
[2,] "E" "B" "0"
[3,] "E" "C" "0.231900361745681"
[4,] "E" "D" "-0.392232270276368"
[5,] "E" "E" "1"
I understand that somewhere in the twin For loops that is used in the above code has a looping issue and hence only the "E" series is printed. I am not able to figure it out.
If anyone can kindly help me, it would be really great.
EDIT*
Changing the input data a bit
A B C D E
0 0 6 8 7
0 0 9 5 9
0 0 2 1 10
0 0 3 4 1
0 0 5 9 3
0 0 8 7 6
If one of the columns are having 0, the correlation value that we will get would be 'NaN'. I want to handle 'NaN', replace with some other value according the business specification. Sorry for the late addition. Thank you for your understanding.
To answer your question without altering your code too much, there are two main issues. First, you are not allocating a matrix of the correct size. There are five interations of five variables, or 25 combinations (with doubling of some combinations, ie A/C = C/A) in this example, so you need to fix your matrix declaration to account for that:
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
Second, you are only assigning values to the first five columns of this matrix within your nested for loop. This line:
correlation[l,] <- rbind(corr_string_A, corr_string_B, corr)
Needs to have a value greater than l (which can only reach 5 in the example) after the first time through the nested loop, like this:
correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)
This code should fix those problems:
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
for (k in 1:iterations) {
for (l in 1:iterations){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)
}
}
The Hmisc package has an rcorr function that will return a list whose first item is the correlation matrix. It requires a matrix as input, which the function data.matrix is designed to deliver. The transformation to a three column format is accomplished by the as.data.frame.table function:
library(Hmisc)
as.data.frame.table( rcorr(data.matrix(dat))[[1]] )
#-------
Var1 Var2 Freq
1 A A 1.00000000
2 B A -0.02299154
3 C A 0.23155349
4 D A -0.28036851
5 E A -0.05230260
6 A B -0.02299154
7 B B 1.00000000
8 C B -0.58384037
9 D B -0.80175394
10 E B 0.00000000
11 A C 0.23155349
12 B C -0.58384037
13 C C 1.00000000
14 D C 0.52094591
15 E C 0.23190036
16 A D -0.28036851
17 B D -0.80175394
18 C D 0.52094591
19 D D 1.00000000
20 E D -0.39223227
21 A E -0.05230260
22 B E 0.00000000
23 C E 0.23190036
24 D E -0.39223227
25 E E 1.00000000
The names<- function can be used to dress up column names to your specification.

Replacing header in data frame based on values in second data frame

Say I have a data frame which looks like this:
df.A
A B C
x 1 3 4
y 5 4 6
z 8 9 1
And I want to replace the column names in the first based on column values in a second:
df.B
Low High
A D
B F
C G
Such that I get:
df.A
D F G
x 1 3 4
y 5 4 6
z 8 9 1
How would I do it?
I have tried extracting the vector df.B$High from df.B and using this in names(df.A), but everything is in alphabetical order and shifted over one. Furthermore, this only works if the order of columns in df.A is conserved with respect to the elements in df.B$High, which is not always the case (and in my real example there is no numeric or alphabetical way to sort the two to the same order). So I think I need an rbind-type argument for matching elements, but I'm not sure.
Thanks!
You can use rename from plyr:
library(plyr)
dat <- read.table(text = " A B C
x 1 3 4
y 5 4 6
z 8 9 1",header = TRUE,sep = "")
> new <- read.table(text = "Low High
A D
B F
C G",header = TRUE,sep = "")
> rename(dat,replace = setNames(new$High,new$Low))
D F G
x 1 3 4
y 5 4 6
z 8 9 1
using match:
df.A <- read.table(sep=" ", header=T, text="
A B C
x 1 3 4
y 5 4 6
z 8 9 1")
df.B <- read.table(sep=" ", header=T, text="
Low High
A D
B F
C G")
df.C <- df.A
names(df.C) <- df.B$High[match(names(df.A), df.B$Low)]
df.C
# D F G
# x 1 3 4
# y 5 4 6
# z 8 9 1
You can play games with the row names of df.B to make a lookup more convenient:
rownames(df.B) <- df.B$Low
names(df.A) <- df.B[names(df.A),"High"]
df.A
## D F G
## x 1 3 4
## y 5 4 6
## z 8 9 1
Here's an approach abusing factor:
f <- factor(names(df.A), levels=df.B$Low)
levels(f) <- df.B$High
f
## [1] D F G
## Levels: D F G
names(df.A) <- f
## Desired results

Resources