Related
I am having very basic doubt in R.
I am having a table like this:
A B C D E
7 1 6 8 7
9 3 9 5 9
4 6 2 1 10
10 5 3 4 1
1 3 5 9 3
6 4 8 7 6
I am in the process of finding correlation of each variable with every other variable in the table. The final report should be something like this:
Var_1 Var_2 Correlation
A A 1
A B -0.022991544
A C 0.231553
A D -0.28037
A E -0.00523
B A -0.022999
B B 1
…
…
E D -0.39223
E E 1
The below is the R code i am using to achieve this:
rm(list=ls())
test <- read.csv("D:/AB/test.csv")
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations)
for (k in 1:iterations) {
for (l in 1:iterations){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[l,] <- rbind(corr_string_A, corr_string_B, corr)
}
}
But i am ending up getting only the output of E variables:
> correlation
[,1] [,2] [,3]
[1,] "E" "A" "-0.0523026032815805"
[2,] "E" "B" "0"
[3,] "E" "C" "0.231900361745681"
[4,] "E" "D" "-0.392232270276368"
[5,] "E" "E" "1"
I understand that somewhere in the twin For loops that is used in the above code has a looping issue and hence only the "E" series is printed. I am not able to figure it out.
If anyone can kindly help me, it would be really great.
EDIT*
Changing the input data a bit
A B C D E
0 0 6 8 7
0 0 9 5 9
0 0 2 1 10
0 0 3 4 1
0 0 5 9 3
0 0 8 7 6
If one of the columns are having 0, the correlation value that we will get would be 'NaN'. I want to handle 'NaN', replace with some other value according the business specification. Sorry for the late addition. Thank you for your understanding.
To answer your question without altering your code too much, there are two main issues. First, you are not allocating a matrix of the correct size. There are five interations of five variables, or 25 combinations (with doubling of some combinations, ie A/C = C/A) in this example, so you need to fix your matrix declaration to account for that:
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
Second, you are only assigning values to the first five columns of this matrix within your nested for loop. This line:
correlation[l,] <- rbind(corr_string_A, corr_string_B, corr)
Needs to have a value greater than l (which can only reach 5 in the example) after the first time through the nested loop, like this:
correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)
This code should fix those problems:
iterations <- ncol(test)
correlation <- matrix(ncol = 3 , nrow = iterations * iterations)
for (k in 1:iterations) {
for (l in 1:iterations){
corr <- cor(test[,k], test[,l])
corr_string_A <- names(test[k])
corr_string_B <- names(test[l])
correlation[l + ((k-1) * iterations),] <- rbind(corr_string_A, corr_string_B, corr)
}
}
The Hmisc package has an rcorr function that will return a list whose first item is the correlation matrix. It requires a matrix as input, which the function data.matrix is designed to deliver. The transformation to a three column format is accomplished by the as.data.frame.table function:
library(Hmisc)
as.data.frame.table( rcorr(data.matrix(dat))[[1]] )
#-------
Var1 Var2 Freq
1 A A 1.00000000
2 B A -0.02299154
3 C A 0.23155349
4 D A -0.28036851
5 E A -0.05230260
6 A B -0.02299154
7 B B 1.00000000
8 C B -0.58384037
9 D B -0.80175394
10 E B 0.00000000
11 A C 0.23155349
12 B C -0.58384037
13 C C 1.00000000
14 D C 0.52094591
15 E C 0.23190036
16 A D -0.28036851
17 B D -0.80175394
18 C D 0.52094591
19 D D 1.00000000
20 E D -0.39223227
21 A E -0.05230260
22 B E 0.00000000
23 C E 0.23190036
24 D E -0.39223227
25 E E 1.00000000
The names<- function can be used to dress up column names to your specification.
Consider the following named vector x.
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a b c d e f g h
# 1 2 0 NA 4 NA NA 6
I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.
I can do it this way.
y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.
What better methods are there to return the cumulative sum while effectively ignoring NA values?
You can do this in one line with:
cumsum(ifelse(is.na(x), 0, x)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
Or, similarly:
library(dplyr)
cumsum(coalesce(x, 0)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
It's an old question but tidyr gives a new solution.
Based on the idea of replacing NA with zero.
require(tidyr)
cumsum(replace_na(x, 0))
a b c d e f g h
1 3 3 3 7 7 7 13
Do you want something like this:
x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])
x2
[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -
miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum
Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.
cumSkipNA <- function(x, FUNC)
{
d <- deparse(substitute(FUNC))
funs <- c("max", "min", "prod", "sum")
stopifnot(is.vector(x), is.numeric(x), d %in% funs)
FUNC <- match.fun(paste0("cum", d))
x[!is.na(x)] <- FUNC(x[!is.na(x)])
x
}
set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1] 4 NA 9 14 NA 14 NA 10 10 1
cumSkipNA(x, sum)
# [1] 4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1] 4 NA 36 504 NA 7056 NA
# [8] 70560 705600 705600
cumSkipNA(x, min)
# [1] 4 NA 4 4 NA 4 NA 4 4 1
cumSkipNA(x, max)
# [1] 4 NA 9 14 NA 14 NA 14 14 14
Definitely nothing new, but maybe useful to someone.
Another option is using the collapse package with fcumsum function like this:
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#> a b c d e f g h
#> 1 2 0 NA 4 NA NA 6
library(collapse)
fcumsum(x)
#> a b c d e f g h
#> 1 3 3 NA 7 NA NA 13
Created on 2022-08-24 with reprex v2.0.2
Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}
I'm using the cor.prob() function that's been posted several times around the mailing list to get a matrix of correlations (lower diagonal) and p-values (upper diagonals):
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
d <- data.frame(x=1:5, y=c(10,16,8,60,80), z=c(10,9,12,2,1))
cor.prob(d)
> cor.prob(d)
x y z
x NA 0.04856042 0.107654038
y 0.8807155 NA 0.003523594
z -0.7953560 -0.97945703 NA
How would I collapse the above correlation matrix (with the correlations in the lower half, p-values in the upper half) into a four-column matrix: two indexes, the correlation, and the p-value? E.g.:
i j cor pval
x y .88 .048
x z -.79 .107
y z -.97 0.0035
I've seen the answer to the previous question like this, but will only give me a 3-column matrix, not a four column matrix with separate columns for the p-value and correlation.
Any help is appreciated!
well it's not a matrix, because you can't mix characters and numerics. But:
this is my first attempt (before your label swap):
m <- cor.prob(d)
ut <- upper.tri(m)
lt <- lower.tri(m)
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=m[lt])
now apply the correction I suggested below and you get
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=t(m)[ut])
finally your label swap, use row()/col(), and write it as a function:
f1 <- function(m) {
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=tm[ut])
}
then
m<-matrix(1:25,5,dimnames=list(letters[1:5],letters[1:5])
> m
a b c d e
a 1 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
> f1(m)
i j cor p
1 a b 6 2
2 a c 11 3
3 b c 12 8
4 a d 16 4
5 b d 17 9
6 c d 18 14
7 a e 21 5
8 b e 22 10
9 c e 23 15
10 d e 24 20
Can you explain what you expected if it wasn't this?
cd <- cor.prob(d)
dcd <- as.data.frame( which( row(cd) < col(cd), arr.ind=TRUE) )
dcd$pval <- cd[row(cd) < col(cd)]
dcd$cor <- cd[row(cd) > col(cd)]
dcd[[2]] <-dimnames(cd)[[2]][dcd$col]
dcd[[1]] <-dimnames(cd)[[2]][dcd$row]
dcd
#--------------------
row col pval cor
1 x y 0.048560420 0.8807155
2 x z 0.107654038 -0.7953560
3 y z 0.003523594 -0.9794570
I have a matrix in R. Each entry i,j is a score and the rownames and colnames are ids.
Instead of the matrix I just want a 3 column matrix that has: i,j,score
Right now I'm using nested for loops. Like:
for(i in rownames(g))
{
print(which(rownames(g)==i))
for(j in colnames(g))
{
cur.vector<-c(cur.ref, i, j, g[rownames(g) %in% i,colnames(g) %in% j])
rbind(new.file,cur.vector)->new.file
}
}
But thats very inefficient I think...I'm sure there's a better way I'm just not good enough with R yet.
Thoughts?
If I understand you correctly, you need to flatten the matrix.
You can use as.vector and rep to add the id columns e.g. :
m = cbind(c(1,2,3),c(4,5,6),c(7,8,9))
row.names(m) = c('R1','R2','R3')
colnames(m) = c('C1','C2','C3')
d <- data.frame(i=rep(row.names(m),ncol(m)),
j=rep(colnames(m),each=nrow(m)),
score=as.vector(m))
Result:
> m
C1 C2 C3
R1 1 4 7
R2 2 5 8
R3 3 6 9
> d
i j score
1 R1 C1 1
2 R2 C1 2
3 R3 C1 3
4 R1 C2 4
5 R2 C2 5
6 R3 C2 6
7 R1 C3 7
8 R2 C3 8
9 R3 C3 9
Please, note that this code converts a matrix into a data.frame, since the row and col names can be string and you can't have a matrix with different column type.
If you are sure that all row and col names are numbers, you can coerced it to a matrix.
If you convert your matrix first to a table (with as.table) then to a data frame (as.data.frame) then it will accomplish what you are asking for. A simple example:
> tmp <- matrix( 1:12, 3 )
> dimnames(tmp) <- list( letters[1:3], LETTERS[4:7] )
> as.data.frame( as.table( tmp ) )
Var1 Var2 Freq
1 a D 1
2 b D 2
3 c D 3
4 a E 4
5 b E 5
6 c E 6
7 a F 7
8 b F 8
9 c F 9
10 a G 10
11 b G 11
12 c G 12