Is there a way to distract the correlation coefficients out of a correlation matrix ?
Let's say I have a dataset with 3 variables (a, b, c) and I want to calculate the correlations among themselves.
with
df <- data.frame(a <- c(2, 3, 3, 5, 6, 9, 14, 15, 19, 21, 22, 23),
b <- c(23, 24, 24, 23, 17, 28, 38, 34, 35, 39, 41, 43),
c <- c(13, 14, 14, 14, 15, 17, 18, 19, 22, 20, 24, 26),
d <- c(6, 6, 7, 8, 8, 8, 7, 6, 5, 3, 3, 2))
and
cor(df[, c('a', 'b', 'c')])
I'll get a correlation matrix:
a b c
a 1.0000000 0.9279869 0.9604329
b 0.9279869 1.0000000 0.8942139
c 0.9604329 0.8942139 1.0000000
Is there a way to show the results in a manner like this:
Correlation between a and b is: 0.9279869.
Correlation between a and c is: 0.9604329.
Correlation between b and c is: 0.8942139:
?
My correlation matrix is of obviously bigger (~300 entries) eand I need a way to distract only the values that are important for me.
Thanks.
Using reshape2 and melt
df <- data.frame("a" = c(2, 3, 3, 5, 6, 9, 14, 15, 19, 21, 22, 23),
"b" = c(23, 24, 24, 23, 17, 28, 38, 34, 35, 39, 41, 43),
"c" = c(13, 14, 14, 14, 15, 17, 18, 19, 22, 20, 24, 26),
"d" = c(6, 6, 7, 8, 8, 8, 7, 6, 5, 3, 3, 2))
tmp=cor(df[, c('a', 'b', 'c')])
tmp[lower.tri(tmp)]=NA
diag(tmp)=NA
library(reshape2)
na.omit(melt(tmp))
resulting in
Var1 Var2 value
4 a b 0.9279869
7 a c 0.9604329
8 b c 0.8942139
You can do,
df1 = cor(df[, c('a', 'b', 'c')])
df1 = as.data.frame(as.table(df1))
df1$Freq = round(df1$Freq,2)
df2 = subset(df1, (as.character(df1$Var1) != as.character(df1$Var2)))
df2$res = paste('Correlation between', df2$Var1, 'and', df2$Var2, 'is', df2$Freq)
Var1 Var2 Freq res
2 b a 0.93 Correlation between b and a is 0.93
3 c a 0.96 Correlation between c and a is 0.96
4 a b 0.93 Correlation between a and b is 0.93
6 c b 0.89 Correlation between c and b is 0.89
7 a c 0.96 Correlation between a and c is 0.96
8 b c 0.89 Correlation between b and c is 0.89
Here is another idea with reshaping to long format, i.e.
tidyr::pivot_longer(tibble::rownames_to_column(as.data.frame(cor(df[, c('a', 'b', 'c')])), var = 'rn'), -1)
# A tibble: 9 x 3
rn name value
<chr> <chr> <dbl>
1 a a 1
2 a b 0.928
3 a c 0.960
4 b a 0.928
5 b b 1
6 b c 0.894
7 c a 0.960
8 c b 0.894
9 c c 1
Maybe you can try as.table + as.data.frame
> as.data.frame(as.table(cor(df[, c("a", "b", "c")])))
Var1 Var2 Freq
1 a a 1.0000000
2 b a 0.9279869
3 c a 0.9604329
4 a b 0.9279869
5 b b 1.0000000
6 c b 0.8942139
7 a c 0.9604329
8 b c 0.8942139
9 c c 1.0000000
Related
Imagine I have a tidy dataset with 1 variable and 10 observations. The values of the variable are e.g. 3, 5, 7, 9, 13, 17, 29, 33, 34, 67. How do I recode it so that the 3 will be 1, the 5 will be 2 (...) and the 67 will be 10?
One possibility is to use rank: in a ´dplyr` setting it could look like this:
library(dplyr)
tibble(x = c(3, 5, 7, 9, 13, 17, 29, 33, 34, 67)) %>%
mutate(y = rank(x))
Here is one way -
x <- c(3, 5, 7, 9, 13, 17, 29, 33, 67, 34)
x1 <- sort(x)
y <- match(x1, unique(x1))
y
#[1] 1 2 3 4 5 6 7 8 9 10
Changed the order of last 2 values so that it also works when the data is not in order.
Another way:
x <- c(3, 5, 7, 9, 13, 17, 29, 33, 67, 34)
x <- sort(x)
seq_along(x)
# 1 2 3 4 5 6 7 8 9 10
I have a cvs file that has the following structure (minimum example):
ID Variable Vector
1 a [0,0,0]
2 a [1,2,3]
1 a [1,1,2]
2 a [1,2,3]
1 b [0,0,0]
2 b [1,1,1]
1 b [0,0,1]
2 b [3,5,7]
I would like to calculate the mean vector for each combination of parameters (in this case, ID and Variable). That is, I want to obtain a dataframe like the following one:
ID Variable Vector
1 a [0.5,0.5,1]
2 a [1,2,3]
1 b [0,0,0.5]
2 b [2,3,4]
I have generated this csv file with Python, that's why I have that structure with brackets. But I do not know how to start to do this using R. It doesn't seem to be a common data structure.
Update:
Vector variable structure (obtained from dput(head(data, 8))
Vector = c("[3, 16, 14, 5, 6, 13, 17, 7, 13, 6]",
"[7, 12, 6, 10, 6, 5, 16, 9, 19, 10]", "[4, 13, 4, 11, 6, 15, 17, 10, 12, 8]",
"[18, 11, 16, 8, 10, 10, 7, 4, 9, 7]", "[9, 9, 10, 17, 8, 13, 3, 13, 8, 10]",
"[17, 12, 7, 13, 6, 13, 8, 9, 5, 10]", "[9, 6, 14, 10, 8, 4, 8, 14, 15, 12]",
"[7, 13, 8, 10, 16, 8, 13, 13, 8, 4]")), row.names = c(NA, 8L
), class = "data.frame")
Assuming the 'Vector' column is a list, after grouping by 'ID', 'Variable', we reduce the 'Vector' by adding (+) the corresponding elements together and then divide by the total number of elements (n()) in that group
library(dplyr)
library(purrr)
out <- df1 %>%
group_by(ID, Variable) %>%
summarise(Vector = list(reduce(Vector, `+`)/n()), .groups = 'drop')
-output
out
# A tibble: 4 x 3
# ID Variable Vector
# <dbl> <chr> <list>
#1 1 a <dbl [3]>
#2 1 b <dbl [3]>
#3 2 a <dbl [3]>
#4 2 b <dbl [3]>
out$Vector
#[[1]]
#[1] 0.5 0.5 1.0
#[[2]]
#[1] 0.0 0.0 0.5
#[[3]]
#[1] 1 2 3
#[[4]]
#[1] 2 3 4
If the column 'Vector' is a character string, an option is to extract the numeric part into a list
library(stringr)
out <- df1 %>%
group_by(ID, Variable) %>%
summarise(Vector = list((str_extract_all(Vector, "\\d+") %>%
map(as.numeric) %>% reduce(`+`))/n()), .groups = 'drop')
data
df1 <- structure(list(ID = c(1, 2, 1, 2, 1, 2, 1, 2), Variable = c("a",
"a", "a", "a", "b", "b", "b", "b"), Vector = structure(list(c(0,
0, 0), c(1, 2, 3), c(1, 1, 2), c(1, 2, 3), c(0, 0, 0), c(1, 1,
1), c(0, 0, 1), c(3, 5, 7)), class = "AsIs")), class = "data.frame",
row.names = c(NA,
-8L))
Here is a simplified version of a problem that involves processing a large, complex table. Here is the input table:
library(tidyverse)
input <- tribble(
~group, ~score, ~label,
1, 10, 'A',
1, 20, 'B',
1, 30, 'C',
1, 40, 'D',
2, 11, 'A',
2, 21, 'B',
2, 31, 'C',
2, 41, 'D',
3, 12, 'A',
3, 22, 'B',
4, 13, 'A',
4, 23, 'B',
4, 33, 'C',
4, 43, 'D'
)
The table has 14 rows. The data are grouped in numbered groups (1:4), each group is supposed to have four scores labeled A, B, C, D.
The problem is group 3, which is missing the C and D rows.
I want R to do the following:
Find group 3 based on its lack of C and D rows.
Insert C and D rows for group 3, in proper alphabetical sequence.
Populate score in the new C and D rows with the value of of score (22) from group 3 row B.
Another way of describing the transformation is that I want two insert two copies of row 3B, changing the label
of those copied rows from B to C and D, respectively.
The desired output table has 16 rows and looks like this:
output <- tribble(
~group, ~score, ~label,
1, 10, 'A',
1, 20, 'B',
1, 30, 'C',
1, 40, 'D',
2, 11, 'A',
2, 21, 'B',
2, 31, 'C',
2, 41, 'D',
3, 12, 'A',
3, 22, 'B',
3, 22, 'C',
3, 22, 'D',
4, 13, 'A',
4, 23, 'B',
4, 33, 'C',
4, 43, 'D'
)
Thanks in advance for any help!
complete(input, group, label) %>%
fill(score)
# A tibble: 16 x 3
group label score
<dbl> <chr> <dbl>
1 1 A 10
2 1 B 20
3 1 C 30
4 1 D 40
5 2 A 11
6 2 B 21
7 2 C 31
8 2 D 41
9 3 A 12
10 3 B 22
11 3 C 22
12 3 D 22
13 4 A 13
14 4 B 23
15 4 C 33
16 4 D 43
I'd like to use dplyr to calculate differences in value between people nested in pair by session.
dat <- data.frame(person=c(rep(1, 10),
rep(2, 10),
rep(3, 10),
rep(4, 10),
rep(5, 10),
rep(6, 10),
rep(7, 10),
rep(8, 10)),
pair=c(rep(1, 20),
rep(2, 20),
rep(3, 20),
rep(4, 20)),
condition=c(rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10),
rep("NEW", 10),
rep("OLD", 10)),
session=rep(seq(from=1, to=10, by=1), 8),
value=c(0, 2, 4, 8, 16, 16, 18, 20, 20, 20,
0, 1, 1, 2, 4, 5, 8, 12, 15, 15,
0, 2, 8, 10, 15, 16, 18, 20, 20, 20,
0, 4, 4, 6, 6, 8, 10, 12, 12, 18,
0, 6, 8, 10, 16, 16, 18, 20, 20, 20,
0, 2, 2, 3, 4, 8, 8, 8, 10, 12,
0, 10, 12, 16, 18, 18, 18, 20, 20, 20,
0, 2, 2, 8, 10, 10, 11, 12, 15, 20)
)
For instance, person 1 and 2 make a pair (pair==1):
person==1 & session==2: 2
person==2 & session==2: 1
Difference (NEW-OLD) is 2-1=1.
Here's what I have tried so far. I think I need to group_by() first and then summarise(), but I have not cracked this nut.
dat %>%
mutate(session = factor(session)) %>%
group_by(condition, pair, session) %>%
summarise(pairDiff = value-first(value))
Desired output:
Your output can be obtained by:
dat %>% group_by(pair,session) %>% arrange(condition) %>% summarise(diff = -diff(value))
Source: local data frame [40 x 3]
Groups: pair [?]
# A tibble: 40 x 3
pair session diff
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 1
3 1 3 3
4 1 4 6
5 1 5 12
6 1 6 11
7 1 7 10
8 1 8 8
9 1 9 5
10 1 10 5
# ... with 30 more rows
The arrange ensures that NEW and OLD are in the correct positions, but the solution does depend on there being exactly 2 values for each combination of pair and session.
You can spread condition to headers and then do the subtraction NEW - OLD:
library(dplyr); library(tidyr)
dat %>%
select(-person) %>%
spread(condition, value) %>%
mutate(diff = NEW - OLD) %>%
select(session, pair, diff)
# A tibble: 40 x 3
# session pair diff
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 1 1
# 3 3 1 3
# 4 4 1 6
# 5 5 1 12
# 6 6 1 11
# 7 7 1 10
# 8 8 1 8
# 9 9 1 5
#10 10 1 5
# ... with 30 more rows
I have the following data and nested for loop:
x <- c(12, 27, 21, 16, 12, 21, 18, 16, 20, 23, 21, 10, 15, 26, 21, 22, 22, 19, 26, 26)
y <- c(8, 10, 7, 7, 9, 5, 7, 7, 10, 4, 10, 3, 9, 6, 4, 2, 4, 2, 3, 6)
a <- c(20,25)
a.sub <- c()
df <- c()
for(j in 1:length(a)){
a.sub <- which(x >= a[j])
for(i in 1:length(a.sub)){
df[i] <- y[a.sub[i]]
}
print(df)
}
I'd like the loop to return values for df as:
[1] 10 6 3 6 4 10 6 4 2 4 3 6
[1] 10 6 3 6
As I have it, however, the loop returns the same values twice of df for a <- 20 but not a <- 25:
[1] 10 7 5 10 4 10 6 4 2 4 3 6
[1] 10 6 3 6 4 10 6 4 2 4 3 6
for(i in 1:length(a.sub)){
df[i] <- y[a.sub[i]]
}
can become
df <- y[a.sub]
neither a.sub nor df need to be predefined then and thus...
x <- c(12, 27, 21, 16, 12, 21, 18, 16, 20, 23, 21, 10, 15, 26, 21, 22, 22, 19, 26, 26)
y <- c(8, 10, 7, 7, 9, 5, 7, 7, 10, 4, 10, 3, 9, 6, 4, 2, 4, 2, 3, 6)
a <- c(20,25)
for(j in 1:length(a)){
a.sub <- which(x >= a[j])
df <- y[a.sub]
print(df)
}
It could be made shorter. df is unnecessary if you're just printing the subset of y anyway. Just print it directly. And the selector is so short it wouldn't make a single line confusing. Furthermore, why use length of a and index.. loop through a directly. So, it could be...
a <- c(20,25)
for(ax in a){
print( y[ which(x >= ax) ] )
}
Not sure if this is a simplified version of a more complex problem, but I'd probably solve this using some direct indexing and an apply function. Something like this:
z <- cbind(x,y)
sapply(c(20,25), function(x) z[z[, 1] >= x, 2])
[[1]]
[1] 10 7 5 10 4 10 6 4 2 4 3 6
[[2]]
[1] 10 6 3 6