Computing correlation of vectors by factor label - r

I have have two data frames. The first one, df1, is a matrix of vectors with labeled columns, like the following:
df1 <- data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10), D=rnorm(10), E=rnorm(10))
> df1
A B C D E
-0.3200306 0.4370963 -0.9146660 1.03219577 0.5215359
-0.3193144 0.8900656 -1.1720264 -0.42591761 0.1936993
0.4897262 -1.3970806 0.6054637 0.12487936 1.0149530
0.3772420 0.8726322 0.3250020 -0.36952560 -0.5447512
-0.6921561 -0.6734468 0.3500812 -0.53373720 -0.6129472
0.2540649 -1.1911106 -0.3266428 0.14013437 1.0830148
0.6606825 -0.8942715 1.1099637 -1.52416540 -0.2383048
1.4767074 -2.1492360 0.2441242 -0.36136344 0.5589114
-0.5338117 -0.2357821 0.7694879 -0.21652356 0.3185631
3.4215916 -0.3157938 0.8895597 0.09946069 -1.0961730
The second data frame, df2, contains items that match the colnames of df1. Example:
group <- c("1", "1", "2", "2", "3", "3")
S1 <- c("A", "D", "E", "C", "B", "D")
S2 <- c("D", "B", "A", "C", "B", "A")
S3 <- c("B", "C", "A", "E", "E", "A")
df2 <- data.frame(group,S1, S2, S3)
> df2
group S1 S2 S3
1 A D B
1 D B C
2 E A A
2 C C E
3 B B E
3 D A A
I would like to compute the correlations between the column vectors in df1 that correspond to the labeled items in df2. Specifically, the vectors that match cor(df2$S1, df2$S2) and cor(df2$S1, df2$S3).
The output should be something like this:
group S1 S2 S3 cor.S1.S2 cor.S1.S3
1 A D B 0.003825055 -0.2817946
1 D B C -0.2817946 -0.4928023
2 E A A -0.3856809 -0.3856809
2 C C E 1 -0.3862433
3 B B E 1 -0.3888541
3 D A A 0.003825055 0.003825055
I've been trying to resolve this with cbind[] but keep running into problems such as the 'x' must be numeric error with cor. Thanks in advance for any help!

You can do this with mapply().
my.cor <- function(x,y) {
cor(df1[,x],df1[,y])
}
df2$cor.S1.S2 <- mapply(my.cor,df2$S1,df2$S2)
df2$cor.S2.S3 <- mapply(my.cor,df2$S2,df2$S3)

Another approach would to the get the correlation between the matrix/data.frame after subsetting the columns of 'df1' with the columns of 'df2', get the diag and assign the output as new column in 'df2'. Here, I am using lapply as we have to do both 'S1 vs S2' and 'S1 vs S3'.
df2[c('cor.S1.S2', 'cor.S1.S3')] <- lapply(c('S2', 'S3'),
function(x) diag(cor(df1[, df2[,x]], df1[,df2$S1])))

Related

Compare each row in two dataframes in R

I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))

R - new variables from two subsets in data frame, random order in rows

I have a data frame containing two sets of variables: First, 30 columns containing 30 stimulus IDs, but in random order for each row. Then, the 30 response values relative to each stimulus. The first column of each block consist of a stimulus-response pair, the second column from each block are the second stimulus response pair etc., but the stimulus id itself varies.
I want to create new variables for each stimulus ID with the corresponding response.
I believe what I have is similar to the end-result of this question: Shuffle a data frame while maintaining order with another data frame
Example:
set.seed(3)
d <- data.frame( a = c("L", "G", "E", "E"),
b = c("G", "E", "L", "G"),
c = c("E", "L", "G", "L"),
e = rnorm(4), f = rnorm(4), g = rnorm(4))
d
# a b c e f g
# 1 L G E -1.1312186 -0.3076564 0.1998116
# 2 G E L -0.7163585 -0.9530173 -0.5784837
# 3 E L G 0.2526524 -0.6482428 -0.9423007
# 4 E G L 0.1520457 1.2243136 -0.2037282
Output I want:
d$L <- c(d[1, 4], d[2, 6], d[3, 5], d[4, 6])
d$E <- c(d[1, 6], d[2, 5], d[3, 4], d[4, 4])
d$G <- c(d[1, 5], d[2, 4], d[3, 6], d[4, 5])
d
# a b c e f g L E
# 1 L G E -1.1312186 -0.3076564 0.1998116 -1.1312186 0.1998116
# 2 G E L -0.7163585 -0.9530173 -0.5784837 -0.5784837 -0.9530173
# 3 E L G 0.2526524 -0.6482428 -0.9423007 -0.6482428 0.2526524
# 4 E G L 0.1520457 1.2243136 -0.2037282 -0.9423007 -1.1312186
I have two problems:
populating the new stimulus variable
repeating this for each stimulus
for 1., I tried nested ifelse statements
d$L <- ifelse(d$a == "L", d$e,
ifelse(d$b=="L", d$f,
ifelse(d$c=="L", d$g, NA)))
but the last ifelse overrides the first two. I tried a dplyr::mutate but can't figure out how to have one single ifelse statement, and with case_when got stuck on how to reference the correct column in the second set containing the response, and not defaulting to the first response column.
For 2.: I think I am supposed to use mapply with the two subsets divided in two separate matrices, but as far as I know, I then need a function-based solution for my first problem.
One option is to create a row/column index to extract the values from columns 4:6 and assign it to three new columns in the dataset
un1 <- unique(unlist(d[1:3]))
d[un1] <- lapply(un1, function(x)
d[4:6][cbind(seq_len(nrow(d)), max.col(d[1:3] == x, "first"))])
data
d <- structure(list(a = c("L", "G", "E", "E"), b = c("G", "E", "L",
"G"), c = c("E", "L", "G", "L"), e = c(-1.1312186, -0.7163585,
0.2526524, 0.1520457), f = c(-0.3076564, -0.9530173, -0.6482428,
1.2243136), g = c(0.1998116, -0.5784837, -0.9423007, -0.2037282
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

Pasting several values from a vector into a dataframe column

My dataframe "test" is like this:
a b c
d e f
I want to add strings to the 1st col so as to get this
a__3 b c
a__23 b c
a__45 b c
...
sb <- c(3, 23, 45)
datalist <- ""
for (i in 1:length(sb)) {
new <- apply(test[,1],1,paste0,collapse=("__" sb[i]))
datalist[i] <- new
}
I want to add rows into test df including all sb[i].
I have tried rbind, but does not get the correct result
An idea is to replicate the rows based on the length of your sb vector, do the paste and filter to keep only the ones you are interested in, i.e.
d3 <- d2[rep(rownames(d2), length(sb)),]
d3$V1[d3$V1 == 'a'] <- paste0(d3$V1[d3$V1 == 'a'], '__', sb)
d3[grepl('a', d3$V1),]
# V1 V2 V3
#1 a__3 b c
#1.1 a__23 b c
#1.2 a__45 b c
DATA
dput(d2)
structure(list(V1 = c("a", "d"), V2 = c("b", "e"), V3 = c("c",
"f")), row.names = c(NA, -2L), class = "data.frame")

Turn ordered pairs into unordered pairs in a data frame with dplyr

I have a data frame that looks like this:
library(dplyr)
df <- data_frame(doc.x = c("a", "b", "c", "d"),
doc.y = c("b", "a", "d", "c"))
So that df is:
Source: local data frame [4 x 2]
doc.x doc.y
(chr) (chr)
1 a b
2 b a
3 c d
4 d c
This is a list of ordered pairs, a to d but also d to a, and so on. What is a dplyr-like way to return only a list of unordered pairs in this data frame? I.e.
doc.x doc.y
(chr) (chr)
1 a b
2 c d
Use pmin and pmax to sort the pairs alphabetically, i.e. turn (b,a) into (a,b) and then filter away all the duplicates.
df %>%
mutate(dx = pmin(doc.x, doc.y), dy = pmax(doc.x, doc.y)) %>%
distinct(dx, dy) %>%
select(-dx, -dy)
doc.x doc.y
(chr) (chr)
1 a b
2 c d
Alternate way using data.table:
df <- data.frame(doc.x = c("a", "b", "c", "d"),
doc.y = c("b", "a", "d", "c"), stringsAsFactors = F)
library(data.table)
setDT(df)
df[, row := 1:nrow(df)]
df <- df[, list(Left = max(doc.x,doc.y),Right = min(doc.x,doc.y)), by = row]
df <- df[, list(Left,Right)]
unique(df)
Left Right
1: b a
2: d c
Using dplyr
# make character columns into factors
df <- as.data.frame(unclass(df))
df$x.lvl <- levels(df$doc.x)
df$y.lvl <- levels(df$doc.y)
# find unique pairs
res <- df %>%
group_by(doc.x) %>%
transform(x.lvl = order(doc.x),
y.lvl = order(doc.y)) %>%
transform(pair = ifelse(x.lvl < y.lvl,
paste(doc.x, doc.y, sep=","), paste(doc.y, doc.x, sep=","))) %>%
.$pair %>%
unique
Unique pairs
res
[1] a,b c,d
Levels: a,b c,d
Edit
Inspired by Backlin's solution, in base R
unique(with(df, paste(pmin(doc.x, doc.y), pmax(doc.x, doc.y), sep=","))
[1] "a,b" "c,d"
Or to store in a data.frame
unique(with(df, data.frame(lvl1=pmin(doc.x, doc.y), lvl2=pmax(doc.x, doc.y))))
lvl1 lvl2
1 a b
3 c d

Create an Index of a combination of data.frame columns in R

this question is kind of related to this one, however i want to create an Index using a unique combination of two data.frame columns.
So my data structure looks for example like this (dput):
structure(list(avg = c(0.246985988921473, 0.481522354272779,
0.575400762275067, 0.14651009243539, 0.489308880181752, 0.523678968337178
), i_ID = c("H", "H", "C", "C", "H", "S"), j_ID = c("P", "P",
"P", "P", "P", "P")), .Names = c("avg", "i_ID", "j_ID"), row.names = 7:12, class = "data.frame")
The created Index for the above structure should therefore look like this
1
1
2
2
1
3
In the example data the column j_ID always has the value P, but this isn't always the case. Furthermore vice-versa (S-P or P-S) combinations should result in the same index.
Someone knows a nice way to accomplish that? I can do it with a lot of for-loops and if-else commands, but thats not really elegant.
The interaction function will work well:
foo = structure(list(avg = c(0.246985988921473, 0.481522354272779, 0.575400762275067, 0.14651009243539, 0.489308880181752, 0.523678968337178), i_ID = c("H", "H", "C", "C", "H", "S"), j_ID = c("P", "P", "P", "P", "P", "P")), .Names = c("avg", "i_ID", "j_ID"), row.names = 7:12, class = "data.frame")
foo$idx <- as.integer(interaction(foo$i_ID, foo$j_ID))
> foo
avg i_ID j_ID idx
7 0.2469860 H P 2
8 0.4815224 H P 2
9 0.5754008 C P 1
10 0.1465101 C P 1
11 0.4893089 H P 2
12 0.5236790 S P 3
Ah, I didn't read carefully enough. There is probably a more elegant solution, but you can use outer function and upper and lower triangles:
# lets assign some test values
x <- c('a', 'b', 'c')
foo$idx <- c('a b', 'b a', 'b c', 'c b', 'a a', 'b a')
mat <- outer(x, x, FUN = 'paste') # gives all possible combinations
uppr_ok <- mat[upper.tri(mat, diag=TRUE)]
mat_ok <- mat
mat_ok[lower.tri(mat)] <- mat[upper.tri(mat)]
Then you can match indexes found in mat with those found in mat_ok:
foo$idx <- mat_ok[match(foo$idx, mat)]
To add to Justin's answer, if you would like the indexes to preserve the order of the original of the i_ID, you can assign the interaction() results to a variable and then order the levels.
x <- interaction(foo$i_ID, foo$j_ID)
x <- factor(x, levels=levels(x)[order(unique(foo$i_ID))])
foo$idx <- as.integer(x)
which gives:
> foo
avg i_ID j_ID idx
7 0.2469860 H P 1
8 0.4815224 H P 1
9 0.5754008 C P 2
10 0.1465101 C P 2
11 0.4893089 H P 1
12 0.5236790 S P 3

Resources