Create an Index of a combination of data.frame columns in R - r

this question is kind of related to this one, however i want to create an Index using a unique combination of two data.frame columns.
So my data structure looks for example like this (dput):
structure(list(avg = c(0.246985988921473, 0.481522354272779,
0.575400762275067, 0.14651009243539, 0.489308880181752, 0.523678968337178
), i_ID = c("H", "H", "C", "C", "H", "S"), j_ID = c("P", "P",
"P", "P", "P", "P")), .Names = c("avg", "i_ID", "j_ID"), row.names = 7:12, class = "data.frame")
The created Index for the above structure should therefore look like this
1
1
2
2
1
3
In the example data the column j_ID always has the value P, but this isn't always the case. Furthermore vice-versa (S-P or P-S) combinations should result in the same index.
Someone knows a nice way to accomplish that? I can do it with a lot of for-loops and if-else commands, but thats not really elegant.

The interaction function will work well:
foo = structure(list(avg = c(0.246985988921473, 0.481522354272779, 0.575400762275067, 0.14651009243539, 0.489308880181752, 0.523678968337178), i_ID = c("H", "H", "C", "C", "H", "S"), j_ID = c("P", "P", "P", "P", "P", "P")), .Names = c("avg", "i_ID", "j_ID"), row.names = 7:12, class = "data.frame")
foo$idx <- as.integer(interaction(foo$i_ID, foo$j_ID))
> foo
avg i_ID j_ID idx
7 0.2469860 H P 2
8 0.4815224 H P 2
9 0.5754008 C P 1
10 0.1465101 C P 1
11 0.4893089 H P 2
12 0.5236790 S P 3
Ah, I didn't read carefully enough. There is probably a more elegant solution, but you can use outer function and upper and lower triangles:
# lets assign some test values
x <- c('a', 'b', 'c')
foo$idx <- c('a b', 'b a', 'b c', 'c b', 'a a', 'b a')
mat <- outer(x, x, FUN = 'paste') # gives all possible combinations
uppr_ok <- mat[upper.tri(mat, diag=TRUE)]
mat_ok <- mat
mat_ok[lower.tri(mat)] <- mat[upper.tri(mat)]
Then you can match indexes found in mat with those found in mat_ok:
foo$idx <- mat_ok[match(foo$idx, mat)]

To add to Justin's answer, if you would like the indexes to preserve the order of the original of the i_ID, you can assign the interaction() results to a variable and then order the levels.
x <- interaction(foo$i_ID, foo$j_ID)
x <- factor(x, levels=levels(x)[order(unique(foo$i_ID))])
foo$idx <- as.integer(x)
which gives:
> foo
avg i_ID j_ID idx
7 0.2469860 H P 1
8 0.4815224 H P 1
9 0.5754008 C P 2
10 0.1465101 C P 2
11 0.4893089 H P 1
12 0.5236790 S P 3

Related

Non-normal sampling for card shuffling

I have a script that uses sample() to shuffle a pack of cards into four hands - Shown below.
This script produces normally distributed hand splits. That is the distributions of the 13 cards in a suit across the four hands will be "4432" 21.6% of the time, "5442" 15.5% of the time, ... - See here: https://en.wikipedia.org/wiki/Contract_bridge_probabilities#Hand%20pattern%20probabilities.
Is there a way to make sample() give non-normal distributions, e.g. Flatten the profile, so that the distributions across the hands deviate from the expected probabilities? sample() can take a weighting parameter, but can't see how to use this to achieve my goals. Alternatively, is there another sampling function that could provide this?
Thank you,
TC
# Set up
library(tidyverse)
set.seed(123)
# Build pack
pack <- expand.grid(rank = c("A", 2:9, "T", "J", "Q", "K"), suit = c("S", "H", "D", "C")) %>%
as_tibble(.name_repair = "minimal") %>%
mutate(card = paste(suit, rank, sep = "-"))
# Divide cards into hands
for (i in 1:4) {
temp <- sample(pack$card, 13, replace = FALSE) %>%
as_tibble(.name_repair = "minimal") %>%
separate(value, sep = "-", into = c("suit", "rank")) %>%
mutate(
suit = factor(suit, levels = c("S", "H", "D", "C")),
rank = factor(rank, levels = c("A", "K", "Q", "J", "T", 9:2, " "))
) %>%
arrange(suit, rank) %>%
unite("card", sep = "-")
assign(glue::glue("hand{i}"), temp)
pack <- pack %>%
filter(!card %in% unname(unlist(temp)))
}
# Reassemble pack
pack <- hand1 %>%
cbind(hand2) %>%
cbind(hand3) %>%
cbind(hand4) %>%
rename(N = 1, E = 2, S = 3, W = 4)
You can simplify the process of creating the deck, shuffling, and dealing as follows:
# Create the deck
Suit <- c("S", "H", "D", "C")
Rank <- c("A", 2:9, "T", "J", "Q", "K")
Deck <- data.frame(Rank=rep(Rank, 4), Suit=rep(Suit, each=13))
# Shuffle and deal
Shuffle <- Deck[sample(nrow(Deck)), ] # Shuffle the deck
Hand <- factor(rep(c("N", "E", "S", "W"), 13), levels=c("N", "E", "S", "W"))
Deal <- data.frame(Hand, Shuffle)
Now deal is a data frame showing the cards in each hand. To get the distribution of cards by suit:
xtabs(~Suit+Hand, Even)
# Hand
# Suit N E S W
# C 3 3 3 4
# D 3 3 4 3
# H 3 4 3 3
# S 4 3 3 3
To get the hands:
split(Deal, Deal$Hand)
# $N
# Hand Rank Suit
# 12 N Q S
# 2 N 2 S
# 37 N J D
# 49 N T C
# 43 N 4 C
# 17 N 4 H
# 33 N 7 D
# 26 N K H
# 13 N K S
# 44 N 5 C
# 18 N 5 H
# 46 N 7 C
# 11 N J S
. . . . .
Since you mention Hand Pattern Probabilities, it may be helpful to show how you can generate them empirically:
HPP <- function() {
Shuffle <- Deck[sample(nrow(Deck)), ] # Shuffle the deck
Hand <- factor(rep(c("N", "E", "S", "W"), 13), levels=c("N", "E", "S", "W"))
Deal <- data.frame(Hand, Shuffle)
tbl <- xtabs(~Suit+Hand, Deal)
return(unname(apply(tbl, 1, function(x) paste(sort(x, decreasing=TRUE), collapse="-"))))
}
X <- replicate(1000, HPP())
HP <- prop.table(sort(table(X), decreasing=TRUE))
as.matrix(HP)
# [,1]
# 4-4-3-2 0.22200
# 5-3-3-2 0.14925
# 5-4-3-1 0.12075
# 4-3-3-3 0.11225
# 5-4-2-2 0.10775
# 6-3-2-2 0.05175
# 6-4-2-1 0.04850
# . . . .
To get more "even" distributions of cards by suit, you have to break the random distribution of the shuffle, for example we can shuffle the rank values in the deck but preserve the order of the suits:
Even <- Deck
Even$Rank <- c(replicate(4, sample(Rank)))
Even <- data.frame(Hand, Even)
Now the card values will differ, but the distribution by suit will always be the same.

R - new variables from two subsets in data frame, random order in rows

I have a data frame containing two sets of variables: First, 30 columns containing 30 stimulus IDs, but in random order for each row. Then, the 30 response values relative to each stimulus. The first column of each block consist of a stimulus-response pair, the second column from each block are the second stimulus response pair etc., but the stimulus id itself varies.
I want to create new variables for each stimulus ID with the corresponding response.
I believe what I have is similar to the end-result of this question: Shuffle a data frame while maintaining order with another data frame
Example:
set.seed(3)
d <- data.frame( a = c("L", "G", "E", "E"),
b = c("G", "E", "L", "G"),
c = c("E", "L", "G", "L"),
e = rnorm(4), f = rnorm(4), g = rnorm(4))
d
# a b c e f g
# 1 L G E -1.1312186 -0.3076564 0.1998116
# 2 G E L -0.7163585 -0.9530173 -0.5784837
# 3 E L G 0.2526524 -0.6482428 -0.9423007
# 4 E G L 0.1520457 1.2243136 -0.2037282
Output I want:
d$L <- c(d[1, 4], d[2, 6], d[3, 5], d[4, 6])
d$E <- c(d[1, 6], d[2, 5], d[3, 4], d[4, 4])
d$G <- c(d[1, 5], d[2, 4], d[3, 6], d[4, 5])
d
# a b c e f g L E
# 1 L G E -1.1312186 -0.3076564 0.1998116 -1.1312186 0.1998116
# 2 G E L -0.7163585 -0.9530173 -0.5784837 -0.5784837 -0.9530173
# 3 E L G 0.2526524 -0.6482428 -0.9423007 -0.6482428 0.2526524
# 4 E G L 0.1520457 1.2243136 -0.2037282 -0.9423007 -1.1312186
I have two problems:
populating the new stimulus variable
repeating this for each stimulus
for 1., I tried nested ifelse statements
d$L <- ifelse(d$a == "L", d$e,
ifelse(d$b=="L", d$f,
ifelse(d$c=="L", d$g, NA)))
but the last ifelse overrides the first two. I tried a dplyr::mutate but can't figure out how to have one single ifelse statement, and with case_when got stuck on how to reference the correct column in the second set containing the response, and not defaulting to the first response column.
For 2.: I think I am supposed to use mapply with the two subsets divided in two separate matrices, but as far as I know, I then need a function-based solution for my first problem.
One option is to create a row/column index to extract the values from columns 4:6 and assign it to three new columns in the dataset
un1 <- unique(unlist(d[1:3]))
d[un1] <- lapply(un1, function(x)
d[4:6][cbind(seq_len(nrow(d)), max.col(d[1:3] == x, "first"))])
data
d <- structure(list(a = c("L", "G", "E", "E"), b = c("G", "E", "L",
"G"), c = c("E", "L", "G", "L"), e = c(-1.1312186, -0.7163585,
0.2526524, 0.1520457), f = c(-0.3076564, -0.9530173, -0.6482428,
1.2243136), g = c(0.1998116, -0.5784837, -0.9423007, -0.2037282
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

Pasting several values from a vector into a dataframe column

My dataframe "test" is like this:
a b c
d e f
I want to add strings to the 1st col so as to get this
a__3 b c
a__23 b c
a__45 b c
...
sb <- c(3, 23, 45)
datalist <- ""
for (i in 1:length(sb)) {
new <- apply(test[,1],1,paste0,collapse=("__" sb[i]))
datalist[i] <- new
}
I want to add rows into test df including all sb[i].
I have tried rbind, but does not get the correct result
An idea is to replicate the rows based on the length of your sb vector, do the paste and filter to keep only the ones you are interested in, i.e.
d3 <- d2[rep(rownames(d2), length(sb)),]
d3$V1[d3$V1 == 'a'] <- paste0(d3$V1[d3$V1 == 'a'], '__', sb)
d3[grepl('a', d3$V1),]
# V1 V2 V3
#1 a__3 b c
#1.1 a__23 b c
#1.2 a__45 b c
DATA
dput(d2)
structure(list(V1 = c("a", "d"), V2 = c("b", "e"), V3 = c("c",
"f")), row.names = c(NA, -2L), class = "data.frame")

From long to wide formats just based on two columns Rstudio

This is my data frame:
I have a data frame of six columns and last columns contains the values . The Column 'code' includes s and d. column 'Sex' includes M and F. And I have two thousand offsprings in the column offspring.
seq parent code Sex offspring Value
1 49032 s M J44010_CCG7YANXX_2_661_X4 -0.38455056
2 48741 s M J44010_CCG7YANXX_2_661_X4 0.10574340
3 48757 s M J44010_CCG7YANXX_2_661_X4 0.39572906
4 48465 d f J44010_CCG7YANXX_2_661_X4 0.43409006
5 48521 d f J44010_CCG7YANXX_2_661_X4 0.40337447
6 48703 d f J44010_CCG7YANXX_2_661_X4 -0.38148980
The column parent includes ids for both males and females.
I want to keep the female/dam id ,female/dam code and female/dam sex just beside the male/sire as a column and also keep the sire value and dam value seperately . So, the 'value' will be seprated in two parts .
The data frame will look like the below:
'seq''parent1''sirecode''Sex''parent2''damcode''Sex''offspring''sireValue' 'damvalue'
1 49032 s M 48465 d f J44010 -0.38455056 0.43409006
2 48741 s M 48521 d f J44010 0.10574340 0.40337447
3 48757 s M 48703 d f J44010 0.39572906 -0.38148980
So, each offspring will have 3 or 4 pair of parents.
I tried to use dcast function on it.
We could use dcast after creating a sequence column
library(data.table)
setDT(df1)[, n := seq_len(.N), .(code, Sex)]
dcast(df1, n + offspring ~ rowid(n), value.var = c('parent', 'code', 'Sex', 'Value'), sep = "")
# n offspring parent1 parent2 code1 code2 Sex1 Sex2 Value1 Value2
#1: 1 J44010_CCG7YANXX_2_661_X4 49032 48465 s d M f -0.3845506 0.4340901
#2: 2 J44010_CCG7YANXX_2_661_X4 48741 48521 s d M f 0.1057434 0.4033745
#3: 3 J44010_CCG7YANXX_2_661_X4 48757 48703 s d M f 0.3957291 -0.3814898
In base R, we can use reshape
df1$n <- with(df1, ave(seq_along(Sex), Sex, FUN = seq_along))
df1$n1 <- with(df1, ave(n, n, FUN = seq_along))
reshape(df1[-1], idvar = c('n', 'offspring'), timevar = 'n1', direction = 'wide' )
data
df1 <- structure(list(seq = 1:6, parent = c(49032L, 48741L, 48757L,
48465L, 48521L, 48703L), code = c("s", "s", "s", "d", "d", "d"
), Sex = c("M", "M", "M", "f", "f", "f"),
offspring = c("J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4", "J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4", "J44010_CCG7YANXX_2_661_X4"),
Value = c(-0.38455056,
0.1057434, 0.39572906, 0.43409006, 0.40337447, -0.3814898)),
class = "data.frame", row.names = c(NA, -6L))

Computing correlation of vectors by factor label

I have have two data frames. The first one, df1, is a matrix of vectors with labeled columns, like the following:
df1 <- data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10), D=rnorm(10), E=rnorm(10))
> df1
A B C D E
-0.3200306 0.4370963 -0.9146660 1.03219577 0.5215359
-0.3193144 0.8900656 -1.1720264 -0.42591761 0.1936993
0.4897262 -1.3970806 0.6054637 0.12487936 1.0149530
0.3772420 0.8726322 0.3250020 -0.36952560 -0.5447512
-0.6921561 -0.6734468 0.3500812 -0.53373720 -0.6129472
0.2540649 -1.1911106 -0.3266428 0.14013437 1.0830148
0.6606825 -0.8942715 1.1099637 -1.52416540 -0.2383048
1.4767074 -2.1492360 0.2441242 -0.36136344 0.5589114
-0.5338117 -0.2357821 0.7694879 -0.21652356 0.3185631
3.4215916 -0.3157938 0.8895597 0.09946069 -1.0961730
The second data frame, df2, contains items that match the colnames of df1. Example:
group <- c("1", "1", "2", "2", "3", "3")
S1 <- c("A", "D", "E", "C", "B", "D")
S2 <- c("D", "B", "A", "C", "B", "A")
S3 <- c("B", "C", "A", "E", "E", "A")
df2 <- data.frame(group,S1, S2, S3)
> df2
group S1 S2 S3
1 A D B
1 D B C
2 E A A
2 C C E
3 B B E
3 D A A
I would like to compute the correlations between the column vectors in df1 that correspond to the labeled items in df2. Specifically, the vectors that match cor(df2$S1, df2$S2) and cor(df2$S1, df2$S3).
The output should be something like this:
group S1 S2 S3 cor.S1.S2 cor.S1.S3
1 A D B 0.003825055 -0.2817946
1 D B C -0.2817946 -0.4928023
2 E A A -0.3856809 -0.3856809
2 C C E 1 -0.3862433
3 B B E 1 -0.3888541
3 D A A 0.003825055 0.003825055
I've been trying to resolve this with cbind[] but keep running into problems such as the 'x' must be numeric error with cor. Thanks in advance for any help!
You can do this with mapply().
my.cor <- function(x,y) {
cor(df1[,x],df1[,y])
}
df2$cor.S1.S2 <- mapply(my.cor,df2$S1,df2$S2)
df2$cor.S2.S3 <- mapply(my.cor,df2$S2,df2$S3)
Another approach would to the get the correlation between the matrix/data.frame after subsetting the columns of 'df1' with the columns of 'df2', get the diag and assign the output as new column in 'df2'. Here, I am using lapply as we have to do both 'S1 vs S2' and 'S1 vs S3'.
df2[c('cor.S1.S2', 'cor.S1.S3')] <- lapply(c('S2', 'S3'),
function(x) diag(cor(df1[, df2[,x]], df1[,df2$S1])))

Resources