correlation with multiple variables and its mutiple combination - r

Here is the example of the data set to be calculated the correlation between O_data and possible multiple combinations of M_data.
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
I can calculate the correlation between O_data and individual M_data data.
correlation= matrix(NA,ncol = length(M_data[1,]))
for (i in 1:length(correlation))
{
correlation[,i]=cor(O_data,M_data[,i])
}
In addition to this, how can I get the correlation between O_data and possible multiple combinations of M_data set?
let's clarify the combination.
cor_M_ab=cor((M_a+M_b),O_data)
cor_M_abc=cor((M_a+M_b+M_c),O_data)
cor_M_abcd=...
cor_M_abcde=...
...
....
cor_M_bcd=..
..
cor_M_eab=...
....
...
I don't want combinations of M_a and M_c, I want the combination on a continuous basis, like, M_ab, or bc,bcd,abcde,ea,eab........

Generate the data using set.seed so you can reproduce:
set.seed(42)
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
The tricky part is just keeping things organized. Since you didn't specify, I made a matrix with 5 rows and 31 columns. The rows get the names of the variables in your M_data. Here's the matrix (motivated by: All N Combinations of All Subsets)
M_grid <- t(do.call(expand.grid, replicate(5, 0:1, simplify = FALSE))[-1,])
rownames(M_grid) <- names(M_data)
M_grid
#> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> M_a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#> M_b 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> M_c 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
#> M_d 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
#> M_e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
#> 28 29 30 31 32
#> M_a 1 0 1 0 1
#> M_b 1 0 0 1 1
#> M_c 0 1 1 1 1
#> M_d 1 1 1 1 1
#> M_e 1 1 1 1 1
Now when I do a matrix multiplication of M_data and any column of my M_grid I get a sum of the columns in M_data corresponding to which rows of M_grid have 1's. For example:
as.matrix(M_data) %*% M_grid[,4]
gives me the sum of M_a and M_b. I can calculate the correlation between O_data and any of these sums. Putting it all together in one line:
(final <- cbind(t(M_grid), apply(as.matrix(M_data) %*% M_grid, 2, function(x) cor(O_data, x))))
#> M_a M_b M_c M_d M_e
#> 2 1 0 0 0 0 0.066499681
#> 3 0 1 0 0 0 -0.343839423
#> 4 1 1 0 0 0 -0.255957896
#> 5 0 0 1 0 0 0.381614222
#> 6 1 0 1 0 0 0.334916617
#> 7 0 1 1 0 0 0.024198743
#> 8 1 1 1 0 0 0.059297654
#> 9 0 0 0 1 0 0.180676146
#> 10 1 0 0 1 0 0.190656099
#> 11 0 1 0 1 0 -0.140666930
#> 12 1 1 0 1 0 -0.094245439
#> 13 0 0 1 1 0 0.363591787
#> 14 1 0 1 1 0 0.363546012
#> 15 0 1 1 1 0 0.111435827
#> 16 1 1 1 1 0 0.142772457
#> 17 0 0 0 0 1 0.248640472
#> 18 1 0 0 0 1 0.178471959
#> 19 0 1 0 0 1 -0.117930168
#> 20 1 1 0 0 1 -0.064838097
#> 21 0 0 1 0 1 0.404258155
#> 22 1 0 1 0 1 0.348609692
#> 23 0 1 1 0 1 0.114267433
#> 24 1 1 1 0 1 0.131731971
#> 25 0 0 0 1 1 0.241561478
#> 26 1 0 0 1 1 0.229693510
#> 27 0 1 0 1 1 0.001390233
#> 28 1 1 0 1 1 0.030884234
#> 29 0 0 1 1 1 0.369212761
#> 30 1 0 1 1 1 0.354971839
#> 31 0 1 1 1 1 0.166132390
#> 32 1 1 1 1 1 0.182368955
The final column is the correlation of O_data with all 31 possible sums of columns in M_data. You can tell which column is included by seeing which has a 1 under it for that row.
I try not to resort to matrices too much but this was the first thing I thought of.

Related

create a loop to get samples in grouped data which meet a condition

I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0

R - Controlling interaction order in model matrix

I would like to control the order of interaction dummy codes in a design matrix, separately from the order of main effects dummy codes. Specifically the order in which the terms that make the interaction are cycled through.
For example:
df <- expand.grid(X1 = letters[1:3],
X2 = LETTERS[24:26])
When writing the formula as ~X1+X2+X1:X2, the interaction design cycles through X2 and then through X1.
model.matrix(~X1+X2+X1:X2, df)
#> (Intercept) X1b X1c X2Y X2Z X1b:X2Y X1c:X2Y X1b:X2Z X1c:X2Z
#> 1 1 0 0 0 0 0 0 0 0
#> 2 1 1 0 0 0 0 0 0 0
#> 3 1 0 1 0 0 0 0 0 0
#> 4 1 0 0 1 0 0 0 0 0
#> 5 1 1 0 1 0 1 0 0 0
#> 6 1 0 1 1 0 0 1 0 0
#> 7 1 0 0 0 1 0 0 0 0
#> 8 1 1 0 0 1 0 0 1 0
#> 9 1 0 1 0 1 0 0 0 1
#> attr(,"assign")
#> [1] 0 1 1 2 2 3 3 3 3
#> attr(,"contrasts")
#> attr(,"contrasts")$X1
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$X2
#> [1] "contr.treatment"
When I flip the interaction term in the formula to ~X1+X2+X2:X1, the interaction design still cycles first through X2 and then through X1.
model.matrix(~X1+X2+X2:X1, df)
#> (Intercept) X1b X1c X2Y X2Z X1b:X2Y X1c:X2Y X1b:X2Z X1c:X2Z
#> 1 1 0 0 0 0 0 0 0 0
#> 2 1 1 0 0 0 0 0 0 0
#> 3 1 0 1 0 0 0 0 0 0
#> 4 1 0 0 1 0 0 0 0 0
#> 5 1 1 0 1 0 1 0 0 0
#> 6 1 0 1 1 0 0 1 0 0
#> 7 1 0 0 0 1 0 0 0 0
#> 8 1 1 0 0 1 0 0 1 0
#> 9 1 0 1 0 1 0 0 0 1
#> attr(,"assign")
#> [1] 0 1 1 2 2 3 3 3 3
#> attr(,"contrasts")
#> attr(,"contrasts")$X1
#> [1] "contr.treatment"
#>
#> attr(,"contrasts")$X2
#> [1] "contr.treatment"
What I would like end up with is the following design matrix:
#> (Intercept) X1b X1c X2Y X2Z X1b:X2Y X1b:X2Z X1c:X2Y X1c:X2Z
#> 1 1 0 0 0 0 0 0 0 0
#> 2 1 1 0 0 0 0 0 0 0
#> 3 1 0 1 0 0 0 0 0 0
#> 4 1 0 0 1 0 0 0 0 0
#> 5 1 1 0 1 0 1 0 0 0
#> 6 1 0 1 1 0 0 0 1 0
#> 7 1 0 0 0 1 0 0 0 0
#> 8 1 1 0 0 1 0 1 0 0
#> 9 1 0 1 0 1 0 0 0 1
Thanks!

Calculate mean values of multiple measurements in a table with two categorical variables and a single continues variable

I have this puzzle to solve.
This is given data
# A tibble: 351 x 3
# Groups: expcode [?]
expcode rank distributpermm.3
<chr> <int> <dbl>
1 ER02 1 892.325
2 ER02 2 694.030
3 ER02 3 917.110
4 ER02 4 991.475
5 ER02 5 1487.210
6 ER02 6 892.325
7 ER02 7 694.030
8 ER02 8 1710.290
9 ER02 9 1090.620
10 ER02 10 1288.915
# ... with 341 more rows
When I call table on this data like this:
table(ranktab$expcode, ranktab$rank)
I get a ordinary table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
ER02 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER03 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER04 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER05 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER07 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ER11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
ER18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER23 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Now I would like to get a matrix looks like this table above, but instate of sum of cases I would like to have the valves of third variable in the data frame, if there are two observations, then the mean of these.
Let's consider that your initial data is in df dataframe
df1 <- with(df, aggregate(distributpermm.3, by = list(expcode, rank), mean))
colnames(df1) <- colnames(df)
#this will give you final output in the desired format
xtabs(distributpermm.3 ~ expcode + rank, df1)
Hope this helps!
If you just want to obtain the means of variable relative to variable, you can use aggregate function.
Try this:
expcode = c (rep ("ER02", 3), rep ("ER03", 4), "ER04", rep ("ER05", 2))
rank = c (1, 2, 3, 1, 2, 3, 4, 1, 1, 2)
ddistributpermml.3 = c (892.325, 694.030, 917.110, 991.475, 1487.210, 892.325, 694.030, 1710.290, 1090.620, 1288.915)
data = data.frame (expcode, rank, ddistributpermml)
res = aggregate (data [, 3], list (data$expcode), mean)
colnames (res) = c ("expcode", "mean (distributpermm.3)")
res
# > res
# expcode mean (distributpermm.3)
# 1 ER02 834.4883
# 2 ER03 1016.2600
# 3 ER04 1710.2900
# 4 ER05 1189.7675
If you want to keep variable in some way, please clarify what you want to obtain.

Calculating means in R, via case/row frommultiple variables; count and exclude NA values

I'm trying to calculate participant average scores on the following scheme:
1. Take a series of values from multiple variables (test items),
2. Calculate an average score only for items answered Yes or No,
3. Omitting NA values from affecting the mean yet counting frequency and getting coordinates for all NA values,
4. Storing that newfound mean value in a new variable.
I need to do this with binary questions (1 = Yes, 0 = No, -99 = Missing / NA), such as below:
id var1 var2 var3 var4 var5
1 1 0 0 0 0
2 1 1 0 1 1
3 1 0 0 1 0
4 1 0 0 1 0
5 1 0 0 0 0
6 1 1 0 0 1
7 1 1 0 0 1
8 1 1 0 0 0
9 1 0 1 0 1
10 1 0 0 -99 1
11 1 1 0 1 0
12 1 0 0 1 0
13 1 0 0 -99 0
14 1 -99 0 1 1
15 1 0 0 1 0
16 1 0 0 0 1
17 1 0 0 1 0
18 1 0 -99 0 1
19 1 0 0 1 0
20 1 0 0 1 1
21 1 0 0 1 0
22 1 0 0 1 1
23 1 0 0 1 0
24 1 0 0 0 1
25 1 0 0 0 0
26 1 0 0 1 0
27 1 0 0 0 0
28 1 1 0 1 1
And with Likert scale questions (0 = Strongly Disagree / 6 = Strongly Disagree, -99 Missing / NA).
var10 var11 var12 var13 var14
1 1 1 1 0
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 1
2 1 1 1 0
1 1 1 1 0
1 1 1 1 1
2 1 1 1 1
1 1 1 1 0
4 1 1 1 1
4 1 1 1 1
-99 1 1 1 1
1 1 2 1 1
1 4 2 2 0
4 1 1 1 1
4 1 1 1 1
1 1 1 1 1
2 1 1 1 1
4 1 1 1 0
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
4 1 1 1 1
1 1 1 1 1
Any ideas of how to go about this? I'm sure it can be done by selecting individual columns or by indicating a range of columns from which to draw data. However, I'm inexperienced in writing such a complex, multi-stepped function in R so I'm hoping to get a veteran's advice.
Thanks in advance.

Convert binary string to decimal

I have a question on data conversion from binary to decimal. Suppose I have a binary pattern like this:
pattern<-do.call(expand.grid, replicate(5, 0:1, simplify=FALSE))
pattern
Var1 Var2 Var3 Var4 Var5
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
8 1 1 1 0 0
9 0 0 0 1 0
10 1 0 0 1 0
11 0 1 0 1 0
12 1 1 0 1 0
13 0 0 1 1 0
14 1 0 1 1 0
15 0 1 1 1 0
16 1 1 1 1 0
17 0 0 0 0 1
18 1 0 0 0 1
19 0 1 0 0 1
20 1 1 0 0 1
21 0 0 1 0 1
22 1 0 1 0 1
23 0 1 1 0 1
24 1 1 1 0 1
25 0 0 0 1 1
26 1 0 0 1 1
27 0 1 0 1 1
28 1 1 0 1 1
29 0 0 1 1 1
30 1 0 1 1 1
31 0 1 1 1 1
32 1 1 1 1 1
I'm wondering in R what is the easiest way to convert each row to a decimal value? and versus. such as:
00000->0
10000->16
...
01111->15
Try:
res <- strtoi(apply(pattern,1, paste, collapse=""), base=2)
res
#[1] 0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30 1 17 9 25 5 21 13 29 3
#[26] 19 11 27 7 23 15 31
You could try intToBits to convert back to the binary:
pat2 <- t(sapply(res, function(x) as.integer(rev(intToBits(x)))))[,28:32]
pat1 <- as.matrix(pattern)
dimnames(pat1) <- NULL
identical(pat1, pat2)
#[1] TRUE
You can try:
as.matrix(pattern) %*% 2^((ncol(pattern)-1):0)

Resources