I have a question on data conversion from binary to decimal. Suppose I have a binary pattern like this:
pattern<-do.call(expand.grid, replicate(5, 0:1, simplify=FALSE))
pattern
Var1 Var2 Var3 Var4 Var5
1 0 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 1 1 0 0 0
5 0 0 1 0 0
6 1 0 1 0 0
7 0 1 1 0 0
8 1 1 1 0 0
9 0 0 0 1 0
10 1 0 0 1 0
11 0 1 0 1 0
12 1 1 0 1 0
13 0 0 1 1 0
14 1 0 1 1 0
15 0 1 1 1 0
16 1 1 1 1 0
17 0 0 0 0 1
18 1 0 0 0 1
19 0 1 0 0 1
20 1 1 0 0 1
21 0 0 1 0 1
22 1 0 1 0 1
23 0 1 1 0 1
24 1 1 1 0 1
25 0 0 0 1 1
26 1 0 0 1 1
27 0 1 0 1 1
28 1 1 0 1 1
29 0 0 1 1 1
30 1 0 1 1 1
31 0 1 1 1 1
32 1 1 1 1 1
I'm wondering in R what is the easiest way to convert each row to a decimal value? and versus. such as:
00000->0
10000->16
...
01111->15
Try:
res <- strtoi(apply(pattern,1, paste, collapse=""), base=2)
res
#[1] 0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30 1 17 9 25 5 21 13 29 3
#[26] 19 11 27 7 23 15 31
You could try intToBits to convert back to the binary:
pat2 <- t(sapply(res, function(x) as.integer(rev(intToBits(x)))))[,28:32]
pat1 <- as.matrix(pattern)
dimnames(pat1) <- NULL
identical(pat1, pat2)
#[1] TRUE
You can try:
as.matrix(pattern) %*% 2^((ncol(pattern)-1):0)
Related
Here is my data set. I would like to add 5 new columns to mydata with 5 different conditions.
mydata=data.frame(sub=rep(c(1:4),c(3,4,5,5)),t=c(1:3,1:4,1:5,1:5),
y.val=c(10,20,13,
5,7,8,0,
45,17,25,12,10,
40,0,0,5,8))
mydata
sub t y.val
1 1 1 10
2 1 2 20
3 1 3 13
4 2 1 5
5 2 2 7
6 2 3 8
7 2 4 0
8 3 1 45
9 3 2 17
10 3 3 25
11 3 4 12
12 3 5 10
13 4 1 40
14 4 2 0
15 4 3 0
16 4 4 5
17 4 5 8
I would like to add the following 5 (max of 't' column) columns as
mydata$It1=ifelse(mydata$t==1 & mydata$y.val>0,1,0)
mydata$It2=ifelse(mydata$t==2 & mydata$y.val>0,1,0)
mydata$It3=ifelse(mydata$t==3 & mydata$y.val>0,1,0)
mydata$It4=ifelse(mydata$t==4 & mydata$y.val>0,1,0)
mydata$It5=ifelse(mydata$t==5 & mydata$y.val>0,1,0)
Here is the expected outcome.
> mydata
sub t y.val It1 It2 It3 It4 It5
1 1 1 10 1 0 0 0 0
2 1 2 20 0 1 0 0 0
3 1 3 13 0 0 1 0 0
4 2 1 5 1 0 0 0 0
5 2 2 7 0 1 0 0 0
6 2 3 8 0 0 1 0 0
7 2 4 0 0 0 0 0 0
8 3 1 45 1 0 0 0 0
9 3 2 17 0 1 0 0 0
10 3 3 25 0 0 1 0 0
11 3 4 12 0 0 0 1 0
12 3 5 10 0 0 0 0 1
13 4 1 40 1 0 0 0 0
14 4 2 0 0 0 0 0 0
15 4 3 0 0 0 0 0 0
16 4 4 5 0 0 0 1 0
17 4 5 8 0 0 0 0 1
I appreciate your help if it can be written as a function using for loop or any other technique.
You could use sapply/lapply
n <- seq_len(5)
mydata[paste0("It", n)] <- +(sapply(n, function(x) mydata$t==x & mydata$y.val>0))
mydata
# sub t y.val It1 It2 It3 It4 It5
#1 1 1 10 1 0 0 0 0
#2 1 2 20 0 1 0 0 0
#3 1 3 13 0 0 1 0 0
#4 2 1 5 1 0 0 0 0
#5 2 2 7 0 1 0 0 0
#6 2 3 8 0 0 1 0 0
#7 2 4 0 0 0 0 0 0
#8 3 1 45 1 0 0 0 0
#9 3 2 17 0 1 0 0 0
#10 3 3 25 0 0 1 0 0
#11 3 4 12 0 0 0 1 0
#12 3 5 10 0 0 0 0 1
#13 4 1 40 1 0 0 0 0
#14 4 2 0 0 0 0 0 0
#15 4 3 0 0 0 0 0 0
#16 4 4 5 0 0 0 1 0
#17 4 5 8 0 0 0 0 1
mydata$t==x & mydata$y.val>0 returns a logical value of TRUE/FALSE based on condition. The + changes those logical values to 1/0 respectively. (Try +c(FALSE, TRUE)). It avoids using ifelse i.e ifelse(condition, 1, 0).
Here's another approach based on multiplying a model matrix by the logical y.val > 0.
df <- cbind(mydata[1:3], model.matrix(~ factor(t) + 0, mydata)*(mydata$y.val>0))
Which gives:
sub t y.val factor.t.1 factor.t.2 factor.t.3 factor.t.4 factor.t.5
1 1 1 10 1 0 0 0 0
2 1 2 20 0 1 0 0 0
3 1 3 13 0 0 1 0 0
4 2 1 5 1 0 0 0 0
5 2 2 7 0 1 0 0 0
6 2 3 8 0 0 1 0 0
7 2 4 0 0 0 0 0 0
8 3 1 45 1 0 0 0 0
9 3 2 17 0 1 0 0 0
10 3 3 25 0 0 1 0 0
11 3 4 12 0 0 0 1 0
12 3 5 10 0 0 0 0 1
13 4 1 40 1 0 0 0 0
14 4 2 0 0 0 0 0 0
15 4 3 0 0 0 0 0 0
16 4 4 5 0 0 0 1 0
17 4 5 8 0 0 0 0 1
To clean up the names you can do:
names(df) <- sub("factor.t.", "It", names(df), fixed = TRUE)
You can use sapply to compare each t for equality against 1:5 and combine this with an & of y.val>0.
within(mydata, It <- +(sapply(1:5, `==`, t) & y.val>0))
# sub t y.val It.1 It.2 It.3 It.4 It.5
#1 1 1 10 1 0 0 0 0
#2 1 2 20 0 1 0 0 0
#3 1 3 13 0 0 1 0 0
#4 2 1 5 1 0 0 0 0
#5 2 2 7 0 1 0 0 0
#6 2 3 8 0 0 1 0 0
#7 2 4 0 0 0 0 0 0
#8 3 1 45 1 0 0 0 0
#9 3 2 17 0 1 0 0 0
#10 3 3 25 0 0 1 0 0
#11 3 4 12 0 0 0 1 0
#12 3 5 10 0 0 0 0 1
#13 4 1 40 1 0 0 0 0
#14 4 2 0 0 0 0 0 0
#15 4 3 0 0 0 0 0 0
#16 4 4 5 0 0 0 1 0
#17 4 5 8 0 0 0 0 1
Here's a tidyverse solution, using pivot_wider:
library(tidyverse)
mydata %>%
mutate(new_col = paste0("It", t),
y_test = as.integer(y.val > 0)) %>%
pivot_wider(id_cols = c(sub, t, y.val),
names_from = new_col,
values_from = y_test,
values_fill = list(y_test = 0))
sub t y.val It1 It2 It3 It4 It5
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 1 0 0 0 0
2 1 2 20 0 1 0 0 0
3 1 3 13 0 0 1 0 0
4 2 1 5 1 0 0 0 0
5 2 2 7 0 1 0 0 0
6 2 3 8 0 0 1 0 0
7 2 4 0 0 0 0 0 0
8 3 1 45 1 0 0 0 0
9 3 2 17 0 1 0 0 0
10 3 3 25 0 0 1 0 0
11 3 4 12 0 0 0 1 0
12 3 5 10 0 0 0 0 1
13 4 1 40 1 0 0 0 0
14 4 2 0 0 0 0 0 0
15 4 3 0 0 0 0 0 0
16 4 4 5 0 0 0 1 0
17 4 5 8 0 0 0 0 1
Explanation:
Make two columns, new_col (new column names with "It") and y_test (y.val > 0).
Pivot new_col values into column names.
Fill in the NA values with zeros.
One purrr and dplyr option could be:
map_dfc(.x = 1:5,
~ mydata %>%
mutate(!!paste0("It", .x) := as.integer(t == .x & y.val > 0)) %>%
select(starts_with("It"))) %>%
bind_cols(mydata)
It1 It2 It3 It4 It5 sub t y.val
1 1 0 0 0 0 1 1 10
2 0 1 0 0 0 1 2 20
3 0 0 1 0 0 1 3 13
4 1 0 0 0 0 2 1 5
5 0 1 0 0 0 2 2 7
6 0 0 1 0 0 2 3 8
7 0 0 0 0 0 2 4 0
8 1 0 0 0 0 3 1 45
9 0 1 0 0 0 3 2 17
10 0 0 1 0 0 3 3 25
11 0 0 0 1 0 3 4 12
12 0 0 0 0 1 3 5 10
13 1 0 0 0 0 4 1 40
14 0 0 0 0 0 4 2 0
15 0 0 0 0 0 4 3 0
16 0 0 0 1 0 4 4 5
17 0 0 0 0 1 4 5 8
Or if you want to perform it dynamically according the range in t column:
map_dfc(.x = reduce(as.list(range(mydata$t)), `:`),
~ mydata %>%
mutate(!!paste0("It", .x) := as.integer(t == .x & y.val > 0)) %>%
select(starts_with("It"))) %>%
bind_cols(mydata)
I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0
I am trying to recode a data frame with four columns. Across all of the columns, I want to recode all the numeric values into these ordinal numeric values:
0 stays as is
1:3 <- 1
4:10 <- 2
11:22 <- 3
22:max <-4
This is the data frame:
> df
T4.1 T4.2 T4.3 T4.4
1 0 54 0 5
2 0 5 0 0
3 0 3 0 0
4 0 2 0 0
5 0 3 0 0
6 0 2 0 0
7 0 4 0 0
8 1 20 0 0
9 1 7 0 2
10 0 14 0 0
11 0 3 0 0
12 0 202 0 41
13 2 12 0 0
14 3 6 0 0
15 3 21 0 3
16 0 143 0 0
17 0 0 0 0
18 4 9 0 0
19 3 15 0 0
20 0 58 0 6
21 2 0 0 0
22 0 52 0 0
23 0 3 0 0
24 0 1 0 0
25 4 6 0 1
26 1 4 0 0
27 0 38 0 1
28 0 6 0 0
29 0 8 0 0
30 0 29 0 4
31 1 14 0 0
32 0 12 0 10
33 4 1 0 3
I'm trying to use the recode function, but I can't seem to figure out how to input a range of numeric values into it. I get the following errors with my attempts:
> recode(df, 11:22=3)
Error: unexpected '=' in "recode(df, 11:22="
> recode(df, c(11:22)=3)
Error: unexpected '=' in "recode(df, c(11:22)="
I would greatly appreciate any advice. Thanks for your time!
Edit: Thanks all for the help!!
You can use cut with range of values as:
df_res <- as.data.frame(sapply(df, function(x)cut(x,
breaks = c(-0.5, 0.5, 3.5, 10.5, 22.5, Inf),
labels = c(0, 1, 2, 3, 4)))
)
str(df_res)
#'data.frame': 33 obs. of 4 variables:
# $ T4.1: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 2 2 1 ...
# $ T4.2: Factor w/ 5 levels "0","1","2","3",..: 5 3 2 2 2 2 3 4 3 4 ...
# $ T4.3: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
# $ T4.4: Factor w/ 4 levels "0","1","2","4": 3 1 1 1 1 1 1 1 2 1 ...
df_res
# T4.1 T4.2 T4.3 T4.4
# 1 0 4 0 2
# 2 0 2 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 1 0 0
# 6 0 1 0 0
# 7 0 2 0 0
# 8 1 3 0 0
# 9 1 2 0 1
# 10 0 3 0 0
# 11 0 1 0 0
# 12 0 4 0 4
# 13 1 3 0 0
# 14 1 2 0 0
# 15 1 3 0 1
# 16 0 4 0 0
# 17 0 0 0 0
# 18 2 2 0 0
# 19 1 3 0 0
# 20 0 4 0 2
# 21 1 0 0 0
# 22 0 4 0 0
# 23 0 1 0 0
# 24 0 1 0 0
# 25 2 2 0 1
# 26 1 2 0 0
# 27 0 4 0 1
# 28 0 2 0 0
# 29 0 2 0 0
# 30 0 4 0 2
# 31 1 3 0 0
# 32 0 3 0 2
# 33 2 1 0 1
I find named vectors are a nice pattern for re-coding variables, especially for irregular patterns. You could use one like this here:
decoder <- c(0, rep(1,3), rep(2,7), rep(3, 12))
names(decoder) <- 0:22
sapply(df, function(x) ifelse(x <= 22, decoder[as.character(x)], 4))
If the re-coding was more of a pattern, cut is a useful function.
I hava a text file that i want to use for survival data analysis:
1 0 0 0 15 0 0 1 1 0 0 2 12 0 12 0 12 0
2 0 0 1 20 0 0 1 0 0 0 4 9 0 9 0 9 0
3 0 0 1 15 0 0 0 1 1 0 2 13 0 13 0 7 1
4 0 0 0 20 1 0 1 0 0 0 2 11 1 29 0 29 0
5 0 0 1 70 1 1 1 1 0 0 2 28 1 31 0 4 1
6 0 0 1 20 1 0 1 0 0 0 4 11 0 11 0 8 1
7 0 0 1 5 0 0 0 0 0 1 4 12 0 12 0 11 1
8 0 0 1 30 1 0 1 1 0 0 4 8 1 34 0 4 1
9 0 0 1 25 0 1 0 1 1 0 4 10 1 53 0 4 1
10 0 0 1 20 0 1 0 1 0 0 4 7 0 1 1 7 0
11 0 0 1 30 1 0 1 0 0 1 4 7 1 21 1 44 1
12 0 0 0 20 0 0 1 0 0 1 4 20 0 1 1 20 0
13 0 0 1 25 0 0 1 1 1 0 4 12 1 32 0 32 0
14 0 0 1 70 0 0 0 0 0 1 4 16 0 16 0 16 0
15 0 0 1 20 1 0 1 0 0 0 4 39 0 39 0 39 0
16 0 0 0 10 1 0 1 0 0 1 4 23 1 34 0 34 0
17 0 0 1 10 1 0 0 0 0 0 4 8 0 8 0 8 0
18 0 0 1 15 0 0 0 0 0 0 4 15 0 15 0 6 1
19 0 0 1 10 0 0 0 0 0 1 4 8 0 8 0 8 0
20 0 0 1 15 0 0 0 0 1 0 4 24 1 32 0 32 0
21 0 0 1 16 0 0 1 0 0 0 4 25 1 22 1 43 0
22 0 1 1 55 1 0 1 1 0 0 4 14 1 3 1 56 0
23 0 0 1 20 1 0 1 1 0 0 4 24 1 47 0 11 1
24 0 0 0 30 0 0 0 1 1 0 4 6 1 43 0 43 0
25 0 0 1 40 0 1 0 1 1 0 1 25 0 3 1 25 0
26 0 0 1 15 1 0 1 1 0 0 4 12 0 12 0 12 0
27 0 1 1 50 0 0 1 0 0 1 4 15 1 53 0 32 1
28 0 0 1 40 1 0 1 1 0 0 4 18 1 52 0 51 1
29 0 1 1 45 0 1 1 1 1 0 4 13 1 11 1 21 0
30 0 1 0 40 0 1 1 1 1 0 2 29 0 2 1 29 0
31 0 0 1 28 0 0 1 0 0 0 2 7 0 7 0 3 1
32 0 0 1 19 1 0 1 0 0 0 3 16 0 16 0 16 0
33 0 0 1 15 0 0 1 0 0 0 2 10 0 10 0 3 1
34 0 0 1 5 0 0 1 0 1 0 3 6 0 6 0 4 1
35 0 1 1 35 0 0 1 0 0 0 4 8 1 43 0 7 1
36 0 0 1 2 1 0 1 0 0 0 1 1 1 27 0 27 0
37 0 1 1 5 0 0 1 0 0 0 2 18 0 18 0 18 0
38 0 0 1 55 1 0 1 0 0 1 4 6 1 5 1 47 1
39 0 0 0 10 0 0 0 1 0 0 2 19 1 29 0 29 0
40 0 0 1 15 0 0 1 0 0 0 4 5 0 5 0 5 0
41 0 1 1 20 1 0 1 0 0 1 4 1 1 4 1 97 0
42 0 1 0 30 1 0 1 1 0 1 4 15 1 28 0 28 0
43 0 0 1 25 1 1 1 1 0 1 4 14 1 4 1 7 1
44 0 0 1 95 1 1 1 1 1 1 4 9 0 9 0 3 1
45 0 1 1 30 0 0 0 0 1 0 4 1 1 39 0 39 0
46 0 0 1 15 1 0 1 0 0 0 4 10 0 10 0 10 0
47 0 0 1 20 0 1 1 1 0 0 4 6 1 5 1 46 0
48 0 1 1 6 0 0 1 0 0 0 2 13 1 28 0 28 0
49 0 0 1 15 0 0 1 0 0 1 4 11 1 21 0 21 0
50 0 0 1 7 0 0 1 1 0 0 1 8 1 17 1 38 0
51 0 0 1 13 0 0 1 1 1 0 4 10 0 10 0 10 0
52 0 0 1 25 1 0 1 0 0 1 4 6 1 40 0 5 1
53 0 0 1 25 1 0 1 0 1 1 4 18 1 22 0 9 1
54 0 1 1 20 1 0 1 0 0 1 4 16 1 16 1 21 1
55 0 1 1 25 0 0 1 1 0 0 4 7 1 26 0 26 0
56 0 0 1 95 1 0 1 1 1 1 4 14 0 14 0 14 0
57 0 0 1 17 1 0 1 0 0 0 4 16 0 16 0 16 0
58 0 0 1 3 0 0 1 0 1 0 3 4 0 4 0 1 1
59 0 0 1 15 1 0 1 0 0 0 4 19 0 6 1 19 0
60 0 0 1 65 1 1 1 1 1 1 4 21 1 8 1 10 1
61 0 1 1 15 1 0 1 1 1 1 4 18 0 18 0 18 0
62 0 0 1 40 1 0 1 0 0 0 3 31 0 31 0 13 1
63 0 0 1 45 1 0 1 1 0 1 4 11 1 24 1 40 0
64 0 1 0 35 0 0 1 1 0 0 4 4 1 5 1 47 0
65 0 0 1 85 1 1 1 1 0 1 4 12 1 8 1 9 1
66 0 1 1 15 0 1 0 1 0 1 4 11 1 35 0 19 1
67 0 0 1 70 0 1 1 1 1 0 2 23 1 8 1 60 0
68 0 0 1 6 1 0 0 0 0 1 4 7 0 7 0 7 0
69 0 0 1 20 0 0 1 0 0 0 4 19 1 26 0 6 1
70 0 1 1 36 1 0 1 0 1 1 4 16 1 20 1 23 1
71 1 1 1 50 1 1 1 0 1 0 4 15 0 1 1 15 0
72 1 0 1 21 1 0 1 0 0 0 4 6 1 13 1 23 0
73 1 0 1 16 1 0 1 0 0 0 4 2 1 9 0 9 0
74 1 1 1 3 0 0 1 0 0 0 4 6 1 14 0 14 0
75 1 0 1 5 1 0 1 0 0 0 3 8 0 8 0 2 1
76 1 0 1 32 0 1 1 1 0 1 4 18 1 51 0 18 1
77 1 0 1 38 0 1 1 1 0 0 4 12 1 22 0 22 0
78 1 0 1 16 1 0 1 0 0 0 4 7 1 16 0 16 0
79 1 1 1 9 0 1 0 1 0 0 4 6 1 2 1 2 1
80 1 0 1 17 0 1 1 0 0 0 2 10 1 10 1 22 0
81 1 0 1 22 1 0 1 0 0 0 4 12 1 20 0 5 1
82 1 0 1 10 0 0 1 0 0 0 4 5 1 5 1 14 0
83 1 0 1 12 1 0 1 0 0 0 4 12 0 12 0 12 0
84 1 0 1 80 1 1 1 1 1 1 4 6 1 4 1 41 0
85 1 1 1 15 0 0 1 1 0 0 4 9 1 9 1 21 0
86 1 0 1 50 1 0 1 0 0 1 4 18 1 7 1 56 0
87 1 0 1 50 1 1 1 1 1 1 4 7 1 42 1 67 0
88 1 0 1 15 1 0 1 0 0 0 3 11 0 11 0 11 0
89 1 0 1 8 1 0 1 0 0 0 4 9 1 17 0 17 0
90 1 1 1 45 1 1 1 1 0 0 1 11 1 11 1 18 1
91 1 0 1 20 0 1 1 1 0 1 4 6 1 6 1 14 1
92 1 0 1 5 0 0 1 0 1 0 3 4 1 8 0 5 1
93 1 0 1 25 0 0 1 0 0 0 2 5 1 10 0 5 1
94 1 0 1 40 0 1 1 1 0 0 4 11 1 8 1 31 0
95 1 0 1 4 0 0 1 0 1 0 3 9 1 7 1 23 0
96 1 0 1 25 0 0 1 1 0 1 4 4 1 14 1 46 0
97 1 1 1 20 0 0 1 0 1 0 4 5 1 1 1 38 0
98 1 1 1 26 0 0 1 0 0 1 4 8 1 3 1 35 0
99 1 0 1 10 0 1 1 1 0 0 4 13 1 21 0 21 0
100 1 1 1 85 1 1 1 1 0 1 4 11 0 3 1 11 0
101 1 0 1 75 1 0 1 1 1 0 4 29 1 49 0 16 1
102 1 0 0 5 0 0 1 0 1 0 1 13 0 13 0 13 0
103 1 0 1 20 1 0 1 0 0 0 4 1 1 12 0 12 0
104 1 1 1 8 0 1 0 1 1 0 4 6 1 6 1 13 0
105 1 1 1 10 0 0 1 0 0 1 4 6 1 23 0 23 0
106 1 0 1 10 0 0 0 0 1 1 4 3 1 31 0 31 0
107 1 1 0 2 0 0 1 0 0 0 1 2 1 2 1 10 0
108 1 0 0 5 0 0 0 0 1 0 2 4 1 4 1 17 0
109 1 0 1 10 1 0 0 0 1 0 4 5 1 18 0 18 0
110 1 0 1 18 0 0 1 1 1 0 4 6 1 5 1 33 0
111 1 0 1 20 1 0 1 1 0 0 4 9 1 8 1 17 0
112 1 0 1 80 1 1 1 1 1 1 4 4 1 11 1 13 0
113 1 0 0 17 1 0 1 1 1 1 4 5 1 4 1 35 0
114 1 0 0 35 1 0 1 0 0 0 4 7 1 7 1 71 0
115 1 0 1 50 1 0 1 0 1 1 4 11 0 11 0 3 1
116 1 0 0 20 0 0 1 0 0 0 4 6 1 31 1 42 1
117 1 0 1 25 0 1 1 1 0 0 3 8 0 8 0 5 1
118 1 0 1 20 0 0 0 1 0 1 1 3 1 2 1 30 0
119 1 0 1 20 0 0 1 1 0 0 4 6 1 38 0 38 0
120 1 0 1 10 1 0 1 0 0 0 4 16 0 16 0 16 0
121 1 0 0 15 1 0 1 0 0 0 2 20 0 20 0 20 0
122 1 0 1 15 0 0 1 0 1 0 4 30 0 2 1 30 0
123 1 0 1 15 0 0 1 0 0 0 4 2 1 7 0 7 0
124 1 0 1 20 0 0 1 1 0 0 2 8 1 6 1 22 0
125 1 0 1 13 1 0 1 0 0 0 4 13 0 4 1 5 1
126 1 0 1 25 1 0 1 0 0 1 4 13 1 1 1 31 0
127 1 0 1 25 0 0 1 1 0 1 4 17 0 17 0 10 1
128 1 0 1 8 1 0 1 0 0 0 4 14 0 14 0 14 0
129 1 1 1 30 1 0 1 0 0 1 4 13 0 5 1 13 0
130 1 0 1 40 0 1 1 1 1 0 4 24 0 7 1 17 1
131 1 1 1 12 0 1 1 1 1 0 1 14 1 21 0 21 0
132 1 0 1 15 0 0 1 0 0 0 4 8 1 19 1 25 0
133 1 0 1 25 1 0 1 0 0 0 4 23 0 23 0 8 1
134 1 0 1 15 0 0 1 0 0 0 4 17 1 17 0 11 1
135 1 0 0 20 0 0 1 1 1 0 4 19 1 31 0 31 0
136 1 0 1 22 0 1 1 0 0 0 4 14 1 20 0 20 0
137 1 0 1 15 1 0 1 0 1 0 4 15 1 22 0 22 0
138 1 0 1 7 1 0 1 0 0 0 3 13 0 3 1 13 0
139 1 0 1 30 0 1 1 1 1 0 2 49 0 49 0 4 1
140 1 0 1 20 1 0 1 0 0 1 4 14 0 10 1 14 0
141 1 1 1 35 1 0 1 0 0 1 4 6 1 5 1 49 0
142 1 0 0 10 0 0 1 0 0 0 4 12 0 12 0 12 0
143 1 0 1 8 0 0 1 0 1 0 3 14 0 1 1 14 0
144 1 0 1 13 0 0 0 0 1 0 4 32 1 38 0 38 0
145 1 1 0 10 0 1 1 1 0 0 2 12 1 13 1 41 0
146 1 0 1 8 0 0 0 1 1 0 4 10 1 18 0 18 0
147 1 0 1 7 1 0 1 0 0 0 4 8 0 8 0 8 0
148 1 0 1 52 1 0 1 1 1 1 4 15 1 39 1 76 0
149 1 1 1 14 0 1 1 1 1 0 4 8 1 62 0 62 0
150 1 1 1 7 0 0 1 0 0 0 1 5 1 17 0 17 0
151 1 1 1 20 1 0 1 0 0 0 4 7 1 6 1 17 1
152 1 0 1 15 0 0 0 1 1 1 4 19 1 3 1 42 0
153 1 0 1 10 0 0 1 0 0 0 4 10 0 10 0 2 1
154 1 0 1 35 1 1 1 0 0 0 4 10 1 27 0 27 0
I have used the Import Dataset tool within R, but I cannot seem to find the right setting to import the dataset. The columns are either merged together, or there are additional columns (with many) NAs.
I have looked around for similar questions, however I cannot find a solution that suits my problem.
How can I import this dataset?
Ensure it is saved as a text file (for example text.txt) then apply the following: read.table("text.txt").
Here is the example of the data set to be calculated the correlation between O_data and possible multiple combinations of M_data.
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
I can calculate the correlation between O_data and individual M_data data.
correlation= matrix(NA,ncol = length(M_data[1,]))
for (i in 1:length(correlation))
{
correlation[,i]=cor(O_data,M_data[,i])
}
In addition to this, how can I get the correlation between O_data and possible multiple combinations of M_data set?
let's clarify the combination.
cor_M_ab=cor((M_a+M_b),O_data)
cor_M_abc=cor((M_a+M_b+M_c),O_data)
cor_M_abcd=...
cor_M_abcde=...
...
....
cor_M_bcd=..
..
cor_M_eab=...
....
...
I don't want combinations of M_a and M_c, I want the combination on a continuous basis, like, M_ab, or bc,bcd,abcde,ea,eab........
Generate the data using set.seed so you can reproduce:
set.seed(42)
O_data=runif(10)
M_a=runif(10)
M_b=runif(10)
M_c=runif(10)
M_d=runif(10)
M_e=runif(10)
M_data=data.frame(M_a,M_b,M_c,M_d,M_e)
The tricky part is just keeping things organized. Since you didn't specify, I made a matrix with 5 rows and 31 columns. The rows get the names of the variables in your M_data. Here's the matrix (motivated by: All N Combinations of All Subsets)
M_grid <- t(do.call(expand.grid, replicate(5, 0:1, simplify = FALSE))[-1,])
rownames(M_grid) <- names(M_data)
M_grid
#> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> M_a 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#> M_b 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1
#> M_c 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
#> M_d 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1
#> M_e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
#> 28 29 30 31 32
#> M_a 1 0 1 0 1
#> M_b 1 0 0 1 1
#> M_c 0 1 1 1 1
#> M_d 1 1 1 1 1
#> M_e 1 1 1 1 1
Now when I do a matrix multiplication of M_data and any column of my M_grid I get a sum of the columns in M_data corresponding to which rows of M_grid have 1's. For example:
as.matrix(M_data) %*% M_grid[,4]
gives me the sum of M_a and M_b. I can calculate the correlation between O_data and any of these sums. Putting it all together in one line:
(final <- cbind(t(M_grid), apply(as.matrix(M_data) %*% M_grid, 2, function(x) cor(O_data, x))))
#> M_a M_b M_c M_d M_e
#> 2 1 0 0 0 0 0.066499681
#> 3 0 1 0 0 0 -0.343839423
#> 4 1 1 0 0 0 -0.255957896
#> 5 0 0 1 0 0 0.381614222
#> 6 1 0 1 0 0 0.334916617
#> 7 0 1 1 0 0 0.024198743
#> 8 1 1 1 0 0 0.059297654
#> 9 0 0 0 1 0 0.180676146
#> 10 1 0 0 1 0 0.190656099
#> 11 0 1 0 1 0 -0.140666930
#> 12 1 1 0 1 0 -0.094245439
#> 13 0 0 1 1 0 0.363591787
#> 14 1 0 1 1 0 0.363546012
#> 15 0 1 1 1 0 0.111435827
#> 16 1 1 1 1 0 0.142772457
#> 17 0 0 0 0 1 0.248640472
#> 18 1 0 0 0 1 0.178471959
#> 19 0 1 0 0 1 -0.117930168
#> 20 1 1 0 0 1 -0.064838097
#> 21 0 0 1 0 1 0.404258155
#> 22 1 0 1 0 1 0.348609692
#> 23 0 1 1 0 1 0.114267433
#> 24 1 1 1 0 1 0.131731971
#> 25 0 0 0 1 1 0.241561478
#> 26 1 0 0 1 1 0.229693510
#> 27 0 1 0 1 1 0.001390233
#> 28 1 1 0 1 1 0.030884234
#> 29 0 0 1 1 1 0.369212761
#> 30 1 0 1 1 1 0.354971839
#> 31 0 1 1 1 1 0.166132390
#> 32 1 1 1 1 1 0.182368955
The final column is the correlation of O_data with all 31 possible sums of columns in M_data. You can tell which column is included by seeing which has a 1 under it for that row.
I try not to resort to matrices too much but this was the first thing I thought of.