Permutations and Decision Trees with R - r

I was wondering if there is a way to produce a decision tree that solves a permutation of selecting n objects of k classes. We have the set A={1,2,...,10}, and the subsets B={1,2,..,5}, C={6,7} and D={8,9,10}. The total number of permutations can be calculated by
x <- factorial(10)/(factorial(5)*factorial(2)*factorial(3))
I would like to produce a decision tree similar to an edge list, as the following:
1 2 5 B
1 3 2 C
1 4 3 D
2 5 4 B
2 6 2 C
2 7 3 D
3 8 5 B
3 9 1 C
3 10 3 D
4 11 5 B
4 12 2 C
4 13 2 D
5 14 3 B
5 15 2 C
5 16 3 D
6 17 4 B
6 18 1 C
6 19 3 D
7 20 4 B
7 21 2 C
7 22 2 D
8 23 4 B
8 24 1 C
8 25 3 D
9 26 5 B
9 27 3 D
10 28 5 B
10 29 1 C
10 30 2 D
11 31 4 B
11 32 2 C
11 33 2 D
12 34 5 B
12 35 1 C
12 36 2 D
13 37 5 B
13 38 2 C
13 39 1 D
14 40 2 B
14 41 2 C
14 42 3 D
15 43 3 B
15 44 1 C
15 45 3 D
16 46 3 B
16 47 2 C
16 48 2 D
17 49 3 B
17 50 1 C
17 51 3 D
18 52 4 B
18 53 3 D
19 54 4 B
19 55 1 C
19 56 2 D
20 57 3 B
20 58 2 C
20 59 2 D
21 60 4 B
21 61 1 C
21 62 2 D
22 63 4 B
22 64 2 C
22 65 1 D
23 66 3 B
23 67 1 C
23 68 3 D
24 69 4 B
24 70 3 D
25 71 4 B
25 72 1 C
25 73 2 D
26 74 4 B
26 75 3 D
27 76 5 B
27 77 2 D
28 78 4 B
28 79 1 C
28 80 2 D
29 81 5 B
29 82 2 D
30 83 5 B
30 84 1 C
30 85 1 D
31 86 3 B
31 87 2 C
31 88 2 D
32 89 4 B
32 90 1 C
32 91 2 D
33 92 4 B
33 93 2 C
33 94 1 D
34 95 4 B
34 96 1 C
34 97 2 D
. . . .
. . . .
. . . .
The first two columns correspond to the edge list, the third column is the number of elements in each subset decreasing by each ramification and the fourth column is just the subset name.
One computed the edge list, I'm thinking on plotting the graph with this command:
plot(g, layout = layout.reingold.tilford(g, root=1)

Related

Recode column every nth element in R

I'm looking to recode a column, say the following:
df <- data.frame(col1 = rep(3, 100),
col2 = rep(NA, 100))
I want to recode col2 as 1 for rows 1:5, 2 for rows 6:10, 3 for 11:15, etc. So, every five rows I would add +1 to the assigned value. Any way to automate this process to avoid manually recoding 100 rows?
There are lot of ways to do that. Here are couple of them -
Using rep :
df$col2 <- rep(1:nrow(df), each = 5, length.out = nrow(df))
Using ceiling
df$col2 <- ceiling(seq(nrow(df))/5)
dplyr way
df %>% mutate(col2 = ((row_number()-1) %/% 5)+1)
OR
A simple for loop
for(i in 0:((nrow(df)/5)-1)){
df[0:nrow(df) %/% 5 == i,2] <- i+1
}
> df
col1 col2
1 3 1
2 3 1
3 3 1
4 3 1
5 3 1
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
11 3 3
12 3 3
13 3 3
14 3 3
15 3 3
16 3 4
17 3 4
18 3 4
19 3 4
20 3 4
21 3 5
22 3 5
23 3 5
24 3 5
25 3 5
26 3 6
27 3 6
28 3 6
29 3 6
30 3 6
31 3 7
32 3 7
33 3 7
34 3 7
35 3 7
36 3 8
37 3 8
38 3 8
39 3 8
40 3 8
41 3 9
42 3 9
43 3 9
44 3 9
45 3 9
46 3 10
47 3 10
48 3 10
49 3 10
50 3 10
51 3 11
52 3 11
53 3 11
54 3 11
55 3 11
56 3 12
57 3 12
58 3 12
59 3 12
60 3 12
61 3 13
62 3 13
63 3 13
64 3 13
65 3 13
66 3 14
67 3 14
68 3 14
69 3 14
70 3 14
71 3 15
72 3 15
73 3 15
74 3 15
75 3 15
76 3 16
77 3 16
78 3 16
79 3 16
80 3 16
81 3 17
82 3 17
83 3 17
84 3 17
85 3 17
86 3 18
87 3 18
88 3 18
89 3 18
90 3 18
91 3 19
92 3 19
93 3 19
94 3 19
95 3 19
96 3 20
97 3 20
98 3 20
99 3 20
100 3 20
As there is a pattern (each 5th row) you can use rep(row_number()) length.out = n() takes into account the length of column.
Learned here dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe from Ronak!!!
Thanks to Ronak!
df %>% mutate(col2 = rep(row_number(), each=5, length.out = n()))

Gen variable conditional on other variables from a different dataframe

I Have DF1:
c01 p01 c02 p02 c03 p03 c04 p04
1 0 1 20 1 33 1 49
2 3 2 21 2 34 2 50
3 4 3 21 3 38 3 50
4 6 4 23 4 40 4 51
5 7 5 24 5 41 5 53
6 9 6 27 6 41 6 54
7 11 7 29 7 41 7 55
8 15 8 31 8 43 8 57
9 15 9 33 9 47 9 57
10 16 10 33 10 49 10 60
And i Have DF2:
type round
A 1
B 1
A 2
B 2
A 3
B 3
A 4
B 4
What i want is to generate a new variable in DF2 that goes like:
DF2$g1<- if(DF2$round==1, 0)
DF2$g2<- if(c01==4 & round==1,DF2$p01)
DF2$g3<- if(c01==4 & round==2,DF2$p02)
DF2$g4<- if(c01==4 & round==3,DF2$p03)
DF2$g5<- if(c01==4 & round==4,DF2$p04)
DF2$g6<- if(c01==4 & round==5,DF2$p05)
So DF2 becomes:
type round g
A 1 6
B 1 6
A 2 23
B 2 23
A 3 40
B 3 40
A 4 50
B 4 50
Is there a way that i can loop this? In the original dataframe, i have 40 rounds, e C01 to C40 and P01 to P40

merge data tables in R

My apologies for this simple question. Basically, I want to make three separate cumsum() tables and merge them together by the first table. For example:
a <- cumsum(table(df$variable))
b <- cumsum(table(df$variable[c(TRUE, FALSE)]))
c <- cumsum(table(df$variable[c(FALSE, TRUE)]))
Where a is the cumsum of the entire vector of df$variable, b is the cumsum of the odd-numbered values of df$variable, c is the cumsum of the even-numbered values of df$variable. Another way of interpreting this is that combining b and c produces a.
This is the entire vector of numbers.
[1] 18 17 15 10 5 0 10 10 0 10 15 5 5 5 25 15 13 0 0 0 25 18 15 15 1 4 5
[28] 5 5 15 5 12 15 0 3 12 20 0 5 5 13 10 10 10 3 15 13 20 12 60 10 10 2 0
[55] 5 10 8 4 0 15 5 5 15 5 0 5 2 8 5 5 5 5 9 9 3 7 20 25 5 4 10
[82] 10 2 4 5 5 18 8 0 10 5 5 7 12 5 13 26 20 13 21 5 15 10 10 5 15 5 15
[109] 0 1 13 21 25 25 5 14 5 15 10 0 5 15 3 4 5 15 15 5 25 25 5 15 0 2 13
[136] 22 2 10 3 3 15 11 0 2 40 35 24 24 5 5 10 5 16 0 17 19 20 5 5 5 0 15
[163] 3 13 20 4 5 5 3 19 25 25 0 15 5 3 22 22 25 5 15 15 5 15 17 9 5 5 15
[190] 10
For a, I used cbind(cumsum(table(df$variable)))
0 18
1 20
2 26
3 35
4 41
5 88
7 90
8 93
9 96
10 115
11 116
12 120
13 128
14 129
15 154
16 155
17 158
18 161
19 163
20 169
21 171
22 174
24 176
25 186
26 187
35 188
40 189
60 190
For b, I used cbind(cumsum(table(df$variable[c(TRUE, FALSE)])))
0 10
1 11
2 15
3 22
5 50
7 51
8 52
9 53
10 60
12 61
13 67
15 76
16 77
17 79
18 81
20 85
22 86
24 87
25 93
26 94
40 95
For c, I used cbind(cumsum(table(df$variable[c(FALSE, TRUE)])))
0 8
1 9
2 11
3 13
4 19
5 38
7 39
8 41
9 43
10 55
11 56
12 59
13 61
14 62
15 78
17 79
18 80
19 82
20 84
21 86
22 88
24 89
25 93
35 94
60 95
In frequency form, the distributions should look something like this.
a b c
0 18 10 8
1 2 1 1
2 6 4 2
3 9 7 2
4 6 0 6
5 47 28 19
7 2 1 1
8 3 1 2
9 3 1 2
10 19 7 12
11 1 0 1
12 4 1 3
13 8 6 2
14 1 0 1
15 25 9 16
16 1 1 0
17 3 2 1
18 3 2 1
19 2 0 2
20 6 4 2
21 2 0 2
22 3 1 2
24 2 1 1
25 10 6 4
26 1 1 0
35 1 0 1
40 1 1 0
60 1 0 1
190 95 95
But I want it in cumsum() form, such that it should look something like this. I wrote out the first 6 rows as illustration.
a b c
0 18 10 8
1 20 11 9
2 26 15 11
3 35 22 13
4 41 22 19
5 88 50 38
7 90 51 39
The problem I've been having is that the subsets a and b doesn't have all the values (i.e. some values have 0 frequency), such that it shortens the length of the vector; as a result, I'm unable to properly merge or cbind() these values.
Any suggestion is greatly appreciated.
You could probably get there using match quite easily. Assuming your data is:
set.seed(1)
df <- data.frame(variable=rbinom(10,prob=0.5,size=3))
Something like this seems to work
out <- data.frame(a,b=b[match(names(a),names(b))],c=c[match(names(a),names(c))])
replace(out,is.na(out),0)
# a b c
#0 1 0 1
#1 4 2 2
#2 7 4 3
#3 10 5 5

Removing duplicates for each ID

Suppose that there are three variables in my data frame (mydata): 1) id, 2) case, and 3) value.
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), case=c("a","b","c","c","b","a","b","c","c","a","b","c","c","a","b","c","a"), value=c(1,34,56,23,34,546,34,67,23,65,23,65,23,87,34,321,87))
mydata
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 1 b 34
6 2 a 546
7 2 b 34
8 2 c 67
9 2 c 23
10 3 a 65
11 3 b 23
12 3 c 65
13 3 c 23
14 4 a 87
15 4 b 34
16 4 c 321
17 4 a 87
For each id, we could have similar ‘case’ characters, and their values could be the same or different. So basically, if their values are the same, I only need to keep one and remove the duplicate.
My final data then would be
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 2 a 546
6 2 b 34
7 2 c 67
8 2 c 23
9 3 a 65
10 3 b 23
11 3 c 65
12 3 c 23
13 4 a 87
14 4 b 34
15 4 c 321
To add to the other answers, here's a dplyr approach:
library(dplyr)
mydata %>% group_by(id, case, value) %>% distinct()
Or
mydata %>% distinct(id, case, value)
You could try duplicated
mydata[!duplicated(mydata[,c('id', 'case', 'value')]),]
# id case value
#1 1 a 1
#2 1 b 34
#3 1 c 56
#4 1 c 23
#6 2 a 546
#7 2 b 34
#8 2 c 67
#9 2 c 23
#10 3 a 65
#11 3 b 23
#12 3 c 65
#13 3 c 23
#14 4 a 87
#15 4 b 34
#16 4 c 321
Or use unique with by option from data.table
library(data.table)
set.seed(25)
mydata1 <- cbind(mydata, value1=rnorm(17))
DT <- as.data.table(mydata1)
unique(DT, by=c('id', 'case', 'value'))
# id case value value1
#1: 1 a 1 -0.21183360
#2: 1 b 34 -1.04159113
#3: 1 c 56 -1.15330756
#4: 1 c 23 0.32153150
#5: 2 a 546 -0.44553326
#6: 2 b 34 1.73404543
#7: 2 c 67 0.51129562
#8: 2 c 23 0.09964504
#9: 3 a 65 -0.05789111
#10: 3 b 23 -1.74278763
#11: 3 c 65 -1.32495298
#12: 3 c 23 -0.54793388
#13: 4 a 87 -1.45638428
#14: 4 b 34 0.08268682
#15: 4 c 321 0.92757895
Case and value only? Easy:
> mydata[!duplicated(mydata[,c("id","case","value")]),]
Even if you have a ton more variables in the dataset, they won't be considered by the duplicated() call.

How do I use plyr to number rows?

Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut)
> myDataFrame
size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570
I want a column added that has new row names based on the kmer/cvCut group
> myDataFrame
size kmer cvCut cumsum newID
1 8132 23 10 8132 1
10000 778 23 10 13789274 2
30000 324 23 10 23658740 3
50000 182 23 10 28534840 4
100000 65 23 10 33943283 5
200000 25 23 10 37954383 6
250000 584 23 12 16546507 1
300000 110 23 12 29435303 2
400000 28 23 12 34697860 3
600000 127 23 2 47124443 1
600001 127 23 2 47124570 2
I'd do it like this:
library(plyr)
ddply(df, c("kmer", "cvCut"), transform, newID = seq_along(kmer))
Just add a new column each time plyr calls you:
R> DF <- data.frame(kmer=sample(1:3, 50, replace=TRUE), \
cvCut=sample(LETTERS[1:3], 50, replace=TRUE))
R> library(plyr)
R> ddply(DF, .(kmer, cvCut), function(X) data.frame(X, newId=1:nrow(X)))
kmer cvCut newId
1 1 A 1
2 1 A 2
3 1 A 3
4 1 A 4
5 1 A 5
6 1 A 6
7 1 A 7
8 1 A 8
9 1 A 9
10 1 A 10
11 1 A 11
12 1 B 1
13 1 B 2
14 1 B 3
15 1 B 4
16 1 B 5
17 1 B 6
18 1 C 1
19 1 C 2
20 1 C 3
21 2 A 1
22 2 A 2
23 2 A 3
24 2 A 4
25 2 A 5
26 2 B 1
27 2 B 2
28 2 B 3
29 2 B 4
30 2 B 5
31 2 B 6
32 2 B 7
33 2 C 1
34 2 C 2
35 2 C 3
36 2 C 4
37 3 A 1
38 3 A 2
39 3 A 3
40 3 A 4
41 3 B 1
42 3 B 2
43 3 B 3
44 3 B 4
45 3 C 1
46 3 C 2
47 3 C 3
48 3 C 4
49 3 C 5
50 3 C 6
R>
I think that this is what you want:
Load the data:
x <- read.table(textConnection(
"id size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570"), header=TRUE)
Use ddply:
library(plyr)
ddply(x, .(kmer, cvCut), function(x) cbind(x, 1:nrow(x)))

Resources