Recode column every nth element in R - r

I'm looking to recode a column, say the following:
df <- data.frame(col1 = rep(3, 100),
col2 = rep(NA, 100))
I want to recode col2 as 1 for rows 1:5, 2 for rows 6:10, 3 for 11:15, etc. So, every five rows I would add +1 to the assigned value. Any way to automate this process to avoid manually recoding 100 rows?

There are lot of ways to do that. Here are couple of them -
Using rep :
df$col2 <- rep(1:nrow(df), each = 5, length.out = nrow(df))
Using ceiling
df$col2 <- ceiling(seq(nrow(df))/5)

dplyr way
df %>% mutate(col2 = ((row_number()-1) %/% 5)+1)
OR
A simple for loop
for(i in 0:((nrow(df)/5)-1)){
df[0:nrow(df) %/% 5 == i,2] <- i+1
}
> df
col1 col2
1 3 1
2 3 1
3 3 1
4 3 1
5 3 1
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
11 3 3
12 3 3
13 3 3
14 3 3
15 3 3
16 3 4
17 3 4
18 3 4
19 3 4
20 3 4
21 3 5
22 3 5
23 3 5
24 3 5
25 3 5
26 3 6
27 3 6
28 3 6
29 3 6
30 3 6
31 3 7
32 3 7
33 3 7
34 3 7
35 3 7
36 3 8
37 3 8
38 3 8
39 3 8
40 3 8
41 3 9
42 3 9
43 3 9
44 3 9
45 3 9
46 3 10
47 3 10
48 3 10
49 3 10
50 3 10
51 3 11
52 3 11
53 3 11
54 3 11
55 3 11
56 3 12
57 3 12
58 3 12
59 3 12
60 3 12
61 3 13
62 3 13
63 3 13
64 3 13
65 3 13
66 3 14
67 3 14
68 3 14
69 3 14
70 3 14
71 3 15
72 3 15
73 3 15
74 3 15
75 3 15
76 3 16
77 3 16
78 3 16
79 3 16
80 3 16
81 3 17
82 3 17
83 3 17
84 3 17
85 3 17
86 3 18
87 3 18
88 3 18
89 3 18
90 3 18
91 3 19
92 3 19
93 3 19
94 3 19
95 3 19
96 3 20
97 3 20
98 3 20
99 3 20
100 3 20

As there is a pattern (each 5th row) you can use rep(row_number()) length.out = n() takes into account the length of column.
Learned here dplyr: Mutate a new column with sequential repeated integers of n time in a dataframe from Ronak!!!
Thanks to Ronak!
df %>% mutate(col2 = rep(row_number(), each=5, length.out = n()))

Related

Create edgelist that contains mutual dyads

I have an edgelist where I want to keep dyads that mutually selected each other (e.g., 1 -> 4 and 4 -> 1). However, in the final edgelist I only want to keep one row instead of both rows of the mutual dyads (e.g., only row 1 -> 4 not both rows 1 -> 4 and 4 -> 1). How do I achieve that?
Here is the dataset:
library(igraph)
ff <- as_data_frame(sample_gnm(10, 50, directed=TRUE))
ff
from to
1 1 10
2 1 3
3 1 4
4 1 5
5 1 6
6 1 7
7 1 8
8 2 1
9 2 3
10 2 8
11 2 9
12 3 1
13 3 2
14 3 10
15 3 4
16 3 5
17 3 6
18 3 8
19 3 9
20 4 3
21 4 10
22 5 1
23 5 2
24 5 3
25 5 4
26 6 2
27 6 3
28 6 4
29 6 5
30 7 3
31 7 5
32 7 6
33 7 10
34 7 8
35 8 1
36 8 2
37 8 4
38 8 5
39 8 10
40 9 1
41 9 2
42 9 3
43 9 4
44 9 5
45 9 7
46 10 1
47 10 3
48 10 4
49 10 8
50 10 9
cd <- which_mutual(g) #I know I can use `which_mutual` to identify the mutual dyads
ff[which(cd==1),] #but in the end this keeps both rows of the mutual dyads (e.g., 1 -> 4 and 4 -> 1)
from to
4 1 4
6 1 6
7 1 7
9 2 10
10 2 3
14 3 2
18 3 6
21 4 1
25 5 10
28 6 1
30 6 3
32 6 10
33 6 7
34 7 1
37 7 6
39 7 8
42 8 7
45 9 10
46 10 2
47 10 5
48 10 6
50 10 9
We may use duplicated to create a logical vector after sorting the elements by row
ff1 <- ff[which(cd==1),]
subset(ff1, !duplicated(cbind(pmin(from, to), pmax(from, to))))

Gen variable conditional on other variables from a different dataframe

I Have DF1:
c01 p01 c02 p02 c03 p03 c04 p04
1 0 1 20 1 33 1 49
2 3 2 21 2 34 2 50
3 4 3 21 3 38 3 50
4 6 4 23 4 40 4 51
5 7 5 24 5 41 5 53
6 9 6 27 6 41 6 54
7 11 7 29 7 41 7 55
8 15 8 31 8 43 8 57
9 15 9 33 9 47 9 57
10 16 10 33 10 49 10 60
And i Have DF2:
type round
A 1
B 1
A 2
B 2
A 3
B 3
A 4
B 4
What i want is to generate a new variable in DF2 that goes like:
DF2$g1<- if(DF2$round==1, 0)
DF2$g2<- if(c01==4 & round==1,DF2$p01)
DF2$g3<- if(c01==4 & round==2,DF2$p02)
DF2$g4<- if(c01==4 & round==3,DF2$p03)
DF2$g5<- if(c01==4 & round==4,DF2$p04)
DF2$g6<- if(c01==4 & round==5,DF2$p05)
So DF2 becomes:
type round g
A 1 6
B 1 6
A 2 23
B 2 23
A 3 40
B 3 40
A 4 50
B 4 50
Is there a way that i can loop this? In the original dataframe, i have 40 rounds, e C01 to C40 and P01 to P40

(Update) Add index column to data.frame based on two columns

Example data.frame:
df = read.table(text = 'colA colB
2 7
2 7
2 7
2 7
1 7
1 7
1 7
89 5
89 5
89 5
88 5
88 5
70 5
70 5
70 5
69 5
69 5
44 4
44 4
44 4
43 4
42 4
42 4
41 4
41 4
120 1
100 1', header = TRUE)
I need to add an index col based on colA and colB where colB shows the exact number of rows to group but it can be duplicated. colB groups rows based on colA and colA -1.
Expected output:
colA colB index_col
2 7 1
2 7 1
2 7 1
2 7 1
1 7 1
1 7 1
1 7 1
89 5 2
89 5 2
89 5 2
88 5 2
88 5 2
70 5 3
70 5 3
70 5 3
69 5 3
69 5 3
44 4 4
44 4 4
44 4 4
43 4 4
42 4 5
42 4 5
41 4 5
41 4 5
120 1 6
100 1 7
UPDATE
How can I adapt the code that works for the above df for the same purpose but by looking at colB values grouped based on colA, colA -1 and colA -2? i.e. (instead of 2 days considering 3 days)
new_df = read.table(text = 'colA colB
3 10
3 10
3 10
2 10
2 10
2 10
2 10
1 10
1 10
1 10
90 7
90 7
89 7
89 7
89 7
88 7
88 7
71 7
71 7
70 7
70 7
70 7
69 7
69 7
44 5
44 5
44 5
43 5
42 5
41 5
41 5
41 5
40 5
40 5
120 1
100 1', header = TRUE)
Expected output:
colA colB index_col
3 10 1
3 10 1
3 10 1
2 10 1
2 10 1
2 10 1
2 10 1
1 10 1
1 10 1
1 10 1
90 7 2
90 7 2
89 7 2
89 7 2
89 7 2
88 7 2
88 7 2
71 7 3
71 7 3
70 7 3
70 7 3
70 7 3
69 7 3
69 7 3
44 5 4
44 5 4
44 5 4
43 5 4
42 5 4
41 5 5
41 5 5
41 5 5
40 5 5
40 5 5
120 1 6
100 1 7
Thanks
We can use rleid
library(data.table)
index_col <-setDT(df)[, if(colB[1L] < .N) ((seq_len(.N)-1) %/% colB[1L])+1
else as.numeric(colB), rleid(colB)][, rleid(V1)]
df[, index_col := index_col]
df
# colA colB index_col
# 1: 2 7 1
# 2: 2 7 1
# 3: 2 7 1
# 4: 2 7 1
# 5: 1 7 1
# 6: 1 7 1
# 7: 1 7 1
# 8: 70 5 2
# 9: 70 5 2
#10: 70 5 2
#11: 69 5 2
#12: 69 5 2
#13: 89 5 3
#14: 89 5 3
#15: 89 5 3
#16: 88 5 3
#17: 88 5 3
#18: 120 1 4
#19: 100 1 5
Or a one-liner would be
setDT(df)[, index_col := df[, ((seq_len(.N)-1) %/% colB[1L])+1, rleid(colB)][, as.integer(interaction(.SD, drop = TRUE, lex.order = TRUE))]]
Update
Based on the new update in the OP's post
setDT(new_df)[, index_col := cumsum(c(TRUE, abs(diff(colA))> 1))
][, colB := .N , index_col]
new_df
# colA colB index_col
# 1: 3 10 1
# 2: 3 10 1
# 3: 3 10 1
# 4: 2 10 1
# 5: 2 10 1
# 6: 2 10 1
# 7: 2 10 1
# 8: 1 10 1
# 9: 1 10 1
#10: 1 10 1
#11: 71 7 2
#12: 71 7 2
#13: 70 7 2
#14: 70 7 2
#15: 70 7 2
#16: 69 7 2
#17: 69 7 2
#18: 90 7 3
#19: 90 7 3
#20: 89 7 3
#21: 89 7 3
#22: 89 7 3
#23: 88 7 3
#24: 88 7 3
#25: 44 2 4
#26: 43 2 4
#27: 120 1 5
#28: 100 1 6
An approach in base R:
df$idxcol <- cumsum(c(1,abs(diff(df$colA)) > 1) + c(0,diff(df$colB) != 0) > 0)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 70 5 2
9 70 5 2
10 70 5 2
11 69 5 2
12 69 5 2
13 89 5 3
14 89 5 3
15 89 5 3
16 88 5 3
17 88 5 3
18 120 1 4
19 100 1 5
On the updated example data, you need to adapt the approach to:
n <- 1
idx1 <- cumsum(c(1, diff(df$colA) < -n) + c(0, diff(df$colB) != 0) > 0)
idx2 <- ave(df$colA, cumsum(c(1, diff(df$colA) < -n)), FUN = function(x) c(0, cumsum(diff(x)) < -n ))
idx2[idx2==1 & c(0,diff(idx2))==0] <- 0
df$idxcol <- idx1 + cumsum(idx2)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 89 5 2
9 89 5 2
10 89 5 2
11 88 5 2
12 88 5 2
13 70 5 3
14 70 5 3
15 70 5 3
16 69 5 3
17 69 5 3
18 44 4 4
19 44 4 4
20 44 4 4
21 43 4 4
22 42 4 5
23 42 4 5
24 41 4 5
25 41 4 5
26 120 1 6
27 100 1 7
For new_df just change n tot 2 and you will get the desired output for that as well.

Quartiles by group saved as new variable in data frame

I have data that look something like this:
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
yr <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
gr <- c(3,4,5,3,4,5,3,4,5,4,5,6,4,5,6,4,5,6,5,6,7,5,6,7,5,6,7)
x <- c(33,48,31,41,31,36,25,38,28,17,39,53,60,60,19,39,34,47,20,28,38,15,17,49,48,45,39)
df <- data.frame(id,yr,gr,x)
id yr gr x
1 1 1 3 33
2 1 2 4 48
3 1 3 5 31
4 2 1 3 41
5 2 2 4 31
6 2 3 5 36
7 3 1 3 25
8 3 2 4 38
9 3 3 5 28
10 4 1 4 17
11 4 2 5 39
12 4 3 6 53
13 5 1 4 60
14 5 2 5 60
15 5 3 6 19
16 6 1 4 39
17 6 2 5 34
18 6 3 6 47
19 7 1 5 20
20 7 2 6 28
21 7 3 7 38
22 8 1 5 15
23 8 2 6 17
24 8 3 7 49
25 9 1 5 48
26 9 2 6 45
27 9 3 7 39
I would like to create a new variable in the data frame that contains the quantiles of "x" computed within each unique combination of "yr" and "gr". That is, rather than finding the quantiles of "x" based on all 27 rows of data in the example, I would like to compute the quantiles by two grouping variables: yr and gr. For instance, the quantiles of "x" when yr = 1 and gr = 3, yr = 1 and gr = 4, etc.
Once these values are computed, I would like them to be appended to the data frame as a single column, say "x_quant".
I am able to split the data into the separate groups I need, and I am know how to calculate quantiles, but I am having trouble combining the two steps in a way that is amenable to creating a new column in the existing data frame.
Any help y'all can provide would be greatly appretiated! Thank you much!
~kj
# turn "yr" and "gr" into sortable column
df$y <- paste(df$yr,"",df$gr)
df.ordered <- df[order(df$y),] #sort df based on group
grp <- split(df.ordered,df.ordered$y);grp
# get quantiles and turn results into string
q <- vector('list')
for (i in 1:length(grp)) {
a <- quantile(grp[[i]]$x)
q[i] <- paste(a[1],"",a[2],"",a[3],"",a[4],"",a[5])
}
x_quant <- unlist(sapply(q, `[`, 1))
x_quant <- rep(x_quant,each=3)
# append quantile back to data frame. Gave new column a more descriptive name
df.ordered$xq_0_25_50_75_100 <- x_quant
df.ordered$y <- NULL
df <- df.ordered;df </pre>
Output:
> # turn "yr" and "gr" into sortable column
> df$y <- paste(df$yr,"",df$gr)
> df.ordered <- df[order(df$y),] #sort df based on group
> grp <- split(df.ordered,df.ordered$y);grp
$`1 3`
id yr gr x y
1 1 1 3 33 1 3
4 2 1 3 41 1 3
7 3 1 3 25 1 3
$`1 4`
id yr gr x y
10 4 1 4 17 1 4
13 5 1 4 60 1 4
16 6 1 4 39 1 4
$`1 5`
id yr gr x y
19 7 1 5 20 1 5
22 8 1 5 15 1 5
25 9 1 5 48 1 5
$`2 4`
id yr gr x y
2 1 2 4 48 2 4
5 2 2 4 31 2 4
8 3 2 4 38 2 4
$`2 5`
id yr gr x y
11 4 2 5 39 2 5
14 5 2 5 60 2 5
17 6 2 5 34 2 5
$`2 6`
id yr gr x y
20 7 2 6 28 2 6
23 8 2 6 17 2 6
26 9 2 6 45 2 6
$`3 5`
id yr gr x y
3 1 3 5 31 3 5
6 2 3 5 36 3 5
9 3 3 5 28 3 5
$`3 6`
id yr gr x y
12 4 3 6 53 3 6
15 5 3 6 19 3 6
18 6 3 6 47 3 6
$`3 7`
id yr gr x y
21 7 3 7 38 3 7
24 8 3 7 49 3 7
27 9 3 7 39 3 7
> # get quantiles and turn results into string
> q <- vector('list')
> for (i in 1:length(grp)) {
+ a <- quantile(grp[[i]]$x)
+ q[i] <- paste(a[1],"",a[2],"",a[3],"",a[4],"",a[5])
+ }
> x_quant <- unlist(sapply(q, `[`, 1))
> x_quant <- rep(x_quant,each=3)
> # append quantile back to data frame
> df.ordered$xq_0_25_50_75_100 <- x_quant
> df.ordered$y <- NULL
> df <- df.ordered
> df
id yr gr x xq_0_25_50_75_100
1 1 1 3 33 25 29 33 37 41
4 2 1 3 41 25 29 33 37 41
7 3 1 3 25 25 29 33 37 41
10 4 1 4 17 17 28 39 49.5 60
13 5 1 4 60 17 28 39 49.5 60
16 6 1 4 39 17 28 39 49.5 60
19 7 1 5 20 15 17.5 20 34 48
22 8 1 5 15 15 17.5 20 34 48
25 9 1 5 48 15 17.5 20 34 48
2 1 2 4 48 31 34.5 38 43 48
5 2 2 4 31 31 34.5 38 43 48
8 3 2 4 38 31 34.5 38 43 48
11 4 2 5 39 34 36.5 39 49.5 60
14 5 2 5 60 34 36.5 39 49.5 60
17 6 2 5 34 34 36.5 39 49.5 60
20 7 2 6 28 17 22.5 28 36.5 45
23 8 2 6 17 17 22.5 28 36.5 45
26 9 2 6 45 17 22.5 28 36.5 45
3 1 3 5 31 28 29.5 31 33.5 36
6 2 3 5 36 28 29.5 31 33.5 36
9 3 3 5 28 28 29.5 31 33.5 36
12 4 3 6 53 19 33 47 50 53
15 5 3 6 19 19 33 47 50 53
18 6 3 6 47 19 33 47 50 53
21 7 3 7 38 38 38.5 39 44 49
24 8 3 7 49 38 38.5 39 44 49
27 9 3 7 39 38 38.5 39 44 49
>

merge data tables in R

My apologies for this simple question. Basically, I want to make three separate cumsum() tables and merge them together by the first table. For example:
a <- cumsum(table(df$variable))
b <- cumsum(table(df$variable[c(TRUE, FALSE)]))
c <- cumsum(table(df$variable[c(FALSE, TRUE)]))
Where a is the cumsum of the entire vector of df$variable, b is the cumsum of the odd-numbered values of df$variable, c is the cumsum of the even-numbered values of df$variable. Another way of interpreting this is that combining b and c produces a.
This is the entire vector of numbers.
[1] 18 17 15 10 5 0 10 10 0 10 15 5 5 5 25 15 13 0 0 0 25 18 15 15 1 4 5
[28] 5 5 15 5 12 15 0 3 12 20 0 5 5 13 10 10 10 3 15 13 20 12 60 10 10 2 0
[55] 5 10 8 4 0 15 5 5 15 5 0 5 2 8 5 5 5 5 9 9 3 7 20 25 5 4 10
[82] 10 2 4 5 5 18 8 0 10 5 5 7 12 5 13 26 20 13 21 5 15 10 10 5 15 5 15
[109] 0 1 13 21 25 25 5 14 5 15 10 0 5 15 3 4 5 15 15 5 25 25 5 15 0 2 13
[136] 22 2 10 3 3 15 11 0 2 40 35 24 24 5 5 10 5 16 0 17 19 20 5 5 5 0 15
[163] 3 13 20 4 5 5 3 19 25 25 0 15 5 3 22 22 25 5 15 15 5 15 17 9 5 5 15
[190] 10
For a, I used cbind(cumsum(table(df$variable)))
0 18
1 20
2 26
3 35
4 41
5 88
7 90
8 93
9 96
10 115
11 116
12 120
13 128
14 129
15 154
16 155
17 158
18 161
19 163
20 169
21 171
22 174
24 176
25 186
26 187
35 188
40 189
60 190
For b, I used cbind(cumsum(table(df$variable[c(TRUE, FALSE)])))
0 10
1 11
2 15
3 22
5 50
7 51
8 52
9 53
10 60
12 61
13 67
15 76
16 77
17 79
18 81
20 85
22 86
24 87
25 93
26 94
40 95
For c, I used cbind(cumsum(table(df$variable[c(FALSE, TRUE)])))
0 8
1 9
2 11
3 13
4 19
5 38
7 39
8 41
9 43
10 55
11 56
12 59
13 61
14 62
15 78
17 79
18 80
19 82
20 84
21 86
22 88
24 89
25 93
35 94
60 95
In frequency form, the distributions should look something like this.
a b c
0 18 10 8
1 2 1 1
2 6 4 2
3 9 7 2
4 6 0 6
5 47 28 19
7 2 1 1
8 3 1 2
9 3 1 2
10 19 7 12
11 1 0 1
12 4 1 3
13 8 6 2
14 1 0 1
15 25 9 16
16 1 1 0
17 3 2 1
18 3 2 1
19 2 0 2
20 6 4 2
21 2 0 2
22 3 1 2
24 2 1 1
25 10 6 4
26 1 1 0
35 1 0 1
40 1 1 0
60 1 0 1
190 95 95
But I want it in cumsum() form, such that it should look something like this. I wrote out the first 6 rows as illustration.
a b c
0 18 10 8
1 20 11 9
2 26 15 11
3 35 22 13
4 41 22 19
5 88 50 38
7 90 51 39
The problem I've been having is that the subsets a and b doesn't have all the values (i.e. some values have 0 frequency), such that it shortens the length of the vector; as a result, I'm unable to properly merge or cbind() these values.
Any suggestion is greatly appreciated.
You could probably get there using match quite easily. Assuming your data is:
set.seed(1)
df <- data.frame(variable=rbinom(10,prob=0.5,size=3))
Something like this seems to work
out <- data.frame(a,b=b[match(names(a),names(b))],c=c[match(names(a),names(c))])
replace(out,is.na(out),0)
# a b c
#0 1 0 1
#1 4 2 2
#2 7 4 3
#3 10 5 5

Resources