add values of one group into another group in R - r

I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex:
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination.
My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution.
Year Cluster Seed Day value
1 1 a 1 1 60
2 1 a 1 2 68
3 1 a 1 3 78
4 1 a 1 4 90
5 1 a 1 5 107
6 1 a 2 1 61
7 1 a 2 2 73
8 1 a 2 3 86
9 1 a 2 4 91
10 1 a 2 5 104
11 1 a 3 1 57
12 1 a 3 2 67
13 1 a 3 3 79
14 1 a 3 4 96
15 1 a 3 5 105
16 1 c 99 1 10
17 2 b 1 1 60
18 2 b 1 2 68
19 2 b 1 3 78
20 2 b 1 4 90
21 2 b 1 5 107
22 2 b 2 1 61
23 2 b 2 2 73
24 2 b 2 3 86
25 2 b 2 4 91
26 2 b 2 5 104
27 2 b 3 1 57
28 2 b 3 2 67
29 2 b 3 3 79
30 2 b 3 4 96
31 2 b 3 5 105
32 2 d 99 1 10

A data.table approach:
library(data.table)
df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value),
flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL]
Output:
df[]
Year Cluster Seed Day value
1: 1 a 1 1 60
2: 1 a 1 2 68
3: 1 a 1 3 78
4: 1 a 1 4 90
5: 1 a 1 5 107
6: 1 a 2 1 61
7: 1 a 2 2 73
8: 1 a 2 3 86
9: 1 a 2 4 91
10: 1 a 2 5 104
11: 1 a 3 1 57
12: 1 a 3 2 67
13: 1 a 3 3 79
14: 1 a 3 4 96
15: 1 a 3 5 105
16: 1 c 99 1 10
17: 2 b 1 1 60
18: 2 b 1 2 68
19: 2 b 1 3 78
20: 2 b 1 4 90
21: 2 b 1 5 107
22: 2 b 2 1 61
23: 2 b 2 2 73
24: 2 b 2 3 86
25: 2 b 2 4 91
26: 2 b 2 5 104
27: 2 b 3 1 57
28: 2 b 3 2 67
29: 2 b 3 3 79
30: 2 b 3 4 96
31: 2 b 3 5 105
32: 2 d 99 1 10

Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better.
library(tidyverse)
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
seeds <- df %>%
filter(Seed == 99)
matches <- df %>%
filter(Seed != 99) %>%
inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>%
mutate(value = value.x + value.y) %>%
select(Year, Cluster, Seed, Day, value)
no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day"))
bind_rows(matches, no_matches) %>%
arrange(Year, Cluster, Seed, Day)
#> Year Cluster Seed Day value
#> 1 1 a 1 1 60
#> 2 1 a 1 2 68
#> 3 1 a 1 3 78
#> 4 1 a 1 4 90
#> 5 1 a 1 5 107
#> 6 1 a 2 1 61
#> 7 1 a 2 2 73
#> 8 1 a 2 3 86
#> 9 1 a 2 4 91
#> 10 1 a 2 5 104
#> 11 1 a 3 1 57
#> 12 1 a 3 2 67
#> 13 1 a 3 3 79
#> 14 1 a 3 4 96
#> 15 1 a 3 5 105
#> 16 1 c 99 1 10
#> 17 2 b 1 1 60
#> 18 2 b 1 2 68
#> 19 2 b 1 3 78
#> 20 2 b 1 4 90
#> 21 2 b 1 5 107
#> 22 2 b 2 1 61
#> 23 2 b 2 2 73
#> 24 2 b 2 3 86
#> 25 2 b 2 4 91
#> 26 2 b 2 5 104
#> 27 2 b 3 1 57
#> 28 2 b 3 2 67
#> 29 2 b 3 3 79
#> 30 2 b 3 4 96
#> 31 2 b 3 5 105
#> 32 2 d 99 1 10
Created on 2018-11-23 by the reprex package (v0.2.1)

Related

How to use dplyr & casewhen, across groups and rows, with three outcomes?

This seems a simple question to me but I'm super stuck on it! My data looks like this:
Name round MatchNumber Score
<chr> <int> <int> <dbl>
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73
All I want to do is create a new column Outcome from Score to designate that for every name, round and match, there is a Win/ Loss or Draw. Ideally, this would be done via dplyr and likely via casewhen but I just can't get my head around the row-wise calculation and grouping. I've tried (but am stuck at) the following:
MatchOutcome <- ExampleData %>%
arrange(round, MatchNumber) %>%
group_by(Name, round, MatchNumber) %>%
mutate(Outcome = Score)
My ideal output would look like:
Name round MatchNumber Score Outcome
<chr> <int> <int> <dbl> <chr>
1 A 1 1 48 Loss
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Loss
5 E 1 3 61 Loss
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Loss
10 D 2 1 59 Win
11 A 2 2 50 Loss
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Loss
15 C 2 4 72 Loss
16 F 2 4 73 Win
Maybe something like this?
ExampleData %>%
group_by(round, MatchNumber) %>%
mutate(Outcome = case_when(Score == mean(Score) ~ "Draw",
Score == max(Score) ~ "Win",
TRUE ~ "Loss")) %>%
ungroup()
# A tibble: 16 x 5
Name round MatchNumber Score Outcome
<chr> <int> <int> <int> <chr>
1 A 1 1 48 Lose
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Lose
5 E 1 3 61 Lose
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Lose
10 D 2 1 59 Win
11 A 2 2 50 Lose
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Lose
15 C 2 4 72 Lose
16 F 2 4 73 Win
Data:
ExampleData <- read.table(text = "Name round MatchNumber Score
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73")

count row number first and then insert new row by condition [duplicate]

This question already has answers here:
How to create missing value for repeated measurement data?
(2 answers)
Closed 4 years ago.
I need to count the number of rows first after a group_by function and add up new row(s) to 6 row if the row number < 6.
My df has three variables (v1,v2,v3): v1 = group name, v2 = row number (i.e., 1,2,3,4,5,6). In the new row(s), I want to repeat the v1 value, v2 continue the couting of row number, v3 = NA
sample df
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
expected output
v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
2 5 NA #insert
2 6 NA #insert
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32
3 6 NA #insert
I tried to count the row number first by dplyr, but I don't know if I can or how can I add this if else condition by using the pip. Or is there other easier function?
My code
df %>%
group_by(v1) %>%
dplyr::summarise(N=n()) %>%
if (N < 6) {
# sth like that?
}
Thanks!
We can use complete
library(tidyverse)
complete(df1, v1, v2)
# A tibble: 18 x 3
# v1 v2 v3
# <int> <int> <int>
# 1 1 1 79
# 2 1 2 32
# 3 1 3 53
# 4 1 4 33
# 5 1 5 76
# 6 1 6 11
# 7 2 1 32
# 8 2 2 42
# 9 2 3 44
#10 2 4 12
#11 2 5 NA
#12 2 6 NA
#13 3 1 22
#14 3 2 12
#15 3 3 12
#16 3 4 67
#17 3 5 32
#18 3 6 NA
Here is a way to do it using merge.
df <- read.table(text =
"v1 v2 v3
1 1 79
1 2 32
1 3 53
1 4 33
1 5 76
1 6 11
2 1 32
2 2 42
2 3 44
2 4 12
3 1 22
3 2 12
3 3 12
3 4 67
3 5 32", header = T)
toMerge <- data.frame(v1 = rep(1:3, each = 6), v2 = rep(1:6, times = 3))
m <- merge(toMerge, df, by = c("v1", "v2"), all.x = T)
m
v1 v2 v3
1 1 1 79
2 1 2 32
3 1 3 53
4 1 4 33
5 1 5 76
6 1 6 11
7 2 1 32
8 2 2 42
9 2 3 44
10 2 4 12
11 2 5 NA
12 2 6 NA
13 3 1 22
14 3 2 12
15 3 3 12
16 3 4 67
17 3 5 32
18 3 6 NA

(Update) Add index column to data.frame based on two columns

Example data.frame:
df = read.table(text = 'colA colB
2 7
2 7
2 7
2 7
1 7
1 7
1 7
89 5
89 5
89 5
88 5
88 5
70 5
70 5
70 5
69 5
69 5
44 4
44 4
44 4
43 4
42 4
42 4
41 4
41 4
120 1
100 1', header = TRUE)
I need to add an index col based on colA and colB where colB shows the exact number of rows to group but it can be duplicated. colB groups rows based on colA and colA -1.
Expected output:
colA colB index_col
2 7 1
2 7 1
2 7 1
2 7 1
1 7 1
1 7 1
1 7 1
89 5 2
89 5 2
89 5 2
88 5 2
88 5 2
70 5 3
70 5 3
70 5 3
69 5 3
69 5 3
44 4 4
44 4 4
44 4 4
43 4 4
42 4 5
42 4 5
41 4 5
41 4 5
120 1 6
100 1 7
UPDATE
How can I adapt the code that works for the above df for the same purpose but by looking at colB values grouped based on colA, colA -1 and colA -2? i.e. (instead of 2 days considering 3 days)
new_df = read.table(text = 'colA colB
3 10
3 10
3 10
2 10
2 10
2 10
2 10
1 10
1 10
1 10
90 7
90 7
89 7
89 7
89 7
88 7
88 7
71 7
71 7
70 7
70 7
70 7
69 7
69 7
44 5
44 5
44 5
43 5
42 5
41 5
41 5
41 5
40 5
40 5
120 1
100 1', header = TRUE)
Expected output:
colA colB index_col
3 10 1
3 10 1
3 10 1
2 10 1
2 10 1
2 10 1
2 10 1
1 10 1
1 10 1
1 10 1
90 7 2
90 7 2
89 7 2
89 7 2
89 7 2
88 7 2
88 7 2
71 7 3
71 7 3
70 7 3
70 7 3
70 7 3
69 7 3
69 7 3
44 5 4
44 5 4
44 5 4
43 5 4
42 5 4
41 5 5
41 5 5
41 5 5
40 5 5
40 5 5
120 1 6
100 1 7
Thanks
We can use rleid
library(data.table)
index_col <-setDT(df)[, if(colB[1L] < .N) ((seq_len(.N)-1) %/% colB[1L])+1
else as.numeric(colB), rleid(colB)][, rleid(V1)]
df[, index_col := index_col]
df
# colA colB index_col
# 1: 2 7 1
# 2: 2 7 1
# 3: 2 7 1
# 4: 2 7 1
# 5: 1 7 1
# 6: 1 7 1
# 7: 1 7 1
# 8: 70 5 2
# 9: 70 5 2
#10: 70 5 2
#11: 69 5 2
#12: 69 5 2
#13: 89 5 3
#14: 89 5 3
#15: 89 5 3
#16: 88 5 3
#17: 88 5 3
#18: 120 1 4
#19: 100 1 5
Or a one-liner would be
setDT(df)[, index_col := df[, ((seq_len(.N)-1) %/% colB[1L])+1, rleid(colB)][, as.integer(interaction(.SD, drop = TRUE, lex.order = TRUE))]]
Update
Based on the new update in the OP's post
setDT(new_df)[, index_col := cumsum(c(TRUE, abs(diff(colA))> 1))
][, colB := .N , index_col]
new_df
# colA colB index_col
# 1: 3 10 1
# 2: 3 10 1
# 3: 3 10 1
# 4: 2 10 1
# 5: 2 10 1
# 6: 2 10 1
# 7: 2 10 1
# 8: 1 10 1
# 9: 1 10 1
#10: 1 10 1
#11: 71 7 2
#12: 71 7 2
#13: 70 7 2
#14: 70 7 2
#15: 70 7 2
#16: 69 7 2
#17: 69 7 2
#18: 90 7 3
#19: 90 7 3
#20: 89 7 3
#21: 89 7 3
#22: 89 7 3
#23: 88 7 3
#24: 88 7 3
#25: 44 2 4
#26: 43 2 4
#27: 120 1 5
#28: 100 1 6
An approach in base R:
df$idxcol <- cumsum(c(1,abs(diff(df$colA)) > 1) + c(0,diff(df$colB) != 0) > 0)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 70 5 2
9 70 5 2
10 70 5 2
11 69 5 2
12 69 5 2
13 89 5 3
14 89 5 3
15 89 5 3
16 88 5 3
17 88 5 3
18 120 1 4
19 100 1 5
On the updated example data, you need to adapt the approach to:
n <- 1
idx1 <- cumsum(c(1, diff(df$colA) < -n) + c(0, diff(df$colB) != 0) > 0)
idx2 <- ave(df$colA, cumsum(c(1, diff(df$colA) < -n)), FUN = function(x) c(0, cumsum(diff(x)) < -n ))
idx2[idx2==1 & c(0,diff(idx2))==0] <- 0
df$idxcol <- idx1 + cumsum(idx2)
which gives:
> df
colA colB idxcol
1 2 7 1
2 2 7 1
3 2 7 1
4 2 7 1
5 1 7 1
6 1 7 1
7 1 7 1
8 89 5 2
9 89 5 2
10 89 5 2
11 88 5 2
12 88 5 2
13 70 5 3
14 70 5 3
15 70 5 3
16 69 5 3
17 69 5 3
18 44 4 4
19 44 4 4
20 44 4 4
21 43 4 4
22 42 4 5
23 42 4 5
24 41 4 5
25 41 4 5
26 120 1 6
27 100 1 7
For new_df just change n tot 2 and you will get the desired output for that as well.

How to create a matrix in simple correspondence analysis?

I am trying to create a matrix in order to apply a simple correspondence analysis on it; I have 2 categorical variables: exp and conexinternet with 3 levels each.
obs conexinternet exp
1 1 2
2 1 1
3 2 2
4 1 1
5 1 1
6 2 1
7 1 2
8 1 2
9 1 2
10 2 1
11 1 1
12 2 1
13 2 2
14 2 1
15 1 1
16 2 2
17 1 1
18 2 2
19 2 2
20 2 2
21 2 2
22 1 1
23 2 3
24 1 1
25 2 1
26 2 1
27 1 1
28 2 2
29 2 1
30 1 2
31 1 2
32 2 3
33 2 1
34 2 1
35 2 1
36 3 2
37 2 1
38 3 2
39 2 3
40 2 3
41 2 2
42 2 3
43 2 2
44 2 2
45 2 1
46 2 2
47 2 3
48 1 3
49 2 3
50 3 2
51 2 2
52 2 2
53 2 1
54 1 2
55 1 1
56 2 3
57 3 2
58 3 1
59 3 1
60 1 2
61 2 3
62 2 2
63 3 1
64 3 2
65 3 2
66 1 2
67 3 2
68 3 2
69 3 3
70 2 1
71 3 3
72 3 2
73 3 2
74 3 2
75 3 1
76 3 2
77 3 1
I want to make a vector to categorize the observations as 11, 12, 13, 21, 22, 23, 31, 32, 33, how can I do it?
Is this what you want?
d <- read.table(text="obs conexinternet exp
1 1 2
...
77 3 1", header=T)
(tab <- xtabs(~conexinternet+exp, d))
# exp
# conexinternet 1 2 3
# 1 10 9 1
# 2 14 15 9
# 3 5 12 2

dplyr append group id sequence?

I have a dataset like below, it's created by dplyr and currently grouped by ‘Stage', how do I generate a sequence based on unique, incremental value of Stage, starting from 1 (for eg row$4 should be 1 row#1 and #8 should be 4)
X Y Stage Count
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
I tried the approach in below post but didn't work.
how to mutate a column with ID in group
Thanks.
Here is another dplyr solution:
> df
# A tibble: 11 × 4
X Y Stage Count
<dbl> <dbl> <dbl> <dbl>
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
To create the group id's use dpylr's group_indicies:
i <- df %>% group_indices(Stage)
df %>% mutate(group = i)
# A tibble: 11 × 5
X Y Stage Count group
<dbl> <dbl> <dbl> <dbl> <int>
1 61 74 1 2 4
2 58 56 2 1 5
3 78 76 0 1 3
4 100 100 -2 1 1
5 89 88 -1 1 2
6 47 44 3 1 6
7 36 32 4 1 7
8 75 58 1 2 4
9 24 21 5 1 8
10 12 11 6 1 9
11 0 0 10 1 10
It would be great if you could pipe both commands together. But, as of this writing, it doesn't appear to be possible.
After some experiment, I did %>% ungroup() %>% mutate(test = rank(Stage)), which will yield the following result.
X Y Stage Count test
1 100 100 -2 1 1.0
2 89 88 -1 1 2.0
3 78 76 0 1 3.0
4 61 74 1 2 4.5
5 75 58 1 2 4.5
6 58 56 2 1 6.0
7 47 44 3 1 7.0
8 36 32 4 1 8.0
9 24 21 5 1 9.0
10 12 11 6 1 10.0
11 0 0 10 1 11.0
I don't know whether this is the best approach, feel free to comment....
update
Another approach, assuming the data called Node
lvs <- levels(as.factor(Node$Stage))
Node %>% mutate(Rank = match(Stage,lvs))

Resources