How to use dplyr & casewhen, across groups and rows, with three outcomes? - r
This seems a simple question to me but I'm super stuck on it! My data looks like this:
Name round MatchNumber Score
<chr> <int> <int> <dbl>
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73
All I want to do is create a new column Outcome from Score to designate that for every name, round and match, there is a Win/ Loss or Draw. Ideally, this would be done via dplyr and likely via casewhen but I just can't get my head around the row-wise calculation and grouping. I've tried (but am stuck at) the following:
MatchOutcome <- ExampleData %>%
arrange(round, MatchNumber) %>%
group_by(Name, round, MatchNumber) %>%
mutate(Outcome = Score)
My ideal output would look like:
Name round MatchNumber Score Outcome
<chr> <int> <int> <dbl> <chr>
1 A 1 1 48 Loss
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Loss
5 E 1 3 61 Loss
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Loss
10 D 2 1 59 Win
11 A 2 2 50 Loss
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Loss
15 C 2 4 72 Loss
16 F 2 4 73 Win
Maybe something like this?
ExampleData %>%
group_by(round, MatchNumber) %>%
mutate(Outcome = case_when(Score == mean(Score) ~ "Draw",
Score == max(Score) ~ "Win",
TRUE ~ "Loss")) %>%
ungroup()
# A tibble: 16 x 5
Name round MatchNumber Score Outcome
<chr> <int> <int> <int> <chr>
1 A 1 1 48 Lose
2 B 1 1 66 Win
3 C 1 2 74 Win
4 D 1 2 62 Lose
5 E 1 3 61 Lose
6 F 1 3 63 Win
7 G 1 4 63 Draw
8 H 1 4 63 Draw
9 E 2 1 51 Lose
10 D 2 1 59 Win
11 A 2 2 50 Lose
12 H 2 2 78 Win
13 B 2 3 51 Win
14 G 2 3 47 Lose
15 C 2 4 72 Lose
16 F 2 4 73 Win
Data:
ExampleData <- read.table(text = "Name round MatchNumber Score
1 A 1 1 48
2 B 1 1 66
3 C 1 2 74
4 D 1 2 62
5 E 1 3 61
6 F 1 3 63
7 G 1 4 63
8 H 1 4 63
9 E 2 1 51
10 D 2 1 59
11 A 2 2 50
12 H 2 2 78
13 B 2 3 51
14 G 2 3 47
15 C 2 4 72
16 F 2 4 73")
Related
dplyr creating new column based on some condition [duplicate]
This question already has an answer here: Assign the value of the first row of a group to the whole group [duplicate] (1 answer) Closed 1 year ago. I have the following df: df<-data.frame(geo_num=c(11,12,22,41,42,43,77,71), cust_id=c("A","A","B","C","C","C","D","D"), sales=c(2,3,2,1,2,4,6,3)) > df geo_num cust_id sales 1 11 A 2 2 12 A 3 3 22 B 2 4 41 C 1 5 42 C 2 6 43 C 4 7 77 D 6 8 71 D 3 Require to create a new column 'geo_num_new' which has for every group from 'cust_id' has first values from 'geo_num' as shown below: > df_new geo_num cust_id sales geo_num_new 1 11 A 2 11 2 12 A 3 11 3 22 B 2 22 4 41 C 1 41 5 42 C 2 41 6 43 C 4 41 7 77 D 6 77 8 71 D 3 77 thanks.
We could use first after grouping by 'cust_id'. The single value will be recycled for the entire grouping library(dplyr) df <- df %>% group_by(cust_id) %>% mutate(geo_num_new = first(geo_num)) %>% ungroup -ouptut df # A tibble: 8 x 4 geo_num cust_id sales geo_num_new <dbl> <chr> <dbl> <dbl> 1 11 A 2 11 2 12 A 3 11 3 22 B 2 22 4 41 C 1 41 5 42 C 2 41 6 43 C 4 41 7 77 D 6 77 8 71 D 3 77 Or use data.table library(data.table) setDT(df)[, geo_num_new := first(geo_num), by = cust_id] or with base R df$geo_num_new <- with(df, ave(geo_num, cust_id, FUN = function(x) x[1])) Or an option with collapse library(collapse) tfm(df, geo_num_new = ffirst(geo_num, g = cust_id, TRA = "replace")) geo_num cust_id sales geo_num_new 1 11 A 2 11 2 12 A 3 11 3 22 B 2 22 4 41 C 1 41 5 42 C 2 41 6 43 C 4 41 7 77 D 6 77 8 71 D 3 77
add values of one group into another group in R
I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex: df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2), Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"), Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99), Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1), value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10)) in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination. My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution. Year Cluster Seed Day value 1 1 a 1 1 60 2 1 a 1 2 68 3 1 a 1 3 78 4 1 a 1 4 90 5 1 a 1 5 107 6 1 a 2 1 61 7 1 a 2 2 73 8 1 a 2 3 86 9 1 a 2 4 91 10 1 a 2 5 104 11 1 a 3 1 57 12 1 a 3 2 67 13 1 a 3 3 79 14 1 a 3 4 96 15 1 a 3 5 105 16 1 c 99 1 10 17 2 b 1 1 60 18 2 b 1 2 68 19 2 b 1 3 78 20 2 b 1 4 90 21 2 b 1 5 107 22 2 b 2 1 61 23 2 b 2 2 73 24 2 b 2 3 86 25 2 b 2 4 91 26 2 b 2 5 104 27 2 b 3 1 57 28 2 b 3 2 67 29 2 b 3 3 79 30 2 b 3 4 96 31 2 b 3 5 105 32 2 d 99 1 10
A data.table approach: library(data.table) df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value), flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL] Output: df[] Year Cluster Seed Day value 1: 1 a 1 1 60 2: 1 a 1 2 68 3: 1 a 1 3 78 4: 1 a 1 4 90 5: 1 a 1 5 107 6: 1 a 2 1 61 7: 1 a 2 2 73 8: 1 a 2 3 86 9: 1 a 2 4 91 10: 1 a 2 5 104 11: 1 a 3 1 57 12: 1 a 3 2 67 13: 1 a 3 3 79 14: 1 a 3 4 96 15: 1 a 3 5 105 16: 1 c 99 1 10 17: 2 b 1 1 60 18: 2 b 1 2 68 19: 2 b 1 3 78 20: 2 b 1 4 90 21: 2 b 1 5 107 22: 2 b 2 1 61 23: 2 b 2 2 73 24: 2 b 2 3 86 25: 2 b 2 4 91 26: 2 b 2 5 104 27: 2 b 3 1 57 28: 2 b 3 2 67 29: 2 b 3 3 79 30: 2 b 3 4 96 31: 2 b 3 5 105 32: 2 d 99 1 10
Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better. library(tidyverse) df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2), Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"), Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99), Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1), value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10)) seeds <- df %>% filter(Seed == 99) matches <- df %>% filter(Seed != 99) %>% inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>% mutate(value = value.x + value.y) %>% select(Year, Cluster, Seed, Day, value) no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day")) bind_rows(matches, no_matches) %>% arrange(Year, Cluster, Seed, Day) #> Year Cluster Seed Day value #> 1 1 a 1 1 60 #> 2 1 a 1 2 68 #> 3 1 a 1 3 78 #> 4 1 a 1 4 90 #> 5 1 a 1 5 107 #> 6 1 a 2 1 61 #> 7 1 a 2 2 73 #> 8 1 a 2 3 86 #> 9 1 a 2 4 91 #> 10 1 a 2 5 104 #> 11 1 a 3 1 57 #> 12 1 a 3 2 67 #> 13 1 a 3 3 79 #> 14 1 a 3 4 96 #> 15 1 a 3 5 105 #> 16 1 c 99 1 10 #> 17 2 b 1 1 60 #> 18 2 b 1 2 68 #> 19 2 b 1 3 78 #> 20 2 b 1 4 90 #> 21 2 b 1 5 107 #> 22 2 b 2 1 61 #> 23 2 b 2 2 73 #> 24 2 b 2 3 86 #> 25 2 b 2 4 91 #> 26 2 b 2 5 104 #> 27 2 b 3 1 57 #> 28 2 b 3 2 67 #> 29 2 b 3 3 79 #> 30 2 b 3 4 96 #> 31 2 b 3 5 105 #> 32 2 d 99 1 10 Created on 2018-11-23 by the reprex package (v0.2.1)
count row number first and then insert new row by condition [duplicate]
This question already has answers here: How to create missing value for repeated measurement data? (2 answers) Closed 4 years ago. I need to count the number of rows first after a group_by function and add up new row(s) to 6 row if the row number < 6. My df has three variables (v1,v2,v3): v1 = group name, v2 = row number (i.e., 1,2,3,4,5,6). In the new row(s), I want to repeat the v1 value, v2 continue the couting of row number, v3 = NA sample df v1 v2 v3 1 1 79 1 2 32 1 3 53 1 4 33 1 5 76 1 6 11 2 1 32 2 2 42 2 3 44 2 4 12 3 1 22 3 2 12 3 3 12 3 4 67 3 5 32 expected output v1 v2 v3 1 1 79 1 2 32 1 3 53 1 4 33 1 5 76 1 6 11 2 1 32 2 2 42 2 3 44 2 4 12 2 5 NA #insert 2 6 NA #insert 3 1 22 3 2 12 3 3 12 3 4 67 3 5 32 3 6 NA #insert I tried to count the row number first by dplyr, but I don't know if I can or how can I add this if else condition by using the pip. Or is there other easier function? My code df %>% group_by(v1) %>% dplyr::summarise(N=n()) %>% if (N < 6) { # sth like that? } Thanks!
We can use complete library(tidyverse) complete(df1, v1, v2) # A tibble: 18 x 3 # v1 v2 v3 # <int> <int> <int> # 1 1 1 79 # 2 1 2 32 # 3 1 3 53 # 4 1 4 33 # 5 1 5 76 # 6 1 6 11 # 7 2 1 32 # 8 2 2 42 # 9 2 3 44 #10 2 4 12 #11 2 5 NA #12 2 6 NA #13 3 1 22 #14 3 2 12 #15 3 3 12 #16 3 4 67 #17 3 5 32 #18 3 6 NA
Here is a way to do it using merge. df <- read.table(text = "v1 v2 v3 1 1 79 1 2 32 1 3 53 1 4 33 1 5 76 1 6 11 2 1 32 2 2 42 2 3 44 2 4 12 3 1 22 3 2 12 3 3 12 3 4 67 3 5 32", header = T) toMerge <- data.frame(v1 = rep(1:3, each = 6), v2 = rep(1:6, times = 3)) m <- merge(toMerge, df, by = c("v1", "v2"), all.x = T) m v1 v2 v3 1 1 1 79 2 1 2 32 3 1 3 53 4 1 4 33 5 1 5 76 6 1 6 11 7 2 1 32 8 2 2 42 9 2 3 44 10 2 4 12 11 2 5 NA 12 2 6 NA 13 3 1 22 14 3 2 12 15 3 3 12 16 3 4 67 17 3 5 32 18 3 6 NA
Unnest (seperate) multiple column values into new rows using Sparklyr
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr. id <- c(1,1,1,1,1,2,2,2,3,3,3) name <- c("A,B,C","B,F","C","D,R,P","E","A,Q,W","B,J","C","D,M","E,X","F,E") value <- c("1,2,3","2,4,43,2","3,1,2,3","1","1,2","26,6,7","3,3,4","1","1,12","2,3,3","3") dt <- data.frame(id,name,value) R solution: separate_rows(dt, name, sep=",") %>% separate_rows(value, sep=",") Desired Output from sparkframe(sparklyr package)- > final_result id name value 1 1 A 1 2 1 A 2 3 1 A 3 4 1 B 1 5 1 B 2 6 1 B 3 7 1 C 1 8 1 C 2 9 1 C 3 10 1 B 2 11 1 B 4 12 1 B 43 13 1 B 2 14 1 F 2 15 1 F 4 16 1 F 43 17 1 F 2 18 1 C 3 19 1 C 1 20 1 C 2 21 1 C 3 22 1 D 1 23 1 R 1 24 1 P 1 25 1 E 1 26 1 E 2 27 2 A 26 28 2 A 6 29 2 A 7 30 2 Q 26 31 2 Q 6 32 2 Q 7 33 2 W 26 34 2 W 6 35 2 W 7 36 2 B 3 37 2 B 3 38 2 B 4 39 2 J 3 40 2 J 3 41 2 J 4 42 2 C 1 43 3 D 1 44 3 D 12 45 3 M 1 46 3 M 12 47 3 E 2 48 3 E 3 49 3 E 3 50 3 X 2 51 3 X 3 52 3 X 3 53 3 F 3 54 3 E 3 Note- I have approx 1000 columns with nested values. so, I need a function which can loop in for each column. I know we have sdf_unnest() function from package sparklyr.nested. But, I am not sure how to split strings of multiple columns and apply this function. I am quite new in sparklyr. Any help would be much appreciated.
You have to combine explode and split sdt %>% mutate(name = explode(split(name, ","))) %>% mutate(value = explode(split(value, ","))) # Source: lazy query [?? x 3] # Database: spark_connection id name value <dbl> <chr> <chr> 1 1.00 A 1 2 1.00 A 2 3 1.00 A 3 4 1.00 B 1 5 1.00 B 2 6 1.00 B 3 7 1.00 C 1 8 1.00 C 2 9 1.00 C 3 10 1.00 B 2 # ... with more rows Please note that lateral views have be to expressed as separate subqueries, so this: sdt %>% mutate( name = explode(split(name, ",")), value = explode(split(value, ","))) won't work
dplyr append group id sequence?
I have a dataset like below, it's created by dplyr and currently grouped by ‘Stage', how do I generate a sequence based on unique, incremental value of Stage, starting from 1 (for eg row$4 should be 1 row#1 and #8 should be 4) X Y Stage Count 1 61 74 1 2 2 58 56 2 1 3 78 76 0 1 4 100 100 -2 1 5 89 88 -1 1 6 47 44 3 1 7 36 32 4 1 8 75 58 1 2 9 24 21 5 1 10 12 11 6 1 11 0 0 10 1 I tried the approach in below post but didn't work. how to mutate a column with ID in group Thanks.
Here is another dplyr solution: > df # A tibble: 11 × 4 X Y Stage Count <dbl> <dbl> <dbl> <dbl> 1 61 74 1 2 2 58 56 2 1 3 78 76 0 1 4 100 100 -2 1 5 89 88 -1 1 6 47 44 3 1 7 36 32 4 1 8 75 58 1 2 9 24 21 5 1 10 12 11 6 1 11 0 0 10 1 To create the group id's use dpylr's group_indicies: i <- df %>% group_indices(Stage) df %>% mutate(group = i) # A tibble: 11 × 5 X Y Stage Count group <dbl> <dbl> <dbl> <dbl> <int> 1 61 74 1 2 4 2 58 56 2 1 5 3 78 76 0 1 3 4 100 100 -2 1 1 5 89 88 -1 1 2 6 47 44 3 1 6 7 36 32 4 1 7 8 75 58 1 2 4 9 24 21 5 1 8 10 12 11 6 1 9 11 0 0 10 1 10 It would be great if you could pipe both commands together. But, as of this writing, it doesn't appear to be possible.
After some experiment, I did %>% ungroup() %>% mutate(test = rank(Stage)), which will yield the following result. X Y Stage Count test 1 100 100 -2 1 1.0 2 89 88 -1 1 2.0 3 78 76 0 1 3.0 4 61 74 1 2 4.5 5 75 58 1 2 4.5 6 58 56 2 1 6.0 7 47 44 3 1 7.0 8 36 32 4 1 8.0 9 24 21 5 1 9.0 10 12 11 6 1 10.0 11 0 0 10 1 11.0 I don't know whether this is the best approach, feel free to comment.... update Another approach, assuming the data called Node lvs <- levels(as.factor(Node$Stage)) Node %>% mutate(Rank = match(Stage,lvs))