Add original values for columns after group by - r

For the dataframe below I want to add the original values for Var_x after a group_by on ID and event and a max() on quest, but I cannot get my code right. Any suggestions? By the way, in my original dataframe more than 1 column needs to be added.
df <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,3,3,3),
quest = c(1,1,2,2,3,3,1,2,3,1,2,3),
event = c("A","B","A","B","A",NA,"C","D","C","D","D",NA),
VAR_X = c(2,4,3,6,3,NA,6,4,5,7,5,NA))
Code:
df %>%
group_by(ID,event) %>%
summarise(quest = max(quest))
Desired output:
ID quest event VAR_X
1 1 2 B 6
2 1 3 A 3
3 2 2 D 4
4 2 3 C 5
5 3 2 D 5

Start by omiting the na values and in the end do an inner_join with the original data set.
df %>%
na.omit() %>%
group_by(ID, event) %>%
summarise(quest = max(quest)) %>%
inner_join(df, by = c("ID", "event", "quest"))
## A tibble: 5 x 4
## Groups: ID [3]
# ID event quest VAR_X
# <dbl> <fct> <dbl> <dbl>
#1 1 A 3 3
#2 1 B 2 6
#3 2 C 3 5
#4 2 D 2 4
#5 3 D 2 5

df %>%
drop_na() %>% # remove if necessary ..
group_by(ID, event) %>%
filter(quest == max(quest)) %>%
ungroup()
# A tibble: 5 x 4
# ID quest event VAR_X
#<dbl> <dbl> <chr> <dbl>
# 1 1 2 B 6
# 2 1 3 A 3
# 3 2 2 D 4
# 4 2 3 C 5
# 5 3 2 D 5

Related

Flash fill possible in R?

How would flash filling based on observation data that is tied to another column look for R?
Example
df <- data.frame(A = c(1,1,1,1,2,2,2,2),
B = c('my initials1', NA, NA, NA,NA,'my initials2',NA,NA))
Is there a way to have my initials (which are tied to 1) fill down?
I've tried
df |> fill(B)
But what happens is it fills down to the next observation in B. Where I'd like the fill to stop at the end of 1 based on column A instead.
I was just thinking of copying the data to a separate data frame and joining it on A to achieve that.
We may use
library(dplyr)
df %>%
group_by(A) %>%
mutate(B = first(B[!is.na(B)])) %>%
ungroup
# A tibble: 8 × 2
A B
<dbl> <chr>
1 1 my initials1
2 1 my initials1
3 1 my initials1
4 1 my initials1
5 2 my initials2
6 2 my initials2
7 2 my initials2
8 2 my initials2
You can group by the first column, then it will only fill down within the group:
library(tidyverse)
df %>%
group_by(A) %>%
fill(B, .direction = "down")
Output
A B
<dbl> <chr>
1 1 my initials1
2 1 my initials1
3 1 my initials1
4 1 my initials1
5 2 NA
6 2 my initials2
7 2 my initials2
8 2 my initials2
Or if you want to fill in for every group, then you can change the .direction argument:
df %>%
group_by(A) %>%
fill(B, .direction = "updown")
Output
A B
<dbl> <chr>
1 1 my initials1
2 1 my initials1
3 1 my initials1
4 1 my initials1
5 2 my initials2
6 2 my initials2
7 2 my initials2
8 2 my initials2

dplyr mutate: create column using first occurrence of another column

I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

summarise and group_by using two different columns consecutively

I have a dataframe df with three columns a,b,c.
df <- data.frame(a = c('a','b','c','d','e','f','g','e','f','g'),
b = c('X','Y','Z','X','Y','Z','X','X','Y','Z'),
c = c('cat','dog','cat','dog','cat','cat','dog','cat','cat','dog'))
df
# output
a b c
1 a X cat
2 b Y dog
3 c Z cat
4 d X dog
5 e Y cat
6 f Z cat
7 g X dog
8 e X cat
9 f Y cat
10 g Z dog
I have to group_by using the column b followed by summarise using the column c with counts of available values in it.
df %>% group_by(b) %>%
summarise(nCat = sum(c == 'cat'),
nDog = sum(c == 'dog'))
#output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 2 2
2 Y 2 1
3 Z 2 1
However, before doing the above task, I should remove the rows belonging to a value in a which has more than one value in b.
df %>% group_by(a) %>% summarise(count = n())
#output
# A tibble: 7 × 2
a count
<fctr> <int>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 2
6 f 2
7 g 2
For example, in this dataframe, all the rows having value e(values: Y,X), f(values: Z,Y), g(values: X,Z) in column a.
# Expected output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 1 1
2 Y 0 1
3 Z 1 0
We can use filter with n_distinct to filter the values in 'b' that have only one unique element for each 'a' group, then grouped by 'b', we do the summarise
df %>%
group_by(a) %>%
filter(n_distinct(b)==1) %>%
group_by(b) %>%
summarise(nCat =sum(c=='cat'), nDog = sum(c=='dog'), Total = n())
# A tibble: 3 × 4
# b nCat nDog Total
# <fctr> <int> <int> <int>
#1 X 1 1 2
#2 Y 0 1 1
#3 Z 1 0 1

Add missing subtotals to each group using dplyr

I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.
This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1
Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12

Resources