Dplyr filtering based on two variables - r

I want to use dplyr to determine which observations in a dataframe meet the following condition:
Within each Group, the combined total of Var2 for observations where Var1 == good is greater than the combined total of observations whereVar1 == bad
Here's the toy dataframe:
library(dplyr)
set.seed(seed = 10)
df <- data.frame("Id" = 1:12,
"Group" = paste(sapply(toupper(letters[1:3]), rep, times = 4,simplify = T)),
"Var1" = sample(rep(c("good","bad"),times = 1000),size = 12),
"Var2" = sample(rep(1:10, times = 1000),size = 12))
print(df)
Id Group Var1 Var2
1 1 A good 6
2 2 A bad 9
3 3 A good 10
4 4 A good 7
5 5 B bad 9
6 6 B bad 1
7 7 B bad 6
8 8 B good 6
9 9 C good 1
10 10 C bad 8
11 11 C good 4
12 12 C bad 2
So far I've determined that I should be using some combination of group_by(),summarise(), and filter() but I can't seem to wrap my head around a good way to do it. Here's what I've come up with so far:
keepers <- df %>%
group_by(Group, Var1) %>%
summarise(Total = sum(Var2)) %>%
print()
Source: local data frame [6 x 3]
Groups: Group [?]
Group Var1 Total
(chr) (chr) (int)
1 A bad 9
2 A good 23
3 B bad 16
4 B good 6
5 C bad 10
6 C good 5
What next steps should I take? Ultimately the analysis should return "A", because it's the only Group where Total is greater for the good observations than for the bad observations.

How about using spread than filter:
> library(tidyr)
> df %>% group_by(Group, Var1) %>%
+ summarise(Total = sum(Var2)) %>%
+ spread(Var1,Total) %>%
+ filter(good>bad)
Source: local data frame [1 x 3]
Group bad good
1 A 9 23

A similar option with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Group', 'Var1', get the sum of 'Var2', reshape from 'long' to 'wide' and filter the rows where the 'good' is greater than 'bad'.
library(data.table)
dcast(setDT(df)[, sum(Var2) , by = .(Group, Var1)],
Group~Var1, value.var='V1')[good>bad]
# Group bad good
#1: A 9 23

Related

count number of combinations by group

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

R dplyr: filter common values by group

I need to find common values between different groups ideally using dplyr and R.
From my dataset here:
group val
<fct> <dbl>
1 a 1
2 a 2
3 a 3
4 b 3
5 b 4
6 b 5
7 c 1
8 c 3
the expected output is
group val
<fct> <dbl>
1 a 3
2 b 3
3 c 3
as only number 3 occurs in all groups.
This code seems not working:
# Filter the data
dd %>%
group_by(group) %>%
filter(all(val)) # does not work
Example here solves similar issue but have a defined vector of shared values. What if I do not know which ones are shared?
Dummy example:
# Reproducible example: filter all id by group
group = c("a", "a", "a",
"b", "b", "b",
"c", "c")
val = c(1,2,3,
3,4,5,
1,3)
dd <- data.frame(group,
val)
group_by isolates each group, so we can't very well group_by(group) and compare between between groups. Instead, we can group_by(val) and see which ones have all the groups:
dd %>%
group_by(val) %>%
filter(n_distinct(group) == n_distinct(dd$group))
# # A tibble: 3 x 2
# # Groups: val [1]
# group val
# <chr> <dbl>
# 1 a 3
# 2 b 3
# 3 c 3
This is one of the rare cases where we want to use data$column in a dplyr verb - n_distinct(dd$group) refers explicitly to the ungrouped original data to get the total number of groups. (It could also be pre-computed.) Whereas n_distinct(group) is using the grouped data piped in to filter, thus it gives the number of distinct groups for each value (because we group_by(val)).
A base R approach can be:
#Code
newd <- dd[dd$val %in% Reduce(intersect, split(dd$val, dd$group)),]
Output:
group val
3 a 3
4 b 3
8 c 3
A similar option in data.table as that of #GregorThomas solution is
library(data.table)
setDT(dd)[dd[, .I[uniqueN(group) == uniqueN(dd$group)], val]$V1]

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

I would like to create a new variable with the Unique occurences and their Frequency in R [duplicate]

This question already has answers here:
Find how many times duplicated rows repeat in R data frame [duplicate]
(4 answers)
Closed 6 years ago.
beginner and while i have attempted to search for an answer to this problem none seem to offer the solution that applys. it might be a simple one but i seem not to hack it. i have this data frame
df <- data.frame(FROM = c("A","A","A","B","D","C","A","D"),
TO = c("B","C","D","A","C","A","B","C"))
I would like to create a new data frame with an extra variable call it "FREQ" with all the unique values of "FROM" and "TO" Such that the new data set Looks like this. I would appreciate some assistance.
df2 <- data.frame(FROM = c("A","A","A","B","D","C"),
TO = c("B","C","D","A","C","A"),
FREQ = c(2,1,1,1,2,1))
If you are using dplyr package, you can use count, which is a short cut for group_by(FROM, TO) %>% summarise(n = n()) and count the number of rows for each group:
library(dplyr)
df %>% count(FROM, TO)
#Source: local data frame [6 x 3]
#Groups: FROM [?]
# FROM TO n
# <fctr> <fctr> <int>
#1 A B 2
#2 A C 1
#3 A D 1
#4 B A 1
#5 C A 1
#6 D C 2
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'FROM', 'TO', we get the number of elements in each group (.N)
library(data.table)
setDT(df)[, .(FREQ = .N) ,.(FROM, TO)]
# FROM TO FREQ
#1: A B 2
#2: A C 1
#3: A D 1
#4: B A 1
#5: D C 2
#6: C A 1
Another option is tally() from dplyr
library(dplyr)
df %>%
group_by(FROM, TO) %>%
tally()
# FROM TO n
# <fctr> <fctr> <int>
#1 A B 2
#2 A C 1
#3 A D 1
#4 B A 1
#5 C A 1
#6 D C 2
Or using table from base R, we just get the frequency of the dataset, convert to data.frame and remove the 0 elements in 'Freq' with subset.
subset(as.data.frame(table(df)), Freq !=0)

Resources