dplyr conditional count function in R language - r

In the following code, the 2rd payment for item b is zero value. Using the pipe %>%, is possible to show the count for item b as 2 not 3 since there is no payment for b at the 2rd payment?
df <-data.frame("item" = c("a","b", "b","a","b"), "payment" = c(10,20,0,40,30) )
df_sum <-
df %>%
group_by(item) %>%
summarise(total = sum(payment),
totalcount =n())

You can filter out rows you don't want. E.g., if you don't want to count rows where payment = 0, you can use filter:
df %>%
group_by(item) %>%
filter(payment > 0) %>%
summarise(total = sum(payment),
totalcount =n())
# A tibble: 2 x 3
item total totalcount
<fct> <dbl> <int>
1 a 50.0 2
2 b 50.0 2

Related

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Creating counts of subset with dplyr

I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!
We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2
I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".
df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1

Performing operations on dplyr summaries

Assume we have some random data:
data <- data.frame(ID = rep(seq(1:3),3),
Var = sample(1:9, 9))
we can compute summarizing operations using dplyr, like this:
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
which gives output that looks like this below an r markdown chunk:
ID count
1 3
2 3
3 3
I would like to know how we can perform operations on individual data points in this dplyr output without saving the output in a separate object.
For example in the output of summarise, lets say we wanted to subtract the output value for ID == 3 from the sum of the output values for ID == 1 and ID == 2, and leave the output values for ID == 1 and ID == 2 like they are. The only way I know to do this is to save the summary output in another object and perform the operation on that object, like this:
a<-
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var))
a
#now perform the operation on a
a[3,2] <- a[2,1]+a[2,2]-1
a
a now looks like this:
ID count
1 3
2 3
3 4
Is there a way to do this in dplyr output without making new objects? Can we somehow use mutate directly on output like this?
We can add a mutate after the summarise with replace to modify the location specified in list
library(dplyr)
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), count[2] + ID[2] - 1))
-output
# A tibble: 3 x 2
ID count
<int> <dbl>
1 1 3
2 2 3
3 3 4
Or if there are more than two columns, use sum on the sliced row
data%>%
group_by(ID)%>%
summarize(count = n_distinct(Var)) %>%
mutate(count = replace(count, n(), sum(cur_data() %>%
slice(2)) - 1))
Alternative that does what you say you want ("sum others") but not what you demonstrate.
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count) - count, count))
# # A tibble: 3 x 2
# ID count
# <int> <int>
# 1 1 3
# 2 2 3
# 3 3 6
or, if there are other IDs that should not be included in the sum, then
data %>%
group_by(ID) %>%
summarize(count = n_distinct(Var)) %>%
mutate(count = if_else(ID == 3L, sum(count[ID %in% 1:2]), count))

Dplyr pipe groupby top_n does not get top_n in group

I'm trying to obtain the top 2 names, sorted alphabetically, per group. I would think that top_n() would select this after I perform a group_by. However, this does not seem to be the case. This code shows the problem.
df <- data.frame(Group = c(0, 0, 0, 1, 1, 1),
Name = c("a", "c", "b", "e", "d", "f"))
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
# A tibble: 2 x 2
# Groups: Group [1]
Group Name
<dbl> <chr>
1 1 e
2 1 f
Expected output would be:
df <- df %>%
arrange(Name, Group) %>%
group_by(Group) %>%
top_n(2)
df
Group Name
1 0 a
2 0 b
3 1 d
4 1 e
Or something similar. Thanks.
top_n selects top n max values. You seem to need top n min values. You can use index with negative values to get that. Additionaly you don't need to arrange the data when using top_n.
library(dplyr)
df %>% group_by(Group) %>% top_n(-2, Name)
# Group Name
# <dbl> <chr>
#1 0 a
#2 0 b
#3 1 e
#4 1 d
Another way is to arrange the data and select first two rows in each group.
df %>% arrange(Group, Name) %>% group_by(Group) %>% slice(1:2)
We can use
library(dplyr)
df %>%
arrange(Group, Name) %>%
group_by(Group) %>%
filter(row_number() < 3)

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

Resources