What is the best way to handle potentially missing columns when summarizing? - r

A financial statement is a good illustration of this issue. Here is an example dataframe:
df <- data.frame( date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
category = sample(c('a','b', 'c'), 10, replace=TRUE),
direction = sample(c('credit', 'debit'), 10, replace=TRUE),
value = sample(0:25, 10, replace = TRUE) )
I want to produce a summary table with incoming, outgoing and total columns for each category.
df %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
In most cases this works perfectly with the example dataframe above.
But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.
Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found
Given that I have no control over the dataframe, what is the best way to handle this?
I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.
...
summarize( outgoing = case_when(
"debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE),
TRUE ~ 0 ) )
...
Have I made a syntax error, or am I going in completely the wrong direction with this?

The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0
library(dplyr)
df %>%
slice(-c(9:10)) %>% # just removed the 'debit' rows completely
group_by(category) %>%
summarise(total = sum(value[direction == 'credit']) -
sum(value[direction == "debit"]))
-output
# A tibble: 3 × 2
category total
<chr> <int>
1 a 15
2 b 30
3 c 63
With pivot_wider, it is not the case
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value)
# A tibble: 8 × 3
date category credit
<date> <chr> <int>
1 2020-07-25 c 19
2 2020-05-09 b 15
3 2020-08-27 a 15
4 2020-03-27 b 15
5 2020-04-06 c 6
6 2020-07-06 c 11
7 2020-09-22 c 25
8 2020-10-06 c 2
it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) )
Error: Problem with summarise() column outgoing.
ℹ outgoing = sum(debit, na.rm = TRUE).
✖ object 'debit' not found
ℹ The error occurred in group 1: category = "a".
Run rlang::last_error() to see where the error occurred.
In this case, we can do a complete to create some rows with debit as well which will have NA for other columns
library(tidyr)
df %>%
slice(-c(9:10)) %>%
complete(category, direction = c("credit", "debit")) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
# A tibble: 3 × 4
category incoming outgoing total
<chr> <int> <int> <int>
1 a 15 0 15
2 b 30 0 30
3 c 63 0 63

Related

Select second largest row by group in r

I have this problem
library(dplyr)
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
For each id, I want to only keep row with second largest value of var1. I tried different ways (dplyr, base R):
problem %>%
group_by(id) %>%
slice_tail(2, -var1)
problem[with(problem, ave(var1, id, FUN = function(x) x == tail(sort(x), 2)[1])), ]
First code doesn;t work, second code gives wrong answer.
What am I doing wrong?
problem |> group_by(id) %>% arrange(var1) %>% slice(n()-1)
n() counts the number of rows in each group. slice(n()-1) takes the n-1th element. Note this will cause issues with groups with fewer than 2 members - you may wish to allow for that.
If you wish to use slice, I guess you can first slice_max() the largest two rows, than slice_tail to remove the largest row.
library(dplyr)
problem %>%
group_by(id) %>%
slice_max(var1, n = 2) %>%
slice_tail(n = 1)
Or you can use a single filter:
problem %>% group_by(id) %>% filter(var1 == max(var1[var1 != max(var1)]))
Output
# A tibble: 2 × 3
# Groups: id [2]
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9
In case you have volume, here is a data.tableapproach.
problem = data.frame(id = c(1,1,1,2,2,2), var1 = c(5,4,3, 6,5,4), var2 = c(99,12,32,88,9,8))
setDT(problem)
setorder(problem, id, - var1)
problem[, .SD[2], by=id]
As for #paul Stafford Allen comment, you will have issue for groups of size only 1.
After arrangeing the 'var1' on descending use slice with 2
library(dplyr)
problem %>%
arrange(id, desc(var1)) %>%
group_by(id) %>%
slice(2) %>%
ungroup
-output
# A tibble: 2 × 3
id var1 var2
<dbl> <dbl> <dbl>
1 1 4 12
2 2 5 9

Pivot wider dataframe with difficult structure dplyr

I was working on something I thought would be simple, but maybe today my brain isn't working. My data is like this:
tibble(metric = c('income', 'income_upp', 'income_low', 'n_house', 'n_house_upp', 'n_house_low'),
value = c(120, 140, 100, 10, 8, 12))
metric value
income 120
income_low 100
income_upp 140
n 10
n_low 8
n_upp 12
And I want to pivot_wider so it looks like this:
metric value value_low value_upp
income 120 100 140
n 10 8 12
I'm having trouble separating metrics, because pivot_wider as is, brings a dataframe that's too wide:
df %>% pivot_wider(names_from = 'metric', values_from = value)
How can I achieve this or should I pivot longer after the pivot wider?
Thanks!
I think if you convert metric into a column with "value", "value_upp" and "value_low" values, you can pivot_wider:
df %>%
mutate(param = case_when(str_detect(metric, "upp") ~ "value_upp",
str_detect(metric, "low") ~ "value_low",
TRUE ~ "value"),
metric = str_remove(metric, "_low|_upp")) %>%
pivot_wider(names_from = param, values_from = value)
I like to use separate() when I have text in a column like this. This function allows you to separate a column into multiple columns if there is a separator in the function.
In particular in this example we would want to use the arguments sep="_" and into = c("metric", "state") to convert into columns with those names.
Then mutate() and pivot_wider() can be used as you had previously specified.
library(tidyverse)
df <- tribble(~metric, ~value,
"income", 120,
"income_low", 100,
"income_upp", 140,
"n", 10,
"n_low", 8,
"n_upp", 12)
df |>
separate(metric, sep = "_", into = c("metric", "state")) |>
mutate(state = ifelse(is.na(state), "value", state)) |>
pivot_wider(id_cols = metric, names_from = state, values_from = value, names_sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 4].
#> # A tibble: 2 × 4
#> metric value low upp
#> <chr> <dbl> <dbl> <dbl>
#> 1 income 120 100 140
#> 2 n 10 8 12
Created on 2022-12-21 with reprex v2.0.2
Note you can use the argument names_glue or names_prefix in pivot_wider() to add the "value" as a prefix to the column names.
a data.table approach (if you can live wit the trailing underacore achter value_
library(data.table)
setDT(df)
# create some new columns based on metric
df[, c("first", "second") := tstrsplit(metric, "_")]
# metric value first second
# 1: income 120 income <NA>
# 2: income_low 100 income low
# 3: income_upp 140 income upp
# 4: n 10 n <NA>
# 5: n_low 8 n low
# 6: n_upp 12 n upp
# replace NA with ""
df[is.na(df)] <- ""
# now cast to wide, createing colnames on the fly
dcast(df, first ~ paste0("value_", second), value.var = "value")
# first value_ value_low value_upp
# 1: income 120 100 140
# 2: n 10 8 12

How to count the number of times a specified variable appears in a dataframe column using dplyr?

Suppose we start with this very simple dataframe called myData:
> myData
Element Class
1 A 0
2 A 0
3 C 0
4 A 0
5 B 1
6 B 1
7 A 2
Generated by:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
How would I use dplyr to extract the number of times "A" appears in the Element column of the myData dataframe? I would simply like the number 4 returned, for further processing in dplyr. All I have so far is the dplyr code shown at the bottom, which seems clumsy because among other things it yields another dataframe with more information than just the number 4 that is needed:
# A tibble: 1 x 2
Element counted
<chr> <int>
1 A 4
The dplyr code that produces the above tibble:
library(dplyr)
myData %>% group_by(Element) %>% filter(Element == "A") %>% summarise(counted = n())
We can use count which simplifies the group_by + summarise step
library(dplyr)
myData %>%
filter(Element == 'A') %>%
count(Element, name = 'counted')
Or with just summarise and sum
myData %>%
summarise(counted = sum(Element == 'A'), Element = 'A') %>%
relocate(Element, .before = 1)
Element counted
1 A 4
Another option using tally like this:
myData = data.frame(Element = c("A","A","C","A","B","B","A"),Class = c(0,0,0,0,1,1,2))
library(dplyr)
myData %>%
filter(Element == "A") %>%
group_by(Element) %>%
tally()
#> # A tibble: 1 × 2
#> Element n
#> <chr> <int>
#> 1 A 4
Created on 2022-07-28 by the reprex package (v2.0.1)

Filter data by group & preserve empty groups

I wonder how can I filter my data by group, and preserve the groups that are empty?
Example:
year = c(1,2,3,1,2,3,1,2,3)
site = rep(c("a", "b", "d"), each = 3)
value = c(3,3,0,1,8,5,10,18,27)
df <- data.frame(year, site, value)
I want to subset the rows where the value is more than 5. For some groups, this is never true. Filter function simply skips empty groups.
How can I keep my empty groups and have NA instead? Ideally, I would like to use dplyr funtions instead of base R.
My filtering approach, where .preserve does not preserve empty groups:
df %>%
group_by(site) %>%
filter(value > 5, .preserve = TRUE)
Expected output:
year site value
<dbl> <fct> <dbl>
1 NA a NA
2 2 b 8
3 1 d 10
4 2 d 18
5 3 d 27
With the addition of tidyr, you can do:
df %>%
group_by(site) %>%
filter(value > 5) %>%
ungroup() %>%
complete(site = df$site)
site year value
<fct> <dbl> <dbl>
1 a NA NA
2 b 2 8
3 d 1 10
4 d 2 18
5 d 3 27
Or if you want to keep it in dplyr:
df %>%
group_by(site) %>%
filter(value > 5) %>%
bind_rows(df %>%
group_by(site) %>%
filter(all(value <= 5)) %>%
summarise_all(~ NA))
Using the nesting functionality of tidyr and applying purrr::map
df %>%
group_by(site) %>%
tidyr::nest() %>%
mutate(data = purrr::map(data, . %>% filter(value > 5))) %>%
tidyr::unnest(cols=c(data), keep_empty = TRUE)

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?
Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.
Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Resources