how to calculate a new column after grouping with dplyr - r

I need to calculate Revenue per load, after grouping by "Team" for my Shiny Dashboard. I am being told I have an invalid 'type' (character) of argument
I have tried changing how the summarise function is formatted. It does not work in the console, so I have removed the Shiny portions of the code.
August <- data.frame("Revenue" = c(10,20,30,40), "Volume" = c(2,4,5,7),
"Team" = c("Blue","Green","Gold","Purple"))
x <- August %>% group_by(Team) %>% summarise(Revenue = sum(Revenue)) /
August %>% group_by(Team) %>% summarise(Volume = sum(volume)) %>%
"Error: invalid 'type' (character) of argument"
this shows up instead of the bar graph

Summarize the Revenue and Volume and then take their ratio. Note that summarise proceeds from left to right so that after Revenue and Volume have been defined in the summarise statement the references in the RevByVol definition to them refers to these new definitions and not to the original unsummarized versions.
August %>%
group_by(Team) %>%
summarise(Revenue = sum(Revenue),
Volume = sum(Volume),
RevByVol = Revenue / Volume) %>%
ungroup
giving:
# A tibble: 4 x 4
Team Revenue Volume RevByVol
<fct> <dbl> <dbl> <dbl>
1 Blue 10 2 5
2 Gold 30 5 6
3 Green 20 4 5
4 Purple 40 7 5.71

Related

What is the best way to handle potentially missing columns when summarizing?

A financial statement is a good illustration of this issue. Here is an example dataframe:
df <- data.frame( date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 10),
category = sample(c('a','b', 'c'), 10, replace=TRUE),
direction = sample(c('credit', 'debit'), 10, replace=TRUE),
value = sample(0:25, 10, replace = TRUE) )
I want to produce a summary table with incoming, outgoing and total columns for each category.
df %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE), outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
In most cases this works perfectly with the example dataframe above.
But there are cases where df$direction could contain a single value e.g., credit, resulting in an error.
Error: Problem with `summarise()` column `outgoing`.
object 'debit' not found
Given that I have no control over the dataframe, what is the best way to handle this?
I've been playing around with a conditional statement in the summarize method to check that the column exists, but have not managed to get this working.
...
summarize( outgoing = case_when(
"debit" %in% colnames(.) ~ sum(debit,na.rm=TRUE),
TRUE ~ 0 ) )
...
Have I made a syntax error, or am I going in completely the wrong direction with this?
The issue happens only when one of the elements is presents i.e. 'credit' and no 'debit' or viceversa. Then, the pivot_wider doesn't create the column missing. Instead of pivoting and then summarising, do this directly with summarise and == i.e. if the 'debit' is absent, sum will take care of it by returning 0
library(dplyr)
df %>%
slice(-c(9:10)) %>% # just removed the 'debit' rows completely
group_by(category) %>%
summarise(total = sum(value[direction == 'credit']) -
sum(value[direction == "debit"]))
-output
# A tibble: 3 × 2
category total
<chr> <int>
1 a 15
2 b 30
3 c 63
With pivot_wider, it is not the case
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value)
# A tibble: 8 × 3
date category credit
<date> <chr> <int>
1 2020-07-25 c 19
2 2020-05-09 b 15
3 2020-08-27 a 15
4 2020-03-27 b 15
5 2020-04-06 c 6
6 2020-07-06 c 11
7 2020-09-22 c 25
8 2020-10-06 c 2
it creates only the 'credit' column, thus when we call a column 'debit' that is not created, it throws error
df %>%
slice(-c(9:10)) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) )
Error: Problem with summarise() column outgoing.
ℹ outgoing = sum(debit, na.rm = TRUE).
✖ object 'debit' not found
ℹ The error occurred in group 1: category = "a".
Run rlang::last_error() to see where the error occurred.
In this case, we can do a complete to create some rows with debit as well which will have NA for other columns
library(tidyr)
df %>%
slice(-c(9:10)) %>%
complete(category, direction = c("credit", "debit")) %>%
pivot_wider(names_from = direction, values_from = value) %>%
group_by(category) %>%
summarize(incoming = sum(credit, na.rm=TRUE),
outgoing=sum(debit,na.rm=TRUE) ) %>%
mutate(total= incoming-outgoing)
# A tibble: 3 × 4
category incoming outgoing total
<chr> <int> <int> <int>
1 a 15 0 15
2 b 30 0 30
3 c 63 0 63

Trying to calculate share - summarize function not working

I'm trying to calculate the share of a certain variable cost for each country, related to the total. However, when I try to create the "share" column through mutate, it yields all answers as 1.
The code I'm using is as follows:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
mutate(share=cost/sum(cost))
This is the table it is generating:
# Groups: cluster [18]
cluster group cost share
<chr> <chr> <dbl> <dbl>
1 AT A 7810. 1
2 AU C 7786. 1
3 CA C 5920. 1
4 KO B 172702. 1
5 DE A 40894. 1
6 ES A 26357. 1
7 FR A 65735. 1
8 GB C 11240. 1
9 IT A 85045. 1
10 JP B 10069. 1
I've tried inverting the positions of group and country on the group_by(), but the share column is still returning the shares as a % of the group, instead of the total sum. Why is this happening and how can I fix it?
It's because the default behavior of summarise is to output a grouped dataframe when grouping by more than one variable (it drops one variable and keeps the next).
To solve it you can add an ungroup:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
ungroup() %>%
mutate(share=cost/sum(cost))
Or from dplyr version > 1.0.0:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost), .groups = "drop") %>%
mutate(share=cost/sum(cost))

Number of days spent in each STATE in r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.
Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result
You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Weighted mean of a group, where weight is from another group

Suppose you have a long data.frame of the following form:
ID Group Year Field VALUE
1 1 2016 AA 10
2 1 2016 AA 16
1 1 2016 TOTAL 100
2 1 2016 TOTAL 120
etc..
and you want to create an grouped output of weighted.mean(Value,??) for each group_by(Group, Year, Field) using Field == TOTAL as the weight for years >2013.
So far i am using dplyr:
dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,VALUE[Field == 'TOTAL'])) %>%
ungroup()
Now the problem (to my understanding) is that by using group_by I cannot define the "Field" value afterwards, as I tell it to look at the group of "Field == AA".
Transforming data from long to wide is not a solution, as i have >1000 different field values which potentially increase over time, and this code will be run daily at some point.
First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))
# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120
# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))
# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.
Created on 2020-11-03 by the reprex package (v0.3.0)

Calculating percentage of increased and decreased values between factors

I'm looking for a way to calculate the change of scores between factors (for example, questionnaire scores between Pre and Post treatment). I want to figure out what percentage of participants improved and what percentage did not between Pre and Post.
I have looked at some dplyr solutions but I think I am missing a line of code from it but I am not sure.
ID<-c("aaa","bbb","ccc","ddd","eee","fff", "ggg","aaa","bbb","ccc","ddd","eee","fff", "ggg")
Score<-sample(40,14)
Pre_Post<-c(1,1,1,1,1,1,1,2,2,2,2,2,2,2)
df<-cbind(ID, Pre_Post, Score)
df$Score<-as.numeric(df$Score)
df<-as.data.frame(df)
#what I have tried
df2<-df%>%
group_by(ID, Pre_post)
mutate(Pct_change=mutate(Score/lead(Score)*100))
But I get error messages. As well, I wasn't confident that the code was right to begin with.
Expected outcome:-
What I want to achieve is getting the percentages of ID's that have improved. So in the case of the mock example that I have provided, only 42.86% of ID's have improved from Pre to Post, while 57.14% actually worsened between Pre and Post.
Any suggestions would be welcome :)
you have several typos that is why you get an error.
You can do something like this to get old and new scores side by side:
library(tidyverse)
df %>%
spread(Pre_Post, Score) %>%
rename(Score_pre = `1`, Score_post = `2`)
ID Score_pre Score_post
1 aaa 19 24
2 bbb 39 35
3 ccc 2 29
4 ddd 38 15
5 eee 36 9
6 fff 23 10
7 ggg 21 27
To get the number of improvements you have to convert Score to numeric first:
df %>% as_tibble() %>%
mutate(Score = as.numeric(Score)) %>%
spread(Pre_Post, Score) %>%
rename(Score_pre = `1`, Score_post = `2`) %>%
mutate(improve = if_else(Score_pre > Score_post, "0", "1")) %>%
group_by(improve) %>%
summarise(n = n()) %>%
mutate(percentage = n / sum(n))
# A tibble: 2 x 3
improve n percentage
<chr> <int> <dbl>
1 0 3 0.429
2 1 4 0.571
Another option with dplyr assuming you always have two values with Pre as 1 and Post as 2 would be to group_by ID and subtract the second value with first value and calculate the ratio for positive and negative values.
library(dplyr)
df %>%
arrange(ID, Pre_Post) %>%
group_by(ID) %>%
summarise(val = Score[2] - Score[1]) %>%
summarise(total_pos = sum(val > 0)/n(),
total_neg = sum(val < 0)/ n())
# A tibble: 1 x 2
# total_pos total_neg
# <dbl> <dbl>
#1 0.429 0.571
data
ID <- c("aaa","bbb","ccc","ddd","eee","fff", "ggg","aaa","bbb",
"ccc","ddd","eee","fff", "ggg")
Score <- sample(40,14)
Pre_Post <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2)
df <- data.frame(ID, Pre_Post, Score)

Resources