Rank df by values and sum by unique variables [duplicate] - r

This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg

Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.

We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5

Related

Weekly Weight Based on a category using dplyr in R

I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))

Group by and Pivot table in R

I'm just starting to learn R and transition a project from Jupyter Notebook to an R Markdown document. I have a data set that looks like this:
DATE | ROUTE | STOP_NAME | BOARDING
-----------------------------------------------
2020-03-09 | 1 | STOP A | 2
2020-03-09 | 1 | STOP B | 3
2020-03-09 | 2 | STOP C | 1
There are 20,xxx records over several days and 16 routes. I am trying to group by DATE and ROUTE and sum the BOARDING column. I was able to do this in Python using
df.groupby(['DATE','ROUTE'],as_index = False)['BOARDING'].sum().pivot('DATE','ROUTE').fillna(0)
I've been able to create a table in R close to what I want using:
groupcol1 <- c("DATE","ROUTE")
datacol1 <- ("BOARDING")
route_totals_table <- ddply(df,groupcol1,function(x) colSums(x[datacol1]))
But this gives me a table with a row for each date and route. I am wanting a table like this.
DATE | ROUTE 1 | Route 2 | Route 3
-----------------------------------------------
2020-03-09 | 25 | 45 | 10
2020-03-10 | 36 | 69 | 22
2020-03-11 | 95 | 100 | 29
I would suggest using the tidyverse package to do this work, and the spread or pivot_wider functions from the tidyr package. Suppose your data is in a data.frame called "dat":
library(tidyverse)
# using spread
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
spread(ROUTE, BOARDING)
# using pivot_wider
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
pivot_wider(names_from = ROUTE, values_from = BOARDING)
Both return:
DATE `Route 1` `Route 2`
<chr> <int> <int>
1 "2020-03-09" 5 1

Swap results of a value from last years value in same month if the month-year combination is not equal to month-year of system

I've a table as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-09 |
| 1 | 12 | 2018-09 |
| 2 | 13 | 2019-10 |
| 2 | 14 | 2018-10 |
| 3 | 67 | 2019-01 |
| 3 | 78 | 2018-01 |
+----+-------+---------+
I want to be able to swap the VALUE column for all ID's where the DATE != year-month of system date
and If the DATE == year-month of system date then just keep this years value
the resulting table I need is as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 12 | 2019-09 |
| 2 | 13 | 2019-10 |
| 3 | 78 | 2019-01 |
+----+-------+---------+
As Jon and Maurits noticed, your example is unclear: you give no line with what is a wrong format to you, and you mention "current year" but do not describe the expected output for the next year for instance.
Here is an attempt of code to actually answer your question:
library(dplyr)
x = read.table(text = "
ID VALUE DATE
1 10 2019-09
1 12 2018-09
1 12 2018-09-04
1 12 2018-99
2 13 2019-10
2 14 2018-10
3 67 2019-01
3 78 2018-01
", header=T)
x %>%
mutate(DATE = paste0(DATE, "-01") %>% as.Date("%Y-%m-%d")) %>%
group_by(ID) %>%
filter(DATE==max(DATE, na.rm=T))
I inserted two lines with a "wrong" format (according to me) and treated "current year" as the maximum year you could find in the column for each ID.
This may be wrong assertions, but I'd need more information to better answer this.

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Making Pivot table with Multiple Columns and Aggregating by Unique Occurences

I'm having a tough time wrapping my head around this or finding a guideline online.
I have membership data. I want to be to see how many members last in a particular month before dropping their membership. I can see which month they have joined and I can see how long they've been active by looking at their transaction no (it increases by 1 each month). So if I track transaction no's for each month, I can get a waterfall of how many people joined that month and what the drop off was.
The kicker is that sometimes there are multiple transactions within a month by the same member, but I would only like to count that member once, so I would need to count that member only once.
Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2
Aggregating for distinct members with months as columns, the result would look something like this:
Transaction# | Jan | Feb | March
1 | 3 | 1 | 2
2 | 2 | 1 | 1
3 | 1 | 1 | 0
4 | 1 | 0 | 0
Any tips or pointers in the correct direction would be very helpful. Should I be using reshape2 or a similar package? Hopefully I did not butcher the explanation or the formatting, please feel free to ask any questions.
Thank you!
Below is a reproducible example that uses the tidyverse functions dplyr::n_distinct and tidyr::spread.
I have first represented your data as a tibble (or you could use a data frame equally well).
Next we group by Transactionno and JoinedMonth before counting distinct Names. To get it in table format you request we use tidyr::spread. If you want the resulting columns in month order, ensuring your data frame has them as ordered factors would be important.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(tidyr)
x <- tribble(
~Name , ~JoinedMonth, ~Transactionno,
"Adam" , "Jan" , 1,
"Adam" , "Jan" , 2,
"Adam" , "Jan" , 2,
"Ben" , "Jan" , 1,
"Ben" , "Jan" , 2,
"Ben" , "Jan" , 3,
"Ben" , "Jan" , 4,
"Cathy", "Jan" , 1,
"Donna", "Feb" , 1,
"Donna", "Feb" , 2,
"Donna", "Feb" , 3,
"Evan" , "Mar" , 1,
"Evan" , "Mar" , 1,
"Frank" , "Mar" , 1,
"Frank" , "Mar" , 2
)
x %>%
group_by(Transactionno, JoinedMonth) %>%
summarise(ct = n_distinct(Name)) %>%
tidyr::spread(JoinedMonth, ct, fill = 0)
#> # A tibble: 4 x 4
#> # Groups: Transactionno [4]
#> Transactionno Feb Jan Mar
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1. 1. 3. 2.
#> 2 2. 1. 2. 1.
#> 3 3. 1. 1. 0.
#> 4 4. 0. 1. 0.
1) xtabs This one-liner uses base R and the input DF shown reproducibly in the Note below. Note that we assume that Joined.Month is a factor with levels Jan, Feb, Mar to ensure that the output is sorted in that order (rather than alphabetically).
xtabs(~ Transaction.no + Joined.Month, unique(DF))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 1 3 2
2 1 2 1
3 1 1 0
4 0 1 0
2) table Another base R approach.
with(unique(DF), table(Transaction.no, Joined.Month))
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
2a) This would also work and is shorter but not quite as clear:
table(unique(DF)[3:2])
3) tapply This also uses only base R:
u <- unique(DF)
tapply(u[[1]], u[3:2], length, default = 0)
giving:
Joined.Month
Transaction.no Jan Feb Mar
1 3 1 2
2 2 1 1
3 1 1 0
4 1 0 0
Note
DF in reproducible form is assumed to be:
Lines <- "Name | Joined Month | Transaction no
Adam | Jan | 1
Adam | Jan | 2
Adam | Jan | 2
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 3
Ben | Jan | 4
Cathy| Jan | 1
Donna| Feb | 1
Donna| Feb | 2
Donna| Feb | 3
Evan | Mar | 1
Evan | Mar | 1
Frank | Mar | 1
Frank | Mar | 2"
DF <- read.table(text = Lines, header = TRUE, sep = "|",
strip.white = TRUE, as.is = TRUE)
DF$Joined.Month <- factor(DF$Joined.Month, lev = month.abb[1:3])

Resources