Group by and Pivot table in R - r

I'm just starting to learn R and transition a project from Jupyter Notebook to an R Markdown document. I have a data set that looks like this:
DATE | ROUTE | STOP_NAME | BOARDING
-----------------------------------------------
2020-03-09 | 1 | STOP A | 2
2020-03-09 | 1 | STOP B | 3
2020-03-09 | 2 | STOP C | 1
There are 20,xxx records over several days and 16 routes. I am trying to group by DATE and ROUTE and sum the BOARDING column. I was able to do this in Python using
df.groupby(['DATE','ROUTE'],as_index = False)['BOARDING'].sum().pivot('DATE','ROUTE').fillna(0)
I've been able to create a table in R close to what I want using:
groupcol1 <- c("DATE","ROUTE")
datacol1 <- ("BOARDING")
route_totals_table <- ddply(df,groupcol1,function(x) colSums(x[datacol1]))
But this gives me a table with a row for each date and route. I am wanting a table like this.
DATE | ROUTE 1 | Route 2 | Route 3
-----------------------------------------------
2020-03-09 | 25 | 45 | 10
2020-03-10 | 36 | 69 | 22
2020-03-11 | 95 | 100 | 29

I would suggest using the tidyverse package to do this work, and the spread or pivot_wider functions from the tidyr package. Suppose your data is in a data.frame called "dat":
library(tidyverse)
# using spread
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
spread(ROUTE, BOARDING)
# using pivot_wider
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
pivot_wider(names_from = ROUTE, values_from = BOARDING)
Both return:
DATE `Route 1` `Route 2`
<chr> <int> <int>
1 "2020-03-09" 5 1

Related

How to group data by time in R and count frequency

I want to group the hour to time of the day:
i.e., Morning - 00:00:00 - 09:59:59
Afternoon - 10:00:00 - 17:59:59
Evening - 18:00:00 - 23:59:59
This is the input data:
| Date | Time |
| 21/10/20 | 03:49:19 |
| 21/10/20 | 05:39:23 |
| 21/10/20 | 09:23:10 |
| 21/10/20 | 14:38:50 |
| 21/10/20 | 17:17:48 |
| 21/10/20 | 21:23:45 |
| 21/10/20 | 21:49:32 |
The output data should be:
| Period | Count |
| Morning | 3 |
| Afternoon | 2 |
| Evening | 2 |
You could use hms and case_when:
data <- read.table(text ='
Date Time
"21/10/20" "03:49:19"
"21/10/20" "05:39:23"
"21/10/20" "09:23:10"
"21/10/20" "14:38:50"
"21/10/20" "17:17:48"
"21/10/20" "21:23:45"
"21/10/20" "21:49:32"',header = T)
library(hms)
library(dplyr)
data %>% mutate(period = case_when(as_hms(Time)<as_hms('10:00:00') ~ 'Morning',
as_hms(Time)<as_hms('18:00:00') ~ 'Afternoon',
T ~ 'Evening')) %>%
group_by(Date,period) %>%
summarize(count=n()) %>%
ungroup()
#> # A tibble: 3 x 3
#> # Groups: Date [1]
#> Date period count
#> <chr> <chr> <int>
#> 1 21/10/20 Afternoon 2
#> 2 21/10/20 Evening 2
#> 3 21/10/20 Morning 3
library(lubridate)
data %>% group_by(period = case_when(hms(Time) < hours(10) ~ 'morning',
hms(Time) < hours(18) ~ 'Afternoon',
TRUE ~ 'Evening')) %>%
summarise(Count = n())
# A tibble: 3 x 2
period Count
<chr> <int>
1 Afternoon 2
2 Evening 2
3 morning 3
Base R :
data$Hour <- as.integer(substr(data$Time, 1, 2))
result <- stack(with(data, table(ifelse(Hour < 10, 'Morning',
ifelse(Hour < 18, 'Afternoon', 'Evening')))))
result
# values ind
#1 2 Afternoon
#2 2 Evening
#3 3 Morning

Can R do the equivalent of an HLOOKUP nested within a VLOOKUP?

I am trying (unsuccessfully) to do the equivalent of an HLOOKUP nested within a VLOOKUP in Excel using R Studio.
Here is the situation.
I have two tables. Table 1 has historical stock prices, where each column represents a ticker name and each row represents a particular date. Table 1 contains the closing stock price for each ticker on each date.
Assume Table 1 looks like this:
|----------------------------|
| Date |MSFT | AMZN |EPD |
|----------------------------|
| 6/1/2020 | 196 | 2600 | 19 |
| 5/1/2020 | 186 | 2200 | 20 |
| 4/1/2020 | 176 | 2000 | 15 |
| 3/1/2020 | 166 | 1800 | 14 |
| 2/1/2020 | 170 | 2200 | 18 |
| 1/1/2020 | 180 | 2300 | 17 |
|----------------------------|
Table 2 has a list of ticker symbols, as well as two dates and placeholders for the stock price on each date. Date1 is always an earlier date than Date2, and each of Date1 and Date2 corresponds with a date in Table 1. Note that Date1 and Date2 are different for each row of Table 2.
My objective is to pull the applicable PriceOnDate1 and PriceOnDate2 into Table 2 similar to VLOOKUP / HLOOKUP functions in Excel. (I can't use Excel going forward on this, as the file is too big for Excel to handle). Then I can calculate the return for each row by a formula like this: (Date2 - Date1) / Date1
Assume I want Table 2 to look like this, but I am unable to pull in the pricing data for PriceOnDate1 and PriceOnDate2:
|-----------------------------------------------------------|
| Ticker | Date1 | Date2 |PriceOnDate1 |PriceOnDate2 |
|-----------------------------------------------------------|
| MSFT | 1/1/2020 | 4/1/2020 | _________ | ________ |
| MSFT | 2/1/2020 | 6/1/2020 | _________ | ________ |
| AMZN | 5/1/2020 | 6/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 3/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 4/1/2020 | _________ | ________ |
|-----------------------------------------------------------|
My question is whether there is a way to use R to pull into Table 2 the closing price data from Table 1 for each Date1 and Date2 in each row of Table 2. For instance, in the first row of Table 2, ideally the R code would pull in 180 for PriceOnDate1 and 176 for PriceOnDate2.
I've tried searching for answers, but I am unable to craft a solution that would allow me to do this in R Studio. Can anyone please help me with a solution? I greatly appreciate your time. THANK YOU!!
Working in something like R requires you to think of the data a bit differently. Your Table 1 is probably easiest to work with pivoted into a long format. You can then just join together on the Ticker and Date to pull the values you want.
Data:
table_1 <- data.frame(Date = c("6/1/2020", "5/1/2020", "4/1/2020", "3/1/2020",
"2/1/2020", "1/1/2020"),
MSFT = c(196, 186, 176, 166, 170, 180),
AMZN = c(2600, 2200, 2000, 1800, 2200, 2300),
EPD = c(19, 20, 15, 14, 18, 17))
# only created part of Table 2
table_2 <- data.frame(Ticker = c("MSFT", "AMZN"),
Date1 = c("1/1/2020", "5/1/2020"),
Date2 = c("4/1/2020", "6/1/2020"))
Solution:
The tidyverse approach is pretty easy here.
library(dplyr)
library(tidyr)
First, pivot Table 1 to be longer.
table_1_long <- table_1 %>%
pivot_longer(-Date, names_to = "Ticker", values_to = "Price")
Then join in the prices that you want by matching the Date and Ticker.
table_2 %>%
left_join(table_1_long, by = c(Date1 = "Date", "Ticker")) %>%
left_join(table_1_long, by = c(Date2 = "Date", "Ticker")) %>%
rename(PriceOnDate1 = Price.x,
PriceOnDate2 = Price.y)
# Ticker Date1 Date2 PriceOnDate1 PriceOnDate2
# 1 MSFT 1/1/2020 4/1/2020 180 176
# 2 AMZN 5/1/2020 6/1/2020 2200 2600
The mapply function would do it here:
Let's say your first table is stored in a data.frame called df and the second in a data.frame called df2
df2$PriceOnDate1 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date1)
df2$PriceOnDate2 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date2)
In this code, the Hlookup is the double brackets ([[), which returns the column with that name. The VLOOKUP is the single brackets ([) which returns the value in a certain position.
This can be done with a single join if both data frames are in long format, followed by a pivot_wider to get the desired final shape.
The code below uses #Adam's sample data. Note that in the sample data, the dates are coded as factors. You'll probably want your dates coded as R's Date class in your real data.
library(tidyverse)
table_2 %>%
pivot_longer(-Ticker, values_to="Date") %>%
left_join(
table_1 %>%
pivot_longer(-Date, names_to="Ticker", values_to="Price")
) %>%
pivot_wider(names_from=name, values_from=c(Date, Price)) %>%
rename_all(~gsub("Date_", "", .))
Ticker Date1 Date2 Price_Date1 Price_Date2
1 MSFT 1/1/2020 4/1/2020 180 176
2 AMZN 5/1/2020 6/1/2020 2200 2600

Swap results of a value from last years value in same month if the month-year combination is not equal to month-year of system

I've a table as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-09 |
| 1 | 12 | 2018-09 |
| 2 | 13 | 2019-10 |
| 2 | 14 | 2018-10 |
| 3 | 67 | 2019-01 |
| 3 | 78 | 2018-01 |
+----+-------+---------+
I want to be able to swap the VALUE column for all ID's where the DATE != year-month of system date
and If the DATE == year-month of system date then just keep this years value
the resulting table I need is as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 12 | 2019-09 |
| 2 | 13 | 2019-10 |
| 3 | 78 | 2019-01 |
+----+-------+---------+
As Jon and Maurits noticed, your example is unclear: you give no line with what is a wrong format to you, and you mention "current year" but do not describe the expected output for the next year for instance.
Here is an attempt of code to actually answer your question:
library(dplyr)
x = read.table(text = "
ID VALUE DATE
1 10 2019-09
1 12 2018-09
1 12 2018-09-04
1 12 2018-99
2 13 2019-10
2 14 2018-10
3 67 2019-01
3 78 2018-01
", header=T)
x %>%
mutate(DATE = paste0(DATE, "-01") %>% as.Date("%Y-%m-%d")) %>%
group_by(ID) %>%
filter(DATE==max(DATE, na.rm=T))
I inserted two lines with a "wrong" format (according to me) and treated "current year" as the maximum year you could find in the column for each ID.
This may be wrong assertions, but I'd need more information to better answer this.

Calculate sum of a column if the difference between consecutive rows meets a condition

This is a continued question from the post Remove the first row from each group if the second row meets a condition
Below is a sample dataset:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
which would look like:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
I need to retain those records where based on each buyer and id, the sum of amount between consecutive rows >5000 if the difference between two consecutive rows <=5. So, for example, Buyer 'Sandy' with id '4' has two transactions of 1849 and 4193 on '6/15/2018' and '6/20/2018' within a gap of 5 days, and since the sum of these two amounts>5000, the output would have these records. Whereas, for the same Buyer 'Sandy' with id '4' has another transactions of 4256, 65 and 100 on '8/17/2018', '8/20/2018' and '8/23/2018' within a gap of 3 days each, but the output will not have these records as the sum of this amount <5000.
The final output would look like:
| id | Date | Buyer | diff | Amount |
|----|:---------:|------:|------|--------|
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
Changing Date from character to Date and Amount from character to numeric:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
Now here I group the dataset by id, arrange it with Date, and create a rank within each id (so for example Sandy is going to have rank from 1 through 5 for 5 different days in which she has shopped), then I define a new variable called ConsecutiveSum which adds the Value of each row to it's previous row's Value (lag gives you the previous row). The ifelse statement forces consecutive sum to output a 0 if the previous row's Value doesn't exists. The next step is just enforcing your conditions:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
I would use a combination of techniques available in tidyverse:
First create a grouping variable (new_id) and use the original id and new_id in combination to add together based on a grouping. Then we can filter by the criteria of the sum of the Amount > 5000. We can take this and filter then join or semi_join to filter based on the criteria.
ids is a dataset that finds the total Amount based on id and new_id and filters for when Dollars > 5000. This gives you the id and new_id that meets your criteria
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c(959,1158,596,922,922,1849,4193,4256,65,100,313,99), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
library(tidyverse)
df1 <- df %>% mutate(Date = as.Date(Date , format = "%m/%d/%Y"),
tf1 = (id != lag(id, default = 0)),
tf2 = (is.na(diffs) | diffs > 5))
df1$new_id <- cumsum(df1$tf1 + df1$tf2 > 0)
>df1
id Date Buyer Amount diffs days_post tf1 tf2 new_id
<chr> <date> <chr> <dbl> <dbl> <date> <lgl> <lgl> <int>
1 9 2018-11-29 John 959 NA 2018-12-04 TRUE TRUE 1
2 9 2018-11-29 John 1158 0 2018-12-04 FALSE FALSE 1
3 9 2018-11-29 John 596 0 2018-12-04 FALSE FALSE 1
4 5 2019-02-13 Maria 922 NA 2019-02-18 TRUE TRUE 2
5 5 2019-02-13 Maria 922 0 2019-02-18 FALSE FALSE 2
6 4 2018-06-15 Sandy 1849 NA 2018-06-20 TRUE TRUE 3
7 4 2018-06-20 Sandy 4193 5 2018-06-25 FALSE FALSE 3
8 4 2018-08-17 Sandy 4256 58 2018-08-22 FALSE TRUE 4
9 4 2018-08-20 Sandy 65 3 2018-08-25 FALSE FALSE 4
10 4 2018-08-23 Sandy 100 3 2018-08-28 FALSE FALSE 4
11 20 2018-12-25 Paul 313 NA 2018-12-30 TRUE TRUE 5
12 20 2018-12-25 Paul 99 0 2018-12-30 FALSE FALSE 5
ids <- df1 %>%
group_by(id, new_id) %>%
summarise(dollar = sum(Amount)) %>%
ungroup() %>% filter(dollar > 5000)
id new_id dollar
<chr> <int> <dbl>
1 4 3 6042
df1 %>% semi_join(ids)

Rank df by values and sum by unique variables [duplicate]

This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg
Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.
We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5

Resources