How to sum values for each unique group in R - r

In the dataset below, I want to identify Top 3 time-consuming projects
library(dplyr)
TransID <-c(1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1014,1018,1022,1023,1024)
EmpID<-c('M001','M001','M001','M001','B005','B005','B005','B005','X101','X101','X101','Z101','K501','K501','K501','K501')
ProjectID <- c(200,200,200,200,500,500,500,500,950,950,950,950,1050,1050,1050,1050)
Site<-c('X','X','X','Y','Y','Y','Z','Z','Z','G','G','G','G','K','K','K')
Region <-c('NE','NW','SE','SW','MW','NW','SW','NE','NC','MW','NE','SE','SW','NC','SW','SE')
hour_difference<-c(1.45,2.14,2.53,3.69,1.73,2.47,3.63,1.59,0.75,1.18,2.78,9.55,1.85,2.39,5.52,0.23)
df = data.frame(TransID,EmpID,ProjectID,Site,Region,hour_difference)
df
Simply,
for each unique ProjectID, I want to sum the hour_difference and sort in descending order
My attempt:
df %>%
group_by(ProjectID,hour_difference) %>%
summarize(sum().sort_values())
Desired output:
for example, ProjectID = 950 will have a sum of 14.26

I'm confused about descending order of ProjectID or sum of hour_difference but you may try
sum(hour_difference)
df %>%
group_by(ProjectID) %>%
summarise(res = sum(hour_difference)) %>%
arrange(desc(res))
ProjectID res
<dbl> <dbl>
1 950 14.3
2 1050 9.99
3 200 9.81
4 500 9.42
ProjectID
df %>%
group_by(ProjectID) %>%
summarise(res = sum(hour_difference)) %>%
arrange(desc(ProjectID))
ProjectID res
<dbl> <dbl>
1 1050 9.99
2 950 14.3
3 500 9.42
4 200 9.81

Related

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

Web Scraping Using Multiple Variables in Link

I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:
https://www.pgatour.com/stats/stat.02568.y2019.eon.t041.html
In the example link - 02568 is one of many stat_id's and t041 is one of many tournament_id's. I want the scrape to get every combo of stat_id and tournament_id in the following manner:
Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?
library(rvest)
library(dplyr)
library(stringr)
tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))
test_table_pga4 <- lapply(url_g, function(i){
page2 <- read_html(i)
test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>%
mutate(tournament = i)
})
test_golf7 <- as_tibble(rbind.fill(test_table_pga4))
Use expand.grid() to create unique combinations of stat_id and tournament_id and then mutate a new column with those links.
library(tidyverse)
library(janitor)
library(rvest)
df <- expand.grid(
tournament_id = c("t041", "t054", "t464"),
stat_id = c("02568", "02567", "02564")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character,
tournament = tournament) %>%
relocate(tournament)
}
# Retrieve the tables and bind them
df %$%
map2_dfr(links, tournament_id, get_info)
# A tibble: 648 × 9
tournament rank_this_week player_name rounds average total_sg_app
<fct> <chr> <chr> <int> <dbl> <dbl>
1 t041 1 Corey Conners 4 2.89 11.6
2 t041 2 Matt Kuchar 4 2.16 8.62
3 t041 3 Byeong Hun An 4 1.90 7.60
4 t041 4 Charley Hoffman 4 1.72 6.88
5 t041 5 Ryan Moore 4 1.43 5.73
6 t041 6 Brian Stuard 4 1.42 5.69
7 t041 7 Danny Lee 4 1.30 5.18
8 t041 8 Cameron Tringale 4 1.22 4.88
9 t041 9 Si Woo Kim 4 1.22 4.87
10 t041 10 Scottie Scheffler 4 1.16 4.62
# … with 638 more rows, and 3 more variables: measured_rounds <int>,
# total_sg_ott <dbl>, total_sg_putting <dbl>

In R , there are `actual` and `budget` values,how to add new variable and calculate the variable values

In variable type ,there are actual and budget values,how to add new variable and calculate the variable value ? Current code can work, but a little bording. Anyone can help? Thanks!
ori_data <- data.frame(
category=c("A","A","A","B","B","B"),
year=c(2021,2022,2022,2021,2022,2022),
type=c("actual","actual","budget","actual","actual","budget"),
sales=c(100,120,130,70,80,90),
profit=c(3.7,5.52,5.33,2.73,3.92,3.69)
)
Add sales inc%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='actual'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2021'&type=='actual']-1
Add budget acheved%
ori_data$sales_inc_or_budget_acheved[category=='A'&year=='2022'&type=='budget'] <-
ori_data$sales[category=='A'&year=='2022'&type=='actual']/
ori_data$sales[category=='A'&year=='2022'&type=='budget']
Using a group_by and an if_elseyou could do:
library(dplyr)
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(sales_inc_or_budget_achieved = if_else(type == "actual",
sales / lag(sales) - 1,
lag(sales) / sales)) |>
ungroup()
#> # A tibble: 6 × 6
#> category year type sales profit sales_inc_or_budget_achieved
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA
#> 2 A 2022 actual 120 5.52 0.2
#> 3 A 2022 budget 130 5.33 0.923
#> 4 B 2021 actual 70 2.73 NA
#> 5 B 2022 actual 80 3.92 0.143
#> 6 B 2022 budget 90 3.69 0.889
And using across you could do the same for both sales and profit:
ori_data |>
group_by(category) |>
arrange(category, type, year) |>
mutate(across(c(sales, profit), ~ if_else(type == "actual",
.x / lag(.x) - 1,
lag(.x) / .x),
.names = "{.col}_inc_or_budget_achieved")) |>
ungroup()
#> # A tibble: 6 × 7
#> category year type sales profit sales_inc_or_budget_achie… profit_inc_or_b…
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2021 actual 100 3.7 NA NA
#> 2 A 2022 actual 120 5.52 0.2 0.492
#> 3 A 2022 budget 130 5.33 0.923 1.04
#> 4 B 2021 actual 70 2.73 NA NA
#> 5 B 2022 actual 80 3.92 0.143 0.436
#> 6 B 2022 budget 90 3.69 0.889 1.06
Answer from stefan suits perfectly well, however, I would suggest you rearrange your data first.
In my opinion sales and profit are types of measures (aka observations) and actual and budget are the measurements here:
library(tidyr)
library(dplyr)
ori_data2 <-
ori_data %>%
pivot_longer(c(sales, profit)) %>%
pivot_wider(names_from = type, values_from = value) %>%
group_by(category, name) %>%
arrange(year, .by_group = TRUE)
then your calculations become much more easier:
ori_data2 %>%
mutate(increase = actual / lag(actual) - 1, # compare to the year before
budget_acheved = actual / budget) %>% # compare actual vs. budget
filter(year == 2022) # you can filter for year of interest
mutate(across(c(increase, budget_acheved), scales::percent)) # and format as percent

Is there any way to join two data frames by date ranges?

I have two data frames, the first dataset is the record for forecasted demand in the following 27 days for each item of the company, shown as below:
library(tidyverse)
library(lubridate)
daily_forecast <- data.frame(
item=c("A","B","A","B"),
date_fcsted=c("2020-8-1","2020-8-1","2020-8-15","2020-8-15"),
fcsted_qty=c(100,200,200,100)
) %>%
mutate(date_fcsted=ymd(date_fcsted)) %>%
mutate(extended_date=date_fcsted+days(27))
and the other dateset is the actual daily demand for each item:
actual_orders <- data.frame(
order_date=rep(seq(ymd("2020-8-3"),ymd("2020-9-15"),by = "1 week"),2),
item=rep(c("A","B"),7),
order_qty=round(rnorm(n=14,mean=50,sd=10),0)
)
What i am trying to accomplish is to get the actual total demand for each item within the date_fcsted and extended_date in the first dataset and then have them joined to calculate the forecast accuracy.
Solutions with tidyverse would be highly appreciated.
You can try the following :
library(dplyr)
daily_forecast %>%
left_join(actual_orders, by = 'item') %>%
filter(order_date >= date_fcsted & order_date <= extended_date) %>%
group_by(item, date_fcsted, extended_date, fcsted_qty) %>%
summarise(value = sum(order_qty))
# item date_fcsted extended_date fcsted_qty value
# <chr> <date> <date> <dbl> <dbl>
#1 A 2020-08-01 2020-08-28 100 179
#2 A 2020-08-15 2020-09-11 200 148
#3 B 2020-08-01 2020-08-28 200 190
#4 B 2020-08-15 2020-09-11 100 197
You could also try fuzzy_join as suggested by #Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).
library(fuzzyjoin)
library(dplyr)
daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))
Output
rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

Using filter in dplyr to generate values for all rows

library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.

Resources