I have a number of excel spreadsheets I'm iterating through with payments on a given date and an associated number of months of service for the payment.
e.g.
Product
Cost
License Date Start
License length in months
Monthly cost
Product A
3000
January 2022
3
1000
Product B
2400
March 2022
4
600
Product B
2400
Feb 2022
3
800
Product A
2000
March 2022
2
1000
What I would like to do is create a new dataframe, shaped around the months, with the broken down individual and total monthly cost of each product, based on the length of the license.
For example, in the table above, the cost of the first instance Product A is 3000 and runs for 3 months, making it 1000/month and running through January, February and March. For the second instance of Product A, it is again 1000/month but runs through March and April, so there is overlap, with March have a total cost of Product A of 2000.
In the end, my outcome should look like this:
Date
Product A cost
Product B cost
Product C cost
Total cost
January 2022
1000
0
0
1000
February 2022
1000
800
0
1800
March 2022
2000
2400
0
4400
April 2022
1000
2400
0
3400
May 2022
1000
600
0
600
June 2022
1000
600
0
600
I am struggling to find the best way to iterate through the original data and generate the end result. My general approach is to use apply to iterate through the original dataframe, generating rows based on the number of months, start date, and monthly cost, before then attempting to reshape into relevant columns, but I am having trouble getting apply to return and am concerned that this isn't the most efficient way to do this.
Any help much appreciated.
I think you have to be a little bit careful with your calculations regarding your dates. In your example the start and end dates are all in the same year, but if your starting month is December and the license lasts more than a month, then you have to pay attention to the calcuation of the month and year. For this you can use the lubridate-package. I added one row to your example for December 2021 to demonstrate it:
library(tidyverse)
library(lubridate)
df <- read.table(text = "Product Cost License Date Start License length in months Monthly cost
Product A 3000 January 2022 3 1000
Product B 2400 March 2022 4 600
Product B 2400 Feb 2022 3 800
Product A 2000 March 2022 2 1000
Product C 2000 December 2021 2 1000", sep = "\t", header = TRUE)
df.result <- df %>%
mutate(id = row_number(), Date = my(License.Date.Start)) %>%
group_by(id, Product, Monthly.cost) %>%
summarise(Date = Date %m+% months((1:License.length.in.months) - 1)) %>%
pivot_wider(id_cols = Date, names_from = Product, values_from = Monthly.cost, values_fn = sum, values_fill = 0) %>%
arrange(Date) %>%
mutate(Total = rowSums(select(., contains("Product"))), Date = format(Date, "%B %Y"))
df.result
#> # A tibble: 7 x 5
#> Date `Product A` `Product B` `Product C` Total
#> <chr> <int> <int> <int> <dbl>
#> 1 December 2021 0 0 1000 1000
#> 2 January 2022 1000 0 1000 2000
#> 3 February 2022 1000 800 0 1800
#> 4 March 2022 2000 1400 0 3400
#> 5 April 2022 1000 1400 0 2400
#> 6 May 2022 0 600 0 600
#> 7 June 2022 0 600 0 600
Created on 2022-10-17 by the reprex package (v2.0.1)
Using your input df as a starting point, where I changed the License Date Start into the corresponding month number, you can uncount the occurences by License length in months.
input_df <- data.frame(
Product = c("Product A", "Product B", "Product B", "Product A"),
month_start = c(1, 3, 2, 3),
License_lenght = c(3, 4, 3, 2),
Monthly = c(1000, 600, 800, 1000)
)
You then want to keep track of every row, as one product can have multiple starting months. In this example I used row_number()
output_df <- input_df %>%
mutate(rn=row_number()) %>%
group_by(Product, rn) %>%
uncount(License_lenght) %>%
mutate(month_active = row_number() + month_start - 1) %>%
group_by(Product, month_active) %>%
summarize(Product_monthly_cost = sum(Monthly)) %>%
group_by(month_active) %>%
mutate(Total_cost = sum(Product_monthly_cost)) %>%
pivot_wider(names_from = Product, values_from = Product_monthly_cost) %>%
replace(is.na(.), 0)
I uncount per product type and row number rn. Then I define every month in which the license is active, and sum the monthly cost per product and active month. Then group per active month to determine the total monthly cost. Finally I pivot_wider per product and active_month just like the desired output dataframe you posted and replace the na's with 0.
The result is
> output_df
# A tibble: 6 × 4
month_active Total_cost `Product A` `Product B`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 1000 0
2 2 1800 1000 800
3 3 3400 2000 1400
4 4 2400 1000 1400
5 5 600 0 600
6 6 600 0 600
Related
Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year
You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506
First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!
I am trying to create a function in R that will allow me to determine the date at which a product will be out of stock. I would like this function to be able to account for scheduled incoming orders and show a "running total" of units in stock. Below is a reproducible idea of what I have been able to do thus far.
library(tidyverse)
library(lubridate)
runrate <- 25
onHand <- tibble(date = Sys.Date(), OnHand = 2000)
ord_tbl <- tibble(date = c(ymd("2020-04-09"), ymd("2020-04-12"), ymd("2020-04-17")), onOrder = c(200, 500, 100))
date_tbl <- tibble(date = seq.Date(from = Sys.Date(), to = Sys.Date() + 180, by = "day")) %>%
mutate(Month = month(date, label = TRUE))
joined_tbl <- date_tbl %>%
left_join(onHand) %>%
left_join(ord_tbl)
joined_tbl <- joined_tbl %>%
mutate(OnHand = coalesce(joined_tbl$OnHand, 0),
onOrder = coalesce(joined_tbl$onOrder, 0),
id = row_number()) %>%
mutate(usage = id * runrate) %>%
select(id, everything())
start_inv_value <- joined_tbl %>%
filter(date == Sys.Date()) %>%
select(OnHand)
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - (id * usage) + onOrder)
Ideally, I would like to take the starting inventory values on hand and then subtract the daily usage and add in units that are expected to be received; however, I am unable to bring down the previous days projected_on_hand value.
The anticipated results would look like this:
Thank you for your help!
I think you might want to include a cumulative sum of onOrder (use cumsum). In addition, you can just subtract usage for each row.
joined_tbl %>%
mutate(projected_On_Hand = start_inv_value$OnHand - usage + cumsum(onOrder))
Output
# A tibble: 181 x 7
id date Month OnHand onOrder usage projected_On_Hand
<int> <date> <ord> <dbl> <dbl> <dbl> <dbl>
1 1 2020-04-08 Apr 2000 0 25 1975
2 2 2020-04-09 Apr 0 200 50 2150
3 3 2020-04-10 Apr 0 0 75 2125
4 4 2020-04-11 Apr 0 0 100 2100
5 5 2020-04-12 Apr 0 500 125 2575
6 6 2020-04-13 Apr 0 0 150 2550
7 7 2020-04-14 Apr 0 0 175 2525
8 8 2020-04-15 Apr 0 0 200 2500
9 9 2020-04-16 Apr 0 0 225 2475
10 10 2020-04-17 Apr 0 100 250 2550
I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows
library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.
I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).