How to merge two datasets with conditions? - r

Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year

You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506

First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!

Related

Best way to iterate through dataframe, calculate monthly value, generate new dataframe?

I have a number of excel spreadsheets I'm iterating through with payments on a given date and an associated number of months of service for the payment.
e.g.
Product
Cost
License Date Start
License length in months
Monthly cost
Product A
3000
January 2022
3
1000
Product B
2400
March 2022
4
600
Product B
2400
Feb 2022
3
800
Product A
2000
March 2022
2
1000
What I would like to do is create a new dataframe, shaped around the months, with the broken down individual and total monthly cost of each product, based on the length of the license.
For example, in the table above, the cost of the first instance Product A is 3000 and runs for 3 months, making it 1000/month and running through January, February and March. For the second instance of Product A, it is again 1000/month but runs through March and April, so there is overlap, with March have a total cost of Product A of 2000.
In the end, my outcome should look like this:
Date
Product A cost
Product B cost
Product C cost
Total cost
January 2022
1000
0
0
1000
February 2022
1000
800
0
1800
March 2022
2000
2400
0
4400
April 2022
1000
2400
0
3400
May 2022
1000
600
0
600
June 2022
1000
600
0
600
I am struggling to find the best way to iterate through the original data and generate the end result. My general approach is to use apply to iterate through the original dataframe, generating rows based on the number of months, start date, and monthly cost, before then attempting to reshape into relevant columns, but I am having trouble getting apply to return and am concerned that this isn't the most efficient way to do this.
Any help much appreciated.
I think you have to be a little bit careful with your calculations regarding your dates. In your example the start and end dates are all in the same year, but if your starting month is December and the license lasts more than a month, then you have to pay attention to the calcuation of the month and year. For this you can use the lubridate-package. I added one row to your example for December 2021 to demonstrate it:
library(tidyverse)
library(lubridate)
df <- read.table(text = "Product Cost License Date Start License length in months Monthly cost
Product A 3000 January 2022 3 1000
Product B 2400 March 2022 4 600
Product B 2400 Feb 2022 3 800
Product A 2000 March 2022 2 1000
Product C 2000 December 2021 2 1000", sep = "\t", header = TRUE)
df.result <- df %>%
mutate(id = row_number(), Date = my(License.Date.Start)) %>%
group_by(id, Product, Monthly.cost) %>%
summarise(Date = Date %m+% months((1:License.length.in.months) - 1)) %>%
pivot_wider(id_cols = Date, names_from = Product, values_from = Monthly.cost, values_fn = sum, values_fill = 0) %>%
arrange(Date) %>%
mutate(Total = rowSums(select(., contains("Product"))), Date = format(Date, "%B %Y"))
df.result
#> # A tibble: 7 x 5
#> Date `Product A` `Product B` `Product C` Total
#> <chr> <int> <int> <int> <dbl>
#> 1 December 2021 0 0 1000 1000
#> 2 January 2022 1000 0 1000 2000
#> 3 February 2022 1000 800 0 1800
#> 4 March 2022 2000 1400 0 3400
#> 5 April 2022 1000 1400 0 2400
#> 6 May 2022 0 600 0 600
#> 7 June 2022 0 600 0 600
Created on 2022-10-17 by the reprex package (v2.0.1)
Using your input df as a starting point, where I changed the License Date Start into the corresponding month number, you can uncount the occurences by License length in months.
input_df <- data.frame(
Product = c("Product A", "Product B", "Product B", "Product A"),
month_start = c(1, 3, 2, 3),
License_lenght = c(3, 4, 3, 2),
Monthly = c(1000, 600, 800, 1000)
)
You then want to keep track of every row, as one product can have multiple starting months. In this example I used row_number()
output_df <- input_df %>%
mutate(rn=row_number()) %>%
group_by(Product, rn) %>%
uncount(License_lenght) %>%
mutate(month_active = row_number() + month_start - 1) %>%
group_by(Product, month_active) %>%
summarize(Product_monthly_cost = sum(Monthly)) %>%
group_by(month_active) %>%
mutate(Total_cost = sum(Product_monthly_cost)) %>%
pivot_wider(names_from = Product, values_from = Product_monthly_cost) %>%
replace(is.na(.), 0)
I uncount per product type and row number rn. Then I define every month in which the license is active, and sum the monthly cost per product and active month. Then group per active month to determine the total monthly cost. Finally I pivot_wider per product and active_month just like the desired output dataframe you posted and replace the na's with 0.
The result is
> output_df
# A tibble: 6 × 4
month_active Total_cost `Product A` `Product B`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 1000 0
2 2 1800 1000 800
3 3 3400 2000 1400
4 4 2400 1000 1400
5 5 600 0 600
6 6 600 0 600

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

split-apply-combine R

I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows

Using filter in dplyr to generate values for all rows

library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.

R: Calculating year to date sum

I would like to calculate sum of sales from the beggining of the year to the newest date.
My data:
ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300
MY YTD should be 200+300
This will sum all values for the current calendar year sum(df$Sales[format(df$Date, "%Y") == format(Sys.Date(), "%Y")]) - you might need to make sure your df$Date variable is of class Date
I assume you Date field is character and last four digits represent year.
Then you can filter where it equals current year with below:
df<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=T)
sum(df[substr(df$Date,4,7)==format(Sys.Date(),"%Y"),]$Sales)
[1] 500
You could use dplyr to summarise by year. lubridate is also useful to group_by year:
df1<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=TRUE, stringsAsFactors=FALSE)
df1$Date <- as.yearmon(df1$Date,format="%m-%Y")
library(dplyr);library(lubridate)
df1%>%
group_by(Year=year(Date))%>%
summarise(Sales=sum(Sales))
Year Sales
<dbl> <int>
1 2016 200
2 2017 500

Resources