R: Calculating year to date sum - r

I would like to calculate sum of sales from the beggining of the year to the newest date.
My data:
ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300
MY YTD should be 200+300

This will sum all values for the current calendar year sum(df$Sales[format(df$Date, "%Y") == format(Sys.Date(), "%Y")]) - you might need to make sure your df$Date variable is of class Date

I assume you Date field is character and last four digits represent year.
Then you can filter where it equals current year with below:
df<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=T)
sum(df[substr(df$Date,4,7)==format(Sys.Date(),"%Y"),]$Sales)
[1] 500

You could use dplyr to summarise by year. lubridate is also useful to group_by year:
df1<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=TRUE, stringsAsFactors=FALSE)
df1$Date <- as.yearmon(df1$Date,format="%m-%Y")
library(dplyr);library(lubridate)
df1%>%
group_by(Year=year(Date))%>%
summarise(Sales=sum(Sales))
Year Sales
<dbl> <int>
1 2016 200
2 2017 500

Related

How to merge two datasets with conditions?

Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year
You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506
First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!

Format Dates in R to show only the year

I have got a date in R having class "factor". I want to take the year from the data. Please advice
Example data:
S No. Customer Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500
data$Month <- as.character(data$Month)
data$Month <- as.Date(data$Month,"%d-%m-%Y")
data$Year <- year(data$Month)
Try this:
df <- read.table(text="S 'No. Customer' Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500", header = TRUE)
df$year <- format(as.Date(df$Month, "%d-%m-%Y"), "%Y")
df
#> S No..Customer Month amount year
#> 1 1 A 01-01-2020 1500 2020
#> 2 2 B 23-02-2020 2000 2020
#> 3 3 C 15-03-2020 2500 2020
Created on 2020-04-10 by the reprex package (v0.3.0)

Sales of same date last year

I have a DF <-(ID,Year,Sales) and I want to create a fourth column LastYearSale which selects Sales of last year. For example if have:
ID Year Sales
1 01/01/2015 50000
2 01/01/2014 20000
I want output like this:
ID Year Sales LastYearSales
1 01/01/2015 50000 20000
2 01/01/2014 20000
If your Year column is not in Date class then convert it it into Date class first by
df$Year <- as.Date(df$Year, "%d/%m/%Y")
You can then try with lubridate package
library(lubridate)
df$LastYearSales <- c(df[df$Year %in% (as.Date(ymd(df$Year) - years(1))), ]$Sales, NA)
df
# ID Year Sales LastYearSales
# 1 1 2015-01-01 50000 20000
# 2 2 2014-01-01 20000 NA

Sum daily values into monthly values

I am trying to sum daily rainfall values into monthly totals for a record over 100 years in length. My data takes the form:
Year Month Day Rain
1890 1 1 0
1890 1 2 3.1
1890 1 3 2.5
1890 1 4 15.2
In the example above I want R to sum all the days of rainfall in January 1890, then February 1890, March 1890.... through to December 2010. I guess what I'm trying to do is create a loop to sum values. My output file should look like:
Year Month Rain
1890 1 80.5
1890 2 72.4
1890 3 66.8
1890 4 77.2
Any easy way to do this?
Many thanks.
You can use dplyr for some pleasing syntax
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Rain = sum(Rain))
In some cases it can be beneficial to convert it to a time-series class like xts, then you can use functions like apply.monthly().
Data:
df <- data.frame(
Year = rep(1890,5),
Month = c(1,1,1,2,2),
Day = 1:5,
rain = rexp(5)
)
> head(df)
Year Month Day rain
1 1890 1 1 0.1528641
2 1890 1 2 0.1603080
3 1890 1 3 0.5363315
4 1890 2 4 0.6368029
5 1890 2 5 0.5632891
Convert it to xts and use apply.monthly():
library(xts)
dates <- with(df, as.Date(paste(Year, Month, Day), format("%Y %m %d")))
myXts <- xts(df$rain, dates)
> head(apply.monthly(myXts, sum))
[,1]
1890-01-03 0.8495036
1890-02-05 1.2000919

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

Resources