I have a DF <-(ID,Year,Sales) and I want to create a fourth column LastYearSale which selects Sales of last year. For example if have:
ID Year Sales
1 01/01/2015 50000
2 01/01/2014 20000
I want output like this:
ID Year Sales LastYearSales
1 01/01/2015 50000 20000
2 01/01/2014 20000
If your Year column is not in Date class then convert it it into Date class first by
df$Year <- as.Date(df$Year, "%d/%m/%Y")
You can then try with lubridate package
library(lubridate)
df$LastYearSales <- c(df[df$Year %in% (as.Date(ymd(df$Year) - years(1))), ]$Sales, NA)
df
# ID Year Sales LastYearSales
# 1 1 2015-01-01 50000 20000
# 2 2 2014-01-01 20000 NA
Related
Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year
You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506
First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!
I have got a date in R having class "factor". I want to take the year from the data. Please advice
Example data:
S No. Customer Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500
data$Month <- as.character(data$Month)
data$Month <- as.Date(data$Month,"%d-%m-%Y")
data$Year <- year(data$Month)
Try this:
df <- read.table(text="S 'No. Customer' Month amount
1 A 01-01-2020 1500
2 B 23-02-2020 2000
3 C 15-03-2020 2500", header = TRUE)
df$year <- format(as.Date(df$Month, "%d-%m-%Y"), "%Y")
df
#> S No..Customer Month amount year
#> 1 1 A 01-01-2020 1500 2020
#> 2 2 B 23-02-2020 2000 2020
#> 3 3 C 15-03-2020 2500 2020
Created on 2020-04-10 by the reprex package (v0.3.0)
I would like to calculate sum of sales from the beggining of the year to the newest date.
My data:
ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300
MY YTD should be 200+300
This will sum all values for the current calendar year sum(df$Sales[format(df$Date, "%Y") == format(Sys.Date(), "%Y")]) - you might need to make sure your df$Date variable is of class Date
I assume you Date field is character and last four digits represent year.
Then you can filter where it equals current year with below:
df<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=T)
sum(df[substr(df$Date,4,7)==format(Sys.Date(),"%Y"),]$Sales)
[1] 500
You could use dplyr to summarise by year. lubridate is also useful to group_by year:
df1<-read.table(text="ID Date Sales
1 11-2016 100
1 12-2016 100
1 01-2017 200
1 02-2017 300",header=TRUE, stringsAsFactors=FALSE)
df1$Date <- as.yearmon(df1$Date,format="%m-%Y")
library(dplyr);library(lubridate)
df1%>%
group_by(Year=year(Date))%>%
summarise(Sales=sum(Sales))
Year Sales
<dbl> <int>
1 2016 200
2 2017 500
I am trying to sum daily rainfall values into monthly totals for a record over 100 years in length. My data takes the form:
Year Month Day Rain
1890 1 1 0
1890 1 2 3.1
1890 1 3 2.5
1890 1 4 15.2
In the example above I want R to sum all the days of rainfall in January 1890, then February 1890, March 1890.... through to December 2010. I guess what I'm trying to do is create a loop to sum values. My output file should look like:
Year Month Rain
1890 1 80.5
1890 2 72.4
1890 3 66.8
1890 4 77.2
Any easy way to do this?
Many thanks.
You can use dplyr for some pleasing syntax
library(dplyr)
df %>%
group_by(Year, Month) %>%
summarise(Rain = sum(Rain))
In some cases it can be beneficial to convert it to a time-series class like xts, then you can use functions like apply.monthly().
Data:
df <- data.frame(
Year = rep(1890,5),
Month = c(1,1,1,2,2),
Day = 1:5,
rain = rexp(5)
)
> head(df)
Year Month Day rain
1 1890 1 1 0.1528641
2 1890 1 2 0.1603080
3 1890 1 3 0.5363315
4 1890 2 4 0.6368029
5 1890 2 5 0.5632891
Convert it to xts and use apply.monthly():
library(xts)
dates <- with(df, as.Date(paste(Year, Month, Day), format("%Y %m %d")))
myXts <- xts(df$rain, dates)
> head(apply.monthly(myXts, sum))
[,1]
1890-01-03 0.8495036
1890-02-05 1.2000919
Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.