As a novice I was hoping to understand how to aggregate data using an arbitrary look back (eg previous 30 days from a date). See my data below as an example. I want to group by each name, and sum sales for the 30 days leading up to say 02-15-2019. So it will look back 30 calendar days from 02-15-2019 and give me the total sales by Name (eg Person 1 = $60; Person 2 = $30)
Name Date Sales
Person1 01-31-2019 $10
Person1 02-15-2019 $50
Person1 06-18-2019 $100
Person2 01-31-2019 $25
Person2 02-15-2019 $5
Person2 06-18-2019 $200
Simple example (if I understood your question correctly):
library(dplyr)
set.seed(123)
df <- data.frame(Name = sample(c("Person1", "Person2"), 6, T),
Date = c("01-31-2019", "02-15-2019", "06-18-2019", "01-31-2019", "02-15-2019", "06-18-2019"),
Sales = runif(6, 10, 100), stringsAsFactors = F)
df$Date <- lubridate::mdy(df$Date)
target <- lubridate::mdy("02-15-2019")
sales <- df %>% filter(between(Date, target - 30, target)) %>%
group_by(Name) %>% summarise(Sales = sum(Sales))
select Name ,sum(sales) from orders
where
DATEDIFF(day,OrderDate,GETDATE()) between 0 and 30
group by Name
Related
Consider a data frame that has 3 columns: A - a name; B - the yearly food intake (one name can eat different foods); C - the year in which the person stops eating that food
Such as:
A B C
Peter 400 2035
Peter 500 2050
Peter 350 2024
John 700 2050
I need to create a time series that sums all the food intake for each person, from today (2022) to 2050. In the case of John is easy: 700 * (2050-2022). But for Peter, I need to add some restrictions: sum the 3 lines until 2024, then one of them goes to zero, but the time series keeps summing the other two lines, until eventually there is only one line to sum.
So year 2022 would be (400+500+350), the same for years 2023 to 2024. Then would be (400+500), until 2035, etc.
This allows me to have a time-series, per person, which contains the yearly intake of food, taking into consideration that the yearly food intake will decrease throughout the years.
Are you after the total intake over the period? Then this will calculate it:
library(tidyverse)
data <- tribble(~"A", ~"B", ~"C",
"Peter", 400, 2035,
"Peter", 500, 2050,
"Peter", 350, 2024,
"John", 700, 2050)
data %>%
mutate(line_total = B*(C - 2022)) %>% # 2022 being the start year
group_by(A) %>%
summarise(person_total = sum(line_total))
If you actually want a time-series, with a column for each row and the total for the row at the end, then try this:
years <- 2022:max(data$C)
mat <- matrix(nrow = nrow(data), ncol = length(years))
colnames(mat) <- c(years)
timeseries <- cbind(data, mat) %>%
as_tibble() %>%
pivot_longer(-c(A, B, C)) %>%
mutate(value = ifelse(name <= C, B, 0)) %>%
pivot_wider() %>%
select(-c(B, C)) %>%
mutate(rowsum = rowSums(across(where(is.numeric))))
I wish to generate some Tidy data.
26 companies are observed everyday for 10 days.
Each day a value is recorded.
The first day is: 2020/1/1
How do I create a list of dates so that the first 26 rows of the date column of the date frame is "2020/1/1" (Year, Month, Day) and the next 26 rows are "2020/1/2" etc.
Here is the data frame without the date column:
library(tidyverse)
set.seed(33)
date_chunk <- rep(as.Date("2020/1/1"), 26)
# Tidy data. 10 sequential days starting 2020/1/1/
df <- tibble(
company = rep(letters, 10),
value = sample(0:5, 260, replace = TRUE),
color = "grey"
)
You can try this
rep(seq(as.Date("2020-01-01"),as.Date("2020-01-10"),1),each=26)
This will return a list of dates from 2020-01-01 to 2020-01-10 where each date will be repeated 26 times
For each company we can add row_number() to first date_chunk to get an incremental sequence of dates.
library(dplyr)
df %>%
group_by(company) %>%
mutate(date = first(date_chunk) + row_number() - 1)
I need to aggregate multiple months from original data with dataframe in R, e.g: data frame with datetime include 2017 and 2018.
date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616
What I want is to sum based on 6 month period and category:
period category sum
1 2017_secondPeriod A 25.00972
2 2018_firstPeriod A 25.59850
3 2017_secondPeriod B 24.96924
4 2018_firstPeriod B 24.79649
5 2017_secondPeriod c 20.17096
6 2018_firstPeriod c 27.01794
What I did:
1. select the last 6 months of 2017, like wise 2018
2. add a new column for each subset to indicate the period
3. Combine 2 subset again
4. aggregate
as following:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
category = c("A","B","c"),
amt = runif(300)
)
df2017_secondHalf <- subset(df, month(df$date) %in% c(7,8,9,10,11,12) & year(df$date) == 2017)
f2018_firstHalf <- subset(df, month(df$date) %in% c(1,2,3,4,5,6) & year(df$date) == 2018)
sum1 <- aggregate(df2017_secondHalf$amt, by=list(Category=df2017_secondHalf$Category), FUN=sum)
sum2 <- aggregate(df2018_firstHalf$amt, by=list(Category=df2018_secondHalf$Category), FUN=sum)
df2017_secondHalf$period <- '2017_secondPeriod'
df2018_firstHalf$period <- '2018_firstPeriod'
aggregate(x = df$amt, by = df[c("period", "category")], FUN = sum)
I try to figure out but did not know how to aggregate multple months e.g, 3 months, or 6 months.
Thanks in advance
Any suggesstion?
With lubridate and tidyverse (dplyr & magrittr)
First, let's create groups with Semesters, Quarter, and "Trimonthly".
library(tidyverse)
library(lubridate)
df <- df %>% mutate(Semester = semester(date, with_year = TRUE),
Quarter = quarter(date, with_year = TRUE),
Trimonthly = round_date(date, unit = "3 months" ))
Lubridate's semester() breaks by semsters and gives you a 1 (Jan-Jun) or 2 (Jul-Aug); quarter() does a similar thing with quarters.
I add a third, the more basic round_date function, where you can specify your time frame in the form of size and time units. It yields the first date of such time frame. I deliberately name it "Trimonthly" so you can see how it compares to quarter()
Pivot.Semester <- df %>%
group_by(Semester, category) %>%
summarise(Semester.sum = sum(amt))
Pivot.Quarter <- df %>%
group_by(Quarter, category) %>%
summarise(Quarter.sum = sum(amt))
Pivot.Trimonthly <- df %>%
group_by(Trimonthly, category) %>%
summarise(Trimonthly.sum = sum(amt))
Pivot.Semester
Pivot.Quarter
Pivot.Trimonthly
Optional: If you want to join the summarised data to the original DF.
df <- df %>% left_join(Pivot.Semester, by = c("category", "Semester")) %>%
left_join(Pivot.Quarter, by = c("category", "Quarter")) %>%
left_join(Pivot.Trimonthly, by = c("category", "Trimonthly"))
df
Here is a 3 line solution that uses no package. Let k be the number of months in a period. For half year periods k is 6. For quarter year periods k would be 3, etc. Replace 02 in the sprintf format with 1 if you want one digit suffices (but not for monthly since those must be two digit). Further modify the sprintf format if you want it to exactly match the question.
k <- 6
period <- with(as.POSIXlt(DF$date), sprintf("%d-%02d", year + 1900, (mon %/% k) + 1))
aggregate(amt ~ category + period, DF, sum)
giving:
category period amt
1 A 2017-02 0.7084425
2 B 2017-02 0.5682296
3 c 2017-02 0.8156812
At the expense of using one package we can simplify the quarterly and monthly calculations by replacing the formula for period with one of these:
library(zoo)
# quarterly
period <- as.yearqtr(DF$date)
# monthly
period <- as.yearmon(DF$date)
Note: The input in reproducible form is:
Lines <- "date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616"
DF <- read.table(text = Lines)
DF$date <- as.Date(DF$date)
I have a df with variables named as below
id indexDate eventDate1 eventDate2 V1 V2 V3 ....... V365
For the date range (eventDate1 - indexDate) to (eventDate2 - indexDate), I want to tag the days of occurrence in the V1 to V365 columns.
Each V represents the number of days post-indexDate.
For example:
If:
indexDate is 1/1/2017
eventDate1 is 1/3/2017 (= Day 2)
eventDate2 is 1/5/2017 (= Day 4),
then:
V2-V4 would be assigned a value of 1 and the rest of the V~ are 0.
If there is a better way to do this, feel free to let me know!
Thanks.
This works-
library(dplyr)
library(tidyr)
# Make fake data
dates <- data.frame(id = 1:10,
indexDate = rep(as.Date("17/01/01"), 10),
eventDate1 = as.Date(paste0("17/01/", 1:10)),
eventDate2 = as.Date(paste0("17/01/", 16:25)))
# Step through this to understand what's going on
dates[rep(row.names(dates), 365), ] %>%
arrange(id) %>%
mutate(Day = rep(1:365, nrow(dates)),
Flag = ifelse(Day <= as.numeric(eventDate2 - indexDate) &
Day > as.numeric(eventDate1 - indexDate), 1, 0)) %>%
# move to long format
spread(Day, Flag)
I played with adding a paste0("V", Day) but the spread came out unordered. With this column convention you can refer tot he individual columns with back-ticks `.
dates %>% select(`1`, `2`, `3`)
I need to calculate so called MAT (Movie Anual Total), means the % change in sales value between same day in two different year:
ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
...
A 200 1 1 2017
Does anybody have an idea about how to deal with it?
I want to get this:
ID Sales Day Month Yeas **MAT**
With the way your data is set up, you're actually quite close. What you want to do now is group your data by month and day, order each group by year, and then take the successive differences (assuming you want the MAT for sequential years)
library(lubridate)
library(dplyr)
X <-
data.frame(date = seq(as.Date("2014-01-01"),
as.Date("2017-12-31"),
by = 1)) %>%
mutate(day = day(date),
month = month(date),
year = year(date),
sales = rnorm(nrow(.), mean = 100, sd = 5))
X %>%
group_by(month, day) %>%
arrange(month, day, year) %>%
mutate(mat = c(NA, diff(sales))) %>%
ungroup()
If you are wanting to be able to generically take a difference between any two years, this will need some refinements.
Here is a solution with base R. Mainly it is a self-join:
d$prev.Year <- d$Year-1
dd <- merge(d,d, by.x=c("prev.Year", "Month", "Day"), by.y=c("Year", "Month", "Day"))
dd$MAT <- with(dd, (Sales.x-Sales.y)/Sales.y)
If you have different values in ID you eventually want:
dd <- merge(d,d, by.x=c("ID", "prev.Year", "Month", "Day"), by.y=c("ID", "Year", "Month", "Day"))
data:
d <- read.table(header=TRUE, text=
"ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
A 200 1 1 2017")