Calculating change of values between same day in different years - r

I need to calculate so called MAT (Movie Anual Total), means the % change in sales value between same day in two different year:
ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
...
A 200 1 1 2017
Does anybody have an idea about how to deal with it?
I want to get this:
ID Sales Day Month Yeas **MAT**

With the way your data is set up, you're actually quite close. What you want to do now is group your data by month and day, order each group by year, and then take the successive differences (assuming you want the MAT for sequential years)
library(lubridate)
library(dplyr)
X <-
data.frame(date = seq(as.Date("2014-01-01"),
as.Date("2017-12-31"),
by = 1)) %>%
mutate(day = day(date),
month = month(date),
year = year(date),
sales = rnorm(nrow(.), mean = 100, sd = 5))
X %>%
group_by(month, day) %>%
arrange(month, day, year) %>%
mutate(mat = c(NA, diff(sales))) %>%
ungroup()
If you are wanting to be able to generically take a difference between any two years, this will need some refinements.

Here is a solution with base R. Mainly it is a self-join:
d$prev.Year <- d$Year-1
dd <- merge(d,d, by.x=c("prev.Year", "Month", "Day"), by.y=c("Year", "Month", "Day"))
dd$MAT <- with(dd, (Sales.x-Sales.y)/Sales.y)
If you have different values in ID you eventually want:
dd <- merge(d,d, by.x=c("ID", "prev.Year", "Month", "Day"), by.y=c("ID", "Year", "Month", "Day"))
data:
d <- read.table(header=TRUE, text=
"ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
A 200 1 1 2017")

Related

How to check number of instances where x threshold is exceeded for n consecutive days in a series of values in r?

I would like to know if there is a way in r to return a value for the number of times where a series of data exceeds a certain value for a number of consecutive days.
e.g. How many times in a year was x greater than 10 for at least 30 consecutive days?
I know that you can find how many instances x was greater than a certain value over the whole year, but I'm not sure how to test for instances that are consecutive.
Where Data is a data.frame with Date, Year, and Value columns with daily data from 2010-2020:
Data %>%
group_by(Year) %>%
filter(Value >= 10) %>%
summarize(exceedances = n())
Here is an example of daily data from 2018-2021 with random values from 0-25:
library(tidyverse)
library(dplyr)
library(lubridate)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
dat %>%
group_by(year) %>%
filter(value >= 10) %>%
summarize(exceedances = n())
The output:
# A tibble: 4 x 2
year exceedances
<dbl> <int>
1 2018 216
2 2019 247
3 2020 229
4 2021 217
Desired output (n of >= 30 consecutive exceedances is a guess):
# A tibble: 4 x 2
year n_exceedances_30_consec
<dbl> <int>
1 2018 1
2 2019 0
3 2020 2
4 2021 0
The trick with this is that if there are 40 consecutive exceedances, I need that to show as 1 instance only, not 10 instances where the previous 30 days were >= 10.
You could use slider::slide_dbl. Maybe a complex way of doing it, but you could
library(tidyverse)
library(lubridate)
library(slider)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
library(dplyr)
library(slider)
consecutive_days <- 10
dat |>
mutate(greater = if_else(value >= 10, TRUE, FALSE)) |>
mutate(consecutive = slide_dbl(greater, sum, .before = consecutive_days-1)) |>
filter(consecutive >= consecutive_days) |>
filter(greater) |>
group_by(year) |>
summarize(exceedances = n())
I'm obviously freestyling here as there's no data provided. Hopefully this fits your needs!
Edit - I'm editing in light of the data you sent. This should now work correctly.
This is if you're more of a base R user:
date_index <- sapply(dat$date, \(x) {
d <- dat[dat$date >= x-(consecutive_days-1) &
dat$date <= x, ]
d <- d[d$value >= 10, ]
nrow(d) >= consecutive_days
})
aggregate(x = dat[date_index, ]$value,
by = list(year = dat[date_index, ]$year),
FUN = length)
I'm changing your data slightly as I wasn't getting any runs long enough.
set.seed(6)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = format(date, "%Y"),
value = sample(5:50, length(date), replace=TRUE)
)
year_runs <- sapply(unique(dat$year), function(y) {
runs <- rle(dat$value[dat$year==y] >= 10); sum(runs$length>=30 & runs$value)
})
year_runs
2018 2019 2020 2021
1 2 2 2
The secret here is the base function rle (run length encoding). This is running separately by year; if a run is over the new year, it will be truncated.
In response to your edit that you want to count the run in the year it ends, you can then skip the sapply and use rle on the whole dataset, remove the runs that are too short, then invert it back and find the ends
runs <- rle(dat$value >= 10)
runs$values [runs$lengths < 30] <- FALSE
run_ends <- which(diff(inverse.rle(runs)) == -1) + 1
table(dat$year[run_ends])
2018 2019 2020 2021
1 2 2 2

How to assign values to a new column based on a range of dates from that overlap years in R?

I have a growth rate, calculated from individual measurements 4 times a year, that I am trying to assign to a different time frame called Year2 (August 1st of year 1 to July 31st of year 2, see attached photo).
My Dataframe:
ID
Date
Year
Year2
Lag
Lapse
Growth
Daily_growth
1
2009-07-30
2009
2009
NA
NA
35.004
NA
1
2009-10-29
2009
2010
2009-07-30
91 days
31.585
0.347
1
2010-01-27
2010
2010
2009-10-29
90 days
63.769
0.709
1
2010-04-27
2010
2010
2010-01-27
90 days
28.329
0.315
1
2010-07-29
2010
2010
2010-04-27
93 days
32.068
0.345
1
2010-11-02
2010
2011
2010-07-29
96 days
128.1617320
1.335
I took the growth rate as follows:
Growth_df <- Growth_df%>%
group_by(ID) %>% # Individuals we measured
mutate(Lag = lag(Date), #Last date measured
Lapse = round(difftime(Date, Lag, units = "days")), #days between Dates monitored
Daily_growth = as.numeric(Growth) / as.numeric(Lapse))
What I am trying to do is assign the daily growth rate between each measurement, matching to the Year2 timeframe:
Growth_df <- Growth_df %>%
mutate(Year = as.numeric(Year),
Year2_growth = ifelse(Year == Year2, Daily_growth*Lapse, 0)) %>%
group_by(Year2) %>%
mutate(Year2_growth = sum(Year2_growth, na.rm = TRUE))
My problem is that I do not know how to get the dates in between the years (something in place of the 0 in the ifelse statement). I need some sort of way that would calculate how many days would be left from the new start date (August 1st) to the most recent measurement, then multiply it by the growth rate, as well as cut the end early (July 31st)
I tried making a second dataframe with nothing by years and days then assigning the growth rate when comparing the two dataframes but I have been stuck on the same issue: partitioning the time frame.
I am sure there's a much much muuuuch more efficient way to deal with this, but this is the way I sorted out:
Make my timeframes
Create a function for the ranges I wanted
Make a dataframe with for both the start and the end ranges
Join them together
Marvel in my lack of r skills.
Start_dates <- seq(ymd('2008-08-01'),ymd('2021-08-1'), by = '12 months')
End_dates <- seq(ymd('2009-07-31'),ymd('2022-07-31'), by = '12 months')
Year2_dates <- data.frame(Start_dates, End_dates)
Year2_dates <- Year2_dates %>%
mutate(Year2 = format(as.Date(Start_dates, format="%d/%m/%Y"),"%Y"),
Year2 = as.numeric(Year2) + 1)
Vegetation <- Vegetation %>%
left_join(Year2_dates)
Range_finder <- function(x,y){
as.numeric(difftime(x, y, unit = "days"))
}
Range_start <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==1) %>%
filter(Year != Year2) #had to get rid of first year samples as they were the top row but didn't have a change in year
Range_start <- Range_start %>%
mutate(Number_days_start = Range_finder(Date, Start_dates),
Border_range = Number_days_start * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Range_end <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==n(),
Year2 != 2022)
Range_end <- Range_end %>%
mutate(Number_days_end = Range_finder(End_dates, Date),
Border_range = Number_days_end * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Ranges <- full_join(Range_start, Range_end)
Vegetation <- Vegetation %>%
left_join(Ranges)

Is it possible in R to split my date-time values into 5 different columns (Year, month, date, hour, minute)?

I am really new at R and this is probably a really basic question: Let's say I have a dataset with a column that includes date values of the format ("y-m-d H:M:S") as a Factor value.
How do I split the one column into 5?
Given example:
x <- as.factor(c("2018-01-03 12:34:32.92382", "2018-01-03 12:50:40.00040"))
x <- as_datetime(x) #to convert to type Date
x <- x %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
I get the error: for objects with the class(c('POSIXct', 'POSIXt') can not be used.
Change it into dataframe then run mutate part will works
x %>%
as.data.frame() %>%
rename(x = '.') %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
x year month day hour minute second
1 2018-01-03 12:34:32 2018 1 3 12 34 32.92382
2 2018-01-03 12:50:40 2018 1 3 12 50 40.00040
You could also make your mutate a little bit cleaner utilizing the power of across:
library(lubridate)
x %>%
data.frame(date = .) %>%
mutate(across(date,
funs(year, month, day, hour, minute, second),
.names = "{.fn}"))

Summarize daily data which lacks explicit grouping variable (month)

I have dataframe that has 6000 locations. For each location, I have 36 years daily data of rainfall in wide format.
A sample data:
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
What I want to do is for each location, generate the long term average rainfall for each month. For e.g. for loc.id = 1, mean rainfall in Jan, Feb, March... Dec.
Let' say this data is called df which is a data table
library(dplyr)
Here's what I did:
loc.list <- unique(dat$loc.id)
my.list <- list() # a list to store results
ptm <- proc.time()
for(i in seq_along(loc.list)){
n <- loc.list[i]
df1 <- dat[dat$loc.id == n,]
df2 <- gather(df1, day, rain, -year) # this melts the data in long format
df3 <- df2 %>% mutate(day = gsub("d_","", day)) %>% # since the day column was in "d_1" format, I converted into integer (1,2,3..365)
mutate(day = as.numeric(as.character(day))) %>% # ensure that day column is numeric. For some reasonson, some NA.s appear.
arrange(year,day) %>% # ensure that they are arranged in order
mutate(month = strptime(paste(year, day), format = "%Y %j")$mon + 1) %>% # assing each day to a month
group_by(year,month) %>% # group by year and month
summarise(month.rain = sum(rain)) %>% # calculate for each location, year and month, total rainfall
group_by(month) %>% # group by month
summarise(month.mean = round(mean(month.rain), digits = 2)) # calculate for each month, the long term mean
my.list[[i]] <- df3
}
proc.time() - ptm
user system elapsed
1036.17 0.20 1040.68
I wanted to ask if there are more efficient and faster way to achieve this task
Another data.table alternative:
# change column names to month, grabbed from 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc, month)]
This seems ~7 times faster than the answer by caw5cs. The result by Martin Morgan is in a different format though, which prevents a direct comparison of timings.
If you rather have unique column names in 'dat', you may use %b_%d (month-day) instead of %b only. Then use substr in by to grab the month part:
# change column names to month_day, using 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b_%d"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month_day", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc.id, month = substr(month_day, 1, 3))]
Use the cryptically named rowsum() to sum daily rainfall at each site, over all years
loc.id = rep(1:6000, each = 36)
daily.by.loc = rowsum(mat, loc.id)
and use the same trick on the transposed matrix to sum by month (since there are 365 columns leap years must be ignored).
month = factor(
months(as.Date(0:364, origin="1970-01-01")),
levels = month.name
)
loc.by.month = rowsum(t(daily.by.loc), month)
Calculate the average by dividing by number of observations; R's column-major matrix representation and recycling rules apply. Transpose so the orientation is the same as the data.
days.per.month = tabulate(month)
ans = t(loc.by.month / (36 * days.per.month))
The result is a 6000 x 12 matrix
> dim(ans)
[1] 6000 12
> head(ans, 3)
January February March April May June
1 0.01554659 0.002043651 -0.02950717 -0.02700926 0.003521505 -0.011268519
2 0.04953405 0.032926587 -0.04959677 0.02808333 0.022051971 0.009768519
3 -0.01125448 -0.023343254 -0.02672939 0.04012963 0.018530466 0.035583333
July August September October November December
1 0.009874552 -0.030824373 -0.04958333 -0.03366487 -0.07390741 -0.07899642
2 -0.011630824 -0.003369176 -0.00100000 -0.00594086 -0.02817593 -0.01161290
3 0.031810036 0.059641577 -0.01109259 0.04646953 -0.01601852 0.03103943
in less than a second.
Grossly misread the question the first time, oops! Seems to be working as intended this time.
library(data.table)
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
system.time({
# convert to long format with month # as column name
date_cols <- colnames(dat)[1:365]
setnames(dat, date_cols, as.character(1:365))
dat.long <- melt(dat, measure.vars=as.character(1:365), variable="day", value="rainfall")
# R date starts at 0 for Jan 1, so we offset the day by 1
dat.long[, day := as.numeric(day) - 1]
setkey(dat.long, year, day)
# Make table for merging year/day/month
months <- CJ(year=1980:2015, day=0:365)
months[, date := as.Date(day, origin=paste(year, "-01-01", sep=""))]
months[, month := tstrsplit(date, "-")[2]]
setkey(months, year, day)
# Merge tables to get month column
dat.merge <- merge(dat.long, months)
# aggregate by location an dmonth
dat.ag <- dat.merge[, list(mean_rainfall = mean(rainfall)), by=list(loc.id, month)]
})
Yielding
user system elapsed
14.420 4.205 18.626
> dat.ag
loc.id month mean_rainfall
1: 1 01 0.015546595
2: 2 01 0.049534050
3: 3 01 -0.011254480
4: 4 01 -0.019453405
5: 5 01 0.005860215
---
71996: 5996 12 0.027407407
71997: 5997 12 0.020334237
71998: 5998 12 0.043360434
71999: 5999 12 -0.006856369
72000: 6000 12 0.040542005

How to aggregate with multiple months with data frame R?

I need to aggregate multiple months from original data with dataframe in R, e.g: data frame with datetime include 2017 and 2018.
date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616
What I want is to sum based on 6 month period and category:
period category sum
1 2017_secondPeriod A 25.00972
2 2018_firstPeriod A 25.59850
3 2017_secondPeriod B 24.96924
4 2018_firstPeriod B 24.79649
5 2017_secondPeriod c 20.17096
6 2018_firstPeriod c 27.01794
What I did:
1. select the last 6 months of 2017, like wise 2018
2. add a new column for each subset to indicate the period
3. Combine 2 subset again
4. aggregate
as following:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
category = c("A","B","c"),
amt = runif(300)
)
df2017_secondHalf <- subset(df, month(df$date) %in% c(7,8,9,10,11,12) & year(df$date) == 2017)
f2018_firstHalf <- subset(df, month(df$date) %in% c(1,2,3,4,5,6) & year(df$date) == 2018)
sum1 <- aggregate(df2017_secondHalf$amt, by=list(Category=df2017_secondHalf$Category), FUN=sum)
sum2 <- aggregate(df2018_firstHalf$amt, by=list(Category=df2018_secondHalf$Category), FUN=sum)
df2017_secondHalf$period <- '2017_secondPeriod'
df2018_firstHalf$period <- '2018_firstPeriod'
aggregate(x = df$amt, by = df[c("period", "category")], FUN = sum)
I try to figure out but did not know how to aggregate multple months e.g, 3 months, or 6 months.
Thanks in advance
Any suggesstion?
With lubridate and tidyverse (dplyr & magrittr)
First, let's create groups with Semesters, Quarter, and "Trimonthly".
library(tidyverse)
library(lubridate)
df <- df %>% mutate(Semester = semester(date, with_year = TRUE),
Quarter = quarter(date, with_year = TRUE),
Trimonthly = round_date(date, unit = "3 months" ))
Lubridate's semester() breaks by semsters and gives you a 1 (Jan-Jun) or 2 (Jul-Aug); quarter() does a similar thing with quarters.
I add a third, the more basic round_date function, where you can specify your time frame in the form of size and time units. It yields the first date of such time frame. I deliberately name it "Trimonthly" so you can see how it compares to quarter()
Pivot.Semester <- df %>%
group_by(Semester, category) %>%
summarise(Semester.sum = sum(amt))
Pivot.Quarter <- df %>%
group_by(Quarter, category) %>%
summarise(Quarter.sum = sum(amt))
Pivot.Trimonthly <- df %>%
group_by(Trimonthly, category) %>%
summarise(Trimonthly.sum = sum(amt))
Pivot.Semester
Pivot.Quarter
Pivot.Trimonthly
Optional: If you want to join the summarised data to the original DF.
df <- df %>% left_join(Pivot.Semester, by = c("category", "Semester")) %>%
left_join(Pivot.Quarter, by = c("category", "Quarter")) %>%
left_join(Pivot.Trimonthly, by = c("category", "Trimonthly"))
df
Here is a 3 line solution that uses no package. Let k be the number of months in a period. For half year periods k is 6. For quarter year periods k would be 3, etc. Replace 02 in the sprintf format with 1 if you want one digit suffices (but not for monthly since those must be two digit). Further modify the sprintf format if you want it to exactly match the question.
k <- 6
period <- with(as.POSIXlt(DF$date), sprintf("%d-%02d", year + 1900, (mon %/% k) + 1))
aggregate(amt ~ category + period, DF, sum)
giving:
category period amt
1 A 2017-02 0.7084425
2 B 2017-02 0.5682296
3 c 2017-02 0.8156812
At the expense of using one package we can simplify the quarterly and monthly calculations by replacing the formula for period with one of these:
library(zoo)
# quarterly
period <- as.yearqtr(DF$date)
# monthly
period <- as.yearmon(DF$date)
Note: The input in reproducible form is:
Lines <- "date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616"
DF <- read.table(text = Lines)
DF$date <- as.Date(DF$date)

Resources