How to aggregate with multiple months with data frame R? - r

I need to aggregate multiple months from original data with dataframe in R, e.g: data frame with datetime include 2017 and 2018.
date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616
What I want is to sum based on 6 month period and category:
period category sum
1 2017_secondPeriod A 25.00972
2 2018_firstPeriod A 25.59850
3 2017_secondPeriod B 24.96924
4 2018_firstPeriod B 24.79649
5 2017_secondPeriod c 20.17096
6 2018_firstPeriod c 27.01794
What I did:
1. select the last 6 months of 2017, like wise 2018
2. add a new column for each subset to indicate the period
3. Combine 2 subset again
4. aggregate
as following:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
category = c("A","B","c"),
amt = runif(300)
)
df2017_secondHalf <- subset(df, month(df$date) %in% c(7,8,9,10,11,12) & year(df$date) == 2017)
f2018_firstHalf <- subset(df, month(df$date) %in% c(1,2,3,4,5,6) & year(df$date) == 2018)
sum1 <- aggregate(df2017_secondHalf$amt, by=list(Category=df2017_secondHalf$Category), FUN=sum)
sum2 <- aggregate(df2018_firstHalf$amt, by=list(Category=df2018_secondHalf$Category), FUN=sum)
df2017_secondHalf$period <- '2017_secondPeriod'
df2018_firstHalf$period <- '2018_firstPeriod'
aggregate(x = df$amt, by = df[c("period", "category")], FUN = sum)
I try to figure out but did not know how to aggregate multple months e.g, 3 months, or 6 months.
Thanks in advance
Any suggesstion?

With lubridate and tidyverse (dplyr & magrittr)
First, let's create groups with Semesters, Quarter, and "Trimonthly".
library(tidyverse)
library(lubridate)
df <- df %>% mutate(Semester = semester(date, with_year = TRUE),
Quarter = quarter(date, with_year = TRUE),
Trimonthly = round_date(date, unit = "3 months" ))
Lubridate's semester() breaks by semsters and gives you a 1 (Jan-Jun) or 2 (Jul-Aug); quarter() does a similar thing with quarters.
I add a third, the more basic round_date function, where you can specify your time frame in the form of size and time units. It yields the first date of such time frame. I deliberately name it "Trimonthly" so you can see how it compares to quarter()
Pivot.Semester <- df %>%
group_by(Semester, category) %>%
summarise(Semester.sum = sum(amt))
Pivot.Quarter <- df %>%
group_by(Quarter, category) %>%
summarise(Quarter.sum = sum(amt))
Pivot.Trimonthly <- df %>%
group_by(Trimonthly, category) %>%
summarise(Trimonthly.sum = sum(amt))
Pivot.Semester
Pivot.Quarter
Pivot.Trimonthly
Optional: If you want to join the summarised data to the original DF.
df <- df %>% left_join(Pivot.Semester, by = c("category", "Semester")) %>%
left_join(Pivot.Quarter, by = c("category", "Quarter")) %>%
left_join(Pivot.Trimonthly, by = c("category", "Trimonthly"))
df

Here is a 3 line solution that uses no package. Let k be the number of months in a period. For half year periods k is 6. For quarter year periods k would be 3, etc. Replace 02 in the sprintf format with 1 if you want one digit suffices (but not for monthly since those must be two digit). Further modify the sprintf format if you want it to exactly match the question.
k <- 6
period <- with(as.POSIXlt(DF$date), sprintf("%d-%02d", year + 1900, (mon %/% k) + 1))
aggregate(amt ~ category + period, DF, sum)
giving:
category period amt
1 A 2017-02 0.7084425
2 B 2017-02 0.5682296
3 c 2017-02 0.8156812
At the expense of using one package we can simplify the quarterly and monthly calculations by replacing the formula for period with one of these:
library(zoo)
# quarterly
period <- as.yearqtr(DF$date)
# monthly
period <- as.yearmon(DF$date)
Note: The input in reproducible form is:
Lines <- "date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616"
DF <- read.table(text = Lines)
DF$date <- as.Date(DF$date)

Related

How to check number of instances where x threshold is exceeded for n consecutive days in a series of values in r?

I would like to know if there is a way in r to return a value for the number of times where a series of data exceeds a certain value for a number of consecutive days.
e.g. How many times in a year was x greater than 10 for at least 30 consecutive days?
I know that you can find how many instances x was greater than a certain value over the whole year, but I'm not sure how to test for instances that are consecutive.
Where Data is a data.frame with Date, Year, and Value columns with daily data from 2010-2020:
Data %>%
group_by(Year) %>%
filter(Value >= 10) %>%
summarize(exceedances = n())
Here is an example of daily data from 2018-2021 with random values from 0-25:
library(tidyverse)
library(dplyr)
library(lubridate)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
dat %>%
group_by(year) %>%
filter(value >= 10) %>%
summarize(exceedances = n())
The output:
# A tibble: 4 x 2
year exceedances
<dbl> <int>
1 2018 216
2 2019 247
3 2020 229
4 2021 217
Desired output (n of >= 30 consecutive exceedances is a guess):
# A tibble: 4 x 2
year n_exceedances_30_consec
<dbl> <int>
1 2018 1
2 2019 0
3 2020 2
4 2021 0
The trick with this is that if there are 40 consecutive exceedances, I need that to show as 1 instance only, not 10 instances where the previous 30 days were >= 10.
You could use slider::slide_dbl. Maybe a complex way of doing it, but you could
library(tidyverse)
library(lubridate)
library(slider)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
library(dplyr)
library(slider)
consecutive_days <- 10
dat |>
mutate(greater = if_else(value >= 10, TRUE, FALSE)) |>
mutate(consecutive = slide_dbl(greater, sum, .before = consecutive_days-1)) |>
filter(consecutive >= consecutive_days) |>
filter(greater) |>
group_by(year) |>
summarize(exceedances = n())
I'm obviously freestyling here as there's no data provided. Hopefully this fits your needs!
Edit - I'm editing in light of the data you sent. This should now work correctly.
This is if you're more of a base R user:
date_index <- sapply(dat$date, \(x) {
d <- dat[dat$date >= x-(consecutive_days-1) &
dat$date <= x, ]
d <- d[d$value >= 10, ]
nrow(d) >= consecutive_days
})
aggregate(x = dat[date_index, ]$value,
by = list(year = dat[date_index, ]$year),
FUN = length)
I'm changing your data slightly as I wasn't getting any runs long enough.
set.seed(6)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = format(date, "%Y"),
value = sample(5:50, length(date), replace=TRUE)
)
year_runs <- sapply(unique(dat$year), function(y) {
runs <- rle(dat$value[dat$year==y] >= 10); sum(runs$length>=30 & runs$value)
})
year_runs
2018 2019 2020 2021
1 2 2 2
The secret here is the base function rle (run length encoding). This is running separately by year; if a run is over the new year, it will be truncated.
In response to your edit that you want to count the run in the year it ends, you can then skip the sapply and use rle on the whole dataset, remove the runs that are too short, then invert it back and find the ends
runs <- rle(dat$value >= 10)
runs$values [runs$lengths < 30] <- FALSE
run_ends <- which(diff(inverse.rle(runs)) == -1) + 1
table(dat$year[run_ends])
2018 2019 2020 2021
1 2 2 2

How to reorder file in chronological order

I have a dataset with multiple columns but I'd like to change the order in chronological order by date!
This is a really bad example but would there be a code to r
Station
year
ID
1
2020
D
2
2019
C
3
2017
A
4
2018
B
This is a really bad example but would there be a code to reorder by date oldest to newest?
Station
year
ID
3
2017
A
4
2018
B
2
2019
C
1
2020
D
To look something like this!
Any help would be amazing! :)
Thank you
Well... "2020" is not a date, and you can order the column as regular integer.
But, if you had dates like "2020-01-25"... transforming strings to dates is easy as...
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df <- df %>%
mutate(
dt = as.Date(dt)
) %>%
arrange(
dt
)
Use ymd () function from lubridate package to bring dt to date format and year () to extract the year. With this format you can sort your dates with arrange
library(dplyr)
library(lubridate)
# data borrowed from abreums
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df1 <- df %>%
mutate(dt = ymd(dt), # "2020-01-01"
dt = year(dt)) %>% # "2020"
arrange(dt)

I want to return a season and year value from a continuous list of dates

I have a continuous list of dates (yyyy-mm-dd) from 1985 to 2018 in one column (Colname = date). What I wish to do is generate another column which outputs a water season and year given the date.
To make it clearer I have two water season:
Summer = yyyy-04-01 to yyyy-09-31;
Winter = yyyy-10-01 to yyyy(+1)-03-31.
So for 2018 - Summer = 2018-04-01 to 2018-09-31; Winter 2018-10-01 to 2019-03-31.
What I would like to output is something like the following:
Many thanks.
A tidy verse approach
library(tidyverse)
df <-tibble(date = seq(from = as.Date('2000-01-01'), to = as.Date('2001-12-31'), by = '1 month'))
df
df %>%
mutate(water_season_year = case_when(
lubridate::month(date) %in% c(4:9) ~str_c('Su_', lubridate::year(date)),
lubridate::month(date) %in% c(10:12) ~str_c('Wi_', lubridate::year(date)),
lubridate::month(date) %in% c(1:3)~str_c('Wi_', lubridate::year(date) -1),
TRUE ~ 'Error'))
You can compare just the month part of the data to get the season, in base R consider doing
month <- as.integer(format(df$date, "%m"))
year <- format(df$date, "%Y")
inds <- month >= 4 & month <= 9
df$water_season_year <- NA
df$water_season_year[inds] <- paste("Su", year[inds], sep = "_")
df$water_season_year[!inds] <- paste("Wi", year[!inds], sep = "_")
#To add previous year for month <= 3 do
df$water_season_year[month <= 3] <- paste("Wi",
as.integer(year[month <= 3]) - 1, sep = "_")
df
# date water_season_year
#1 2019-01-03 Wi_2019
#2 2000-06-01 Su_2000
Make sure that date variable is of "Date" class.
data
df <-data.frame(date = as.Date(c("2019-01-03", "2000-06-01")))

Summarize daily data which lacks explicit grouping variable (month)

I have dataframe that has 6000 locations. For each location, I have 36 years daily data of rainfall in wide format.
A sample data:
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
What I want to do is for each location, generate the long term average rainfall for each month. For e.g. for loc.id = 1, mean rainfall in Jan, Feb, March... Dec.
Let' say this data is called df which is a data table
library(dplyr)
Here's what I did:
loc.list <- unique(dat$loc.id)
my.list <- list() # a list to store results
ptm <- proc.time()
for(i in seq_along(loc.list)){
n <- loc.list[i]
df1 <- dat[dat$loc.id == n,]
df2 <- gather(df1, day, rain, -year) # this melts the data in long format
df3 <- df2 %>% mutate(day = gsub("d_","", day)) %>% # since the day column was in "d_1" format, I converted into integer (1,2,3..365)
mutate(day = as.numeric(as.character(day))) %>% # ensure that day column is numeric. For some reasonson, some NA.s appear.
arrange(year,day) %>% # ensure that they are arranged in order
mutate(month = strptime(paste(year, day), format = "%Y %j")$mon + 1) %>% # assing each day to a month
group_by(year,month) %>% # group by year and month
summarise(month.rain = sum(rain)) %>% # calculate for each location, year and month, total rainfall
group_by(month) %>% # group by month
summarise(month.mean = round(mean(month.rain), digits = 2)) # calculate for each month, the long term mean
my.list[[i]] <- df3
}
proc.time() - ptm
user system elapsed
1036.17 0.20 1040.68
I wanted to ask if there are more efficient and faster way to achieve this task
Another data.table alternative:
# change column names to month, grabbed from 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc, month)]
This seems ~7 times faster than the answer by caw5cs. The result by Martin Morgan is in a different format though, which prevents a direct comparison of timings.
If you rather have unique column names in 'dat', you may use %b_%d (month-day) instead of %b only. Then use substr in by to grab the month part:
# change column names to month_day, using 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b_%d"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month_day", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc.id, month = substr(month_day, 1, 3))]
Use the cryptically named rowsum() to sum daily rainfall at each site, over all years
loc.id = rep(1:6000, each = 36)
daily.by.loc = rowsum(mat, loc.id)
and use the same trick on the transposed matrix to sum by month (since there are 365 columns leap years must be ignored).
month = factor(
months(as.Date(0:364, origin="1970-01-01")),
levels = month.name
)
loc.by.month = rowsum(t(daily.by.loc), month)
Calculate the average by dividing by number of observations; R's column-major matrix representation and recycling rules apply. Transpose so the orientation is the same as the data.
days.per.month = tabulate(month)
ans = t(loc.by.month / (36 * days.per.month))
The result is a 6000 x 12 matrix
> dim(ans)
[1] 6000 12
> head(ans, 3)
January February March April May June
1 0.01554659 0.002043651 -0.02950717 -0.02700926 0.003521505 -0.011268519
2 0.04953405 0.032926587 -0.04959677 0.02808333 0.022051971 0.009768519
3 -0.01125448 -0.023343254 -0.02672939 0.04012963 0.018530466 0.035583333
July August September October November December
1 0.009874552 -0.030824373 -0.04958333 -0.03366487 -0.07390741 -0.07899642
2 -0.011630824 -0.003369176 -0.00100000 -0.00594086 -0.02817593 -0.01161290
3 0.031810036 0.059641577 -0.01109259 0.04646953 -0.01601852 0.03103943
in less than a second.
Grossly misread the question the first time, oops! Seems to be working as intended this time.
library(data.table)
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
system.time({
# convert to long format with month # as column name
date_cols <- colnames(dat)[1:365]
setnames(dat, date_cols, as.character(1:365))
dat.long <- melt(dat, measure.vars=as.character(1:365), variable="day", value="rainfall")
# R date starts at 0 for Jan 1, so we offset the day by 1
dat.long[, day := as.numeric(day) - 1]
setkey(dat.long, year, day)
# Make table for merging year/day/month
months <- CJ(year=1980:2015, day=0:365)
months[, date := as.Date(day, origin=paste(year, "-01-01", sep=""))]
months[, month := tstrsplit(date, "-")[2]]
setkey(months, year, day)
# Merge tables to get month column
dat.merge <- merge(dat.long, months)
# aggregate by location an dmonth
dat.ag <- dat.merge[, list(mean_rainfall = mean(rainfall)), by=list(loc.id, month)]
})
Yielding
user system elapsed
14.420 4.205 18.626
> dat.ag
loc.id month mean_rainfall
1: 1 01 0.015546595
2: 2 01 0.049534050
3: 3 01 -0.011254480
4: 4 01 -0.019453405
5: 5 01 0.005860215
---
71996: 5996 12 0.027407407
71997: 5997 12 0.020334237
71998: 5998 12 0.043360434
71999: 5999 12 -0.006856369
72000: 6000 12 0.040542005

Remove incomplete months from a data frame even when part of the month contains data

I would like to remove incomplete months from my data frame even if some of the month has data.
Example data frame:
date <- seq.Date(as.Date("2016-01-15"),as.Date("2016-09-19"),by="day")
data <- seq(1:249)
df <- data.frame(date,data)
What I would like:
date2 <- seq.Date(as.Date("2016-02-01"),as.Date("2016-08-31"),by="day")
data2 <- seq(from = 18, to = 230)
df2 <- data.frame(date2,data2)
If I interpreted your question correctly, you want to be able to select the months that have a complete number of days, removing those that don't.
The following uses dplyr v0.7.0:
library(dplyr)
df <- df %>%
mutate(mo = months(date)) # add month (mo)
complete_mo <- df %>%
count(mo) %>% #count number of days in month (n)
filter(n >= 28) %>% #rule of thumb definition of a `complete month`
pull(mo)
df_complete_mo <- df %>%
filter(mo %in% complete_mo) %>% # here is where you select the complete months
select(-mo) #remove mo, to keep your original df
Then df_complete_mo yields your dataset with just complete months.
You could join a complete set of dates for each month to your data frame and then filter out months with any missing values.
library(tidyverse)
library(lubridate)
df.filtered = data.frame(date=seq(min(df$date)-31,max(df$date)+31,by="day")) %>%
left_join(df) %>%
group_by(month=month(date)) %>% # Add a month column and group by it
filter(!any(is.na(data))) %>% # Remove months with any missing data
ungroup %>%
select(-month) # Remove the month column
# A tibble: 213 x 2
date data
<date> <int>
1 2016-02-01 18
2 2016-02-02 19
3 2016-02-03 20
4 2016-02-04 21
5 2016-02-05 22
6 2016-02-06 23
7 2016-02-07 24
8 2016-02-08 25
9 2016-02-09 26
10 2016-02-10 27
# ... with 203 more rows
In base R, you could do the following.
# get start and end dates of months that are are beyond the sample
dateRange <- as.Date(format(range(df$date) + c(-32, 32), c("%Y-%m-2", "%Y-%m-1"))) - 1
the second argument of format is a vector that separately formats the min and the max dates. We subtract 1 from these dates to get the first day of a month and the last day of a month. This returns
dateRange
[1] "2015-12-01" "2016-09-30"
Now, use which.max to select the first date that matches and which with tail to select the last day that matches monthly sequences in order to figure out the starting and stopping rows of your data.frame.
startRow <- which.max(df$date %in% seq(dateRange[1], dateRange[2], by="month"))
stopRow <- tail(which(df$date %in% (seq(dateRange[1], dateRange[2], by="month")-1)), 1)
Now, subset your data.frame
dfNew <- df[startRow:stopRow,]
range(dfNew$date)
[1] "2016-02-01" "2016-08-31"
nrow(dfNew)
[1] 213

Resources