Binning examples for R I've found seem to assume the source data has a single date (or date/time). I've discrete start and stop dates for user accounts ranging over years 2002-2017. I want to output counts of the number of accounts active, using monthly bins, during the overall range 2002-17.
The data is currently in dd/mm/yyyy strings,though I could easily change that format if needed; rows are sorted by ascending start date. for example
Start Stop
04/09/2006 23/01/2014
...
06/07/2008 11/03/2017
...
30/09/2010 22/04/2016
The resulting counts would be, for example:
Mar 2006 0
Jan 2007 1
Mar 2011 3
Jun 2015 2
Sep 2016 1
...etc.
The aim of generating the counts is to the plot the total active accounts over time. I'm open to getting daily counts and then aggregating by month if easier. I'm stuck at the start though: bin where the source is a date range and not a single date.
Convert the columns to "yearmon" class and use mapply to generate the year/months ym covered. Then count how many of each year/month occur and merge that with a data frame having all the year/months between Jan 2002 and Dec 2017 giving M_na and replace NAs with 0 giving M.
library(zoo)
DF2 <- transform(DF, Start = as.yearmon(Start), Stop = as.yearmon(Stop))
ym <- unlist(mapply(seq, DF2$Start, DF2$Stop, MoreArgs = list(by = 1/12)))
Ag <- aggregate(ym^0, list(ym = as.yearmon(ym)), sum)
M_na <- merge(Ag, data.frame(ym = as.yearmon(seq(2002, 2017+11/12, 1/12))), all.y = TRUE)
M <- transform(M_na, x = replace(x, is.na(x), 0))
plot(x ~ ym, M, type = "h", xlab = "", ylab = "Count", xaxt = "n")
axis(1, 2002:2017)
(continued after image)
magrittr
This could also be expressed as a magrittr pipeline like this:
library(magrittr)
library(zoo)
M <- DF %>%
transform(Start = as.yearmon(Start), Stop = as.yearmon(Stop)) %$%
unlist(mapply(seq, Start, Stop, MoreArgs = list(by = 1/12))) %>%
{ aggregate(.^0, list(ym = as.yearmon(.)), sum) } %>%
merge(data.frame(ym = as.yearmon(seq(2002, 2017+11/12, 1/12))), all.y = TRUE) %>%
transform(x = replace(x, is.na(x), 0))
Note: We assume the following input with Date class columns:
Lines <- "
Start Stop
04/09/2006 23/01/2014
06/07/2008 11/03/2017
30/09/2010 22/04/2016"
DF <- read.table(text = Lines, header = TRUE)
fmt <- "%d/%m/%Y"
DF <- transform(DF, Start = as.Date(Start, fmt), Stop = as.Date(Stop, fmt))
If you format your dates as month-year and then apply a factor with all the month-years values you should get what you want
# creating data for example
dates <- sample(seq(as.Date('01/01/2002', format='%m/%d/%Y'),
as.Date('12/31/2017', format='%m/%d/%Y'),
by="day"), 30)
# use the cut function to round up to the first of each month if you like
months <- format(as.Date(cut(dates, breaks= 'month')), '%b %Y')
# cut function is not necessary if you prefer to skip to the format
months <- format(dates, '%b %Y')
# Created an ordered vector of months and years
ord_months <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
ord_year <- as.character(2002:2017)
# create an ordered vector of month years
months_ordered <- apply(expand.grid(ord_months, ord_year), 1, paste, collapse = ' ')
head(months_ordered)
# factor the format and apply the factored vector as the levels
monthsF <- factor(months, levels=months_ordered)
table(monthsF)
Related
I have a continuous list of dates (yyyy-mm-dd) from 1985 to 2018 in one column (Colname = date). What I wish to do is generate another column which outputs a water season and year given the date.
To make it clearer I have two water season:
Summer = yyyy-04-01 to yyyy-09-31;
Winter = yyyy-10-01 to yyyy(+1)-03-31.
So for 2018 - Summer = 2018-04-01 to 2018-09-31; Winter 2018-10-01 to 2019-03-31.
What I would like to output is something like the following:
Many thanks.
A tidy verse approach
library(tidyverse)
df <-tibble(date = seq(from = as.Date('2000-01-01'), to = as.Date('2001-12-31'), by = '1 month'))
df
df %>%
mutate(water_season_year = case_when(
lubridate::month(date) %in% c(4:9) ~str_c('Su_', lubridate::year(date)),
lubridate::month(date) %in% c(10:12) ~str_c('Wi_', lubridate::year(date)),
lubridate::month(date) %in% c(1:3)~str_c('Wi_', lubridate::year(date) -1),
TRUE ~ 'Error'))
You can compare just the month part of the data to get the season, in base R consider doing
month <- as.integer(format(df$date, "%m"))
year <- format(df$date, "%Y")
inds <- month >= 4 & month <= 9
df$water_season_year <- NA
df$water_season_year[inds] <- paste("Su", year[inds], sep = "_")
df$water_season_year[!inds] <- paste("Wi", year[!inds], sep = "_")
#To add previous year for month <= 3 do
df$water_season_year[month <= 3] <- paste("Wi",
as.integer(year[month <= 3]) - 1, sep = "_")
df
# date water_season_year
#1 2019-01-03 Wi_2019
#2 2000-06-01 Su_2000
Make sure that date variable is of "Date" class.
data
df <-data.frame(date = as.Date(c("2019-01-03", "2000-06-01")))
I have dataframe that has 6000 locations. For each location, I have 36 years daily data of rainfall in wide format.
A sample data:
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
What I want to do is for each location, generate the long term average rainfall for each month. For e.g. for loc.id = 1, mean rainfall in Jan, Feb, March... Dec.
Let' say this data is called df which is a data table
library(dplyr)
Here's what I did:
loc.list <- unique(dat$loc.id)
my.list <- list() # a list to store results
ptm <- proc.time()
for(i in seq_along(loc.list)){
n <- loc.list[i]
df1 <- dat[dat$loc.id == n,]
df2 <- gather(df1, day, rain, -year) # this melts the data in long format
df3 <- df2 %>% mutate(day = gsub("d_","", day)) %>% # since the day column was in "d_1" format, I converted into integer (1,2,3..365)
mutate(day = as.numeric(as.character(day))) %>% # ensure that day column is numeric. For some reasonson, some NA.s appear.
arrange(year,day) %>% # ensure that they are arranged in order
mutate(month = strptime(paste(year, day), format = "%Y %j")$mon + 1) %>% # assing each day to a month
group_by(year,month) %>% # group by year and month
summarise(month.rain = sum(rain)) %>% # calculate for each location, year and month, total rainfall
group_by(month) %>% # group by month
summarise(month.mean = round(mean(month.rain), digits = 2)) # calculate for each month, the long term mean
my.list[[i]] <- df3
}
proc.time() - ptm
user system elapsed
1036.17 0.20 1040.68
I wanted to ask if there are more efficient and faster way to achieve this task
Another data.table alternative:
# change column names to month, grabbed from 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc, month)]
This seems ~7 times faster than the answer by caw5cs. The result by Martin Morgan is in a different format though, which prevents a direct comparison of timings.
If you rather have unique column names in 'dat', you may use %b_%d (month-day) instead of %b only. Then use substr in by to grab the month part:
# change column names to month_day, using 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b_%d"),
"loc.id", "year"))
# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
variable.name = "month_day", value.name = "rain")
# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc.id, month = substr(month_day, 1, 3))]
Use the cryptically named rowsum() to sum daily rainfall at each site, over all years
loc.id = rep(1:6000, each = 36)
daily.by.loc = rowsum(mat, loc.id)
and use the same trick on the transposed matrix to sum by month (since there are 365 columns leap years must be ignored).
month = factor(
months(as.Date(0:364, origin="1970-01-01")),
levels = month.name
)
loc.by.month = rowsum(t(daily.by.loc), month)
Calculate the average by dividing by number of observations; R's column-major matrix representation and recycling rules apply. Transpose so the orientation is the same as the data.
days.per.month = tabulate(month)
ans = t(loc.by.month / (36 * days.per.month))
The result is a 6000 x 12 matrix
> dim(ans)
[1] 6000 12
> head(ans, 3)
January February March April May June
1 0.01554659 0.002043651 -0.02950717 -0.02700926 0.003521505 -0.011268519
2 0.04953405 0.032926587 -0.04959677 0.02808333 0.022051971 0.009768519
3 -0.01125448 -0.023343254 -0.02672939 0.04012963 0.018530466 0.035583333
July August September October November December
1 0.009874552 -0.030824373 -0.04958333 -0.03366487 -0.07390741 -0.07899642
2 -0.011630824 -0.003369176 -0.00100000 -0.00594086 -0.02817593 -0.01161290
3 0.031810036 0.059641577 -0.01109259 0.04646953 -0.01601852 0.03103943
in less than a second.
Grossly misread the question the first time, oops! Seems to be working as intended this time.
library(data.table)
set.seed(123)
mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))
dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)
system.time({
# convert to long format with month # as column name
date_cols <- colnames(dat)[1:365]
setnames(dat, date_cols, as.character(1:365))
dat.long <- melt(dat, measure.vars=as.character(1:365), variable="day", value="rainfall")
# R date starts at 0 for Jan 1, so we offset the day by 1
dat.long[, day := as.numeric(day) - 1]
setkey(dat.long, year, day)
# Make table for merging year/day/month
months <- CJ(year=1980:2015, day=0:365)
months[, date := as.Date(day, origin=paste(year, "-01-01", sep=""))]
months[, month := tstrsplit(date, "-")[2]]
setkey(months, year, day)
# Merge tables to get month column
dat.merge <- merge(dat.long, months)
# aggregate by location an dmonth
dat.ag <- dat.merge[, list(mean_rainfall = mean(rainfall)), by=list(loc.id, month)]
})
Yielding
user system elapsed
14.420 4.205 18.626
> dat.ag
loc.id month mean_rainfall
1: 1 01 0.015546595
2: 2 01 0.049534050
3: 3 01 -0.011254480
4: 4 01 -0.019453405
5: 5 01 0.005860215
---
71996: 5996 12 0.027407407
71997: 5997 12 0.020334237
71998: 5998 12 0.043360434
71999: 5999 12 -0.006856369
72000: 6000 12 0.040542005
I think this problem may be of interest to others who deal with data smoothing of long-term environmental variables.
I have a dataset structured as below:
Columns:
Date Hour_Min Y(response variable)
These data are hourly, and I need to create a moving average of the diel cycle, but categorized by the Hour_Min. In other words, if I were to use a 31 day window, for a given day the running average data point for Hour_Min 00:00 would take the average of the day in question with the data points from Hour_Min 00:00 for the previous and the following 15 days. This would then repeat for that day's hour 1:00, etc. through the dataframe.
Unfortunately the data also have many NAs, which is problematic for moving window averages, although I think that can be solved using rollapply from the zoo package.
One approach I tried was to use tidyr's spread function to switch from long to wide format, to create a dataframe like this:
Date Y_Hour_Min_0000 Y_Hour_Min_0100 Y_Hour_Min_0200 etc...
If I could change the format in this way, I could then create new columns of running averages of each Y_Hour_Min_.... column. I would then need to gather everything together back to long format (another task I'm not sure how to approach).
However, I wasn't able to get the spread function to work so that it kept Date as a grouping variable associated with each Y_Hour_Min_.... column.
Another, possibly more elegant solution would be if there is a way to create a single new column in one step, using some combination of rollapply and custom function.
Any thoughts on how to implement code for this task will be greatly appreciated. Below I have a simple code to simulate my dataset:
Simulated data:
### Create vector of hours/dates:
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
I think this could work:
EDIT: Simplified code by adding fill = NA parameter to rollapply() function as suggested in the comments.
# add a complete date + time stamp
df$date_time <- paste(df$date, df$hour_min)
# make new column to store median data
df$median_y <- NA
# set rolling median width
width_roll <- 31
# do a rolling median for each hour, one at a time
# add NAs where no median can be calculated
for (i in levels(factor(df$hour_min))) {
df[df$hour_min == i, "median_y"] <- rollapply(df[df$hour_min == i, "y"],
width = width_roll,
median,
na.rm = TRUE,
fill = NA))
}
The approach is just to use the rollapply() function as you suggested, but only on one particular hour at a time. Then each of these is placed back into a new column in turn.
Here's an example for just one hour over the whole year, which makes it easier to visualize the median smoothing.
# Examples:
# plot one hour plus rolling median over time
# here i = "23:00:00"
plot(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "y"],
type = "l",
col = "blue",
ylab = "y values",
xlab = i)
lines(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "median_y"],
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for a single hour
This is for everything (lots of data so not so easy to see but looks like it worked).
# plot all the data
plot(x = as.POSIXct(df$date_time),
y = df$y,
type = "l",
col = "blue",
ylab = "y values",
xlab = "Date")
lines(x = as.POSIXct(df$date_time),
y = df$median_y,
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for all data
I'll take a crack at it but its not perfect. Hoping someone can come in and top me off.
TL:DR;
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
In Depth:
Starting at your starting point.. I added set.seed(1) so we can all follow along in tandem.
Your Initial Starting Point:
### Create vector of hours/dates:
set.seed(1)
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
First thing was to do what you said and try the long format. Normally I think this problem would be best by using dplyr's group_by on the hour_min column and doing the rolling average there, but I'm not sure how to do that.
First thing I noticed is that there is a duplicate value for one row on a given day. There are two observations for 1am, which breaks our spread, so I removed that observation using slice(-7441)
So let's spread your df.
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
As we can see, the dataframe is now 365 observations long(dates), and 25 columns wide (date + 24 hours)
dim(df2)
[1] 365 25
Next thing I did which is where this isn't perfect, is using rollapply. When using rollapply we can give it a width = list(-15:15). This will look 15 days into the past and 15 into the future and average all 31 days together. The problem is the first 15 days don't have a past 15, and the last 15 days don't have a future 15. So I padded these with NAs. I'm hoping someone can fix this part of my answer.
I created a custom function to do this:
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
If we just do the rollapply we will get a vector of length 335. I padded 15 in front and back to get us to our needed 365.
Next we want to lapply that function across our entire dataframe. That will give us a list of 24 vectors of length 365. We then want to turn that into a dataframe and bind it to our current dataframe.
Lastly we gather all of the columns back into the long format and arrange
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
I hope this helps.
I have a data frame which consists of date and temperature of 34 different systems each system in different column. I need to calculate every systems average hourly temperature. I use this code to calculate average for 1 system. But if I want to calculate average for other 33 systems, I have to repeat code again, and again. Is there a better way to find hourly average in all columns at once ?
dat$ut_ms <- dat$ut_ms/1000
dat[ ,1]<- as.POSIXct(dat[,1], origin="1970-01-01")
dat$ut_ms <- strptime(dat$ut_ms, "%Y-%m-%d %H:%M")
dat$ut_ms <- cut(dat[enter image description here][1]$ut_ms, breaks = 'hour')
meanNPWD2401<- aggregate(NPWD2401 ~ ut_ms, dat, mean)
I added a picture of the data. For better understing of what I want.
You can split your data per hour and itterate,
list1 <- split(dat, cut(strptime(dat$ut_ms, format = '%Y-%m-%d %H:%M'), 'hour'))
lapply(list1, colMeans)
When you rearrange the data into a long format, things get much easier
n.system <- 34
n.time <- 100
temp <- rnorm(n.time * n.system)
temp <- matrix(temp, ncol = n.system)
seconds <- runif(n.time, max = 3 * 3600)
time <- as.POSIXct(seconds, origin = "1970-01-01")
dataset <- data.frame(time, temp)
library(dplyr)
library(tidyr)
dataset %>%
gather(key = "system", value = "temperature", -time) %>%
mutate(hour = cut(time, "hour")) %>%
group_by(system, hour) %>%
summarise(average = mean(temperature))
I don't often have to work with dates in R, but I imagine this is fairly easy. I have a column that represents a date in a dataframe. I simply want to create a new dataframe that summarizes a 2nd column by Month/Year using the date. What is the best approach?
I want a second dataframe so I can feed it to a plot.
Any help you can provide will be greatly appreciated!
EDIT: For reference:
> str(temp)
'data.frame': 215746 obs. of 2 variables:
$ date : POSIXct, format: "2011-02-01" "2011-02-01" "2011-02-01" ...
$ amount: num 1.67 83.55 24.4 21.99 98.88 ...
> head(temp)
date amount
1 2011-02-01 1.670
2 2011-02-01 83.550
3 2011-02-01 24.400
4 2011-02-01 21.990
5 2011-02-03 98.882
6 2011-02-03 24.900
I'd do it with lubridate and plyr, rounding dates down to the nearest month to make them easier to plot:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
x = runif(300)
)
df$my <- floor_date(df$date, "month")
library(plyr)
ddply(df, "my", summarise, x = mean(x))
There is probably a more elegant solution, but splitting into months and years with strftime() and then aggregate()ing should do it. Then reassemble the date for plotting.
x <- as.POSIXct(c("2011-02-01", "2011-02-01", "2011-02-01"))
mo <- strftime(x, "%m")
yr <- strftime(x, "%Y")
amt <- runif(3)
dd <- data.frame(mo, yr, amt)
dd.agg <- aggregate(amt ~ mo + yr, dd, FUN = sum)
dd.agg$date <- as.POSIXct(paste(dd.agg$yr, dd.agg$mo, "01", sep = "-"))
A bit late to the game, but another option would be using data.table:
library(data.table)
setDT(temp)[, .(mn_amt = mean(amount)), by = .(yr = year(date), mon = months(date))]
# or if you want to apply the 'mean' function to several columns:
# setDT(temp)[, lapply(.SD, mean), by=.(year(date), month(date))]
this gives:
yr mon mn_amt
1: 2011 februari 42.610
2: 2011 maart 23.195
3: 2011 april 61.891
If you want names instead of numbers for the months, you can use:
setDT(temp)[, date := as.IDate(date)
][, .(mn_amt = mean(amount)), by = .(yr = year(date), mon = months(date))]
this gives:
yr mon mn_amt
1: 2011 februari 42.610
2: 2011 maart 23.195
3: 2011 april 61.891
As you see this will give the month names in your system language (which is Dutch in my case).
Or using a combination of lubridate and dplyr:
temp %>%
group_by(yr = year(date), mon = month(date)) %>%
summarise(mn_amt = mean(amount))
Used data:
# example data (modified the OP's data a bit)
temp <- structure(list(date = structure(1:6, .Label = c("2011-02-01", "2011-02-02", "2011-03-03", "2011-03-04", "2011-04-05", "2011-04-06"), class = "factor"),
amount = c(1.67, 83.55, 24.4, 21.99, 98.882, 24.9)),
.Names = c("date", "amount"), class = c("data.frame"), row.names = c(NA, -6L))
You can do it as:
short.date = strftime(temp$date, "%Y/%m")
aggr.stat = aggregate(temp$amount ~ short.date, FUN = sum)
Just use xts package for this.
library(xts)
ts <- xts(temp$amount, as.Date(temp$date, "%Y-%m-%d"))
# convert daily data
ts_m = apply.monthly(ts, FUN)
ts_y = apply.yearly(ts, FUN)
ts_q = apply.quarterly(ts, FUN)
where FUN is a function which you aggregate data with (for example sum)
Here's a dplyr option:
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
mutate(ym = format(date, '%Y-%m')) %>%
group_by(ym) %>%
summarize(ym_mean = mean(x))
I have a function monyr that I use for this kind of stuff:
monyr <- function(x)
{
x <- as.POSIXlt(x)
x$mday <- 1
as.Date(x)
}
n <- as.Date(1:500, "1970-01-01")
nn <- monyr(n)
You can change the as.Date at the end to as.POSIXct to match the date format in your data. Summarising by month is then just a matter of using aggregate/by/etc.
One more solution:
rowsum(temp$amount, format(temp$date,"%Y-%m"))
For plot you could use barplot:
barplot(t(rowsum(temp$amount, format(temp$date,"%Y-%m"))), las=2)
Also, given that your time series seem to be in xts format, you can aggregate your daily time series to a monthly time series using the mean function like this:
d2m <- function(x) {
aggregate(x, format(as.Date(zoo::index(x)), "%Y-%m"), FUN=mean)
}