I have some data that I need to analyse easily. I want to create a graph of the average usage per day of a week. The data is in a data.table with the following structure:
time value
2014-10-22 23:59:54 7433033.0
2014-10-23 00:00:12 7433034.0
2014-10-23 00:00:31 7433035.0
2014-10-23 00:00:49 7433036.0
...
2014-10-23 23:59:21 7443032.0
2014-10-23 23:59:40 7443033.0
2014-10-23 23:59:59 7443034.0
2014-10-24 00:00:19 7443035.0
Since the value is cumulative, I would need the maximum value of a day, minus the minimum value of that day, and then average all the values with the same days.
I already know how to get the day of the week (using as.POSIXlt and $wday). So how can I get the daily difference? Once I have the data in a structure like:
dayOfWeek value
0 10
1 20
2 50
I should be able to find the mean myself using some functions.
Here is a sample:
library(data.table)
data <- fread("http://pastebin.com/raw.php?i=GXGiCAiu", header=T)
#get the difference per day
#create average per day of week
There are many ways to do this with R. You can use ave from base R or data.table or dplyr packages. These solutions all add the summaries as columns of your data.
data
df <- data.frame(dayOfWeek = c(0L, 0L, 1L, 1L, 2L),
value = c(10L, 5L, 20L, 60L, 50L))
base r
df$min <- ave(df$value, df$dayOfWeek, FUN = min)
df$max <- ave(df$value, df$dayOfWeek, FUN = max)
data.table
require(data.table)
setDT(df)[, ":="(min = min(value), max = max(value)), by = dayOfWeek][]
dplyr
require(dplyr)
df %>% group_by(dayOfWeek) %>% mutate(min = min(value), max = max(value))
If you just want the summaries, you can also use the following:
# base
aggregate(value~dayOfWeek, df, FUN = min)
aggregate(value~dayOfWeek, df, FUN = max)
# data.table
setDT(df)[, list(min = min(value), max = max(value)), by = dayOfWeek]
# dplyr
df %>% group_by(dayOfWeek) %>% summarise(min(value), max(value))
This is actually a trickier problem than it seemed at first glance. I think you need two separate aggregations, one to aggregate the cumulative usage values within each calendar day by taking the difference of the range, and then a second to aggregate the per-calendar-day usage values by weekday. You can extract the weekday with weekdays(), calculate the daily difference with diff() on the range(), calculate the mean with mean(), and aggregate with aggregate():
set.seed(1);
N <- as.integer(60*60*24/19*14);
df <- data.frame(time=seq(as.POSIXct('2014-10-23 00:00:12',tz='UTC'),by=19,length.out=N)+rnorm(N,0,0.5), value=seq(7433034,by=1,length.out=N)+rnorm(N,0,0.5) );
head(df);
## time value
## 1 2014-10-23 00:00:11 7433034
## 2 2014-10-23 00:00:31 7433035
## 3 2014-10-23 00:00:49 7433036
## 4 2014-10-23 00:01:09 7433037
## 5 2014-10-23 00:01:28 7433039
## 6 2014-10-23 00:01:46 7433039
tail(df);
## time value
## 63658 2014-11-05 23:58:14 7496691
## 63659 2014-11-05 23:58:33 7496692
## 63660 2014-11-05 23:58:51 7496693
## 63661 2014-11-05 23:59:11 7496694
## 63662 2014-11-05 23:59:31 7496695
## 63663 2014-11-05 23:59:49 7496697
df2 <- aggregate(value~date,cbind(df,date=as.Date(df$time)),function(x) diff(range(x)));
df2;
## date value
## 1 2014-10-23 4547.581
## 2 2014-10-24 4546.679
## 3 2014-10-25 4546.410
## 4 2014-10-26 4545.726
## 5 2014-10-27 4546.602
## 6 2014-10-28 4545.194
## 7 2014-10-29 4546.136
## 8 2014-10-30 4546.454
## 9 2014-10-31 4545.712
## 10 2014-11-01 4546.901
## 11 2014-11-02 4544.684
## 12 2014-11-03 4546.378
## 13 2014-11-04 4547.061
## 14 2014-11-05 4547.082
df3 <- aggregate(value~dayOfWeek,cbind(df2,dayOfWeek=weekdays(df2$date)),mean);
df3;
## dayOfWeek value
## 1 Friday 4546.196
## 2 Monday 4546.490
## 3 Saturday 4546.656
## 4 Sunday 4545.205
## 5 Thursday 4547.018
## 6 Tuesday 4546.128
## 7 Wednesday 4546.609
Came across this looking for something else. I think you were looking for the difference and mean per Monday, Tuesday, etc. Sticking with data.table allows a quick all in one call to get the mean per day of week and the difference per day of the week. This gives an output of 7 rows and three columns.
library(data.table)
data <- fread("http://pastebin.com/raw.php?i=GXGiCAiu", header=T)
data_summary <- data[,list(mean = mean(value),
diff = max(value)-min(value)),
by = list(date = format(as.POSIXct(time), format = "%A"))]
This gives an output of 7 rows and three columns.
date mean diff
1: Thursday 7470107 166966
2: Friday 7445945 6119
3: Saturday 7550000 100000
4: Sunday 7550000 100000
5: Monday 7550000 100000
6: Tuesday 7550000 100000
7: Wednesday 7550000 100000
Related
I am trying to get my existing observations to 10 min intervals in R.
I did this:
data3$date= ceiling_date(as.POSIXct(data3$betdate), unit = "10 minutes")
data3 %>% group_by(date, prov) %>%
summarise(cant=n())
But the problem with this code it is that if there is no observation for one interval, the interval will not appear in the output data, which have a lot of sense because there are no observations with the date in that interval. So i need to capture the information about that intervals that does not have observations registred. Any ideas? Already thanks to all of you.
See a simplified example of #Limey's comment, using just months and data.table
# set up fake data
set.seed(1000)
library(lubridate)
# create sequence, and save it as a data.frame so it has a header
months <- seq(ymd("2022-01-01"), ymd("2022-06-01"), by = "month")
# create fake data, and remove some rows
dat <- data.frame(month = months, values = sample(100:200, length(months)))
dat <- dat[-sample(1:length(months),3),]
dat
# month values
#1 2022-01-01 167
#4 2022-04-01 150
#6 2022-06-01 128
here we perform the merge and see the NAs representing missing observations
library(data.table)
setDT(dat)
months_listed <- data.frame(month = seq(min(dat$month), max(dat$month), by = "month"))
setDT(months_listed)
merge.data.table(months_listed, dat, by = "month", all.x = T)
# month values
#1: 2022-01-01 167
#2: 2022-02-01 NA
#3: 2022-03-01 NA
#4: 2022-04-01 150
#5: 2022-05-01 NA
#6: 2022-06-01 128
I'm trying to get the standard deviation of a stock price by year, but I'm getting the same value for every year.
I tried with dplyr (group_by, summarise) and also with a function, but had no luck in any of them, both return the same value of 67.0.
It is probably passing the whole dataframe without subsetting it, how can this issue be fixed?
library(quantmod)
library(tidyr)
library(dplyr)
#initial parameters
initialDate = as.Date('2010-01-01')
finalDate = Sys.Date()
ybeg = format(initialDate,"%Y")
yend = format(finalDate,"%Y")
ticker = "AAPL"
#getting stock prices
stock = getSymbols.yahoo(ticker, from=initialDate, auto.assign = FALSE)
stock = stock[,4] #working only with closing prices
With dplyr:
#Attempt 1 with dplyr - not working, all values by year return the same
stock = stock %>% zoo::fortify.zoo()
stock$Date = stock$Index
separate(stock, Date, c("year","month","day"), sep="-") %>%
group_by(year) %>%
summarise(stdev= sd(stock[,2]))
# A tibble: 11 x 2
# year stdev
# <chr> <dbl>
# 1 2010 67.0
# 2 2011 67.0
#....
#10 2019 67.0
#11 2020 67.0
And with function:
#Attempt 2 with function - not working - returns only one value instead of multiple
#getting stock prices
stock = getSymbols.yahoo(ticker, from=initialDate, auto.assign = FALSE)
stock = stock[,4] #working only with closing prices
#subsetting
years = as.character(seq(ybeg,yend,by=1))
years
calculate_stdev = function(series,years) {
series[years] #subsetting by years, to be equivalent as stock["2010"], stock["2011"] e.g.
sd(series[years][,1]) #calculate stdev on closing prices of the current subset
}
yearly.stdev = calculate_stdev(stock,years)
> yearly.stdev
[1] 67.04185
Use apply.yearly() (a convenience wrapper around the more general period.apply()) to call a function on yearly subsets of the xts object returned by getSymbols().
You can use the Cl() function to extract the close column from objects returned by getSymbols().
stock = getSymbols("AAPL", from = "2010-01-01", auto.assign = FALSE)
apply.yearly(Cl(stock), sd)
## AAPL.Close
## 2010-12-31 5.365208
## 2011-12-30 3.703407
## 2012-12-31 9.568127
## 2013-12-31 6.412542
## 2014-12-31 13.371293
## 2015-12-31 7.683550
## 2016-12-30 7.640743
## 2017-12-29 14.621191
## 2018-12-31 20.593861
## 2019-12-31 34.538978
## 2020-06-19 29.577157
I don't know dplyr, but here's how with data.table
library(data.table)
# convert data.frame to data.table
setDT(stock)
# convert your Date column with content like "2020-06-17" from character to Date type
stock[,Date:=as.Date(Date)]
# calculate sd(price) grouped by year, assuming here your price column is named "price"
stock[,sd(price),year(Date)]
Don't pass the name of the dataframe again in your summarise function. Use the variable name instead.
separate(stock, Date, c("year","month","day"), sep="-") %>%
group_by(year) %>%
summarise(stdev = sd(AAPL.Close)) # <-- here
# A tibble: 11 x 2
# year stdev
# <chr> <dbl>
# 1 2010 5.37
# 2 2011 3.70
# 3 2012 9.57
# 4 2013 6.41
# 5 2014 13.4
# 6 2015 7.68
# 7 2016 7.64
# 8 2017 14.6
# 9 2018 20.6
#10 2019 34.5
#11 2020 28.7
I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
> head(DF)
datetime pw cu year m d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065 2017 8 18 14 53
2 2017-08-18 14:57:00 0.0604653 1.86771 2017 8 18 14 57
3 2017-08-18 14:58:00 0.0601318 1.86596 2017 8 18 14 58
4 2017-08-18 14:59:00 0.0599276 1.83761 2017 8 18 14 59
5 2017-08-18 15:00:00 0.0598998 1.84177 2017 8 18 15 0
I had to use a for loop to reduce my table, I wrote the following to do it:
datetime <- c()
eg_bf <-c ()
for(i in 1:8760){
hour= start + 3600
DF= DF %>%
filter(datetime >= start & datetime < hour) %>%
summarise(eg= mean(pw))
datetime= append(datetime, start)
eg_bf= append(eg_bf, DF$eg)
start= hour
}
new_DF= data.frame(datetime, eg_bf)
So. I was able to get my new data set with the mean value for every hour of the year.
datetime eg_bf
1 2018-01-01 00:00:00 0.025
2 2018-01-01 01:00:00 0.003
3 2018-01-01 02:00:00 0.002
4 2018-01-01 03:00:00 0.010
5 2018-01-01 04:00:00 0.015
The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?
you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = make_datetime(year, m, d, hr)) %>%
group_by(date_time) %>%
summarise(eg_bf = mean(pw))
#Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:
df %>%
group_by(year, m, d, hr) %>%
summarise(test = mean(pw))
# A tibble: 2 x 5
# Groups: year, m, d [?]
year m d hr test
<int> <int> <int> <int> <dbl>
1 2017 8 18 14 0.0609
2 2017 8 18 15 0.0599
You said in your question:
When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
What did you do differently?
Even if you've found your answer, I believe this is worth mentioning:
If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr
You can see with a simple benchmarking how much faster data.table is:
library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)
# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"),
as.POSIXct("2019-01-02 00:00:00"), 60),
pw = runif(527041)) %>%
mutate(year = year(datetime), m=month(datetime),
d=day(datetime), hour = hour(datetime))
# save it as a data.table
dt <- as.data.table(DF)
# transformation with dplyr
f_dplyr <- function(){
DF %>%
group_by(year, m, d, hour) %>%
summarize(eg_bf = mean(pw))
}
# transformation with data.table
f_datatable <- function() {
dt[, mean(pw), by=.(year, m, d, hour)]
}
# benchmarking
microbenchmark(f_dplyr(), f_datatable())
#
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714 100 b
# f_datatable() 9.081295 9.712694 12.53998 10.55697 11.33933 41.85217 100 a
check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?
As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.
remove(list = ls())
library(dplyr)
library(lubridate)
library(purrr)
library(tibble)
date_time <- seq.POSIXt(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
by = "1 min"
)
n <- length(date_time)
data <- tibble(
date_time = date_time,
pw = runif(n),
cu = runif(n),
ye = year(date_time),
mo = month(date_time),
da = day(date_time),
hr = hour(date_time)
)
grouped <- data %>%
group_by(
ye, mo, da, hr
) %>%
summarise(
mean_pw = mean(pw)
)
I have a financial time series data.frame with microsecond precision:
timestamp price volume
2017-08-29 08:00:00.345678 99.1 10
2017-08-29 08:00:00.674566 98.2 5
....
2017-08-29 16:00:00.111234 97.0 3
2017-08-29 16:00:01.445678 96.5 5
In total: around 100k records per day.
I saw a couple of functions where I can specify the width of the rolling windows, e.g. k = 10. But the k is expressed as a number of observations and not minutes.
I need to calculate runing/rolling Max, Min of Price series and a runing/rolling sum of Volume series like that:
starting with a timestamp exactly 5 minutes after the begin of the time series
for every following timestamp: look back for 5 minutes interval and
calculate the rolling statistics.
How to calculate this effectivly?
Your data
I wasn't able to capture milliseconds (but the solution should still work)
library(lubridate)
df <- data.frame(timestamp = ymd_hms("2017-08-29 08:00:00.345678", "2017-08-29 08:00:00.674566", "2017-08-29 16:00:00.111234", "2017-08-29 16:00:01.445678"),
price=c(99.1, 98.2, 97.0, 96.5),
volume=c(10,5,3,5))
purrr and dplyr solution
library(purrr)
library(dplyr)
timeinterval <- 5*60 # 5 minute
Filter df for observations within time interval, save as list
mdf <- map(1:nrow(df), ~df[df$timestamp >= df[.x,]$timestamp & df$timestamp < df[.x,]$timestamp+timeinterval,])
Summarise for each data.frame in list
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp = head(timestamp,1),
max.price = max(price),
max.volume = max(volume),
sum.price = sum(price),
sum.volume = sum(volume),
min.price = min(price),
min.volume = min(volume)))
Output
timestamp max.price max.volume sum.price sum.volume
1 2017-08-29 08:00:00 99.1 10 197.3 15
2 2017-08-29 08:00:00 98.2 5 98.2 5
3 2017-08-29 16:00:00 97.0 5 193.5 8
4 2017-08-29 16:00:01 96.5 5 96.5 5
min.price min.volume
1 98.2 5
2 98.2 5
3 96.5 3
4 96.5 5
As I was looking for a backward calculation (start with a timestamp and look 5 minutes back) I slightly modified the great solution by #CPak as follows:
mdf <- map(1:nrow(df), ~df[df$timestamp <= df[.x,]$timestamp & df$timestamp > df[.x,]$timestamp - timeinterval,])
statdf <- map_df(mdf, ~.x %>%
summarise(timestamp_to = tail(timestamp,1),
timestamp_from = head(timestamp,1),
max.price = max(price),
min.price = min(price),
sum.volume = sum(volume),
records = n()))
In addition, I added records = n() to see how many records have been used in the intervals.
One caveat: the code takes 10 mins on mdf and another 6 mins for statdf on a dataset with 100K+ records.
Any ideas how to optimize it? Thank you!
I am trying to use dplyr in R to calculate rolling stats (mean, sd, etc) based on a dynamic window based on dates and for specific models. For instance, within groupings of items, I would like to calculate the rolling mean for all data 10 days prior. The dates on the data are not sequential and not complete so I can't use a fixed window.
One way to do this is use rollapply referencing the window width as shown below. However, I'm having trouble calculating the dynamic width. I'd prefer a method that omits the intermediate step of calculating the window and simply calculate based on the date_lookback. Here's a toy example.
I've used for loops to do this, but they are very slow.
library(dplyr)
library(zoo)
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE) %>%
mutate(.,
cut_date = date - date_lookback, #calcs based on sample since this date
dyn_win = c(1,1,2,3,3,1,2,3,4,5), ##!! need to calculate this vector??
roll_mean = rollapply(data, align = "right", width = dyn_win, mean),
roll_sd = rollapply(data, align = "right", width = dyn_win, sd))
These are the roll_mean and roll_sd results I'm looking for:
> df
label date data cut_date dyn_win roll_mean roll_sd
1 a 2017-01-02 790 2016-12-23 1 790.0000 NA
2 a 2017-01-20 493 2017-01-10 1 493.0000 NA
3 a 2017-01-21 718 2017-01-11 2 605.5000 159.0990
4 a 2017-01-30 483 2017-01-20 3 564.6667 132.8847
5 a 2017-01-31 825 2017-01-21 3 675.3333 174.9467
6 b 2017-01-05 186 2016-12-26 1 186.0000 NA
7 b 2017-01-08 599 2016-12-29 2 392.5000 292.0351
8 b 2017-01-09 408 2016-12-30 3 397.6667 206.6938
9 b 2017-01-10 108 2016-12-31 4 325.2500 222.3921
10 b 2017-01-11 666 2017-01-01 5 393.4000 245.5928
Thanks in advance.
You could try explicitly referencing your dataset inside the dplyr call:
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE)
df %>%
group_by(date,label) %>%
mutate(.,
roll_mean = mean(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE),
roll_sd = sd(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE))