I am looking to create a simple custom function I can integrate into a dplyr pipe workflow.
I want to calculate the mean of a time excluding dates. So for instance, given a sequence of POSIXct I want to extract the time and calculate average. However, one added complication is that time is circular, meaning that 00:00:00 and 23:00:00 are very close to each other in time terms, but arithmetically not so close. So I can't just use something like mean(time_vector) to calculate average.
I have seen the package psych which has a function called circadian.mean to calculate a circular average. However, it only takes decimal hours, so there's some juggling I have had to do before getting a valid output. For instance:
library(tidyverse)
library(lubridate)
library(psych)
df = data.frame(datetime = as.POSIXct(c("2019-07-14 23:00:17",
"2019-07-14 23:40:20",
"2019-07-14 00:12:45",
"2019-07-14 00:17:19"), tz = "UTC"))
decimal_hours_vector = df %>%
mutate(hours = hour(datetime)) %>% # extracting hours
mutate(minutes = minute(datetime)) %>% # extracting minutes
mutate(seconds = second(datetime)) %>% # extracting seconds
mutate(dec_min = (minutes/60*100)/100) %>% # converting minutes to decimal hour
mutate(dec_sec = (seconds/60/60*100)/100) %>% # converting seconds to decimal hour
rowwise() %>%
mutate(dec_hour = sum(hours,dec_min,dec_sec)) %>% # summing all three time columns rowwise
ungroup() %>% # ungrouping
pull(dec_hour) # extracting dec_hour as vector to use in circadian.mean
# calculating average time with psych
average_time = circadian.mean(decimal_hours_vector)
So, I think the above works but is very cumbersome. Also, I still haven't worked out whether it makes sense to convert the final output back in hh:mm:ss or whether it should stay in the decimal format, and whether everything is working as it should.
As opposed to using the laborious process above, is there a better or more sensible way to do this?
Related
Currently working to clean up a subset of data that's close to 15M rows. Eventually will be working with full data set closer to 120M rows.
Part of my data is dates in hourly increments, split among two columns. One column has the date (1/1/2020) format, another column has the hour corresponding to that date in integer form.
I have successfully accomplished my goal with the following code:
library(tibble)
library(lubridate)
df <- tibble(date = rep(c(mdy("1/1/2020")), each = 5), hour = 1:5)
hour(df$date) <- df$hour
To run this on the full 15M rows takes 120s on my (quite powerful) machine. I don't usually work with datasets this large, it seems to slow to me, but I am an armchair coder at best.
Is that a reasonable time frame to accomplish my goal? If not, is there another function or more efficient way to accomplish the same result?
It may be easier to paste the 'hour' into the 'date' column and reconvert to Datetime class with ymd_h
library(dplyr)
library(lubridate)
df %>%
mutate(date = ymd_h(str_c(date, hour, sep=' ')))
I have imported daily return data for ADSK via a downloaded Yahoo finance .csv file.
ADSKcsv <- read.csv("ADSK.csv", TRUE)
I have converted the .csv file to a data frame
class(ADSKcsv)
I have selected the two relevant columns that I want to work with and sought to take the mean of all daily returns for each year. I do not know how to do this.
aggregate(Close~Date, ADSK, mean)
The above code yields a mean calculation for each date. My objective is to calculate YoY return from this data, first converting daily returns to yearly returns, then using yearly returns to calculate year-over-year returns. I'd appreciate any help.
May I suggest an easier approach?
library(tidyquant)
ADSK_yearly_returns_tbl <- tq_get("ADSK") %>%
tq_transmute(select = close,
mutate_fun = periodReturn,
period = "yearly")
ADSK_yearly_returns_tbl
If you run the above code, it will download the historical returns for a symbol of interest (ADSK in this case) and then calculate the yearly return. An added bonus to using this workflow is that you can swap out any symbols of interest without manually downloading and reading them in. Plus, it saves you the extra step of calculating the average daily return.
You can extract the year value from date and then do aggregate :
This can be done in base R :
aggregate(Close~year, transform(ADSKcsv, year = format(Date, '%Y')), mean)
dplyr
library(dplyr)
ADSKcsv %>%
group_by(year = format(Date, '%Y')) %>%
#Or using lubridate's year function
#group_by(year = lubridate::year(Date)) %>%
summarise(Close = mean(Close))
Or data.table
library(data.table)
setDT(ADSKcsv)[, .(Close = mean(Close)), format(Date, '%Y')]
I'm trying to sum all of the similar Date/time rows into one row and a "count" row. Therefore I'll get two columns- one for the Date/Time and one for the count.
I used this argument to round my observations into a 15 minute time period:
dat$by15 <- cut(dat$Date_Time, breaks = "15 min", )
I tried to use this argument, but it's "jumping" to a previous dataset and giving me the wrong observations for some reason:
dat <- aggregate(dat, by = list(dat$by15), length )
Thank you guys !
I'm not sure if I understood the question, but if you are trying to group by date and count observations for each date it's really simple
library(dplyr)
grouped_dates <- dat %>%
group_by(Date_Time) %>%
summarise(Count = n())
Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.
Goal: Plot a time series.
Problem: X-axis data is of course viewed as a character and I'm having trouble converting the character into a date.
new.df <- df %>%
group_by(Month, Year) %>%
summarise(n = n())
new.df <- new.df %>%
unite(Date, Month, Year, sep = "/") %>%
mutate(Total = cumsum(n))
So, I end up with a data frame looking like this:
Date n Group Total
8/2010 1 1 1
9/2010 414 1 415
etc
I'm trying to convert the Date column into a Date format. The column is a character class. So, I tried doing
new.df$Date <- as.Date(New.Patients$Date, %m/%Y)
However, when I do that, it replaces the entire Date column into NA's.
I'm not sure if this is because my single-digit month dates do not have 0's in front or not. I did the unite() function just because I thought it may make it easier, but it might not.
I originally created the Year/Month variable with the lubridate package but I wasn't sure I could incorporate that here. Bonus points if someone can show me how.
I would appreciate any help or guidance. I'm sure it's not that hard I am just having a major brain fart at the moment.
You can try like this:
library(zoo) # for yearmon
new.df$Date <- as.yearmon(New.Patients$Date, format="%m/%Y")
But if you really need it to be as.Date then I guess you have to define day (e.g. 01) as #lukeA has suggested in comment.
My issue, as pointed out by lukeA in the comments, is that the as.Date function requires a day to be somewhere within the character string.
Therefore, just by pasting "01" (or I think virtually any other two-digit combination would work) to the front of each date fixed the issue.