I find monthly maximum value from daily data of many companies by this r code
DF$date <- as.Date(DF$date,format="%Y-%m-%d")
Output <- aggregate(DF[,-1],
by=list(Month=format(DF$date,"%y-%m")),
FUN=max)
However, I could not figure out how I can change this code to find out the average of two maximum values in a month or the average of three maximum values in a month. As a new learner of r, it would be very helpful for me if you can help me in this regard.
We can sort the values in descending (or ascending), then take the head of 'n' number of values and get the mean by using an anonymous function call in FUN
aggregate(DF[,-1], by=list(Year_Month=format(DF$date,"%y-%m")),
FUN = function(x) mean(head(sort(x, decreasing = TRUE), 2)))
Or this can be done with tidyverse
library(dplyr)
DF %>%
group_by(Year_Month = format(date, "%y-%m")) %>%
summarise_at(vars(-date), funs(mean(head(sort(., decreasing = TRUE), 2))))
Related
I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.
There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)
I'm trying to sum all of the similar Date/time rows into one row and a "count" row. Therefore I'll get two columns- one for the Date/Time and one for the count.
I used this argument to round my observations into a 15 minute time period:
dat$by15 <- cut(dat$Date_Time, breaks = "15 min", )
I tried to use this argument, but it's "jumping" to a previous dataset and giving me the wrong observations for some reason:
dat <- aggregate(dat, by = list(dat$by15), length )
Thank you guys !
I'm not sure if I understood the question, but if you are trying to group by date and count observations for each date it's really simple
library(dplyr)
grouped_dates <- dat %>%
group_by(Date_Time) %>%
summarise(Count = n())
tl;dr
How do I make "partition" from multiplyr split on multiple columns?
Motivation:
I was unhappy with using 1 of 32 cores for hard-working summarize, so I am trying to use multi-dplyer I am operating on multiple columns.
Example:
The vignette shows grouping by a single column, but when I do that, my other grouping column is not considered.
Code:
library(dplyr)
library(multidplyr)
library(nycflights13)
flights1 <- partition(flights, flight)
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)
So how about splitting on year, month, and day?
This doesn't work for me:
flights1 <- partition(flights, list(year, month, day))
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)
I can't seem to make this work. Can you point to a proper or at least effective way to do this?
According to ?partition, the usage for partition is
partition(.data, ..., cluster = get_default_cluster())
where ... are variables to partition by. Instead of passing in a list of variables, pass in each variable separately, i.e.
partition(flights, year, month, day)
I have a data frame of integer-count observations listed by date and time interval. I want to find the median of these observations by date using the dplyr package. I've already formatted the date column correctly, and used group_by like so:
data.bydate <- group_by(data.raw, date)
When I use summarise() to find the median of each date group, all I'm getting are a bunch of zeroes. There are NA's in the data, so I've been stripping them with na.rm = TRUE.
data.median <- summarise(data.bydate, median = median(count, na.rm = TRUE)
Is there another way I should be doing this?
You can do something like,
data.raw %>% group_by(date) %>% summarise(median = median(count, na.rm = TRUE))
It's possible each group has too many zero values. Try to identify number of unique value in each group to check whether the groups have too many zeros in them. The below code could help to see the number of unique values and total values available for count variable in each group.
summarise(data.bydate, unique_code = n_distinct(count), total_count = n(count))
example how I made this using dplyr
data.median<-data.bydate%>% summarise(median = median(count, na.rm = TRUE))
I consistently need to take transaction data and aggregate it by Day, Week, Month, Quarter, Year - essentially, it's time-series data. I started to apply zoo/xts to my data in hopes I could aggregate the data faster, but I either don't fully understand the packages' purpose or I'm trying to apply it incorrectly.
In general, I would like to calculate the number of orders and the number of products ordered by category, by time period (day, week, month, etc).
#Create the data
clients <- 1:10
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(numProducts = 1:10,
category = sample(categories, 1000, replace = TRUE),
clientID = sample(clients, 1000, replace = TRUE),
OrderDate = sample(dates, 1000, replace = TRUE))
I could do this with plyr and reshape, but I think this is a round-about way to do so.
#Aggregate by date and category
products.day <- ddply(products, .(OrderDate, category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Aggregate by Month and category
products.month <- ddply(products, .(Month = months(OrderDate), Category = category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Make a wide-version of the data frame
products.month.wide <- cast(products.month, Month~Category, sum)
I tried to apply zoo to the data like so:
products.TS <- aggregate(products$numProducts, yearmon, mean)
It returned this error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
'by' must be a list
I've read the zoo vignettes and documentation, but every example that I've found only shows 1 record/row/entry per time entry.
Do I have to pre-aggregate the data I want to time-series on? I was hoping that I could simply group by the fields I want, then have the months or quarters get added to the data frame incrementally to the X-axis.
Is there a better approach to aggregating this or a more appropriate package?
products$numProducts is a vector, not a zoo object. You'd need to create a zoo object before you can use method dispatch to call aggregate.zoo.
pz <- with(products, zoo(numProducts, OrderDate))
products.TS <- aggregate(pz, as.yearmon, mean)