I'm trying to sum a set of POSIXct objects by a factor variable, but am getting an error that sum is not defined for POSIXt objects. However, it works fine if I just calculate the mean. But how can I get the summed times by group using tapply?
Example:
data <- data.frame(time = c("2:50:04", "1:24:10", "3:10:43", "1:44:26", "2:10:19", "3:01:04"),
group = c("A","A","A","B","B","B"))
data$group <- as.factor(data$group)
data$time <- as.POSIXct(paste("1970-01-01", data$time), format="%Y-%m-%d %H:%M:%S", tz="GMT")
# works
tapply(data$time, data$group, mean)
# doesn't work
tapply(data$time, data$group, sum)
Date objects cannot be summed, this does semantically not make sense, the + operator is also not defined for POSIXct objects.
Probably you want to model time differences and sum them up?
Try:
times <- as.difftime(c("2:50:04", "1:24:10", "3:10:43",
"1:44:26", "2:10:19", "3:01:04"), "%H:%M:%S")
sum(times)
A difftime object is also that what you get when you subtract two date objects (which is semantically reasonable).
EDIT:
A entire solution for the OPs problem in a semantically neater way (tapply seams to destroy the structure of the difftime class - use group_by from the dplyr package instead)
library(dplyr)
times <- as.difftime(c("2:50:04", "1:24:10", "3:10:43",
"1:44:26", "2:10:19", "3:01:04"), format="%H:%M:%S")
data <- data.frame(time = times, group = c("A","A","A","B","B","B"))
summarise(group_by(data, group), sum(time))
This gives the following output:
Source: local data frame [2 x 2]
group sum(time)
1 A 7.415833 hours
2 B 6.930278 hours
Related
I know this question has probably been answered in different ways, but still struggling with this. I am working with a dataset where the dates format for date1 is '2/1/2000', '5/12/2000', '6/30/2015' where the class() is character. And the second column of dates date2 in the format '2015-07-06', '2015-08-01', '2017-10-09' where the class() is "POSIXct" "POSIXt" .
I am attempting to standardize both columns so I can compute the difference in days between them using something like this
abs(difftime(date1 ,date2 , units = c("days")))
I have tried numerous ways in converting the first date1 into the same class using strtime, lubridate etc. What's the best way to move forward for me to be able to standardize both and compute the difference in days?
sample data
x <- c('2/1/2000', '5/12/2000', '6/30/2015')
y <- as.POSIXct(c('2015-07-06', '2015-08-01', '2017-10-09'))
code
#make both posixct
x2 <- as.POSIXct(x, format = "%m/%d/%Y")
abs(x2 - y)
# Time differences in days
# [1] 5633.958 5559.000 832.000
Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
MainDF with data and startdate/enddate ranges
SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
MainDF with data and a vector of single dates
SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean & Max of DataA
length(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!
I've gotten fairly good with the *apply family of functions, and I've recently learned to use the do.call("rbind", by(... as a wrapper for tapply. I'm working with a large data set (Compustat) and I have a function (see below) that generates a new column of lagged variables which I later attach to the main data frame df.
My problem is that it is extremely slow. I create about two dozen lagged variables, and the processing in this function takes approximately 1.5 hours because there are 350,000+ firm-year observations in the data set.
Can anyone help improve the speed of this function without losing the aspects that I find desirable:
#' lag vector of unknown size (for do.call-rbind-by: using datadate to track)
lag.vec <- function(x){
x <- x[order(x$datadate), ] # sort data into ascending by date
var <- x[,2] # the specific variable name in data.frame x hereby ignored
var.name <- paste(names(x)[2], "lag", sep = '.') # keep variable name
if(length(var)>1){ # no lagging if single observation
lagged <- c(NA, var[1:(length(var)-1)])
datelag <- c(x$datadate[1], x$datadate[1:(length(x$datadate) - 1)])
datediff <- x$datadate - datelag
y <- data.frame(x$datadate, datediff, lagged) # join lagged variable and difference in YYYYMMDD data
y$lagged[y$datediff >= 20000 & !is.na(y$datediff)] <- NA # 2 or more full years difference
y <- y[, c('x.datadate', 'lagged')]
names(y) <- c("datadate", var.name)
} else { y <- c(x$datadate[1], NA); names(y) <- c("datadate", var.name) }
return(y)
}
I then call this function in a command separately for each variable that I want to generate a lagged series for (here I use the ni variable as an example):
ni_lag <- do.call('rbind', by(df[ , c('datadate', 'ni')], df$gvkey, lag.vec))
where gvkey is the ID number for the particular firm and datadate is an 8-digit integer of the form YYYYMMDD.
The approach was much faster when I used a simpler function:
lag.vec.seq <- function(x){#' lag vector when all data points are present, in order
if(length(x)>1){
y <- c(NA, x[1:(length(x)-1)])
} else {y <- NA}
return(y)
}
along with the tapply command in something like
ni_lag <- as.vector(unlist(tapply(df$ni, df$gvkey, lag.vec.seq)))
As you can see the main difference is that the tapply approach doesn't include any datadate information and so the function assumes that all data are sequential (i.e., there are no missing years in the dataframe). Since I know there are missing years, I built the do.call-by function to account for that.
Some notes:
1) The first order command in the function is probably unnecessary since my data is ordered by gvkey and datadate in advance (e.g. df <- df[order(df$gvkey, df$datadate), ]). However, I'm always a bit afraid that R messies up my row ordering when I use functional programming like this. Is that an unfounded fear?
2) Identifying what is slowing down the processing would be very helpful. Is it the renaming of variables? The creation of a new data frame in the function? Or is the do.call with by just typically (much) slower than tapply?
Thank you!
I am trying to understand why R behaves differently with the "aggregate" function. I wanted to average 15m-data to hourly data. For this, I passed the 15m-data together with a pre-designed "hour" array (4 times the same date per hour, taking the original POSIXct array) to the aggregate function.
After some time, I realized that the function was behaving odd (well, probably the data was odd, but why?) when giving over the date-array with
strftime(data.15min$posix, format="%Y-%m-%d %H")
However, if I handed over the data with
cut(data.15min$posix, "1 hour")
the data was averaged correctly.
Below, a minimal example is embedded, including a sample of the data.
I would be happy to understand what I did wrong.
Thanks in advance!
d <- 3
bla <- read.table("test_daten.dat",header=TRUE,sep=",")
data.15min <- NULL
data.15min$posix <- as.POSIXct(bla$dates,tz="UTC")
data.15min$o3 <- bla$o3
hourtimes <- unique(as.POSIXct(paste(strftime(data.15min$posix, format="%Y-%m-%d %H"),":00:00",sep=""),tz="Universal"))
agg.mean <- function (xx, yy, rm.na = T)
# xx: parameter that determines the aggregation: list(xx), e.g. hour etc.
# yy: parameter that will be aggregated
{
aa <- yy
out.mean <- aggregate(aa, list(xx), FUN = mean, na.rm=rm.na)
out.mean <- out.mean[,2]
}
#############
data.o3.hour.mean <- round(agg.mean(strftime(data.15min$posix, format="%m/%d/%y %H"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Wrong
##############
data.o3.hour.mean <- round(agg.mean(cut(data.15min$posix, "1 hour"), data.15min$o3), d); data.o3.hour.mean[1:100]
win.graph(10,5)
par(mar=c(5,15,4,2), new =T)
plot(data.15min$posix,data.15min$o3,col=3,type="l",ylim=c(10,60)) # original data
par(mar=c(5,15,4,2), new =T)
plot(data.date.hour_mean,data.o3.hour.mean,col=5,type="l",ylim=c(10,60)) # Correct
Data:
Download data
Too long for a comment.
The reason your results look different is that aggregate(...) sorts the results by your grouping variable(s). In the first case,
strftime(data.15min$posix, format="%m/%d/%y %H")
is a character vector with poorly formatted dates (they do not sort properly). So the first row corresponds to the "date" "01/01/96 00".
In your second case,
cut(data.15min$posix, "1 hour")
generates actual POSIXct dates, which sort properly. So the first row corresponds to the date: 1995-11-04 13:00:00.
If you had used
strftime(data.15min$posix, format="%Y-%m-%d %H")
in your first case you would have gotten the same result as using cut(...)
I have some data in the following format:
date x
2001/06 9949
2001/07 8554
2001/08 6954
2001/09 7568
2001/10 11238
2001/11 11969
... more rows
I want to extract the x mean for each month. I tried some code with aggregate, but
failed. Thanks for any help on doing this.
Here I simulate a data frame called df with more data:
df <- data.frame(
date = apply(expand.grid(2001:2012,1:12),1,paste,collapse="/"),
x = rnorm(12^2,1000,1000),
stringsAsFactors=FALSE)
Using the way your date vector is constructed you can obtain months by removing the firs four digits followed by a forward slash. Here I use this as indexing variable in tapply to compute the means:
with(df, tapply(x, gsub("\\d{4}/","",date), mean))
Sorry...just creat an month-sequence vector then used tapply.
It was very easy:
m.seq = rep(c(6:12, 1:5), length = nrow(data))
m.means = tapply(data$x, m.seq, mean)
But thanks for the comments anyway!