Related
Say I have the following XTS object containing Bid and Ask data:
Time Bid Ask
00:01 10 11
00:02 10 11
00:03 11 12
00:04 12 13
00:05 10 11
00:06 10 11
00:07 10 11
00:08 9 12
00:09 2 10
00:10 4 5
I would like to get the following output:
Time Bid Ask
00:01 10 11
00:03 11 12
00:04 12 13
00:05 10 11
00:08 9 12
00:09 2 10
00:10 4 5
Rows should only be removed if they are the same as the previous entry. If there is a Bid or Ask change then nothing it is not removed, so simply taking out duplicates will not work.
This should be fairly simple as I have done this before, but I just cannot remember how and cannot find it.
Update:
I added some extra entries in my initial data and expected output.
Joshua's swearer is brilliant, but it depends on the function such as rowSums giving different results, but it breaks with 9 12.
I used the rowProds function from the matrixStats package which works, but obviously fails for my last 2 lines. Also, my sample the values in column one are smaller than in column 2. While that makes sense, it not a must so the function should work if the Ask was less than the Bid where both rowSums and rowProds would fail.
Is there a better row function that would always give a different result if anything is different, such as maybe a row hash?
You can do this by using rle() on the sum of each row.
x <- structure(
c(10L, 10L, 11L, 12L, 10L, 10L, 10L, 11L, 11L, 12L, 13L, 11L, 11L, 11L),
.Dim = c(7L, 2L), .Dimnames = list(NULL, c("Bid", "Ask")),
index = structure(1:7, tzone = "", tclass = c("POSIXct", "POSIXt")),
.indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = "",
tclass = c("POSIXct", "POSIXt"), tzone = "", class = c("xts", "zoo"))
r <- rle(rowSums(x))
If you want the last observation in each group, you can just use cumsum(r$lengths) as the row index when subsetting.
R> x[cumsum(r$lengths),]
Bid Ask
1969-12-31 18:00:02 10 11
1969-12-31 18:00:03 11 12
1969-12-31 18:00:04 12 13
1969-12-31 18:00:07 10 11
Since you want the first observation of each group, you need to prepend the r$lengths vector with a 1 (you always want the first observation) and then remove the last element of r$lengths. Then call cumsum() on the result.
R> x[cumsum(c(1, head(r$lengths, -1))),]
Bid Ask
1969-12-31 18:00:01 10 11
1969-12-31 18:00:03 11 12
1969-12-31 18:00:04 12 13
1969-12-31 18:00:05 10 11
Good catch on the limitation of rowSums(). A robust solution is to diff() the bids and asks and select the rows where either is not zero.
d <- diff(x) != 0 # rows with price changes
d[1,] <- TRUE # always select first observation
g <- cumsum(d$Bid | d$Ask) # groups of repeats
r <- rle(as.numeric(g)) # run length encoding on groups
# now use the solution above
x[cumsum(c(1, head(r$lengths, -1))),]
I have data that looks like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 718184 1/2/2014 1/8/2014
1 885236 1/15/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589963 3/18/2015 3/22/2015
2 589999 2/27/2015 5/9/2015
2 594226 4/11/2015 4/17/2015
2 689959 5/10/2015 6/10/2015
3 656696 5/1/2016 5/5/2016
3 669625 5/6/2016 5/22/2016
4 777777 2/21/2015 3/4/2015
4 778952 2/1/2015 2/28/2015
4 778965 3/1/2015 3/22/2015
I am working on two different problems with this. The first one was answered in a previous post about how to roll dates up (Date roll-up in R) and the second now is that I have intervals that are within intervals and I am trying to get rid of them. So the final outcome should look like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589999 2/27/2015 5/9/2015
3 656696 5/1/2016 5/22/2016
4 778952 2/1/2015 3/22/2015
Now I know I will have to create the extended intervals via the date rollup first, but then how do I get rid of these sub-intervals (a term I am making up for intervals within intervals)? I am also looking for a solution that is efficient since I actually have 75,000 records to go through (i.e. I am trying to avoid iterative solutions).
Using non-equi joins from the current development version of data.table, v1.9.7,
require(data.table) # v1.9.7+
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI]
# ID Date1 Date2 CLM_ID
# 1: 1 2014-01-01 2014-01-17 718182
# 2: 1 2014-03-20 2014-03-21 885362
# 3: 2 2015-02-27 2015-05-09 589999
# 4: 2 2015-05-10 2015-06-10 689959
# 5: 3 2016-05-01 2016-05-05 656696
# 6: 3 2016-05-06 2016-05-22 669625
# 7: 4 2015-02-21 2015-03-04 777777
# 8: 4 2015-02-01 2015-02-28 778952
# 9: 4 2015-03-01 2015-03-22 778965
What this does is, for each row in dt (the one inside of square bracket), it looks up which rows match in dt (on the outside) based on the condition provided to the on argument.
The matching row indices are returned iff the only match is a self-match (since the condition includes equality as well). This is done by CLM_IDs[.N == 1L], where .N holds the number of observations for each group.
"I am also looking for a solution that is efficient ... (i.e. I am trying to avoid iterative solutions)."
"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in." - Isaac Asimov
Below is a super fast base R iterative solution. It returns the correct results for very large data frames virtually instantly. (it also "rolls-up" the data, so there is no need to carry out two algorithms):
MakeDFSubInt <- function(df, includeCost = FALSE) {
## Sorting the data frame to allow for fast
## creation of the "Contained" logical vector below
tempDF <- df[order(df$ID, df$Date1, df$Date2), ]
UniIDs <- unique(tempDF$ID)
Len <- length(UniIDs)
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
## Converting dates to integers so that comparison
## will be faster. Internally dates are stored as
## integers, so this isn't a problem
dte1 <- as.integer(tempDF$Date1)
dte2 <- as.integer(tempDF$Date2)
## Building logical vector in order to quickly create sub-intervals
Contained <- rep(FALSE, dfLen)
BegTime <- Sys.time() ## Included to measure time of for loop execution
for (j in 1:Len) {
Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L])
for (x in (s[j]+1L):e[j]) {
if (!Contained[x-1L]) {
Contained[x] <- dte2[x-1L] >= (dte1[x]-1L)
} else {
Contained[x] <- Compare >= (dte1[x]-1L)
}
## could use ifelse, but this construct is faster
if (Contained[x]) {
Compare <- max(Compare, dte2[x])
} else {
Compare <- dte2[x]
}
}
}
EndTime <- Sys.time()
TotTime <- EndTime - BegTime
if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))}
## identify sub-intervals
nGrps <- which(!Contained)
## Create New fields for our new DF
ID <- tempDF$ID[nGrps]
CLM_ID <- tempDF$CLM_ID[nGrps]
Date1 <- tempDF$Date1[nGrps]
nGrps <- c(nGrps, dfLen+1L)
## as.Date is converting numbers to dates.
## N.B. This only works if origin is supplied
Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) {
max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01")
## in a related question the OP had, "Cost" was
## included to show how the algorithm would handle
## generic summary information
if (includeCost) {
myCost <- tempDF$Cost
Cost <- vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01)
NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost)
} else {
NewDf <- data.frame(ID,CLM_ID,Date1,Date2)
}
NewDf
}
For the example given in the question, we have:
ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3))
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965)
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015")
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015")
myDF <- data.frame(ID, CLM_ID, Date1, Date2)
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y")
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y")
MakeDFSubInt(myDF)
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17
2 1 885362 2014-03-20 2014-03-21
3 2 589999 2015-02-27 2015-06-10
4 3 656696 2016-05-01 2016-05-22
5 4 778952 2015-02-01 2015-03-22
From a similar question the OP posted, we can add a Cost field, to show how we would proceed with calculations for this setup.
set.seed(7777)
myDF$Cost <- round(rnorm(13, 450, sd = 100),2)
MakeDFSubInt(myDF, includeCost = TRUE)
ID CLM_ID Date1 Date2 Cost
1 1 718182 2014-01-01 2014-01-17 1164.66
2 1 885362 2014-03-20 2014-03-21 568.16
3 2 589999 2015-02-27 2015-06-10 2019.16
4 3 656696 2016-05-01 2016-05-22 990.14
5 4 778952 2015-02-01 2015-03-22 1578.68
This algorithm scales very well. For data frames the size the OP is looking for, returning the requested DF returns almost instantaneously and for very large data frames, it returns in just seconds.
First we build a function that will generate a random data frame with n rows.
MakeRandomDF <- function(n) {
set.seed(109)
CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n))))
numYears <- trunc((6/425000)*n + 5)
StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears)
numYears <- ifelse(numYears > 16, 16, numYears)
IDs <- sort(sample(trunc(n/100), n, replace = TRUE))
CLM_IDs <- sample(CLM_Size, n)
StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = ""))
myPossibleDates <- StrtDate+(0:(numYears*365)) ## "numYears" years of data
Date1 <- sample(myPossibleDates, n, replace = TRUE)
Date2 <- Date1 + sample(1:100, n, replace = TRUE)
Cost <- round(rnorm(n, 850, 100), 2)
tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost)
tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y")
tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y")
tempDF
}
For moderate size DFs (i.e. 75,000 rows)
TestDF <- MakeRandomDF(75000)
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.06500006 secs"
user system elapsed
0.14 0.00 0.14
nrow(test1)
[1] 7618
head(test1)
ID CLM_ID Date1 Date2 Cost
1 1 116944 2010-01-29 2010-01-30 799.90 ## The range of dates for
2 1 515993 2010-02-15 2011-10-12 20836.83 ## each row are disjoint
3 1 408037 2011-12-13 2013-07-21 28149.26 ## as requested by the OP
4 1 20591 2013-07-25 2014-03-11 10449.51
5 1 338609 2014-04-24 2014-07-31 4219.48
6 1 628983 2014-08-03 2014-09-11 2170.93
For very large DFs (i.e. > 500,000 rows)
TestDF2 <- MakeRandomDF(500000)
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.3679998 secs"
user system elapsed
1.19 0.03 1.21
nrow(test2)
[1] 154839
head(test2)
ID CLM_ID Date1 Date2 Cost
1 1 71251 2004-04-19 2004-06-29 2715.69 ## The range of dates for
2 1 601676 2004-07-05 2004-09-23 2675.04 ## each row are disjoint
3 1 794409 2004-12-28 2005-04-05 1760.63 ## as requested by the OP
4 1 424671 2005-06-03 2005-08-20 1973.67
5 1 390353 2005-09-16 2005-11-06 785.81
6 1 496611 2005-11-21 2005-11-24 904.09
system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.6930001 secs"
user system elapsed
2.68 0.08 2.79 ## 1 million rows in under 3 seconds!!!
nrow(test3)
[1] 413668
Explanation
The main part of the algorithm is generating the Contained logical vector that is used to determine the sub-intervals of continuous dates. Generation of this vector relies on the fact that the data frame is sorted, first by ID, second by Date1, and finally by Date2. We begin by locating the starting and ending rows of each group of IDs. For example, with the example provided by the OP we have:
myDF
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17 ## <- 1 s[1]
2 1 718184 2014-01-02 2014-01-08
3 1 885236 2014-01-15 2014-01-17
4 1 885362 2014-03-20 2014-03-21 ## <- 4 e[1]
5 2 589963 2015-03-18 2015-03-22 ## <- 5 s[2]
6 2 589999 2015-02-27 2015-05-09
7 2 594226 2015-04-11 2015-04-17
8 2 689959 2015-05-10 2015-06-10 ## <- 8 e[2]
9 3 656696 2016-05-01 2016-05-05 ## <- 9 s[3]
10 3 669625 2016-05-06 2016-05-22 ## <- 10 e[3]
11 4 777777 2015-02-21 2015-03-04 ## <- 11 s[4]
12 4 778952 2015-02-01 2015-02-28
13 4 778965 2015-03-01 2015-03-22 ## <- 13 e[4]
Below is the code that generates s and e.
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
s
1 5 9 11
e
4 8 10 13
Now, we loop over each group and begin populating the logical vector Contained. If the date range for a particular row overlaps (or is a continuance of) the date range above it, we set that particular index of Contained to TRUE. This is why the first row in each group is set to FALSE since there is nothing above to compare it to. As we are doing this, we are updating the largest date to compare against moving forward, hence the Compare variable. It should be noted that it isn't necessarily true that Date2[n] < Date2[n+1L], this is why Compare <- max(Compare, dte2[x]) for a succession of TRUEs. The result for our example is give below.
ID CLM_ID Date1 Date2 Contained
1 1 718182 2014-01-01 2014-01-17 FALSE
2 1 718184 2014-01-02 2014-01-08 TRUE ## These two rows are contained
3 1 885236 2014-01-15 2014-01-17 TRUE ## in the date range 1/1 - 1/17
4 1 885362 2014-03-20 2014-03-21 FALSE ## This row isn't
6 2 589999 2015-02-27 2015-05-09 FALSE
5 2 589963 2015-03-18 2015-03-22 TRUE
7 2 594226 2015-04-11 2015-04-17 TRUE
8 2 689959 2015-05-10 2015-06-10 TRUE ## N.B. 5/10 is a continuance of 5/09
9 3 656696 2016-05-01 2016-05-05 FALSE
10 3 669625 2016-05-06 2016-05-22 TRUE
12 4 778952 2015-02-01 2015-02-28 FALSE
11 4 777777 2015-02-21 2015-03-04 TRUE
13 4 778965 2015-03-01 2015-03-22 TRUE
Now we can easily identify the "starting" rows by identifying all rows with a corresponding FALSE. After this, finding summary information is a breeze by simply calculating whatever you are interested in (e.g. max(Date2), sum(Cost)) over each succession of TRUEs and Voila!!
Here is a not-so-pretty solution comparing each row with the dates of all other rows. I corrected the one year 3015 to 2015. The results are different from what you are expecting, though. Either I misunderstood your question, or you misread the data.
Data:
dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L),
Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"),
Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")),
.Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame",
row.names = c(NA, -13L))
Code:
dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y")
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y")
# Boolean vector to memorize results
keep <- logical(length = nrow(dta))
for(i in 1:nrow(dta)) {
match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ]
if(nrow(match) == 1) keep[i] <- TRUE
}
# Result
dta[keep, ]
I wanted to plot a graph of n(y-axis) versus date(x-axis) in R, but due to the format of the date displayed in my data, the order of the date wasn't in the correct ascending order. How can I solve this? Appreciate for the help.
hybrid <- readWorksheetFromFile(excel.file, sheet="ResultSet", header=TRUE)
wb <- loadWorkbook(excel.file)
setMissingValue(wb,value=c("NA"))
hybrid1 <- readWorksheet(wb, sheet="ResultSet", header=TRUE)
I used the dplyr function. Suppose each Pub.Number have a unique code & I replaced it with one. Then, I count the number of it for a certain date.
hybrid <- mutate(hybrid1, n=sum(Publication.Number=1))
p1 <- select(hybrid1, Publication.Date, n)
pt <- count(p1, Publication.Date, wt=n)
The output look like this:
pt
Source: local data frame [627 x 2]
Publication.Date n
(chr) (dbl)
1 01.01.2013 1
2 01.01.2014 8
3 01.01.2015 10
4 01.02.2012 3
5 01.03.2012 16
6 01.04.2015 2
7 01.05.2012 1
8 01.05.2013 7
9 01.05.2014 23
10 01.06.2011 1
.. ... ...
Then, I plotted it but R recognized Pub.Date as character
qplot(x=Publication.Date, y=n, data=pt, geom="point")
x <- hybrid1[,2]
class(x)
[1] "character"
The graph I've plotted is a mess because of the wrong order of the date
I tried using the as.Date function but it seems that it's not complete (I'm using R version 3.2.2)
> pt[,1] <- as.Date(pt[,1], format='%d.%m.%Y’)
+
First convert 'Publication.Date’ to Date format, then order:
using your data:
data <- read.table(pipe('pbpaste'),sep='',header=T,stringsAsFactors = F)
data <- data[,-1]
names(data) <- c('Pub.Date', 'n’)
Pub.Date n
1 01.01.2014 8
2 01.01.2015 10
3 01.02.2012 3
4 01.03.2012 16
5 01.04.2015 2
6 01.05.2012 1
7 01.05.2013 7
8 01.05.2014 23
9 01.06.2011 1
convert ‘Pub.Date’ to date format:
data[,1] <- as.Date(data[,1],format='%d.%m.%Y’)
and order:
data[order(data$"Pub.Date",data$n), ]
Pub.Date n
9 2011-06-01 1
3 2012-02-01 3
4 2012-03-01 16
6 2012-05-01 1
7 2013-05-01 7
1 2014-01-01 8
8 2014-05-01 23
2 2015-01-01 10
5 2015-04-01 2
In the usual course of data input with R, values like " 01.01.2013" will become factor variables. Since they are not in one of the two "stadard Date formats: YYYY/MM/DD or YYYY-MM-DD, they cannot be input directly as "Date"s with "colClasses" unless you build an "as.DT" method. You will need to make sure they are character vectors either by using stringsAsFactors=FALSE in a read function or by coercing to character with as.character after they are input. That header you have displayed makes me think this data has been manipualtes dsomehow, perhaps with functions in the dplyr package?
res <- structure(list(Publication.Date = structure(1:10, .Label = c("01.01.2013",
"01.01.2014", "01.01.2015", "01.02.2012", "01.03.2012", "01.04.2015",
"01.05.2012", "01.05.2013", "01.05.2014", "01.06.2011"), class = "factor"),
n = c(1L, 8L, 10L, 3L, 16L, 2L, 1L, 7L, 23L, 1L)), .Names = c("Publication.Date",
"n"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10"))
> res
Publication.Date n
1 01.01.2013 1
2 01.01.2014 8
3 01.01.2015 10
4 01.02.2012 3
5 01.03.2012 16
6 01.04.2015 2
7 01.05.2012 1
8 01.05.2013 7
9 01.05.2014 23
10 01.06.2011 1
> res$Publication.Date <- as.Date( as.character(res$Publication.Date), format="%m.%d.%Y")
Then you can plot:
png(); qplot(x=Publication.Date, y=n, data=res, geom="point"); dev.off()
This question already has an answer here:
Aggregate by week in R
(1 answer)
Closed 7 years ago.
I have Date (column B) and Total (column A) variables - how can I create a new variable in R that sums the previous seven days' worth of Totals?
In Excel, I have the following formula:
=SUMIFS($A:$A,$B:$B, ">="&$B20-7,$B:$B,"<"&$B20)
and I just don't know how to convert this to work in R. Suggestions?
This will do it too, advanced, but short - essentially a one-liner.
# Initialze some data
date <- seq(as.Date("2001-01-01"),as.Date("2001-01-31"),"days")
tot <- trunc(rnorm(31,100,20))
df <- data.frame(date,tot)
# Now compute week sum by summing a subsetted df for each date
df$wktot <- sapply(df$date,function(x)sum(df[difftime(df$date,x,,"days") %in% 0:-6,]$tot))
Changed the variable names to match the posed problem.
It also handles the data in any order and multiple entries per day.
Edited to add comments and make it fit in a window.
If there is one total per day, this function may help:
rollSums <- function(totals, roll) {
res <- c()
for(i in 1:(length(totals)-roll)) {
res <- c(res, sum(totals[0:(roll-1)+i]))
}
res
}
df1
Total Date
1 3 2015-01-01
2 8 2015-01-01
3 4 2015-01-02
4 7 2015-01-03
5 6 2015-01-04
6 1 2015-01-04
7 10 2015-01-05
8 9 2015-01-06
9 2 2015-01-07
10 5 2015-01-08
rollSums(df1$Total, 3)
[1] 15 19 17 14 17 20 21
rollSums(df1$Total, 4)
[1] 22 25 18 24 26 22
It will take two arguments, the vector with the totals and how many days you'd like in each sum.
Data
dput(df1)
structure(list(Total = c(3L, 8L, 4L, 7L, 6L, 1L, 10L, 9L, 2L,
5L), Date = structure(c(16436, 16436, 16437, 16438, 16439, 16439,
16440, 16441, 16442, 16443), class = "Date")), .Names = c("Total",
"Date"), row.names = c(NA, -10L), class = "data.frame")
Update
In case you run into a situation with multiple values on the same day, here's a solution. Surprisingly, #MikeWise has a one-liner that can do all of this. See other answer.
grouped.roll <- function(DF, Values, Group, roll) {
totals <- eval(substitute(with(DF, tapply(Values, Group, sum))))
newsums <- rollSums(totals, roll)
data.frame(Group=names(totals), Sums=c(rep(NA, roll), newsums))
}
It uses the rollSums that I used earlier. It will spit out NAs until the desired day grouping begins. That may be the only advantage over the other answer. But they could easily edit that in, I'm sure. Just providing more options for reference.
grouped.roll(df1, Total, Date, 3)
Group Sums
1 2015-01-01 NA
2 2015-01-02 NA
3 2015-01-03 NA
4 2015-01-04 22
5 2015-01-05 18
6 2015-01-06 24
7 2015-01-07 26
8 2015-01-08 21
I have a CSV file with timestamps and certain event-types which happened at this time.
What I want is count the number of occurences of certain event-types in 6-minutes intervals.
The input-data looks like:
date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
I load and cure the data with this piece of code:
> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')
After curing the data looks like this:
> head(cured_data)
date type
1 2011-09-22 14:54:53 2
2 2011-09-22 14:54:53 2
3 2011-09-22 14:54:53 2
4 2011-09-22 14:54:53 2
5 2011-09-22 14:54:53 1
6 2011-09-22 14:54:53 1
I read a lot of samples for xts and zoo, but somehow I can't get a hang on it.
The output data should look something like:
date type count
2011-09-22 14:54:00 CEST 1 11
2011-09-22 14:54:00 CEST 2 19
2011-09-22 15:00:00 CEST 1 9
2011-09-22 15:00:00 CEST 2 12
2011-09-22 15:06:00 CEST 1 23
2011-09-22 15:06:00 CEST 2 18
Zoo's aggregate function looks promising, I found this code-snippet:
# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)
Now I'm just wondering how I could apply this on my use case.
Naive as I am I tried:
> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series
I must admit that I'm not really confident in R, but I try. :-)
I'm kinda lost. Could anyone point me into the right direction?
Thanks a lot!
Cheers, Alex.
Here the output of dput for a small subset of my data. The data itself is something around 80 million rows.
structure(list(date = structure(c(1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""),
type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date",
"type"), row.names = c(NA, -23L), class = "data.frame")
We can read it using read.csv, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1's. Then re-read it using read.zoo splitting on the type and aggregating on the dummy column:
# test data
Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'
library(zoo)
library(chron)
# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's
# and remove any leading space (removing space not needed if there is none)
DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))
# split and aggregate
z <- read.zoo(DF, split = 2, aggregate = length)
With the above test data the solution looks like this:
> z
2 3 4
(09/22/11 12:54:00) 4 3 1
Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.
(We have used chron here since its trunc method fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can't make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g. time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) . This expression is shown in a table in one of the R News 4/1 articles except we used as.Date.dates instead of just as.Date to work around a bug that seems to have been introduced since then. We could also use time(z) <- as.POSIXct(time(z)) but that would result in a different time zone.)
EDIT:
The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.
EDIT:
Revised based on comment.
You are almost all the way there. All you need to do now is create a zoo-isch version of that data and map it to the aggregate.zoo code. Since you want to categorize by both time and by type your second argument to aggregate.zoo must be a bit more complex and you want counts rather than means so your should use length(). I do not think that count is a base R or zoo function and the only count function I see in my workspace comes from pkg:plyr so I don't know how well it would play with aggregate.zoo. length works as most people expect for vectors but is often surprises people when working with data.frames. If you do not get what you want with length, then you should see if NROW works instead (and with your data layout they both succeed): With the new data object it is necessary to put the type argument first. AND it surns out the aggregate/zoo only handles single category classifiers so you need to put in the as.vector to remove it zoo-ness:
with(cured_data,
aggregate(as.vector(x), list(type = type,
interval=as.factor(time(x) - as.numeric(time(x)) %% 360)),
FUN=NROW)
)
# interval x
#1 2011-09-22 09:24:00 12
#2 2011-09-22 09:24:00 11
This is an example modified from where you got the code (an example on SO by WizaRd Dirk):
Aggregate (count) occurences of values over arbitrary timeframe
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(as.vector(x), by=list(cat=as.factor(x),
tms = as.factor(index(x) - as.numeric(index(x)) %% 600)), length)
cat tms x
1 1 1969-12-31 19:00:00 26
2 2 1969-12-31 19:00:00 22
3 3 1969-12-31 19:00:00 11
4 1 1969-12-31 19:10:00 17
5 2 1969-12-31 19:10:00 28
6 3 1969-12-31 19:10:00 15
7 1 1969-12-31 19:20:00 17
8 2 1969-12-31 19:20:00 16
9 3 1969-12-31 19:20:00 27
10 1 1969-12-31 19:30:00 8
11 2 1969-12-31 19:30:00 4
12 3 1969-12-31 19:30:00 9