I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:
ID Year Time Test1 Test2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.
Using reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table from versions 1.9.0 imports reshape2 package and implements fast melt and dcast methods in C for data.tables. A comparison of speed on bigger data is shown below.
For more info regarding NEWS, go here.
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table explicitly as it's not a S3 generic in reshape2 yet.
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table is ~175x faster than reshape2:::melt and dcast.data.table is ~5x than reshape2:::dcast.
Sticking with base R, this is another good candidate for the "stack + reshape" routine. Assuming our dataset is called "mydf":
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
Base reshape function alternative method is below. Though this required using reshape twice, there might be a simpler way.
Assuming your dataset is called df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
tidyverse/tidyr solution:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")
Related
Say I have the following data frame:
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
t<-cbind(ID, Score, days)
I would like to reshape it so that the new columns are ID and the actual weekday names, (meaning 6 columns) and the Score values are distributed according to their ID and day name. Something like this:
I found that reshape package might do. Tried (melt and cast) but it did not produce the result I wanted, but something like in this post: Melt data for one column
A base R solution that uses the built-in reshape command.
set.seed(12345)
t <- data.frame(id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5),
score = sample(x = 1:20,size = 25,replace = TRUE),
days = rep(x = c("Mon","Tue","Wed","Thu","Fri"),times = 5))
t.wide <- reshape(data = t,
v.names = "score",
timevar = "days",
idvar = "id",
direction = "wide")
names(t.wide) <- gsub(pattern = "score.",replacement = "",x = names(t.wide),fixed = TRUE)
t.wide
id Mon Tue Wed Thu Fri
1 1 15 18 16 18 10
6 2 4 7 11 15 20
11 3 1 4 15 1 8
16 4 10 8 9 4 20
21 5 10 7 20 15 13
You can use reshape2 to do this, but you need a data.frame to do that. Using cbind produces a matrix. (And converts all your numerical variables to characters in this case, as matrices can only hold one data type).
I've changed your code to produce a dataframe, which is already in long format (one row per observation).
set.seed(123)
ID<-c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, 4,4,4,4,4,5,5,5,5,5)
Score<- sample(1:20, 25, replace=TRUE)
days<-rep(c("Mon", "Tue", "Wed", "Thu", "Fri"), times=5)
dat<-data.frame(ID, Score, days)
Changing it to wide using reshape2 is then quite straightforward:
library(reshape2)
res <- dcast(ID~days,value.var="Score",data=dat)
> res
ID Fri Mon Thu Tue Wed
1 1 16 3 2 12 6
2 2 19 13 12 7 19
3 3 19 19 17 8 15
4 4 15 3 8 1 20
5 5 3 11 18 8 15
You could also use unstack if your data are complete (same number of days per id).
Here's an example (using the data from TARehman's answer):
unstack(t, score ~ days)
# Fri Mon Thu Tue Wed
# 1 10 15 18 18 16
# 2 20 4 15 7 11
# 3 8 1 1 4 15
# 4 20 10 4 8 9
# 5 13 10 15 7 20
Here's the clean-up for the column ordering, and for adding in the ID column:
cbind(ID = unique(t$id), unstack(t, score ~ days)[c("Mon", "Tue", "Wed", "Thu", "Fri")])
## ID Mon Tue Wed Thu Fri
## 1 1 15 18 16 18 10
## 2 2 4 7 11 15 20
## 3 3 1 4 15 1 8
## 4 4 10 8 9 4 20
## 5 5 10 7 20 15 13
Rather than reshape I'd move to the newer tidyr package and also make use of dplyr like so:
library(dplyr)
library(tidyr)
tdf<-as.data.frame(t) %>%
mutate(Score=as.numeric(Score)) %>%
spread(days,Score, fill=NA)
glimpse(tdf)
HTH
Just another option using splitstackshape
library(splitstackshape)
data = data.frame(t)
out = setnames(cSplit(setDT(data)[, .(x = toString(Score)), by = ID],
'x', ','), c('ID', unique(days)))
#> out
# ID Mon Tue Wed Thu Fri
#1: 1 8 14 11 5 10
#2: 2 16 1 4 14 8
#3: 3 8 18 19 13 3
#4: 4 16 9 19 16 6
#5: 5 7 2 1 2 13
Within both the dplyr & tidyr package, use spread to achieve the following:
library(dplyr)
library(tidyr)
t <- tbl_df(as.data.frame(t))
t %>% spread(days, Score, ID)
and you get the following output:
ID Fri Mon Thu Tue Wed
(fctr) (fctr) (fctr) (fctr) (fctr) (fctr)
1 1 10 10 18 17 10
2 2 18 11 14 3 16
3 3 11 13 9 15 17
4 4 13 13 16 17 11
5 5 7 14 9 15 20
I'm trying to get the mean of some variables inside a dataframe for different factors. Say I have:
time geo var1 var2 var3 var4
1 1990 AT 1 7 13 19
2 1991 AT 2 8 14 20
3 1992 AT 3 9 15 21
4 1990 DE 4 10 16 22
5 1991 DE 5 11 17 23
6 1992 DE 6 12 18 24
And I want:
time geo var1 var2 var3 var4 m_var2 m_var3
1 1990 AT 1 7 13 19 8 14
2 1991 AT 2 8 14 20 8 14
3 1992 AT 3 9 15 21 8 14
4 1990 DE 4 10 16 22 11 17
5 1991 DE 5 11 17 23 11 17
6 1992 DE 6 12 18 24 11 17
I've tried a few things with by() and lapply() but I think this goes into the direction of ddply
require(plyr)
Dataset <- data.frame(time=rep(c(1990:1992),2),geo=c(rep("AT",3),rep("DE",3))
,var1=as.numeric(c(1:6)),var2=as.numeric(c(7:12)),var3=as.numeric(c(13:18)),
var4=as.numeric(c(19:24)))
newvars <- c("var2","var3")
newData <- Dataset[,c("geo",newvars)]
Currently, I can choose between two errors:
ddply(newData,newData[,"geo"],colMeans)
#where R apparently thinks AT is the variable?
ddply(newData,"geo",colMeans)
#where R worries about the factor variable not being numeric?
My lapply attempts got me quite far but then left me with a list I could not get back into the dataframe:
lapply(newvars,function(x){
by(Dataset[x],Dataset[,"geo"],function(x)
rep(colMeans(x,na.rm=T),length(unique(Dataset[,"time"]))))
})
I think this must even be able with merge and filters as here:
Lapply in a dataframe over different variables using filters , but I can't get it together. Any help would be appreciated!
Other method with dplyr
library(dplyr)
df1 %>% group_by(geo) %>% mutate(m_var2=mean(var2), m_var3=mean(var3))
Another simple base R solution is just
transform(df, m_var2 = ave(var2, geo), m_var3 = ave(var3, geo))
# time geo var1 var2 var3 var4 m_var2 m_var3
# 1 1990 AT 1 7 13 19 8 14
# 2 1991 AT 2 8 14 20 8 14
# 3 1992 AT 3 9 15 21 8 14
# 4 1990 DE 4 10 16 22 11 17
# 5 1991 DE 5 11 17 23 11 17
# 6 1992 DE 6 12 18 24 11 17
Couple years later, I think a more concise approach would be to both update the actual data set (instead of creating a new one) and operate on a vector of columns (instead of manually writing them)
vars <- paste0("var", 2:3) # Select desired cols
df[paste0("m_", vars)] <- lapply(df[vars], ave, df[["geo"]]) # Loop and update
One option would be to use data.table. We can convert the data.frame to data.table (setDT(df1)), get the mean (lapply(.SD, mean)) for the selected columns ('var2' and 'var3') by specifying the column index in .SDcols, grouped by 'geo'. Create new columns by assigning the output (:=) to the new column names (paste('m', names(df1)[4:5]))
library(data.table)
setDT(df1)[, paste('m', names(df1)[4:5], sep="_") :=lapply(.SD, mean)
,by = geo, .SDcols=4:5]
# time geo var1 var2 var3 var4 m_var2 m_var3
#1: 1990 AT 1 7 13 19 8 14
#2: 1991 AT 2 8 14 20 8 14
#3: 1992 AT 3 9 15 21 8 14
#4: 1990 DE 4 10 16 22 11 17
#5: 1991 DE 5 11 17 23 11 17
#6: 1992 DE 6 12 18 24 11 17
NOTE: This method is more general. We can create the mean columns even for 100s of variables without any major change in the code. ie. if we need to get the mean of columns 4:100, change the .SDcols=4:100 and in the paste('m', names(df1)[4:100].
data
df1 <- structure(list(time = c(1990L, 1991L, 1992L, 1990L, 1991L, 1992L
), geo = c("AT", "AT", "AT", "DE", "DE", "DE"), var1 = 1:6, var2 = 7:12,
var3 = 13:18, var4 = 19:24), .Names = c("time", "geo", "var1",
"var2", "var3", "var4"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
In base R:
cbind(df1,m_var2=ave(df1$var2,df1$geo),m_var3=ave(df1$var3,df1$geo))
I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
What finally worked was:
a <- cast(we, year ~ region, mean, value='response')
Although, I only have 1 observation per region and site, so mean is just a workaround. I couldn't get c to work as a function.
Output for suggested answer (by Justin)
> DT
> response year
> 1: 15 2000
> 2: 6 2000
> 3: 23 2000
> 4: 23 2000
---
> 794: 3 2010
> 795: 5 2010
> 796: 1 2010
Update: desired output should look like:
> Year x1 x2 x3 x4
> 2000 4 5 16 22
> 2001 6 11 2 18
> 2002 1 0 21 10
> ...
I am struggling to find a way to transpose my data based on factor levels. I have data with 2 columns, a factor and a response. I have many rows for each factor, so I want to transpose the table such that each factor is on one row, with the different responses as a column in that row. I cannot seem to subset within a loop based on levels of that factor. I would appreciate any insight.
example of data:
> response year
> 5 2001
> 10 2001
> 8 2001
> 1 2002
> 7 2010
> levels(data$year)
[1] "2000" "2001" "2002" "2003" "2004" "2005" ...
w <- matrix(0,54,15)
for(i in 1:levels(data$year)){
w[i] <- levels(data$year)==i
}
This syntax is obviously not correct, but it is the idea of what I'm trying to accomplish.
Thank you.
Using the data.table package this is trivial:
library(data.table)
DT <- data.table(data)
DT[, as.list(value), by=year]
However, this will fall apart if you have different numbers of observations per year. Instead:
DT[, list(values = list(value)), by=year]
Or using base R:
tapply(data$value, data$year, c)
Here's another way, using aggregate:
> set.seed(1)
> data <- data.frame(year = rep(2000:2010, each=10), value = sample(3:30, 110, TRUE))
> aggregate(value~year, data=data, FUN=c)
year value.1 value.2 value.3 value.4 value.5 value.6 value.7 value.8 value.9 value.10
1 2000 10 13 19 28 8 28 29 21 20 4
2 2001 8 7 22 13 24 16 23 30 13 24
3 2002 29 8 21 6 10 13 3 13 27 12
4 2003 16 19 16 8 26 21 25 6 23 14
5 2004 25 21 24 18 17 25 3 16 23 22
6 2005 16 27 15 9 4 5 11 17 21 14
7 2006 28 11 15 12 21 10 16 24 5 27
8 2007 12 26 12 12 16 27 27 13 24 29
9 2008 15 22 14 12 24 8 22 6 9 7
10 2009 9 4 20 27 24 25 15 14 25 19
11 2010 21 12 10 30 20 8 6 16 28 19
If I had a different number of responses per year, I would probably come at this problem by first making a new variable to represent the response in each year and then casting that dataset out using dcast. By default dcast fills in missing values with NA, although you can change that if needed.
set.seed(1)
data = data.frame(year = c(rep(2000:2010, each=10), 2011), value = sample(3:30, 111, TRUE))
require(reshape2)
require(plyr)
# Create a new variable representing the number of responses per year and add to dataset
dat2 = ddply(data, .(year), transform,
response = interaction("x", 1:length(value), sep = ""))
dcast(dat2, year ~ response, value.var = "value")
I have a dataframe with counts of geese at several different sites. The aim was to make monthly counts of geese in
all 8 months between September-April at each site in consecutive winter periods. A winter period is defined as the 8 months between
September-April.
If the method had been carried out as planned, this is what the data would look like:
df <- data.frame(site=c(rep('site 1', 16), rep('site 2', 16), rep('site 3', 16)),
date=dmy(rep(c('01/09/2007', '02/10/2007', '02/11/2007',
'02/12/2007', '02/01/2008', '02/02/2008', '02/03/2008',
'02/04/2008', '01/09/2008', '02/10/2008', '02/11/2008',
'02/12/2008', '02/01/2009', '02/02/2009', '02/03/2009',
'02/04/2009'),3)),
count=sample(1:100, 48))
Its ended up with a situation where some sites have all 8 counts in some September-April periods, but not in other September-April periods. In addition, some sites, never achieved 8 counts in a September-April period. These toy data look like my actual data:
df <- df[-c(11:16, 36:48),]
I need to remove rows from the dataframe which do not form part of 8 consecutive counts in a September-April period. Using the toy data, this is the dataframe I need:
df <- df[-c(9:10, 27:29), ]
I've tried various commands using ddply() from plyr package but without success. Is there a solution to this problem?
One way I could think of is to subtract four months from your date so that, then you could group by year. To get the corresponding date by subtracting by 4 months, I suggest you use mondate package. See here for an excellent answer as to what problem you'd face when you subtract month and how you can overcome it.
require(mondate)
df$grp <- mondate(df$date) - 4
df$year <- year(df$grp)
df$month <- month(df$date)
ddply(df, .(site, year), function(x) {
if (all(c(1:4, 9:12) %in% x$month)) {
return(x)
} else {
return(NULL)
}
})
# site date count grp year month
# 1 site 1 2007-09-01 87 2007-05-02 2007 9
# 2 site 1 2007-10-02 44 2007-06-02 2007 10
# 3 site 1 2007-11-02 50 2007-07-03 2007 11
# 4 site 1 2007-12-02 65 2007-08-02 2007 12
# 5 site 1 2008-01-02 12 2007-09-02 2007 1
# 6 site 1 2008-02-02 2 2007-10-03 2007 2
# 7 site 1 2008-03-02 100 2007-11-02 2007 3
# 8 site 1 2008-04-02 29 2007-12-03 2007 4
# 9 site 2 2007-09-01 3 2007-05-02 2007 9
# 10 site 2 2007-10-02 22 2007-06-02 2007 10
# 11 site 2 2007-11-02 56 2007-07-03 2007 11
# 12 site 2 2007-12-02 5 2007-08-02 2007 12
# 13 site 2 2008-01-02 40 2007-09-02 2007 1
# 14 site 2 2008-02-02 15 2007-10-03 2007 2
# 15 site 2 2008-03-02 10 2007-11-02 2007 3
# 16 site 2 2008-04-02 20 2007-12-03 2007 4
# 17 site 2 2008-09-01 93 2008-05-02 2008 9
# 18 site 2 2008-10-02 13 2008-06-02 2008 10
# 19 site 2 2008-11-02 58 2008-07-03 2008 11
# 20 site 2 2008-12-02 64 2008-08-02 2008 12
# 21 site 2 2009-01-02 92 2008-09-02 2008 1
# 22 site 2 2009-02-02 69 2008-10-03 2008 2
# 23 site 2 2009-03-02 89 2008-11-02 2008 3
# 24 site 2 2009-04-02 27 2008-12-03 2008 4
An alternative solution using data.table:
require(data.table)
require(mondate)
dt <- data.table(df)
dt[, `:=`(year=year(mondate(date)-4), month=month(date))]
dt.out <- dt[, .SD[rep(all(c(1:4,9:12) %in% month), .N)],
by=list(site,year)][, c("year", "month") := NULL]