Aggregating price data to different time horizon in R data.table - r

Hi I'm looking to roll up minutely data in a data.table to 5 minutely (or 10 minutely) horizon. I know this is easily done via using xts and the to.minutes5 function, but I prefer not to use xts in this instance as the data set is rather large. Is there an easy way to do this in data.table ?
Data example : In this example the period between 21.30 to 21.34 (both inclusive) would have just one row with t = 21.30, open = 0.88703 , high = 0.88799 , low = 0.88702 , close = 0.88798, volume = 43 (note the data from 21.35 itself is ignored).
t open high low close volume
1: 2010-01-03 21:27:00 0.88685 0.88688 0.88685 0.88688 2
2: 2010-01-03 21:28:00 0.88688 0.88688 0.88686 0.88688 5
3: 2010-01-03 21:29:00 0.88688 0.88704 0.88687 0.88703 7
4: 2010-01-03 21:30:00 0.88703 0.88795 0.88702 0.88795 10
5: 2010-01-03 21:31:00 0.88795 0.88795 0.88774 0.88778 7
6: 2010-01-03 21:32:00 0.88778 0.88778 0.88753 0.88760 8
7: 2010-01-03 21:33:00 0.88760 0.88781 0.88760 0.88775 11
8: 2010-01-03 21:34:00 0.88775 0.88799 0.88775 0.88798 7
9: 2010-01-03 21:35:00 0.88798 0.88803 0.88743 0.88782 8
10: 2010-01-03 21:36:00 0.88782 0.88782 0.88770 0.88778 6
Output from dput(head(myData)) as requested by GSee. I want to use the data.table for storing some more derived fields based on this original data. So, even if I did use xts to roll up these price bars, I'll have to put them in a data table somehow, so I'd appreciate any tips around the correct way to hold data.table with xts items.
structure(list(t = structure(c(1241136000, 1241136060, 1241136120,
1241136180, 1241136240, 1241136300), class = c("POSIXct", "POSIXt"
), tzone = "Europe/London"), open = c(0.89467, 0.89467, 0.89472,
0.89473, 0.89504, 0.895), high = c(0.89481, 0.89475, 0.89473,
0.89506, 0.8951, 0.895), low = c(0.89457, 0.89465, 0.89462, 0.89473,
0.89486, 0.89486), close = c(0.89467, 0.89472, 0.89473, 0.89504,
0.895, 0.89488), volume = c(96L, 14L, 123L, 49L, 121L, 36L)), .Names = c("t",
"open", "high", "low", "close", "volume"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000100788>)

You can use the endpoints function (which is written in C) from xts on POSIXt vectors. endpoints finds the position of the last element of a certain time period. By convention, 1:05 would not be included in the same bar as 1:00. So, the data that you provided dput for (which is different than the printed data above it) will have 2 bars.
Assuming dt is your data.table:
library(data.table)
library(xts)
setkey(dt, t) # make sure the data.table is sorted by time.
ep <- endpoints(dt$t, "minutes", 5)[-1] # remove the first value, which is 0
dt[ep, grp:=seq_along(ep)] # create a column to group by
dt[, grp:=na.locf(grp, fromLast=TRUE)] # fill in NAs
dt[, list(t=last(t), open=open[1], high=max(high), low=min(low),
close=last(close), volume=sum(volume)), by=grp]
grp t open high low close volume
1: 1 2009-05-01 01:04:00 0.89467 0.8951 0.89457 0.89500 403
2: 2 2009-05-01 01:05:00 0.89500 0.8950 0.89486 0.89488 36

Related

R: How to work with time series of sub-hour data?

i just started with R and finished some tutorials. However, i am trying to get into time series analysis and got big troubles with it. I made a data frame that looks like that:
Date Time T1
1 2014-05-22 15:15:00 21.6
2 2014-05-22 15:20:00 21.2
3 2014-05-22 15:25:00 21.3
4 2014-05-22 15:30:00 21.5
5 2014-05-22 15:35:00 21.1
6 2014-05-22 15:40:00 21.5
Since i didn't want to work with half days, i removed the first and last day from the data frame. Since R didnt recognize the date nor time as such, but as "factor", i used the lubridate library to change it properly. Now it looks like that:
Date Time T1
1 2014-05-23 0S 14.2
2 2014-05-23 5M 0S 14.1
3 2014-05-23 10M 0S 14.6
4 2014-05-23 15M 0S 14.3
5 2014-05-23 20M 0S 14.4
6 2014-05-23 25M 0S 14.5
Now the trouble really starts. Using ts function changes date to 16944 and time to 0. How do i setup a data frame with the correct start date and frequency? A new set of data comes in everty 5 min so frequency should be 288. I also tried to set the start date as a vector. Since 22th of may was the 142th day of the year i tried this
ts_df <- ts(df, start=c(2014, 142/365), frequency=288)
No error there, but when i go for start(ds_df) i get and end(ds_df):
[1] 2013.998
[1] 2058.994
Can anyone give me a hint how to work with these kind of data?
"ts" class is typically not a good fit for that type of data. Assuming DF is the data frame shown reproducibly in the Note at the end of this answer we convert it to a "zoo" class object and then perform some manipulations. The related xts package could also be used.
library(zoo)
z <- read.zoo(DF, index = 1:2, tz = "")
window(z, start = "2014-05-22 15:25:00")
head(z, 3) # first 3
head(z, -3) # all but last 3
tail(z, 3) # last 3
tail(z, -3) # all but first 3
z[2:4] # 2nd, 3rd and 4th element of z
coredata(z) # numeric vector of data values
time(z) # vector of datetimes
fortify.zoo(z) # data frame whose 2 cols are (1) datetimes and (2) data values
aggregate(z, as.Date, mean) # convert to daily averaging values
ym <- aggregate(z, as.yearmon, mean) # convert to monthly averaging values
frequency(ym) <- 12 # only needed since ym only has length 1
as.ts(ym) # year/month series can be reasonably converted to ts
plot(z)
library(ggplot2)
autoplot(z)
read.zoo could also have been used to read the data in from a file.
Note: DF used above in reproducible form:
DF <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2014-05-22",
class = "factor"),
Time = structure(1:6, .Label = c("15:15:00", "15:20:00",
"15:25:00", "15:30:00", "15:35:00", "15:40:00"), class = "factor"),
T1 = c(21.6, 21.2, 21.3, 21.5, 21.1, 21.5)), .Names = c("Date",
"Time", "T1"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))

R - Relational Operators and Vectorization

I have a vector of times when people scanned a badge. I have another set of times that are 'measurement points'.
scans = structure(c(1388570120, 1388572119, 1388575229, 1388577402, 1388580457, 1388583364, 1388586817, 1388589929, 1388593054, 1388599025), class = c("POSIXct", "POSIXt"), tzone = "UTC")
points = as.POSIXct(9*3600,"UTC",origin="2014-01-01")+seq(0,10*3600,3600)
What I want to do is count how many scans are greater (or equal to) than points
sum(scans >= points[1])
#> [1] 10
This works one at a time and can easily be converted to a for loop or an lapply
lapply(points,function(x){sum(scans >= x)})
However, I cannot simply use scans >= points and get a list back where all of scans is compared to points element by element.
Is there a way in R to compare one entire vector to each element of another vector without using a looping construct (so the result is identical to the lapply example above except possibly in structure)? What I actually have a list of vectors of scans which I'm already going to be lapplying through and I'm hoping there's a way to avoid nested looping in R.
You can do
colSums(outer(scans,points,'>='))
I can't guarantee that the intermediate matrix would fit into memory though.
You can do the following with the development version of data.table:
library(data.table)
dt1 = data.table(scans)
dt2 = data.table(points)
dt1[dt2, on = .(scans >= points), .N, by = .EACHI]
# scans N
# 1: 2014-01-01 09:00:00 10
# 2: 2014-01-01 10:00:00 9
# 3: 2014-01-01 11:00:00 8
# 4: 2014-01-01 12:00:00 6
# 5: 2014-01-01 13:00:00 5
# 6: 2014-01-01 14:00:00 4
# 7: 2014-01-01 15:00:00 3
# 8: 2014-01-01 16:00:00 2
# 9: 2014-01-01 17:00:00 1
#10: 2014-01-01 18:00:00 0
#11: 2014-01-01 19:00:00 0
This should be much more memory-efficient than building the full outer product.

R: sequence of days between dates

I have the following dataframes:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
What I want to do is, for each date in the allDays dataframe, calculate the number of leases that are in effect. e.g. if there are 4 leases with start date <= 2015-01-01 and end date >= 2015-01-01, then I would like to place a 4 in that dataframe.
I have the following code
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
which works, but as I have about 5000 leases, it takes about 1.1 seconds. Does anyone have a more efficient method that would require less computation time?
Date of interest is just the current date and is used simply to ensure that it doesn't count lease dates in the future.
Using seq is almost surely inefficient--imagine you had a lease in your data that's 10000 years long. seq will take forever and return 10000*365-1 days that don't matter to us. We then have to use %in% which also makes the same number of unnecessary comparisons.
I'm not sure the following is the best approach (I'm convinced there's a fully vectorized solution) but it gets closer to the heart of the problem.
Data
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
Approach
Use data.table and sapply:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
This is exactly the problem where foverlaps shines: subsetting a data.frame based upon another data.frame (foverlaps seems to be tailored for that purpose).
Based on #MichaelChirico's data.
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# #MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here is a brief explanation on how it works by #Arun, which got me started with the data.table.
Without your data, I can't test whether or not this is faster, but it gets the job done with less code:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
I used the following to test it; note that the relevant columns in both data frames are formatted as dates:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
An alternative approach, but I'm not sure it's faster.
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
Try the lubridate package. Create an interval for each lease. Then count the lease intervals which each date falls in.
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
The Output
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1

R Optimal way to create time series from start and end dates for groups

I have a data set where for each group I have a start and an end date. I want to turn this data into one where for each time period (month) I have one row of observation for each group.
Here is a sample of input data, groups are identified by id:
structure(list(id = c(723654, 885618, 269861, 1383642, 250276,
815511, 1506680, 1567855, 667345, 795731), startdate = c("2008-06-29",
"2008-12-01", "2006-09-27", "2010-02-03", "2006-08-31", "2008-09-10",
"2010-04-11", "2010-05-15", "2008-04-12", "2008-08-28"), enddate = c("2008-08-13",
"2009-02-08", "2007-10-12", "2010-09-09", "2007-06-30", "2010-04-27",
"2010-04-13", "2010-05-16", "2010-04-20", "2010-03-09")), .Names = c("id",
"startdate", "enddate"), class = "data.frame", row.names = c("1",
"2", "3", "4", "6", "7", "8", "9", "10", "11"))
I wrote a function and vectorized it. The function takes the three parameters stored in each row and generates the time series with group identifiers.
genDateRange<-function(start, end, id){
dates<-seq(as.Date(start), as.Date(end), by="month")
return( cbind(month=as.character(dates), id=rep(id, length(dates))))
}
genDataRange<-Vectorize(genDateRange)
I run the function as follows to get a data frame. I have over 6M lines in the output, so it takes forever. I need a faster way.
range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))
First ten lines of output looks like this:
structure(c("2008-06-29", "2008-07-29", "2008-12-01", "2009-01-01",
"2009-02-01", "2006-09-27", "2006-10-27", "2006-11-27", "2006-12-27",
"2007-01-27", "723654", "723654", "885618", "885618", "885618",
"269861", "269861", "269861", "269861", "269861"), .Dim = c(10L,
2L), .Dimnames = list(NULL, c("month", "id")))
I would appreciate a faster way to do this. I think I have focused too much on something and missing a much simpler solution.
No need to use the generator function or rbindlist because data.table can easily handle this without it.
# start with a data.table and date columns
library(data.table)
dat <- data.table(dat)
dat[,`:=`(startdate = as.Date(startdate), enddate = as.Date(enddate))]
dat[,num_mons:= length(seq(from=startdate, to=enddate, by='month')),by=1:nrow(dat)]
dat # now your data.table looks like this
# id startdate enddate num_mons
# 1: 723654 2008-06-29 2008-08-13 2
# 2: 885618 2008-12-01 2009-02-08 3
# 3: 269861 2006-09-27 2007-10-12 13
# 4: 1383642 2010-02-03 2010-09-09 8
# 5: 250276 2006-08-31 2007-06-30 10
# 6: 815511 2008-09-10 2010-04-27 20
# 7: 1506680 2010-04-11 2010-04-13 1
# 8: 1567855 2010-05-15 2010-05-16 1
# 9: 667345 2008-04-12 2010-04-20 25
# 10: 795731 2008-08-28 2010-03-09 19
out <- dat[, list(month=seq.Date(startdate, by="month",length.out=num_mons)), by=id]
out
# id month
# 1: 723654 2008-06-29
# 2: 723654 2008-07-29
# 3: 885618 2008-12-01
# 4: 885618 2009-01-01
# 5: 885618 2009-02-01
# ---
# 98: 795731 2009-10-28
# 99: 795731 2009-11-28
# 100: 795731 2009-12-28
# 101: 795731 2010-01-28
# 102: 795731 2010-02-28
This question is related, but the difference is that in you're question we're iterating, not duplicating rows in a data table.
For large datasets this
library(data.table)
range <- rbindlist(lapply(genDataRange(dat$startdate, dat$enddate, dat$id),as.data.frame))
should be faster than
range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))

How to reshape a dataframe with "reoccurring" columns?

I am new to data analysis with R. I recently got a pre-formatted environmental observation-model dataset, an example subset of which is shown below:
date site obs mod site obs mod
2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
Basically, the data include the time series of hourly observed and modelled concentrations of a pollutant at various sites in "reoccurring columns", i.e., site - obs - mod (in the example I only showed 2 out of the total 75 sites). I read this "wide" dataset in as a data frame, and wanted to reshape it into the "narrower" format as:
date site obs mod
2000-09-01 00:00:00 campus NA 61.63
2000-09-01 01:00:00 campus 52 62.55
2000-09-01 02:00:00 campus 52 63.52
2000-09-01 00:00:00 city centre 66 56.69
2000-09-01 01:00:00 city centre NA 54.75
2000-09-01 02:00:00 city centre 56 54.65
I believed that I should use the package "reshape2" to do this. Firstly I tried to melt and then dcast the dataset:
test.melt <- melt(test.data, id.vars = "date", measure.vars = c("site", "obs", "mod"))
However, it only returned half of the data, i.e., records of the site(s) ("city centre") following the first one ("campus") were all cut off:
date variable value
2001-01-01 00:00:00 site campus
2001-01-01 01:00:00 site campus
2001-01-01 02:00:00 site campus
2001-01-01 00:00:00 obs NA
2001-01-01 01:00:00 obs 52
2001-01-01 02:00:00 obs 52
2001-01-01 00:00:00 mod 61.63
2001-01-01 01:00:00 mod 62.55
2001-01-01 02:00:00 mod 63.52
I then tried recast:
test.recast <- recast(test.data, date ~ site + obs + mod)
However, it returned with error message:
Error in eval(expr, envir, enclos) : object 'site' not found
I have tried to search for previous questions but have not found similar scenarios (correct me if I am wrong). Could someone please help me with this?
Many thanks in advance!
You might be better off using base R reshape after doing some variable name cleanup.
Here's your data.
test <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "date site obs mod site obs mod
'2000-09-01 00:00:00' campus NA 61.63 'city centre' 66 56.69
'2000-09-01 01:00:00' campus 52 62.55 'city centre' NA 54.75
'2000-09-01 02:00:00' campus 52 63.52 'city centre' 56 54.65")
test
# date site obs mod site.1 obs.1 mod.1
# 1 2000-09-01 00:00:00 campus NA 61.63 city centre 66 56.69
# 2 2000-09-01 01:00:00 campus 52 62.55 city centre NA 54.75
# 3 2000-09-01 02:00:00 campus 52 63.52 city centre 56 54.65
If you did this correctly, you should get names like I got: as #chase mentions in his answer, "recurring column names is a bit of an oddity and is not normal R behaviour"--so we've got to fix that.
Note: Both of these options generate a "time" variable which you can go ahead and drop. You might want to keep it just in case you wanted to reshape back into a wide format.
Option 1: If you got names like I did (which you should have), the solution is simple. For the first site, just append "0" to the site name and use base R reshape:
names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL # reshape makes UGLY rownames
test
# date time site obs mod
# 1 2000-09-01 00:00:00 0 campus NA 61.63
# 2 2000-09-01 01:00:00 0 campus 52 62.55
# 3 2000-09-01 02:00:00 0 campus 52 63.52
# 4 2000-09-01 00:00:00 1 city centre 66 56.69
# 5 2000-09-01 01:00:00 1 city centre NA 54.75
# 6 2000-09-01 02:00:00 1 city centre 56 54.65
Option 2: If you really do have duplicated column names, the fix is still easy, and follows the same logic. First, create nicer column names (easy to do using rep()), and then use reshape() as described above.
names(test)[-1] <- paste(names(test)[-1],
rep(1:((ncol(test)-1)/3), each = 3), sep = ".")
test <- reshape(test, direction = "long",
idvar = "date", varying = 2:ncol(test))
rownames(test) <- NULL
### Or, more convenient:
# names(test) <- make.unique(names(test))
# names(test)[2:4] <- paste(names(test)[2:4], "0", sep=".")
# test <- reshape(test, direction = "long",
# idvar = "date", varying = 2:ncol(test))
# rownames(test) <- NULL
Optional step: The data in this form are still not totally "long". If that is required, all that is required is one more step:
require(reshape2)
melt(test, id.vars = c("date", "site", "time"))
# date site time variable value
# 1 2000-09-01 00:00:00 campus 0 obs NA
# 2 2000-09-01 01:00:00 campus 0 obs 52.00
# 3 2000-09-01 02:00:00 campus 0 obs 52.00
# 4 2000-09-01 00:00:00 city centre 1 obs 66.00
# 5 2000-09-01 01:00:00 city centre 1 obs NA
# 6 2000-09-01 02:00:00 city centre 1 obs 56.00
# 7 2000-09-01 00:00:00 campus 0 mod 61.63
# 8 2000-09-01 01:00:00 campus 0 mod 62.55
# 9 2000-09-01 02:00:00 campus 0 mod 63.52
# 10 2000-09-01 00:00:00 city centre 1 mod 56.69
# 11 2000-09-01 01:00:00 city centre 1 mod 54.75
# 12 2000-09-01 02:00:00 city centre 1 mod 54.65
Update (to try to address some questions from the comments)
The reshape() documentation is quite confusing. It's best to work through a few examples to get an understanding of how it works. Specifically, "time" does not have to refer to time ("date" in your problem), but is more for, say, panel data, where records are collected at different times for the same ID. In your case, the only "id" in the original data is the "date" column. The other potential "id" is the site, but not in the way the data are organized.
Imagine, for a moment, if your data looked like this:
test1 <- structure(list(date = structure(1:3,
.Label = c("2000-09-01 00:00:00",
"2000-09-01 01:00:00", "2000-09-01 02:00:00"), class = "factor"),
obs.campus = c(NA, 52L, 52L), mod.campus = c(61.63, 62.55,
63.52), obs.cityCentre = c(66L, NA, 56L), mod.cityCentre = c(56.69,
54.75, 54.65)), .Names = c("date", "obs.campus", "mod.campus",
"obs.cityCentre", "mod.cityCentre"), class = "data.frame", row.names = c(NA,
-3L))
test1
# date obs.campus mod.campus obs.cityCentre mod.cityCentre
# 1 2000-09-01 00:00:00 NA 61.63 66 56.69
# 2 2000-09-01 01:00:00 52 62.55 NA 54.75
# 3 2000-09-01 02:00:00 52 63.52 56 54.65
Now try reshape(test1, direction = "long", idvar = "date", varying = 2:ncol(test1)). You'll see that reshape() sees the site names as "time" (that can be overridden by adding "timevar = "site"" to your reshape command).
When direction = "long", you must specify which columns vary with "time". In your case, that is all the columns except for the first, hence my use of 2:ncol(test) for "varying".
test2? Where is that?
Question under #Chase's answer: I think you misunderstand how melt() is supposed to work. Basically, it tries to get you the "skinniest" form of your data. In this case, the skinniest form would be the "optional step" described above since date + site would be the minimum required to comprise a unique ID variable. (I would say that "time" can safely be dropped.)
Once your data are in the format described in the "optional step" (we'll assume that the output has been stored as "test.melt", you can always easily pivot the table around in different ways. As a demonstration of what I mean by that, try the following and see what they do.
dcast(test.melt, date + site ~ variable)
dcast(test.melt, date ~ variable + site)
dcast(test.melt, variable + site ~ date)
dcast(test.melt, variable + date ~ site)
It is not easy to have that flexibility if you stop at "Option 1" or "Option 2".
Update (a few years later)
melt from "data.table" can now "melt" multiple columns in a similar way that reshape does. It should work whether or not the column names are duplicated.
You can try the following:
measure <- c("site", "obs", "mod")
melt(as.data.table(test), measure.vars = patterns(measure), value.name = measure)
# date variable site obs mod
# 1: 2000-09-01 00:00:00 1 campus NA 61.63
# 2: 2000-09-01 01:00:00 1 campus 52 62.55
# 3: 2000-09-01 02:00:00 1 campus 52 63.52
# 4: 2000-09-01 00:00:00 2 city centre 66 56.69
# 5: 2000-09-01 01:00:00 2 city centre NA 54.75
# 6: 2000-09-01 02:00:00 2 city centre 56 54.65
The fact that you have recurring column names is a bit of an oddity and is not normal R behaviour. Most of the time R forces you to have valid names via the make.names() function. Regardless, I'm able to duplicate your problem. Note I made my own example since yours isn't reproducible, but the logic is the same.
#Do not force unique names
s <- data.frame(id = 1:3, x = runif(3), x = runif(3), check.names = FALSE)
#-----
id x x
1 1 0.6845270 0.5218344
2 2 0.7662200 0.6179444
3 3 0.4110043 0.1104774
#Now try to melt, note that 1/2 of your x-values are missing!
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
The solution is to make your column names unique. As I said before, R does this by default in most cases. However, you can do it after the fact via make.unique()
names(s) <- make.unique(names(s))
#-----
[1] "id" "x" "x.1"
Note that the second column of x now has a 1 appended to it. Now melt() works as you'd expect:
melt(s, id.vars = 1)
#-----
id variable value
1 1 x 0.6845270
2 2 x 0.7662200
3 3 x 0.4110043
4 1 x.1 0.5218344
5 2 x.1 0.6179444
6 3 x.1 0.1104774
At this point, if you want to treat x and x.1 as the same variable, I think a little gsub() or other regex function to get rid of the offending characters. THis is a workflow I use quite often.

Resources