Prepare Time Series for Machine Learning - Long to Wide Format - r

I have a data frame of time series data in a 'long' format where there is 1 row/observation per day. I would like to transform this data into a 'wide' format. Each row/observation should have the time series value for the current date and the previous 2 days.
To provide a concrete example, I will use the Air Quality data available in R. This is what my input data frame looks like.
> input <- airquality[1:4,c("Month", "Day", "Ozone")]
> input
Month Day Ozone
1 5 1 41
2 5 2 36
3 5 3 12
4 5 4 18
I would like to transform this input so that it looks like the following.
output <- data.frame(Month = 5, Day = 1:4, Ozone=c(41,36,12,18), Ozone.Prev.1=c(NA,41,36,12), Ozone.Prev.2=c(NA,NA,41,36))
> output
Month Day Ozone Ozone.Prev.1 Ozone.Prev.2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36
Any suggestions on a nice, clean way to do this? Many thanks in advance.

You can use the lag function from zoo, but the following small function get's the trick done without using additional packages:
shift_vector = function(vec, n) c(rep(NA, n), head(vec, -n))
output = transform(input, prev_1 = shift_vector(Ozone, 1),
prev_2 = shift_vector(Ozone, 2))
output
Month Day Ozone prev_1 prev_2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36

Related

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

(In)correct use of a linear time trend variable, and most efficient fix?

I have 3133 rows representing payments made on some of the 5296 days between 7/1/2000 and 12/31/2014; that is, the "Date" feature is non-continuous:
> head(d_exp_0014)
Year Month Day Amount Count myDate
1 2000 7 6 792078.6 9 2000-07-06
2 2000 7 7 140065.5 9 2000-07-07
3 2000 7 11 190553.2 9 2000-07-11
4 2000 7 12 119208.6 9 2000-07-12
5 2000 7 16 1068156.3 9 2000-07-16
6 2000 7 17 0.0 9 2000-07-17
I would like to fit a linear time trend variable,
t <- 1:3133
to a linear model explaining the variation in the Amount of the expenditure.
fit_t <- lm(Amount ~ t + Count, d_exp_0014)
However, this is obviously wrong, as t increments in different amounts between the dates:
> head(exp)
Year Month Day Amount Count Date t
1 2000 7 6 792078.6 9 2000-07-06 1
2 2000 7 7 140065.5 9 2000-07-07 2
3 2000 7 11 190553.2 9 2000-07-11 3
4 2000 7 12 119208.6 9 2000-07-12 4
5 2000 7 16 1068156.3 9 2000-07-16 5
6 2000 7 17 0.0 9 2000-07-17 6
Which to me is the exact opposite of a linear trend.
What is the most efficient way to get this data.frame merged to a continuous date-index? Will a date vector like
CTS_date_V <- as.data.frame(seq(as.Date("2000/07/01"), as.Date("2014/12/31"), "days"), colnames = "Date")
yield different results?
I'm open to any packages (using fpp, forecast, timeSeries, xts, ts, as of right now); just looking for a good answer to deploy in functional form, since these payments are going to be updated every week and I'd like to automate the append to this data.frame.
I think some kind of transformation to regular (continuous) time series is a good idea.
You can use xts to transform time series data (it is handy, because it can be used in other packages as regular ts)
Filling the gaps
# convert myDate to POSIXct if necessary
# create xts from data frame x
ts1 <- xts(data.frame(a = x$Amount, c = x$Count), x$myDate )
ts1
# create empty time series
ts_empty <- seq( from = start(ts1), to = end(ts1), by = "DSTday")
# merge the empty ts to the data and fill the gap with 0
ts2 <- merge( ts1, ts_empty, fill = 0)
# or interpolate, for example:
ts2 <- merge( ts1, ts_empty, fill = NA)
ts2 <- na.locf(ts2)
# zoo-xts ready functions are:
# na.locf - constant previous value
# na.approx - linear approximation
# na.spline - cubic spline interpolation
Deduplicate dates
In your sample there is now sign of duplicated values. But based on a new question it is very likely. I think you want to aggregate values with sum function:
ts1 <- period.apply( ts1, endpoints(ts1,'days'), sum)

Extract intervals from time data in R

My problem is simple. I have table where each row is event (month, day, hour, minute is given). However, the machine was set to record 24/7. So I have more events (rows) than I need. How to remove surplus rows from daytime and to keep only rows from night (from sunset to sunrise)?
Dreadful thing is, that the timing of sunrise/sunset is slightly different each day.
In this example I provide two tables. First is table with all events, second contain timings of sunset/sunrise for each day.
If it is possible to extract, please notice that EACH night consists from two dates could be a additional column inserted in table containing ID of night? (see scheme below)
# table with all events
my.table <- data.frame(event = 1:34,
day = rep(c(30,31,1,2,3), times = c(8,9,7,8,2)),
month = rep(c(3,4), each = 17),
hour = c(13,13,13,13,22,
22,23,23,2,2,2,
14,14,14,19,22,22,
2,2,2,14,15,22,22,
3,3,3,14,14,14,
23,23,2,14),
minute = c(11,13,44,55,27,
32,54,57,10,14,
26,12,16,46,30,
12,13,14,16,45,
12,15,12,15,24,
26,28,12,16,23,12,13,11,11))
# timings of sunset/sunrise for each day
sun.table <- data.frame(day = c(30,31,31,1,1,2,2,3),
month = rep(c(3,4), times = c(3,5)),
hour = rep(c(19,6), times = 4),
minute = c(30,30,31,29,32,
28,33,27),
type = rep(c("sunset","sunrise"), times = 4))
# rigth solution reduced table would contain only rows:
# 5,6,7,8,9,10,11,16,17,18,19,20,23,24,25,26,27,31,32,33.
# nrow("reduced table") == 20
Here's one possible strategy
#convert sun-up, sun-down times to proper dates
ss <- with(sun.table, ISOdate(2000,month,day,hour,minute))
up <- ss[seq(1,length(ss),by=2)]
down <- ss[seq(2,length(ss),by=2)]
Here I assume the table is ordered and starts with a sunrise and alternates back and forth and ends with a sunset. Date values also need a year, here I just hard coded 2000. As long as your data doesn't span years (or leap days) that should be fine, but you'll probably want to pop in the actual year of your observations.
Now do the same for events
tt <- with(my.table, ISOdate(2000,month,day,hour,minute))
Find rows during the day
daytime <- sapply(tt, function(x) any(up<x & x<down))
and extract those rows
my.table[daytime, ]
# event day month hour minute
# 5 5 30 3 22 27
# 6 6 30 3 22 32
# 7 7 30 3 23 54
# 8 8 30 3 23 57
# 9 9 31 3 2 10
# 10 10 31 3 2 14
# 11 11 31 3 2 26
# 16 16 31 3 22 12
# 17 17 31 3 22 13
# 18 18 1 4 2 14
# 19 19 1 4 2 16
# 20 20 1 4 2 45
# 23 23 1 4 22 12
# 24 24 1 4 22 15
# 25 25 2 4 3 24
# 26 26 2 4 3 26
# 27 27 2 4 3 28
# 31 31 2 4 23 12
# 32 32 2 4 23 13
# 33 33 3 4 2 11
Here we only grab values that are after sunrise and before sunset. Since there isn't enough information in the sun.table to make sure that row 34 actually happens before subset, it is not returned.

Reshape in R when "time" values differ between subjects

I've looked all over the web including StackOverflow, and tested various things before asking this question, but pardon me if I missed an excellent answer.
I see lots of help for the reshape function (and the package too, but I can't get either to do what I need). I have a "time" variable that differs by subject, e.g., it is not time1, time2, time3. I would like to make a wide data set that treats each unique time value by subject ID as just "time1", "time2", "time3", but I need to save the dates. To make this concrete, here is some sample data:
Id<-c(1, 1,1, 2,2,2, 3)
date<-c("Jan10", "Jun11", "Dec11", "Feb10", "May10", "Dec10", "Jan11")
Score<-c(52, 43, 67, 56, 33, 21, 20)
format2<-data.frame(Id, date, Score)
format2
Id date Score
1 1 Jan10 52
2 1 Jun11 43
3 1 Dec11 67
4 2 Feb10 56
5 2 May10 33
6 2 Dec10 21
7 3 Jan11 20
I would like it to look like this:
Id date1 Score1 date2 Score2 date3 Score3
1 Jan10 52 Jun11 43 Dec11 67
2 Feb10 56 Dec10 21 May10 33
3 Jan11 20 NA NA NA NA
Thank you for any help and my apologies if I have missed an obvious answer.
You need to generate a time variable, which can be done quickly using ave():
format2$time <- ave(format2$Id, format2$Id, FUN=seq_along)
reshape(format2, direction = "wide", idvar="Id", timevar="time")
# Id date.1 Score.1 date.2 Score.2 date.3 Score.3
# 1 1 Jan10 52 Jun11 43 Dec11 67
# 4 2 Feb10 56 May10 33 Dec10 21
# 7 3 Jan11 20 <NA> NA <NA> NA
Some people prefer the reshape2 package because of its syntax, but even there, you need to have a time variable before you can do anything interesting.
Continuing from above (where the time variable was created):
library(reshape2)
format2m <- melt(format2, id.vars=c("Id", "time"))
dcast(format2m, Id ~ variable + time)
# Id date_1 date_2 date_3 Score_1 Score_2 Score_3
# 1 1 Jan10 Jun11 Dec11 52 43 67
# 2 2 Feb10 May10 Dec10 56 33 21
# 3 3 Jan11 <NA> <NA> 20 <NA> <NA>

cross sectional sub-sets in data.table

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks
Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Resources