Generate entries in time series data - r

I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...

you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.

If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)

Related

Rblpapi, unlist data.frame zoo xts

Using the Rblpapi, I get stockdata of 3 indices in a data.frame with several lists.
Then, I want to get it in either a zoo or preferably xts format. However, I first have to unlist properly.
Since not everyone has access to Rblpapi and therefore cannot replicate, please look at the str output and suggest me how to unlist.
Any leads or help appreciated!
library(Rblpapi)
library(zoo)
library(xts)
str(res)
List of 3
$ :'data.frame': 9 obs. of 2 variables:
..$ date : Date[1:9], format: ...
..$ PX_LAST: num [1:9] 201 194 188 190 190 ...
$ :'data.frame': 9 obs. of 2 variables:
..$ date : Date[1:9], format: ...
..$ PX_LAST: num [1:9] 4891 4686 4477 4568 4517 ...
$ :'data.frame': 9 obs. of 2 variables:
..$ date : Date[1:9], format: ...
..$ PX_LAST: num [1:9] 19.3 22.5 26.1 22.5 22 ...
head(res)
[[1]]
date PX_LAST
1 2016-01-05 201.3600
2 2016-01-12 193.6608
3 2016-01-19 188.0600
4 2016-01-26 190.2000
5 2016-02-02 190.1600
6 2016-02-09 185.4300
7 2016-02-16 189.7800
8 2016-02-23 192.3200
9 2016-03-01 197.9700
[[2]]
date PX_LAST
1 2016-01-05 4891.430
2 2016-01-12 4685.919
3 2016-01-19 4476.950
4 2016-01-26 4567.673
5 2016-02-02 4516.946
6 2016-02-09 4268.763
7 2016-02-16 4435.956
8 2016-02-23 4503.583
9 2016-03-01 4680.479
[[3]]
date PX_LAST
1 2016-01-05 19.34
2 2016-01-12 22.47
3 2016-01-19 26.05
4 2016-01-26 22.50
5 2016-02-02 21.98
6 2016-02-09 26.54
7 2016-02-16 24.11
8 2016-02-23 20.98
9 2016-03-01 17.85
Unlist to get one data.frame / zoo / xts object with (date, pricedata1, pricedata2, pricedata3)
df <- data.frame(matrix(unlist(res), nrow=9))
head(df)
X1 X2 X3 X4 X5 X6
1 16805 201.36 16805 4891.43 16805 19.34
2 16812 193.6608 16812 4685.919 16812 22.47
3 16819 188.06 16819 4476.95 16819 26.05
4 16826 190.2 16826 4567.673 16826 22.5
5 16833 190.16 16833 4516.946 16833 21.98
6 16840 185.43 16840 4268.763 16840 26.54
However, this is not what I want. column X3 en X5 should not be there. Plus the date format is not good. Therefore getting it to zoo or xts doesn't work:
price<-read.zoo(df, format="%Y%m%d")
df$date <-as.Date(as.character(df$date),format="%Y%m%d")
x<-xts(df$date, df$px_last)
Error in read.zoo(df, format = "%Y%m%d") : index has bad entries at data rows: 1 2 3 4 5 6 7 8 9
Error in xts(df$date, df$px_last) : order.by requires an
appropriate time-based object
I believe all you need is to join each list element by the date.
However, for that, first you need to rename all those variables PX_LAST to something unique. For example:
require(data.table)
for (i in 1:length(res)) {
setnames(res[[i]],"PX_LAST",paste("PX_LAST",i,sep="_"))
}
Then you can join, either by pairwise mergeing, or using the plyr::join_all function:
require(plyr)
df <- join_all(res, by="date", type="full")
# date PX_LAST_1 PX_LAST_2 PX_LAST_3
# 1 2016-01-05 201.3600 4891.430 19.34
# 2 2016-01-12 193.6608 4685.919 22.47
# 3 2016-01-19 188.0600 4476.950 26.05
# 4 2016-01-26 190.2000 4567.673 22.50
# 5 2016-02-02 190.1600 4516.946 21.98
# 6 2016-02-09 185.4300 4268.763 26.54
# 7 2016-02-16 189.7800 4435.956 24.11
# 8 2016-02-23 192.3200 4503.583 20.98
# 9 2016-03-01 197.9700 4680.479 17.85
Then finally you can use
include(zoo)
price<-read.zoo(df)
include(xts)
xts(df, df$date)

insert new rows to the time series data, with date added automatically

I have a time-series data frame looks like:
TS.1
2015-09-01 361656.7
2015-09-02 370086.4
2015-09-03 346571.2
2015-09-04 316616.9
2015-09-05 342271.8
2015-09-06 361548.2
2015-09-07 342609.2
2015-09-08 281868.8
2015-09-09 297011.1
2015-09-10 295160.5
2015-09-11 287926.9
2015-09-12 323365.8
Now, what I want to do is add some new data points (rows) to the existing data frame, say,
320123.5
323521.7
How can I added corresponding date to each row? The data is just sequentially inhered from the last row.
Is there any package can do this automatically, so that the only thing I do is to insert new data point?
Here's some play data:
df <- data.frame(date = seq(as.Date("2015-01-01"), as.Date("2015-01-31"), "days"), x = seq(31))
new.x <- c(32, 33)
This adds the extra observations along with the proper sequence of dates:
new.df <- data.frame(date=seq(max(df$date) + 1, max(df$date) + length(new.x), "days"), x=new.x)
Then just rbind them to get your expanded data frame:
rbind(df, new.df)
date x
1 2015-01-01 1
2 2015-01-02 2
3 2015-01-03 3
4 2015-01-04 4
5 2015-01-05 5
6 2015-01-06 6
7 2015-01-07 7
8 2015-01-08 8
9 2015-01-09 9
10 2015-01-10 10
11 2015-01-11 11
12 2015-01-12 12
13 2015-01-13 13
14 2015-01-14 14
15 2015-01-15 15
16 2015-01-16 16
17 2015-01-17 17
18 2015-01-18 18
19 2015-01-19 19
20 2015-01-20 20
21 2015-01-21 21
22 2015-01-22 22
23 2015-01-23 23
24 2015-01-24 24
25 2015-01-25 25
26 2015-01-26 26
27 2015-01-27 27
28 2015-01-28 28
29 2015-01-29 29
30 2015-01-30 30
31 2015-01-31 31
32 2015-02-01 32
33 2015-02-02 33

Get data between certain timestamps in R

I have the following data frame:
>head(d)
Smed Time TTFB
1 186 2014-03-19 20:13:00 6
2 196 2014-03-19 21:23:00 0
3 186 2014-03-19 19:33:00 22
4 186 2014-03-19 14:43:00 134
5 186 2014-03-19 23:53:00 36
> str(d)
$ Smed : Factor w/ 71 levels "1033","1038",..: 14 16 14 14 14 14 47 14 14 68 ...
$ Time : POSIXct, format: "2014-03-19 20:13:00" "2014-03-19 20:13:00" "2014-03-19 20:13:00" "2014-03-19 20:13:00" ...
$ TTFB : int 6 0 22 134 36 96 61 37 28 19 ...
How do i do a time comparison with POSIXct time format to get data say between 18:13 and 23:20
POSIXct understands < and >, so you simply do this:
set.seed(1)
d <- data.frame(Time=as.POSIXct("2014-03-19")+24*60*60*runif(100))
index <- as.POSIXct("2014-03-19 18:00:00")<=d$Time &
d$Time<=as.POSIXct("2014-03-19 23:20:00")
d[index,]
Try something like
start <- as.POSIXct("2014-03-19 18:13:00")
end <- as.POSIXct("2014-03-19 23:10:00")
d[start < d$Time & d$Time < end, ]
Hope it helps

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

Function: calculating seconds between data points

I have the following column in my data frame:
DateTime
1 2011-10-03 08:00:04
2 2011-10-03 08:00:05
3 2011-10-03 08:00:06
4 2011-10-03 08:00:09
5 2011-10-03 08:00:15
6 2011-10-03 08:00:24
7 2011-10-03 08:00:30
8 2011-10-03 08:00:42
9 2011-10-03 08:01:01
10 2011-10-03 08:01:24
11 2011-10-03 08:01:58
12 2011-10-03 08:02:34
13 2011-10-03 08:03:25
14 2011-10-03 08:04:26
15 2011-10-03 08:06:00
With dput:
> dput(smallDF)
structure(list(DateTime = structure(c(1317621604, 1317621605,
1317621606, 1317621609, 1317621615, 1317621624, 1317621630, 1317621642,
1317621661, 1317621684, 1317621718, 1317621754, 1317621805, 1317621866,
1317621960, 1317622103, 1317622197, 1317622356, 1317622387, 1317622463,
1317622681, 1317622851, 1317623061, 1317623285, 1317623404, 1317623498,
1317623612, 1317623849, 1317623916, 1317623994, 1317624174, 1317624414,
1317624484, 1317624607, 1317624848, 1317625023, 1317625103, 1317625179,
1317625200, 1317625209, 1317625229, 1317625238, 1317625249, 1317625264,
1317625282, 1317625300, 1317625315, 1317625339, 1317625353, 1317625365,
1317625371, 1317625381, 1317625395, 1317625415, 1317625423, 1317625438,
1317625458, 1317625469, 1317625487, 1317625500, 1317625513, 1317625533,
1317625548, 1317625565, 1317625581, 1317625598, 1317625613, 1317625640,
1317625661, 1317625674, 1317625702, 1317625715, 1317625737, 1317625758,
1317625784, 1317625811, 1317625826, 1317625841, 1317625862, 1317625895,
1317625909, 1317625935, 1317625956, 1317625973, 1317626001, 1317626043,
1317626062, 1317626100, 1317626113, 1317626132, 1317626153, 1317626179,
1317626212, 1317626239, 1317626271, 1317626296, 1317626323, 1317626361,
1317626384, 1317626407), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = "DateTime", row.names = c(NA,
-100L), class = "data.frame")
My goal: I want to calculate the time difference, in seconds, between each measurement.
Edit:
I'm looking to get the following result, where the time difference (in seconds) between each data point is calculated, except for the first value of the day (line 3), when the time is calculate relative to 8 am:
DateTime Seconds
1 2011-09-30 21:59:02 6
2 2011-09-30 21:59:04 2
3 2011-10-03 08:00:04 4
4 2011-10-03 08:00:05 1
5 2011-10-03 08:00:06 1
6 2011-10-03 08:00:09 3
7 2011-10-03 08:00:15 5
8 2011-10-03 08:00:24 9
9 2011-10-03 08:00:30 6
10 2011-10-03 08:00:42 12
11 2011-10-03 08:01:01 19
12 2011-10-03 08:01:24 23
13 2011-10-03 08:01:58 34
14 2011-10-03 08:02:34 36
15 2011-10-03 08:03:25 51
16 2011-10-03 08:04:26 61
17 2011-10-03 08:06:00 94
However, the measurements start at 8:00 am, so if the value is the first of the day, the number of seconds relative to 8:00 am need to be calculated. In the example above, the first measurement ends at 8:00:04 so using the $sec attribute of POSIX could work here, but on other days the first value may happen a few minutes after 8:00 o'clock.
I've tried to achieve that goal with the following function:
SecondsInBar <- function(x, startTime){
# First data point or first of day
if (x == 1 || x > 1 && x$wkday != x[-1]$wkday){
seconds <- as.numeric(difftime(x,
as.POSIXlt(startTime, format = "%H:%M:%S"),
units = "secs"))
# else calculate time difference
} else {
seconds <- as.numeric(difftime(x, x[-1], units = "secs"))
}
return (seconds)
}
Which then could be called with SecondsInBar(smallDF$DateTime, "08:00:00").
There are at least two problems with this function, but I don't know how to solve these:
The code segment x$wkday != x[-1]$wkday returns a $ operator is
invalid for atomic vectors error,
And the as.POSIXlt(startTime, format = "%H:%M:%S") uses the
current date, which makes the difftime calculation erroneous.
My question:
Where am I going wrong with this function?
And: is this approach a viable way or should I approach it from a different angle?
How about something along these lines:
smallDF$DateTime - as.POSIXct(paste(strftime(smallDF$DateTime,"%Y-%m-%d"),"07:00:00"))
Time differences in secs
[1] 4 5 6 9 15 24 30 42 61 84 118 154 205 266 360
[16] 503 597 756 787 863 1081 1251 1461 1685 1804 1898 2012 2249 2316 2394
[31] 2574 2814 2884 3007 3248 3423 3503 3579 3600 3609 3629 3638 3649 3664 3682
[46] 3700 3715 3739 3753 3765 3771 3781 3795 3815 3823 3838 3858 3869 3887 3900
[61] 3913 3933 3948 3965 3981 3998 4013 4040 4061 4074 4102 4115 4137 4158 4184
[76] 4211 4226 4241 4262 4295 4309 4335 4356 4373 4401 4443 4462 4500 4513 4532
[91] 4553 4579 4612 4639 4671 4696 4723 4761 4784 4807
attr(,"tzone")
[1] ""
Note that I used 7am as when I copied your data my it decided to interpret it as BST.
As for your errors, you can't use $ to get elements of a date with POSIXct (which is how smallDF$DateTime is defined), only with POSIXlt. And for the second error, if you don't supply a date, it has to assume the current date, as there is no other information to draw upon.
Edit
Now its been clarified, I would propose a different approach: split your data frame by day, and then combine the times with the reference time and do diff on that, using lapply to loop over days:
#modify dataframe to add extra day to second half
smallDF[51:100,1] <- smallDF[51:100,1]+86400
smallDF2 <- split(smallDF,strftime(smallDF$DateTime,"%Y-%m-%d"))
lapply(smallDF2,function(x) diff(c(as.POSIXct(paste(strftime(x$DateTime[1],"%Y-%m-%d"),"07:00:00")),x$DateTime)))
$`2011-10-03`
Time differences in secs
[1] 4 1 1 3 6 9 6 12 19 23 34 36 51 61 94 143 94 159 31
[20] 76 218 170 210 224 119 94 114 237 67 78 180 240 70 123 241 175 80 76
[39] 21 9 20 9 11 15 18 18 15 24 14 12
$`2011-10-04`
Time differences in secs
[1] 3771 10 14 20 8 15 20 11 18 13 13 20 15 17 16
[16] 17 15 27 21 13 28 13 22 21 26 27 15 15 21 33
[31] 14 26 21 17 28 42 19 38 13 19 21 26 33 27 32
[46] 25 27 38 23 23

Resources