R plot density smoothed time series - r

I wish to make a probability distribution of some time series data. My data is in the following format
00:00, 3
01:00, 50
05:00, 13
10:00, 34
17:00, 80
21:00, 100
The time column has some missing values that R will have to interpolate. I want to get a nice smooth curve to highlight the busy periods. I have tried with ts, density and plot but these don't produce what I'm after. For example,
data1 <- read.csv(file="c:\\abc\\ts.csv", head=FALSE, sep=",")
data1$V1 <- strptime(data1$V1, format="%H:%M")
plot(data1$V2, density(data1$V1), type="l")
But this gives me lines drawn in crazy order and as a probability distribution.

I think you are definitely after package zoo, which has several functions to deal with NAs. See na.aggregate, na.approx and na.locf also.

You made it a little harder than you might realize. I'll make it easier for now by adding a date in front of your times.
Also, I added a variable "texinp" and a textConnection() statement so you can cut/paste the following code and run it directly. The data is loaded into variable texinp and is read by the read.zoo statement in a similar way to reading a .csv file. For now, this will allow you to plot things and gives you an idea of how to read .csv files using read.zoo.
library(zoo)
library(chron)
texinp <- "
Time, Mydata
2011-02-06 00:00, 3
2011-02-06 01:00, 50
2011-02-06 05:00, 13
2011-02-06 10:00, 34
2011-02-06 17:00, 80
2011-02-06 21:00, 100"
myd.zoo <- read.zoo(textConnection(texinp), header=TRUE, FUN = as.chron, sep=",")
myd.zoo
plot(myd.zoo)
From your question, you talked about "busy periods". I may be wrong, but I'm assuming that the value of 100 at time 21:00 is the "busiest period". If that's true, then you don't need a density plot, and the above plot is what you're after.
Let me know if I'm wrong.

Related

Correct imputation for a zooreg object?

My objective is to impute NAs in a zooreg time series object. The pattern of the time series is cyclic. My code is:
#load libraries required
library("zoo")
# create sequence every 15 minutes from 1st Dec to 20th Dec, 2018
timeStamp <- seq.POSIXt(from=as.POSIXct('2018-01-01 00:00:00', tz="UTC"), to=as.POSIXct('2018-01-20 23:45:00', tz="UTC"), by = "15 min")
# data which increases from 12am to 12pm, then decreases till 12 am of next day, for 20 days
readings <- rep(c(seq(1,48,1), seq(48,1,-1)), 20)
dF <- data.frame(timeStamp=timeStamp, readings=readings)
# create a regular zooreg object, frequency is 1 day( 4 readings * 24 hours)
readingsZooReg <- zooreg(dF$readings, order.by = dF$timeStamp, frequency = 4*24)
plot(readingsZooReg)
# force some data to be NAs
window(readingsZooReg, start = as.POSIXct("2018-01-14 00:00:00", tz="UTC"), end = as.POSIXct("2018-01-16 23:45:00", tz="UTC")) <- NA
plot(readingsZooReg)
# plot imputed values
plot(na.approx(readingsZooReg))
The plots are:
Full time series, NAs added, Imputed time series
I'm purposely using zoo here, since the time series I work on are irregular(eg. solar, oil wells, etc)
1) Is my usage of "zooreg" correct? Or would a "zoo" object suffice ?
2) Is my frequency variable right?
3) Why won't na.approx work? I've also tried na.StructTs, the R script hangs.
4) Is there a solution using any other package? xts, ts, etc?
Your current example time-series is a regular time-series.
(a irregular time series would have time-steps with different time distances between observations)
E.g.:
10:00:10, 10:00:20, 10:00:30, 10:00:40, 10:00:50 (regular spaced)
10:00:10, 10:00:17, 10:00:33, 10:00:37, 10:00:50 (irregular spaced)
If you really need to handle irregular spaced time-series, zoo is your go to package. Otherwise you can also use other time series classes as xts and ts.
About the frequency:
You set the frequency of a time-series usually according to a value where you expect patterns to repeat. (in your example this could be 96). In real live this is often 1 day, 1 week, 1 month,....but it can be also different from these like 1,5 days. (e.g. if you have daily returning patterns and 1 minute observations you would set the frequency to 1440).
na.approx of zoo workes perfectly. It is exactly doing what it is expected to. A interpolation between the points 0 before the gap and 0 at the end of the gap will give a straight line at 0. Of course that is probably not the result you expected, because it does not account for seasonality. That is why G. Grothendieck suggests you na.StructTS as a method to choose. (this method is usually better in accounting for seasonality)
The best choice if you are not bound to zoo would in this specific case be using na_seadec from the imputeTS package ( a package solely dedicated to time series imputation).
I have added you a example also with nice plots from the imputeTS package
library(imputeTS)
yourTS <- ts(coredata(readingsZooReg), frequency = 96)
ggplot_na_distribution(yourTS)
imputedTS <- na_seadec(yourTS)
ggplot_na_imputations(yourTS, imputedTS)
Usually imputeTS also works perfectly with zoo time-series as input. I only changed it to ts again, because something with your zoo object seems odd...that is also why na.StructTS from zoo itself breaks. Maybe somebody with better knowledge can help out here.
Beware, if you really should have irregular time series do not use other packages / imputation functions than from zoo. Because they all assume the data to be regular spaced and will give results accordingly.

r time interval plot

time count
2017-03-08 19:33 1
2017-03-23 22:11 1
2017-03-30 3:30 10
2017-03-09 19:33 13
2017-03-23 22:11 1
2017-03-31 3:30 1
.....
this data is about how fast consumers comments write
so I want to make a plot which I can easily know about how fast comments on.
For example,
In X axis, time series starts from 2017-03-08
through same interval(seconds or minute) there is a bar plot
so if the comments write speed is fast, the bar plot is dense.
and then time goes on, spped is not that fast, the bar plot is not dense
how can I make it?
cc5<-dt[, tdiff := difftime(cc, shift(cc, fill=cc[1L]), units="secs"),
by=title]
using this code, I can make difftime column
I have one more problem time column is character type
so I try to change it to date type using as.Date it doesn't work
so I change it to POSIXct type
I think to make X axis in time series I need to change date type
I'm not 100% sure that I'm really understanding the result that you want,
but generally when I want to put dates in the x-axis, I go to Understanding dates and plotting a histogram with ggplot2 in R
and use Gauden's Code v1. If you have successfully changed the character into a POSIXct time, as.Date() should work fine.

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?
It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.
Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

plot data of time interval type in R

these data are exported from postgresql of interval type, for example:
1 00:01:30
2 00:07:00
3 00:07:00
4 00:03:00
5 00:02:00
6 00:03:30
7 -00:02:00
...
what i want
I want to see the distribution of these data, and what's more, I want to get the decile of the distribution, even if it's interval time.
what I did
I used the :
COPY (SELECT the_interval from the_table) TO '/some/file/path.txt';
to get the file path.txt.
then I used
tools -> import datasets -> from loalfile
to get the data imported into workset of R with RStudio.
what I am asking
I'm new to R, and I want to know: do I need to transfer the data into time type in R, or any function I could use to plot these data. Or any further, you can propose me any better way you think that it would achieve the goal I expressed.
Thanks a lot!
Assuming you can read your data into R as character strings. The easiest option is to convert your times into interval objects with the "times" function. From there R makes it easy to plot a histogram.
#Sample data
t<-c("00:01:30", "00:07:00", "00:07:00", "00:03:00", "00:02:00", "00:03:30", "00:06:00")
#load library and convert to a times object
library(chron)
tt<-times(t)
#Plot the histogram
h<-hist(as.numeric(tt), main="sample data", col="blue")
#For data summaries
summary(tt)
quantile(tt, 0.90)
Hope this provides you a head start on solving your problem, if not please ask a more detail question providing some sample data and the expect output.

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources