Build Graph(Plot) in R: median prices through time intervals - r

I have the data frame with prices and the ending date of some auctions. I want to check when appears, for example, sales with minimal and maximal prices (also the median) depending on the time of the day.
More precisely, I have the data frame mtest:
> str(mtest)
'data.frame': 9144 obs. of 2 variables:
$ Price : num 178 188 228 305 202 ...
$ EndDateTime: POSIXct, format: "2015-05-25 05:00:59" "2015-05-23 00:06:01" ...
I want to build the graph(plot), having 30 minutes time internals (00:00-00:30, 00:31-01:00 etc) on the X axis, and median (maximal, minimal prices) on Y axis.
Another idea is to draw a simple histogram for each time interval, like hist(mtest$Price, breaks=10, col="red")
How can I do this in the best way?

Try this:
cutt=seq(from=min(mtest$EndDateTime),to=max(mtest$EndDateTime), by=30*60)
if (max(mtest$EndDateTime)>max(cutt))cutt[length(cutt)+1]=max(cutt)+30*60
mtest$tint=cut(mtest$EndDateTime,cutt)
stats=do.call(rbind,tapply(mtest[,"Price"],mtest[,"tint"],
function(p)c(min=min(p),median=median(p),max=max(p))))
bp=boxplot(mtest[,"Price"]~mtest[,"tint"],xaxt="n",
col=1:length(levels(mtest$tint)))
axis(1,at=1:length(levels(mtest$tint)),labels=format.Date(levels(mtest$tint),"%Y-%m-%d %H:%M"),
las=2,cex.axis=.5)
stats
Or wilth plot
plot(NA,ylim=range(stats),xlim=c(1,lint),type="n",xaxt="n",xlab="",ylab="")
sapply(1:3,function(z)points(stats[,z]~c(1:lint),col=z))
axis(1,at=1:lint,labels=format.Date(levels(mtest$tint),"%Y-%m-%d %H:%M"),
las=2,cex.axis=.5)
You will have something like this:

Related

Converting a date to numeric in R

I have data where I have the dates in YYYY-MM-DD format in one column and another column is num.
packages:
library(forecast)
library(ggplot2)
library(readr)
Running str(my_data) produces the following:
spec_tbl_df [261 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:261], format: "2017-01-01" "2017-01-08" ...
$ popularity: num [1:261] 100 81 79 75 80 80 71 85 79 81 ...
- attr(*, "spec")=
.. cols(
.. date = col_date(format = ""),
.. popularity = col_double()
.. )
- attr(*, "problems")=<externalptr>
I would like to do some time series analysis on this. When running the first line of code for this decomp <- stl(log(my_data), s.window="periodic")
I keep running into the following error:
Error in Math.data.frame(my_data) :
non-numeric-alike variable(s) in data frame: date
Originally my date format was in MM/DD/YYYY format, so I feel like I'm... barely closer. I'm learning R again, but it's been a while since I took a formal course in it. I did a precursory search here, but could not find anything that I could identify as helpful (I'm just an amateur.)
You currently have a data.frame (or tibble variant thereof). That is not yet time aware. You can do things like
library(ggplot2)
ggplot(data=df) + aes(x=date, y=popularity) + geom_line()
to get a basic line plot properly index by date.
You will have to look more closely at package forecast and the examples of functions you want to use to predict or model. Packages like xts can help you, i.e.
library(xts)
x <- xts(df$popularity, order.by=df$date)
plot(x) # plot xts object
besides plotting you get time- and date aware lags and leads and subsetting. The rest depends more on what you want to do ... which you have not told us much about.
Lastly, if you wanted to convert your dates to numbers (since Jan 1, 1970) a quick as.numeric(df$date)) will; but using time-aware operations is often better (but has the learning curve you see now...)

Time Series application - Guidance Needed

I am relatively new to R, and am currently trying to implement time series on a data set to predict product volume for next six months. My data set has 2 columns Dates(-timestamp) and volume of product in inventory (on that particular day) for example like this :
Date Volume
24-06-2013 16986
25-06-2013 11438
26-06-2013 3378
27-06-2013 27392
28-06-2013 24666
01-07-2013 52368
02-07-2013 4468
03-07-2013 34744
04-07-2013 19806
05-07-2013 69230
08-07-2013 4618
09-07-2013 7140
10-07-2013 5792
11-07-2013 60130
12-07-2013 10444
15-07-2013 36198
16-07-2013 11268
I need to predict six months of product volume required in inventory after end date(in my data set which is "14-06-2019" "3131076").Approx 6 year of data I am having start date 24-06-2013 and end date 14-06-2019
I tried using auto.arima(R) on my data set and got many errors. I started researching on the ways to make my data suitable for ts analysis and came to know about imputets and zoo packages.
I guess date has high relevance for inputting frequency value in the model so I did this : I created a new column and calculated the frequency of each weekday which is not the same
data1 <- mutate(data, day = weekdays(as.Date(Date)))
> View(data1)
> table(data1$day)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
213 214 208 207 206 211 212
There are no missing values against dates but we can see from above that count of each week day is not the same, some of the dates are missing, how to proceed with that ?
I have met kind of dead end , tried going through various posts here on impute ts and zoo package but didn't get much success.
Can someone please guide me how to proceed further and pardon me #admins and users if you think its spamming but it is really important for me at the moment. I tried to go through various tutorials on Time series out side but almost all of them have used air passengers data set which I think has no flaws.
Regards
RD
library(imputeTS)
library(dplyr)
library(forecast)
setwd("C:/Users/sittu/Downloads")
data <- read.csv("ts.csv")
str(data)
$ Date : Factor w/ 1471 levels "01-01-2014","01-01-2015",..: 1132 1181 1221 1272 1324 22 71 115 163 213 ...
$ Volume: Factor w/ 1468 levels "0","1002551",..: 379 116 840 706 643 1095 1006 864 501 1254 ...
data$Volume <- as.numeric(data$Volume)
data$Date <- as.Date(data$Date, format = "%d/%m/%Y")
str(data)
'data.frame': 1471 obs. of 2 variables:
$ Date : Date, format: NA NA NA ... ## 1st Error now showing NA instead of dates
$ Volume: num 379 116 840 706 643 ...
Let's try to generate that dataset :
First, let's reproduce a dataset with missing data :
dates <- seq(as.Date("2018-01-01"),as.Date("2018-12-31"),1)
volume <- floor(runif(365, min=2500, max=50000))
dummy_df <- do.call(rbind, Map(data.frame, date=dates, Volume=volume))
df <- dummy_df %>% sample_frac(0.8)
Here we generated a dataframe with Date and volume for the year 2018, with 20%missing data (sample_frac(0.8)).
This should mimic correctly your dataset with missing data for some days.
What we want from there is to find the days with no volume data :
Df_full_dates <- as.data.frame(dates) %>%
left_join(df,by=c('dates'='date'))
Now you want to replace the NA values (that correspond to days with no data) with a volume (I took 0 there but if its missing data, you might want to put the month avg or a specific value, I do not know what suits best your data from your sample) :
Df_full_dates[is.na(Df_full_dates)] <- 0
From there, you have a dataset with data for each day, you should be able to find a model to predict the volume in future months.
Tell me if you have any question

How to extract time form POSIXct and plot?

Here I have a data frame which looks like following way,with the first column "POSIXct" and second "latitude"
> head(b)
sample_time latitude
3813442 2015-05-21 19:02:41 39.92770
3813483 2015-05-21 19:03:16 39.92770
3813485 2015-05-21 19:14:30 39.92433
3813515 2015-05-21 19:14:59 39.92469
3813550 2015-05-21 19:15:30 39.92520
3813585 2015-05-21 19:16:00 39.92585
Now,I want to plot latitude vs sample_time, with x axis representing 24 hours timestamp within a single day and group latitude by different days.
Any help will be appreciated!Many thanks.
First, you need to define "day", as opposed to the full time. Then you need to figure out what you mean by "group" ... let's just say you want to aggregate and take the daily mean. Third, you need to make the plot.
b$day <- round.Date(b[,"sample_time"], units="days")
b_agg <- aggregate(list(sample_time=b[,"sample_time"]), by=list(day=b[,"day"]), FUN=mean)
plot(b_agg)
Edit:
Just an additional thought, if you didn't want to aggregate, you could skip the second step, and change the third to plot(b[,"day"], b[,"latitude"]. Alternatively, you may even want something like boxplot(latitude~day, data=b).

Exclude rows with certain time of day

I have a time series of continuous data measured at 10 minute intervals for a period of five months. For simplicity's sake, the data is available in two columns as follows:
Timestamp Temp.Diff
2/14/2011 19:00 -0.385
2/14/2011 19:10 -0.535
2/14/2011 19:20 -0.484
2/14/2011 19:30 -0.409
2/14/2011 19:40 -0.385
2/14/2011 19:50 -0.215
... And it goes on for the next five months. I have parsed the Timestamp column using as.POSIXct.
I want to select rows with certain times of the day, (e.g. from 12 noon to 3 PM), I would like either like to exclude the other hours of the day, OR just extract those 3 hours but still have the data flow sequentially (i.e. in a time series).
You seem to know the basic idea, but are just missing the details. As you mentioned, we just transform the Timestamps into POSIX objects then subset.
lubridate Solution
The easiest way is probably with lubridate. First load the package:
library(lubridate)
Next convert the timestamp:
##*m*onth *d*ay *y*ear _ *h*our *m*inute
d = mdy_hm(dd$Timestamp)
Then we select what we want. In this case, I want any dates after 7:30pm (regardless of day):
dd[hour(d) == 19 & minute(d) > 30 | hour(d) >= 20,]
Base R solution
First create an upper limit:
lower = strptime("2/14/2011 19:30","%m/%d/%Y %H:%M")
Next transform the Timestamps in POSIX objects:
d = strptime(dd$Timestamp, "%m/%d/%Y %H:%M")
Finally, a bit of dataframe subsetting:
dd[format(d,"%H:%M") > format(lower,"%H:%M"),]
Thanks to plannapus for this last part
Data for the above example:
dd = read.table(textConnection('Timestamp Temp.Diff
"2/14/2011 19:00" -0.385
"2/14/2011 19:10" -0.535
"2/14/2011 19:20" -0.484
"2/14/2011 19:30" -0.409
"2/14/2011 19:40" -0.385
"2/14/2011 19:50" -0.215'), header=TRUE)
You can do this with easily with the time-based subsetting in the xts package. Assuming your data.frame is named Data:
library(xts)
x <- xts(Data$Temp.Diff, Data$Timestamp)
y <- x["T12:00/T15:00"]
# you need the leading zero if the hour is a single digit
z <- x["T09:00/T12:00"]

Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R:
> str(df.MHwind_load) # compactly displays structure of data frame
'data.frame': 8760 obs. of 6 variables:
$ Date : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time..HRs. : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hour.of.Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Wind.MW : int 375 492 483 476 486 512 421 396 456 453 ...
$ MSEDCL.Demand: int 13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
$ Net.Load : int 12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...
While preserving the hourly structure, I would like to know how to extract
a particular month/group of months
the first day/first week etc of each month
all mondays, all tuesdays etc of the year
I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.
Edit: a sample of data in the data frame is below:
Date Hour.of.Year Wind.MW datetime
1 2010-04-01 1 375 2010-04-01 00:00:00
2 2010-04-01 2 492 2010-04-01 01:00:00
3 2010-04-01 3 483 2010-04-01 02:00:00
4 2010-04-01 4 476 2010-04-01 03:00:00
5 2010-04-01 5 486 2010-04-01 04:00:00
6 2010-04-01 6 512 2010-04-01 05:00:00
7 2010-04-01 7 421 2010-04-01 06:00:00
8 2010-04-01 8 396 2010-04-01 07:00:00
9 2010-04-01 9 456 2010-04-01 08:00:00
10 2010-04-01 10 453 2010-04-01 09:00:00
.. .. ... .......... ........
8758 2011-03-31 8758 302 2011-03-31 21:00:00
8759 2011-03-31 8759 378 2011-03-31 22:00:00
8760 2011-03-31 8760 356 2011-03-31 23:00:00
EDIT: Additional time-based operations I would like to perform on the same dataset
1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points)
2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively
3. Do seasonal averages, for example for June to September
Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.
Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:
###dummy data.frame
df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4)
##1. Select rows for particular month
subset(df,month(Date)==1)
##2a. Select the first day of each month
subset(df,mday(Date)==1)
##2b. Select the first week of each month
##get the week numbers which have the first day of the month
wkd <- subset(week(df$Date),mday(df$Date)==1)
##select the weeks with particular numbers
subset(df,week(Date) %in% wkd)
##3. Select all mondays
subset(df,wday(Date)==1)
First switch to a Date representation: as.Date(df.MHwind_load$Date)
Then call weekdays on the date vector to get a new factor labelled with day of week
Then call months on the date vector to get a new factor labelled with name of month
Optionally create a years variable (see below).
Now subset the data frame using the relevant combination of these.
Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.
To get you going...
newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)
## for some reason R has no years function. Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }
newdf$year <- years(newdf$d)
# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))
# get all Monday observations
subset(newdf, day == 'Monday')
# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')
# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day))
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')
Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.
With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).
df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))
Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:
getMonth <- function(x, mo) { # This function assumes w/in single year vector
isMonth <- month(x) %in% mo # Boolean of matching months
return(x[which(isMonth)] # Return vector of matching months
} # end function
Or, in short form
getMonth <- function(x, mo) x[month(x) %in% mo]
This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).
A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.
getFirstDay <- function(x, mo) {
isMonth <- months(x) %in% mo
x <- sort(x[isMonth]) # Look at only those in the desired month.
# Sort them by date. We only want the first day.
nFirsts <- rle(as.numeric(x))$len[1] # Returns length of 1st days
return(x[seq(nFirsts)])
} # end function
The easier alternative would be
getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}
I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).
# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)), # Iterate through months in Dates
function(month) {getFirstDayOnly(df$date, month)})
The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.
You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

Resources