I have intra-day trade data which I'm trying to plot using ggplot.
For a given day, the data looks as such (for example)…
head(q1)
time last bid ask volume center
1 2014-03-19 09:30:00.480 63.74 63.39 63.74 200 11
2 2014-03-19 09:30:00.645 63.41 63.41 63.60 100 11
3 2014-03-19 09:30:00.645 63.48 63.41 63.60 100 11
4 2014-03-19 09:30:02.792 63.59 63.44 63.60 100 11
5 2014-03-19 09:30:03.023 63.74 63.44 63.75 100 12
6 2014-03-19 09:30:12.987 63.72 63.44 63.76 100 11
tail(q1)
time last bid ask volume center
2116 2014-03-19 15:59:56.266 61.68 61.67 61.74 168 12
2117 2014-03-19 15:59:58.515 61.68 61.68 61.73 100 28
2118 2014-03-19 15:59:59.109 61.69 61.68 61.73 500 11
2119 2014-03-19 16:00:00.411 61.72 61.69 61.73 100 11
2120 2014-03-19 16:00:00.411 61.72 61.69 61.73 200 11
2121 2014-03-19 16:00:00.411 61.72 61.69 61.73 351 11
It's easy to use gglot to visualize a single day of data, where I'm having trouble is linking multiple days together on the same plot. If I have 2 consecutive days in the data frames q1 & q2, how can I plot these on a single plot without the time gap when the market is closed and the lines linking the end of one day to another?
You could try creating a new transformation that transforms day time to a seamless scale of trading time:
9:30 -> 0
12:00 -> approx. 0.5
16:00 -> 1
9:30 next day -> 1
Something along the following lines could do (but I haven't tried myself):
library(scales)
trading_day_trans <- function() {
trans_new("trading_day", trans, inv,
pretty_breaks(), domain = domain)
}
ggplot(rbind.fill(q1, q2)) + ... + coord_trans(xtrans = "trading_day")
You need to provide trans (the transformation function, time -> linear), inv (the inverse transform, linear -> time) and domain (a time vector of length 2, min-max).
Adapted from http://blog.ggplot2.org/post/25938265813/defining-a-new-transformation-for-ggplot2-scales .
Related
I have a dataframe with 18 column and I want to see seasonally adjusted state of each variables on a single chart.
Here is head of my dataframe;
head(cityFootfall))
Istanbul Eskisehir Mersin
1 44280 12452 11024
2 58713 13032 12773
3 21235 5629 5749
4 20934 5968 5764
5 21667 6022 5752
6 21386 6281 5920
Ankara Bursa Adana Izmir
1 19073 5098 8256 15623
2 22812 7551 10631 18511
3 8777 2260 3733 8625
4 8798 2252 3536 8573
5 8893 2398 3641 9713
6 8765 2391 3618 10542
Kayseri Antalya Konya
1 8450 2969 4492
2 8378 4421 0
3 3491 1744 0
4 3414 1833 0
5 3596 1733 0
6 3481 1785 1154
Samsun Kahramanmaras Aydin
1 4472 4382 4376
2 4996 4773 5561
3 1662 1865 2012
4 1775 1710 1957
5 1700 1704 1940
6 1876 1848 1437
Gaziantep Sanliurfa Izmit
1 3951 3752 3825
2 5412 4707 4125
3 2021 1326 1890
4 1960 1411 1918
5 1737 1204 1960
6 1833 1143 2047
Denizli Malatya
1 2742 3809
2 3658 4346
3 1227 1975
4 1172 1884
5 1102 2073
6 1171 2060
Here is my function for this:
plot_seasonality=function(x){
par(mfrow=c(6,3))
plot_draw=lapply(x, function(x) plot(decompose(ts(x,freq=7),type="additive")$x-decompose(ts(x,freq=7),type="additive")$seasonal)
}
plot_seasonality(cityFootfall)
When I run this function I get error says: Error in plot.new() : figure margins too large but when I change my codes frompar(mfrow=c(6,3) to par(mfrow=c(3,3) its works and give me last 9 columns plot like this image but I want to see all variable in a single chart
Could anyone help me about solve my problem?
Fundamentally your windows is not big enough to plot that:
1) open a big window with dev.new(), or from Rstudio X11() under linux or quartz() under MacOSX)
2) simplify your ylab that will free space
# made up data
x <- seq(0,14,length.out=14*10)
y <- matrix(rnorm(14*10*6*3),nrow=3*6)
# large window (may use `X11()` on linux/Rstudio to force opening of new window)
dev.new(width=20,height=15)
par(mfrow=c(6,3))
# I know you could do that with `lapply` but don't listen to the fatwas
# on `for` loops it often does exactly the job you need in R
for(i in 1:dim(y)[1]){
plot(x,y[i,],xlab="Time",ylab=paste("variable",i),type="l")
}
You should also consider plotting several variables in the same graph (using lines after an initial plot).
As suggested: transform data in long format with package tidyr, see function gather:
I added a time variables since it was missing.
temp <- cityFootfall %>% transform(time = 1:nrow(temp)) %>% gather(variable, key, -time)
Now plot it with ggplot2(default settings, you can adjust this like you want)
gplot(temp, aes(x = time, y = key, group = variable, color = variable)) + geom_point() + geom_line()
Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.
I am quite new to r and am trying to perform ARIMA time series forecast. The data I am looking into in electricity load per 15 min. My data looks as follows:
day month year PTE periode_van periode_tm gemeten_uitwisseling
1 1 01 2010 1 0 secs 900 secs 2636
2 1 01 2010 2 900 secs 1800 secs 2621
3 1 01 2010 3 1800 secs 2700 secs 2617
4 1 01 2010 4 2700 secs 3600 secs 2600
5 1 01 2010 5 3600 secs 4500 secs 2582
geplande_import geplande_export date weekend
1 719 -284 2010-01-01 00:00:00 0
2 719 -284 2010-01-01 00:15:00 0
3 719 -284 2010-01-01 00:30:00 0
4 719 -284 2010-01-01 00:45:00 0
5 650 -253 2010-01-01 01:00:00 0
weekday Month gu_ma
1 5 01 NA
2 5 01 NA
3 5 01 NA
4 5 01 NA
5 5 01 NA
to create a time series I have used the following code
library("zoo")
ZOO <- zoo(NLData$gemeten_uitwisseling,
order.by=as.POSIXct(NLData$date, format="%Y-%m-%d %H:%M:%S"))
ZOO <- na.approx(ZOO)
tsNLData <- ts(ZOO)
plot(tsNLData)
I have also tried the following
NLDatats <- ts(NLData$gemeten_uitwisseling, frequency = 96)
However when I plot the data I get the following;
How can I solve this?
There doesn't seem to be any problem with your graph, but your data come in 15 minute intervals, and you are plotting 4 years worth of data. So naturally it will look like a dark shaded region because there is no way to show the thousands of data points you have in your series in a single plot.
If you are struggling to handle this much data, you can consider sampling from your data frame before plotting, although this will remove seasonality and autocorrelation from the outcome. That can be helpful if you want to know average values of your outcome over time, but not as helpful to see the seasonal and autocorrelative structure in the data.
See the code below that uses dplyr and ggplot2 to plot some simulated time series that illustrates these issues. It's always best to start with simulated data and then work with your own data.
require(ggplot2)
require(dplyr)
sim_data <- arima.sim(model=list(ar=.88,order=c(1,0,0)),n=10000,sd=.3)
#Too many points
data_frame(y=as.numeric(sim_data),x=1:10000) %>% ggplot(aes(y=y,x=x)) + geom_line() +
theme_minimal() + xlab('Time') + ylab('Y_t')
#Sample from data (random sample)
#However, this will remove autocorrelation/seasonality
data_frame(y=as.numeric(sim_data),x=1:10000) %>% sample_n(500) %>%
ggplot(aes(y=y,x=x)) + geom_line() + theme_minimal() + xlab('Time') + ylab('Y_t')
# Plot a subset, which preserves autocorrelation and seasonality
data_frame(y=as.numeric(sim_data),x=1:10000) %>% slice(1:300) %>%
ggplot(aes(y=y,x=x)) + geom_line() + theme_minimal() + xlab('Time') + ylab('Y_t')
Here, i have a data set with Start date and End Date and the usages. I have calculated the number of Days between these two days and got the daily usages. (I am okay with one flat usages for each day for now).
Now, what i want to achieve is the sum of the usage for each day in those TIME-FRAME FOR month of June. For example, the first case will be just the Daily_usage
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.156250
And, for 2nd, i want to the add the Usage 3905 to June 1st, and also to June 2nd because it spans in both June 1st and June 2nd.
2015-05-04 2015-06-02 117159.00 30 3905.3000000
I want to continue doing this for all 387 rows and at the end get the sum of Usages for each day. And,I do not know how to do this for hundreds of records.
This is what my datasets looks right now:
str(YYY)
'data.frame': 387 obs. of 5 variables:
$ START_DATE : Date, format: "2015-05-01" "2015-05-04" "2015-05-11" "2015- 05-13" ...
$ END_DATE : Date, format: "2015-06-01" "2015-06-01" "2015-06-01" "2015-06-01" ...
$ x : num 261605 1380796 183 103 489 ...
$ DAYS : num 32 29 22 20 19 12 1 34 30 29 ...
$ DAILY_USAGE: num 8175.16 47613.66 8.32 5.13 25.74 ...
Also, the header.
START_DATE END_DATE x DAYS DAILY_USAGE
1 2015-05-01 2015-06-01 261605.00 32 8175.1562500
2 2015-05-04 2015-06-01 1380796.00 29 47613.6551724
6 2015-05-21 2015-06-01 1392.00 12 116.0000000
7 2015-06-01 2015-06-01 2503.00 1 2503.0000000
8 2015-04-30 2015-06-02 0.00 34 0.0000000
9 2015-05-04 2015-06-02 117159.00 30 3905.3000000
10 2015-05-05 2015-06-02 193334.00 29 6666.6896552
13 2015-05-04 2015-06-03 630.00 31 20.3225806
and so on........
Example of data sets and Results
I will call this data set. EXAMPLE1 (For 3 days, mocked up data)
START_DATE END_DATE x DAYS DAILY_USAGE
5/1/2015 6/1/2015 261605 32 8175.15625
5/4/2015 6/1/2015 1380796 29 47613.65517
5/11/2015 6/1/2015 183 22 8.318181818
4/30/2015 6/2/2015 0 34 0
5/20/2015 6/2/2015 70 14 5
6/1/2015 6/2/2015 569 2 284.5
6/1/2015 6/3/2015 582 3 194
6/2/2015 6/3/2015 6 2 3
For the above examples, answer should be like this
DAY USAGE
6/1/2015 56280.6296
6/2/2015 486.5
6/3/2015 197
HOW?
In Example 1, for June 1st, i have added all the rows of usages except the last row usage because the last row doesn't include the the date 06/01 in time-frame. It starts in 06/02 and ends in 06/03.
To get June 2nd, i have added all the usages from Row 4 to 8 because June 2nd is between all of those start and end dates.
For June 3rd, i have only added, Last two rows to get 197.
So, where to sum, depends on the time-frame of Start & End_date.
Hope this helps!
There might be a easy trick to do this than to write 400 lines of If else statement.
Thank you again for your time!!
-Gyve
library(lubridate)
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
cbind.data.frame(DAY=unique(df$END_DATE),
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x]))))
# DAY USAGE
# 1 6/1/2015 56280.63
# 2 6/2/2015 486.50
# 3 6/3/2015 197.00
Explanation
We can expand it to explain what is happening:
indx <- lapply(unique(mdy(df[,2])), '%within%', interval(mdy(df[,1]), mdy(df[,2])))
The unique end dates are tested to be within the range days in the first and second columns. mdy is a quick way to convert to POSIXct with lubridate. The operator %within% tests a date against an interval. We created intervals with interval('col1', 'col2'). This creates an index that we can subset the data by.
In our final data frame,
cbind.data.frame(DAY=unique(df$END_DATE),
creates the first column of dates.
And,
USAGE=unlist(lapply(indx, function(x) sum(df$DAILY_USAGE[x])))
takes the sum of df$DAILY_USAGE by the index that we created.
This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]