plot a categorial variable based on two other variables - r

So I have a data.frame which contains the columns date, price and a categorial variable.
> head(join)
date e5 near_motorway
1 2019-01-01 05:00:12 1.449 1
2 2019-01-01 05:00:12 1.439 1
3 2019-01-01 05:03:06 1.439 0
4 2019-01-01 05:03:06 1.439 1
5 2019-01-01 05:03:06 1.449 0
6 2019-01-01 05:03:06 1.449 1
I want to do draw two lines in one plot based on the categorial variable, with the hour of the date on the x axis and the price on the y axis.
Does anybody have a solution?

This should work:
library(ggplot2)
ggplot(data = join,
aes(x = date, y = e5, col = near_motorway, group = near_motorway) +
geom_line()
I am supposing you have date in a date format, e5 as numeric and near_motorway as factor. Also that e5 is price.
And to costume the graph you can play with scale_y_datetime and scale_colour_manual and with the prefer theme.

Related

How to apply several time series models like ets, auto.arima etc. to groups in the data in R using purrr/tidyverse?

My dataset looks like following. I am trying to predict the 'amount' for next 2 months using either the ets, auto.arima, Prophet or any other model. But my issue is that I would like to predict amount based on each groups i.e A,B,C for next 2 months. I am not sure how to do that in R ?
data = data.frame(Date=c('2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01'),
Group=c('A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C','C','C'),
Amount=c('12.1','13','15','10','12','9.0','12.5','13.3','14.8','11','10','12.1','13','12.2','11','10.9','13.4','11.1'))
data
Date Group Amount
1 2017-01-01 A 12.1
2 2017-02-01 A 13
3 2017-03-01 A 15
4 2017-04-01 A 10
5 2017-05-01 A 12
6 2017-06-01 A 9.0
7 2017-01-01 B 12.5
8 2017-02-01 B 13.3
9 2017-03-01 B 14.8
10 2017-04-01 B 11
11 2017-05-01 B 10
12 2017-06-01 B 12.1
13 2017-01-01 C 13
14 2017-02-01 C 12.2
15 2017-03-01 C 11
16 2017-04-01 C 10.9
17 2017-05-01 C 13.4
18 2017-06-01 C 11.1
I need to forecast multiple univariate time series models (ets, auto.arima and prophet) by groups (A, B, C). Assume the groups are independent of each other.Also how can we extract error metrics and point forecasts say 2 period ahead (in a data frame) and plot the forecasts, again grouped by groups.Need help here!!!
Iterative methods like using packages such as tidyverse/purrr, or sweep etc. may be a solution here. ?
First convert the dates to yearmon class in order that the months be regularly spaced since Dates are not due to the different number of days per month. yearmon represents dates internally as year + 0 for Jan, year + 1/12 for Feb, ..., year + 11/12 for Dec. If desired the Date can subsequently be converted from yearmon to numeric using as.numeric to get the internal represntation.
calc represents the function that performs the calculation on a single group. Replace it with your function. Its first argument should be a data frame with Date and Amount columns. Additional arguments are optional and only needed if it is desired to pass fixed parameters that do not vary across groups. In the example below we pass a string, "Hello" to the msg argument. The function can return any sort of object such as a plain vector, list or other object.
In the last line by will call calc, once per group, returning a list of the return values from calc, one component per group.
library(zoo)
data2 <- transform(data,
Date = as.yearmon(Date),
Amount = as.numeric(Amount)
)
calc <- function(dat, msg) {
print(msg)
fm <- lm(Amount ~ Date, dat)
predict(fm, list(Date = tail(dat$Date, 1) + 2/12))
}
by(data2[-2], data2[[2]], calc, msg = "Hello")

ggplot time series: Plotting a full month on the x-axis

Ola, I have a question concerning ggplot.
In this code I have 31 days and each day has around 200 measured times with a precipitation sum.
Delay is measured in seconds, Precip in mm (times 100, for plotting measures).
The code below is just so you can get a good view on this.
I'm currently struggling to form this into a ggplot graph.
X ARRIVAL DELAY PRECIP DATDEP
1 08:12 -10 0 01AUG2019
2 11:22 120 19.2222 01AUG2019
3 09:22 22 0.4444 01AUG2019
4 21:22 0 33.2222 01AUG2019
5 08:22 2 744.4444 02AUG2019
etc. etc.
How do I manage to plot this month into a plot with 2 lines (one for DELAY and one for PRECIP)? I can't manage to transform the DATDEP into a nice time frame on the x-axis.
Q2: also for for example 4 months? How would you manage to form that into a nice time frame on the x-axis.
Use package lubridate for your dates. And make your data long. Many many threads here on SO on this topic.
library(tidyverse)
library(lubridate)
#devtools::install_github('alistaire47/read.so')
mydf <- read.so::read.so(' X ARRIVAL DELAY PRECIP DATDEP
1 08:12 -10 0 01AUG2019
2 11:22 120 19.2222 01AUG2019
3 09:22 22 0.4444 01AUG2019
4 21:22 0 33.2222 01AUG2019
5 08:22 2 744.4444 02AUG2019')
mydf <- mydf %>% pivot_longer(names_to = 'key', values_to = 'value', cols = DELAY:PRECIP)
ggplot(mydf, aes(dmy(DATDEP), value)) + geom_line(aes(color = key))
Created on 2020-03-20 by the reprex package (v0.3.0)

How do I only have x-axis labels that specify when the year is changed in R?

I've downloaded a couple of .csv files, and they look something like this, just a lot longer and the date continues until 2020-03-13.
Date Open High Low Close Adj.Close Volume
1 2015-03-13 2064.56 2064.56 2041.17 2053.40 2053.40 3498560000
2 2015-03-16 2055.35 2081.41 2055.35 2081.19 2081.19 3295600000
3 2015-03-17 2080.59 2080.59 2065.08 2074.28 2074.28 3221840000
4 2015-03-18 2072.84 2106.85 2061.23 2099.50 2099.50 4128210000
5 2015-03-19 2098.69 2098.69 2085.56 2089.27 2089.27 3305220000
6 2015-03-20 2090.32 2113.92 2090.32 2108.10 2108.10 5554120000
I've created a data frame that looks like this based on the data
Date t SandP AMD
1 0 1 0.000000000 0.000000000
2 2015-03-16 2 0.013442909 0.003629768
3 2015-03-17 3 -0.003325698 0.003616640
4 2015-03-18 4 0.012085102 -0.007246409
5 2015-03-19 5 -0.004884489 -0.003642991
6 2015-03-20 6 0.008972382 0.021661497
I am trying to graph the SandP and AMD columns on the same axis, however I only want the axis labels to show each year (when each year changes). Therefore I would only want the 6 ticks on the axis (2015,2016,2017,2018,2019,2020).
If it helps, the .csv files were downloaded from Yahoo Finance data for S&P500.
This is my code up to now:
SPdata <- read.csv("^GSPC.csv")
AMDdata <- read.csv("AMD.csv")
head(SPdata)
R_t <- function(t){
S=log(SPdata[t,6])-log(SPdata[t-1,6])
return(S)
}
S_t <- function(t){
S=log(AMDdata[t,6])-log(AMDdata[t-1,6])
return(S)
}
comparedata <- data.frame(0,1,0,0)
names(comparedata)[1]<-"Date"
names(comparedata)[2]<-"t"
names(comparedata)[3]<-"SandP"
names(comparedata)[4]<-"AMD"
t<-2
while(t<1260){
comparedata <-rbind(comparedata, list(AMDdata[t,1],t,R_t(t),S_t(t)))
t=t+1
}
# install.packages("ggplot2")
library("ggplot2")
ggplot() +
geom_line(data=comparedata, aes(x=Date,y=SandP),color="red",group=1)+
geom_line(data=comparedata, aes(x=Date,y=AMD), color="blue",group=1)+
labs(x="Date",y="Returns")
I think you need to use scale_x_date and set the argument date_breaks and date_labels (see the offficial documentation: https://ggplot2.tidyverse.org/reference/scale_date.html)
Here, I recreate an example using the small portion of the data you provided:
library(lubridate)
date <- seq(ymd("2015-03-16"), ymd("2020-03-13"), by = "day")
df <- data.frame(date = date,
t = 1:1825,
SandP = rnorm(1825),
AMD = rnorm(1825))
Starting from this example, I reshape the dataframe into a longer format using pivot_longer function from tidyr:
library(tidyr)
DF <- df %>% pivot_longer(cols = c(SandP, AMD), names_to = "indices", values_to = "values")
# A tibble: 3,650 x 4
date t indices values
<date> <int> <chr> <dbl>
1 2015-03-16 1 SandP 0.566
2 2015-03-16 1 AMD -0.185
3 2015-03-17 2 SandP -1.59
4 2015-03-17 2 AMD 0.236
5 2015-03-18 3 SandP 1.11
6 2015-03-18 3 AMD -1.52
7 2015-03-19 4 SandP -1.02
8 2015-03-19 4 AMD 0.0833
9 2015-03-20 5 SandP 2.78
10 2015-03-20 5 AMD -0.173
# … with 3,640 more rows
Then, I plot both indices according to the date using ggplot2:
library(ggplot2)
ggplot(DF, aes(x = date, y = values, color = indices))+
geom_line()+
labs(x="Date",y="Returns")+
scale_x_date(date_breaks = "year", date_labels = "%Y")
Does it look what you are trying to achieve ?

Plotting daily summed values of data against months [duplicate]

This question already has answers here:
How to change x axis from years to months with ggplot2
(2 answers)
Closed 5 years ago.
I am trying to make a ggplot of solar irradiance (from a weather file) on y-axis and time in months on x-axis.
My data consists of values collected on hour basis for 12 months so overall there are 8760 rows filled with data values.
Now, I want to make plot in such a way that for a single day, I only get a point on plot by adding values for a complete day (Not like taking all the values and plotting them. I believe geom_freqpoly() can plot this type of data. I have looked for this but not finding enough examples in the way I want. (Or if there is some approach that can help me achieve the plot I want as I am not sure what exactly I have to do to add points for a day. Otherwise writing code for 365 days is crazy)
I want the following kind of plot
My plot is showing all the reading for a year and looks like this
My code for this plotting is :
library(ggplot2)
cmsaf_data <- read.csv("C://Users//MEJA03514//Desktop//main folder//Irradiation data//tmy_era_25.796_45.547_2005_2014.csv",skip=16, header=T)
time<- strptime(cmsaf_data[,2], format = "%m/%d/%Y %H:%M")
data <- cbind(time,cmsaf_data[5])
#data %>% select(time)
data <- data.frame(data, months = month(time),days = mday(time))
data <- unite(data, date_month, c(months, days), remove=FALSE, sep="-")
data <- subset(data, data[,2]>0)
GHI <- data[,2]
date_month <- data[,3]
ggplot(data, aes(date_month, GHI))+geom_line()
whereas my data looks like this :
head(data)
time Global.horizontal.irradiance..W.m2.
1 2007-01-01 00:00:00 0
2 2007-01-01 01:00:00 0
3 2007-01-01 02:00:00 0
4 2007-01-01 03:00:00 0
5 2007-01-01 04:00:00 0
6 2007-01-01 05:00:00 159
As I want 1 point for a day, how can I perform sum function so that I can get the output I require and show months names on x-axis (may be using something from time and date that can do this addition for a day and give 365 vales for a year in output)
I have no idea at all of any such function or approach.
Your help will be appreciated!
Here is a solution using the tidyverse and lubridate packages. As you haven't provided complete sample data, I've generated some random data.
library(tidyverse)
library(lubridate)
data <- tibble(
time = seq(ymd_hms('2007-01-01 00:00:00'),
ymd_hms('2007-12-31 23:00:00'),
by='hour'),
variable = sample(0:400, 8760, replace = TRUE)
)
head(data)
#> # A tibble: 6 x 2
#> time variable
#> <dttm> <int>
#> 1 2007-01-01 00:00:00 220
#> 2 2007-01-01 01:00:00 348
#> 3 2007-01-01 02:00:00 360
#> 4 2007-01-01 03:00:00 10
#> 5 2007-01-01 04:00:00 18
#> 6 2007-01-01 05:00:00 227
summarised <- data %>%
mutate(date = date(time)) %>%
group_by(date) %>%
summarise(total = sum(variable))
head(summarised)
#> # A tibble: 6 x 2
#> date total
#> <date> <int>
#> 1 2007-01-01 5205
#> 2 2007-01-02 3938
#> 3 2007-01-03 5865
#> 4 2007-01-04 5157
#> 5 2007-01-05 4702
#> 6 2007-01-06 4625
summarised %>%
ggplot(aes(date, total)) +
geom_line()
In order to get a sum for every month of every year, you need to create a Column which describes a specific month of a specific year (Yearmon).
Then you can group over that Column and sum over that group giving you one sum for every month of every year.
Then you just plot it and set the labels of the x-axis to your liking.
library(ggplot2)
library(dplyr)
library(zoo)
library(scales)
# Create dummy data for time column
time <- seq.POSIXt(from = as.POSIXct("2007-01-01 00:00:00"),
to = as.POSIXct("2017-01-01 23:00:00"),
by = "hour")
# Create dummy data.frame
data <- data.frame(Time = time,
GHI = rnorm(length(time)))
############################
# Add column Yearmon to the data.frame
# Groupy by Yearmon and summarise with sum
# This creates one sum per Yearmon
# ungroup is often not neccessary, however
# not doing this caused problems for me in the past
# Change type of Yearmon to Date for ggplot
#
df <- mutate(data,
Yearmon = as.yearmon(Time)) %>%
group_by(Yearmon) %>%
summarise(GHI_sum = sum(GHI)) %>%
ungroup() %>%
mutate(Yearmon = as.Date(Yearmon))
# Plot the chart with special scale lables
ggplot(df, aes(Yearmon, GHI_sum))+
geom_line()+
scale_x_date(labels = date_format("%m/%y"))
I hope this helps.

Adding the values of second column based on date and time of first column

I have a data frame with 2 variables. the first column "X" represents date and time with format dd/mm/yyyy hh:mm, the values in the second column "Y" are the electricity meter reading which are taken each after 5 minutes. Now I want to add the values of each half an hour. For instance
X Y
13/12/2014 12:00 1
13/12/2014 12:05 2
13/12/2014 12:10 1
13/12/2014 12:15 2
13/12/2014 12:20 2
13/12/2014 12:25 1
At the end i want to present a result as:
13/12/2014 12:00 9
13/12/2014 12:30 12
and so on...
Here's an alternative approach which actually takes X in count (as per OP comment).
First, we will make sure X is of proper POSIXct format so we could manipulate it correctly (I'm using the data.table package here for convenience)
library(data.table)
setDT(df)[, X := as.POSIXct(X, format = "%d/%m/%Y %R")]
Then, we will aggregate per cumulative minutes instances of 00 or 30 within X while summing Y and extracting the first value of X per each group. I've made a more complicated data set in order illustrate more complicated scenarios (see below)
df[order(X), .(X = X[1L], Y = sum(Y)), by = cumsum(format(X, "%M") %in% c("00", "30"))]
# cumsum X Y
# 1: 0 2014-12-13 12:10:00 6
# 2: 1 2014-12-13 12:30:00 6
# 3: 2 2014-12-13 13:00:00 3
Data
df <- read.table(text = "X Y
'13/12/2014 12:10' 1
'13/12/2014 12:15' 2
'13/12/2014 12:20' 2
'13/12/2014 12:25' 1
'13/12/2014 12:30' 1
'13/12/2014 12:35' 1
'13/12/2014 12:40' 1
'13/12/2014 12:45' 1
'13/12/2014 12:50' 1
'13/12/2014 12:55' 1
'13/12/2014 13:00' 1
'13/12/2014 13:05' 1
'13/12/2014 13:10' 1", header = TRUE)
Some explanations
The by expression:
format(X, "%M") gets the minutes out of X (see ?strptime)
Next step is check if they match 00 or 30 (using %in%)
cumsum separates these matched values into separate groups which we aggregate by by putting this expression into the by statement (see ?data.table)
The jth epression
(X = X[1L], Y = sum(Y)) is simply getting the first value of X per each group and the sum of Y per each group.
The ith expression
I've added order(X) in order to make sure the data set is properly ordered by date (one of the main reasons I've converted X to proper POSIXct format)
For a better understanding on how data.table works, see some tutorials here
t1 <- tapply(df$Y, as.numeric(as.POSIXct(df$X, format = '%d/%m/%Y %H:%M')) %/% 1800, sum)
data.frame(time = as.POSIXct(as.numeric(names(t1))*1800 + 1800, origin = '1970-01-01'), t1)
t1 groups the values using integer division by 1800 (30 minutes)
Considering your data frame as df. You can try -
unname(tapply(df$Y, (seq_along(df$Y)-1) %/% 6, sum))

Resources