How do I plot multiple columns of a dataframe in ggplot? - r

I'm trying to plot a data frame that has "Date" as the x-axis, and stock price as the y-axis, and I have four different stocks to be plotted. I'm very confused by the ggplot documentation, and haven't found an easy solution to this. Here is the data frame:
appleData <- read.csv("AAPL.csv", header = TRUE)
microsoftData <- read.csv("MSFT.csv", header = TRUE)
googleData <- read.csv("GOOG.csv", header = TRUE)
amazonData <- read.csv("AMZN.csv", header = TRUE)
names(appleData) <- c("Date", "AAPL")
names(microsoftData) <- c("Date", "MSFT")
names(googleData) <- c("Date", "GOOG")
names(amazonData) <- c("Date", "AMZN")
mergedData1 <- merge(appleData, microsoftData, by = "Date")
mergedData2 <- merge(googleData, amazonData, by = "Date")
totalData <- merge(mergedData1, mergedData2, by = "Date")
totalData
The dataframe is called "totalData", and when I use ggplot(totalData) I get a blank plot. What I need help with specifically is plotting all four stocks onto the same plot, and also rescaling the prices so that they all begin at $100 (so they are on the same scale). Thank you in advance.

I found your question a little difficult to help with because you didn't provide the data you are using. Check out this amazing reference on how to ask really good questions that get answered quickly! How to make a great R reproducible example?
I hope this below code helps you get started on answering your question.
One of the main things I did was I converted your data from an untidy "wide" dataframe to a "tidy" long dataframe using the gather function from tidyr. I highly recommend that you check out this excellent tutorial http://garrettgman.github.io/tidying/ that goes into the basics of tidying. Once your data is "tidy" you will find many tools will work much easier!
Good Luck!
library(dplyr)
library(tidyr)
library(ggplot2)
# create sample data frame with random numbers
set.seed(123)
total_data <- data.frame(date = seq.Date(from = as.Date("2018-01-01"),
to = as.Date("2018-01-31"), by = "day"),
AAPL = sample(100:1000, 31),
MSFT = sample(100:1000, 31),
GOOG = sample(100:1000, 31),
AMZN = sample(100:1000, 31))
head(total_data)
#> date AAPL MSFT GOOG AMZN
#> 1 2018-01-01 359 912 445 691
#> 2 2018-01-02 809 721 346 388
#> 3 2018-01-03 467 815 832 268
#> 4 2018-01-04 892 122 502 802
#> 5 2018-01-05 943 528 826 183
#> 6 2018-01-06 140 779 827 518
# convert your wide data frame to a tidy long data frame
total_data <- gather(total_data, company, value, -date)
# plot using ggplot2
total_data %>%
ggplot(aes(x = date, y = value, color = company)) +
geom_line()

Related

Problems with anomalize function

I need to check the data array with function "Anomalize".
First I hooked up some libraries
library(tidyverse)
library(anomalize)
library(dplyr)
library(zoo)
library(ggplot2)
library(forecast)
library(anytime)
Then I delete all column that i do not need for this task
trash1 <- ASD[, -2]
trash2 <- trash1[,-2]
trash3 <- trash2[,-2]
trash4 <- trash3[,-2]
trash5 <- trash4[,-2]
trash6 <- trash5[,-2]
trash7 <- trash6[,-4]
trash8 <- trash7[,-4]
view(trash8)
Change class from Factor to Date:
trash8$DMY <- as.Date(trash8$DMY, format="%d.%m.%y")
Than I tryed to anomalize this
trash_tbl <- as_tibble(trash8)
trash_tbl %>%
time_decompose(Qp) %>%
anomalize(remainder) %>%
time_recompose() %>%
plot_anomalies(time_recomposed = TRUE, ncol = 3 , alpha_dots = 0.5)
As the result I have this error:
Converting from tbl_df to tbl_time.
Auto-index message: index = DMY
Note: Index not ordered. tibbletime assumes index is in ascending order. Results may not be as desired.
Error: Only year, quarter, month, week, and day periods are allowed for an index of class Date
Please help me with it or say, what can I read to solve that problem??
This is my data. DMY - Date, MCC - Factor, Art - Numeric, Qp - Numeric , Ql - Factor
1 DMY MCC Art Qp Ql
1 2016-01-01 UA0000468 1801 3520 440
2 2016-01-01 UA0000468 3102 3024 604,8
3 2016-01-01 UA0000468 4419 270 521,1
4 2016-01-01 UA0000468 5537 1080 2084,4
5 2016-01-03 UA0010557 3528 180 36
6 2016-01-03 UA0010557 3529 198 39,6
...

Fill in missing date and fill with the data above

I've researched enough until i ask this here but can you please help me with some ideas for this issue?
My data table (df) looks like this:
client id value repmonth
123 100 2012-01-31
123 200 2012-02-31
123 300 2012-05-31
Therefore I have 2 missing months. And i want my data table to look like this:
client id value repmonth
123 100 2012-01-31
123 200 2012-02-31
123 200 2012-03-31
123 200 2012-04-31
123 300 2012-05-31
The code should be filling in the missing repmonth and fill the rows with the last value, in this case 200 and the came client id.
I have tried the following:
zoo library
tidyr library
dlpyr library
posixct
As for codes: ...plenty of fails
library(tidyr)
df %>%
mutate (repmonth = as.Date(repmonth)) %>%
complete(repmonth = seq.Date(min(repmonth), max(repmonth),by ="month"))
or
library(dplyr)
df$reportingDate.end.month <- as.POSIXct(df$datetime, tz = "GMT")
df <- tbl_df(df)
list_df <- list(df, df) # fake list of data.frames
seq_df <- data_frame(datetime = seq.POSIXt(as.POSIXct("2012-01-31"),
as.POSIXct("2018-12-31"),
by="month"))
lapply(list_df, function(x){full_join(total_loan_portfolios_3$reportingDate.end.month, seq_df, by=reportingDate.end.month)})
total_loan_portfolios_3$reportingmonth_notmissing <- full_join(seq_df,total_loan_portfolios_3$reportingDate.end.month)
or
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2012-01-01",'%d/%m/%Y'), as.POSIXct("2018/12/01",'%d/%m/%Y'), by="month")
ts <- seq.POSIXt(as.POSIXlt("2012-01-01"), as.POSIXlt("2018-12-01"), by="month")
ts <- format.POSIXct(ts,'%d/%m/%Y')
df <- data.frame(timestamp=ts)
total_loan_portfolios_3 <- full_join(df,total_loan_portfolios_3$Reporting_date)
Finally, I have plenty of errors like
the format is not date
or
Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) :
'from' must be a finite number
and others.
The following solution uses lubridate and tidyr packages. Note that in OP example, dates are malformed, but implies having data with last-day-of-month input, so tried to replicate it here. Solution creates a sequence of dates from min input date to max input date to get all possible months of interest. Note that input dates are normalized to first-day-of-month to ensure proper sequence generation. With the sequence created, a left-join merge is done to merge data we have and identify missing data. Then fill() is applied to columns to fill in the missing NAs.
library(lubridate)
library(tidyr)
#Note OP has month of Feb with 31 days... Corrected to 28 but this fails to parse as a date
df <- data.frame(client_id=c(123,123,123),value=c(100,200,300),repmonth=c("2012-01-31","2012-02-29","2012-05-31"),stringsAsFactors = F)
df$repmonth <- ymd(df$repmonth) #convert character dates to Dates
start_month <- min(df$repmonth)
start_month <- start_month - days(day(start_month)-1) #first day of month to so seq.Date sequences properly
all_dates <- seq.Date(from=start_month,to=max(df$repmonth),by="1 month")
all_dates <- (all_dates %m+% months(1)) - days(1) #all end-of-month-day since OP suggests having last-day-of-month input?
all_dates <- data.frame(repmonth=all_dates)
df<-merge(x=all_dates,y=df,by="repmonth",all.x=T)
df <- fill(df,c("client_id","value"))
Solution yields:
> df
repmonth client_id value
1 2012-01-31 123 100
2 2012-02-29 123 200
3 2012-03-31 123 200
4 2012-04-30 123 200
5 2012-05-31 123 300

Prophet Forecasting using R for multiple items

I am very new to time series forecasting using Prophet in R. I am able to predict values for one single product using Prophet. Is there any way if i can use loop to generate forecast using Prophet for multiple products? The below code works absolutely fine for single product but i am trying to generate forecasts for multiple products
library(prophet)
df <- read.csv("Prophet.csv")
df$Date<-as.Date(as.character(df$Date), format = "%d-%m-%Y")
colnames(df) <- c("ds", "y")
m <- prophet(df)
future <- make_future_dataframe(m, periods = 40)
tail(future)
forecast <- predict(m, future)
write.csv(forecast[c('ds','yhat')],"Output_Prophet.csv")
tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
Sample Dataset:
This can be done by using lists and map functions from the purrr package.
Lets build some data:
library(tidyverse) # contains also the purrr package
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
tb2 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
y = sample(365)
)
ts_list <- list(tb1, tb2) # two separate time series
# using this construct you could add more of course
Build and prediction:
library(prophet)
m_list <- map(ts_list, prophet) # prophet call
future_list <- map(m_list, make_future_dataframe, periods = 40) # makes future obs
forecast_list <- map2(m_list, future_list, predict) # map2 because we have two inputs
# we can access everything we need like with any list object
head(forecast_list[[1]]$yhat) # forecasts for time series 1
[1] 179.5214 198.2375 182.7478 173.5096 163.1173 214.7773
head(forecast_list[[2]]$yhat) # forecast for time series 2
[1] 172.5096 155.8796 184.4423 133.0349 169.7688 135.2990
Update (just the input part, build and prediction part it's the same):
I created a new example based on OP request, basically you need to put everything again in a list object:
# suppose you have a data frame like this:
set.seed(123)
tb1 <- tibble(
ds = seq(as.Date("2018-01-01"), as.Date("2018-12-31"), by = "day"),
productA = sample(365),
productB = sample(365)
)
head(tb1)
# A tibble: 6 x 3
ds productA productB
<date> <int> <int>
1 2018-01-01 105 287
2 2018-01-02 287 71
3 2018-01-03 149 7
4 2018-01-04 320 148
5 2018-01-05 340 175
6 2018-01-06 17 152
# with some dplyr and base R you can trasform each time series in a data frame within a list
ts_list <- tb1 %>%
gather("type", "y", -ds) %>%
split(.$type)
# this just removes the type column that we don't need anymore
ts_list <- lapply(ts_list, function(x) { x["type"] <- NULL; x })
# now you can continue just like above..

How to use apply.daily/period.apply for calculating maximum per column in XTS time series?

I have a problem using the period.apply function for my case of a high resolution time series analysis.
I want to calculate statistics(Mean for different Periods, Stddev etc.) for my data which is in 10 min intervals. To calculate hourly means worked fine like described in this answer.
It creates a new xts object with means calculated for each column. How do I calculate maximum values for each column?
This reproducible example describes the structure of my data:
library(xts)
start <- as.POSIXct("2018-05-18 00:00")
tseq <- seq(from = start, length.out = 1440, by = "10 mins")
Measurings <- data.frame(
Time = tseq,
Temp = sample(10:37,1440, replace = TRUE, set.seed(seed = 10)),
Variable1 = sample(1:200,1440, replace = TRUE, set.seed(seed = 187)),
Variable2 = sample(300:800,1440, replace = TRUE, set.seed(seed = 333))
)
Measurings_xts <- xts(Measurings[,-1], Measurings$Time)
HourEnds <- endpoints(Measurings_xts, "hours")
Measurings_mean <- period.apply(Measurings_xts, HourEnds, mean)
I thought it would be easy to just change the function argument from mean to max, like this:
Measurings_max <- period.apply(Measurings_xts, HourEnds, max)
It delivers output, but only one column with the overall maximum values. I need the hourly maximums of each column. A simple solution would be much appreciated.
The mean example works by column because there's a zoo method that calls mean on each column (this method is used because xts extends zoo).
The max example returns one number because there is no max.xts or max.zoo method, so it returns the maximum of the entire xts/zoo object.
A simple solution is to define a helper function:
colMax <- function(x, na.rm = FALSE) {
apply(x, 2, max, na.rm = na.rm)
}
Then use that in your period.apply call:
epHours <- endpoints(Measurings_xts, "hours")
Measurings_max <- period.apply(Measurings_xts, epHours, colMax)
head(Measurings_max)
# Temp Variable1 Variable2
# 2018-05-18 00:50:00 29 194 787
# 2018-05-18 01:50:00 28 178 605
# 2018-05-18 02:50:00 26 188 756
# 2018-05-18 03:50:00 34 152 444
# 2018-05-18 04:50:00 33 145 724
# 2018-05-18 05:50:00 35 187 621

Combining two time series with different ranges, when column headings are the dates

I am stuck trying to combine two time series datasets that have different ranges and both are stored with item# in column1 and date as column headings. For example:
df1
#ITEM 1/1/16 1/2/16 1/3/16 ... 3/24/17
#1 350 365 370 ... 400
#2 100 95 101 ... 95
#3 5 8 9 ... 15
The other dataset range is smaller, its in the same format, and both are daily frequency.
How can I append the rows of df2 to df1 despite having different ranges, but making sure the dates are aligned when merged? Happy with NA in the new dataframe where df#2 didn't have values for dates in df1
Should I create these at xts objects so that once they are merged I can easily pull data for item1 on X date? Or is there an easy way to do that with this format as well?
Thanks in advance for you help.
One option is to use data.table::rbindlist(df1, df2) with fill = TRUE
that fills missing columns with NAs.
Example:
library(data.table)
dt1 <- data.table(item=c(1,2,3),"d1/1/16" = c(350,100,5) ,"d1/2/16" = c(360,120,7))
dt2 <- data.table(item=c(3,4,5),"d1/2/16" = c(50,50,2) ,"d1/3/16" = c(460,150,9))
l = list(dt1,dt2)
data.table::rbindlist(l, use.names= TRUE, fill=TRUE, idcol=TRUE )
Normally in R time series are represented in columns, not rows. Assuming we have DF1 and DF2 shown reproducibly in the Note at the end here are some alternatives
1) zoo we can create zoo series from each by transposing. Then merge them:
library(zoo)
fmt <- "%m/%d/%y"
z1 <- setNames(zoo(t(DF1[-1]), as.Date(names(DF1[-1]), fmt)), DF1[[1]])
z2 <- setNames(zoo(t(DF2[-1]), as.Date(names(DF2[-1]), ftm)), DF2[[1]])
z <- merge(z1, z2)
It is probably best to leave this as the zoo series z but if you want to transform to a data frame then use: fortity.zoo(z)
2) base Alternately, without zoo using fmt from above:
d1 <- data.frame(as.Date(names(DF1[-1]), fmt), t(DF1[-1]))
names(d1) <- c("Index", DF1[[1]])
d2 <- data.frame(as.Date(names(DF2[-1]), fmt), t(DF2[-1]))
names(d2) <- c("Index", DF2[[1]])
merge(d1, d2, by = "Index", all = TRUE)
Note: The input in reproducible form is assumed to be:
Lines <- "ITEM 1/1/16 1/2/16 1/3/16 3/24/17
1 350 365 370 400
2 100 95 101 95
3 5 8 9 15"
DF <- read.table(text = Lines, header = TRUE, check.names = FALSE)
DF1 <- DF[1:2, 1:3]
DF2 <- DF[3, -3]

Resources