R: forecasting multiple time series from a single data frame - r

I am trying to forecast multiple time series data that are present in a single data frame.
The dataframe df looks like below. The dput(df) is given below as well to reproduce quickly.
Date Group Value
01-04-2019 Saffron 62.78
01-04-2019 Green 75.65
01-05-2019 Saffron 67.89
01-06-2019 Saffron 54.56
01-06-2019 Green 77.00
01-07-2019 Green 71.22
structure(list(Date = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("01-04-2019", "01-05-2019", "01-06-2019", "01-07-2019"), class = "factor"),
Group = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("Green",
"Saffron"), class = "factor"), Value = c(62.78, 75.65, 67.89,
54.56, 77, 71.22)), .Names = c("Date", "Group", "Value"), class = "data.frame", row.names = c(NA, -6L))
Objective: I want to forecast for each Group using forecast package.
So my approach was as follows:
col_name_date <- "Date"
col_name_measure <- "Value"
col_name_sku_depo <- "Group"
dates_to_forecast <- 3
for (v in unique(as.character(df$Group))) {
temp <-subset(data,Group == v)
assign(paste0("df_",tolower(v)),temp)
temp <- temp [order(temp[, col_name_date]), ]
start_date <- as.Date(temp[1, col_name_date], date_format) #< ---library(lubridate)
ts_historic <- ts(temp[, col_name_measure],
start = c(year(start_date), month(start_date)),
frequency = 12)
----Forecasting process using forecast package, omitting as it is out of scope-----
forecast_mean <- rep(NA, dates_to_forecast)
forecast_mean <- ts_forecast$mean
forecast_upper <- ts_forecast$upper
forecast_lower <- ts_forecast$lower
dates_all_mean <- as.numeric(c(as.numeric(ts_historic), as.numeric(forecast_mean)))
dates_all_lower <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_lower)))
dates_all_upper <- as.numeric(c(rep(NA, length(ts_historic)), as.numeric(forecast_upper)))
result <- data.frame(
MONTH = dates_all,
MEASURETYPE = date_types,
GROUP = v
MEASURE = dates_all_mean,
MEASURELOWER = dates_all_lower,
MEASUREUPPER = dates_all_upper,
MODEL = model_descr)
}
The above code works fine for a single Group i.e. Saffron. But this doesn't produce the result for Green group.
I am looking for the following output:
MONTH MEASURETYPE GROUP MEASURE MEASUREUPPER MEASURELOWER MODEL
01-04-2019 Actual Saffron 62.78 NA NA Test
01-05-2019 Actual Saffron 67.89 NA NA Test
01-06-2019 Actual Saffron 54.56 NA NA Test
01-07-2019 Forecast Saffron 55.35 56.15 54.23 Test
01-08-2019 Forecast Saffron 57.29 58.15 56.39 Test
01-04-2019 Actual Green 75.65 NA NA Test
01-05-2019 Actual Green 77.00 NA NA Test
01-06-2019 Actual Green 71.22 NA NA Test
01-07-2019 Forecast Green 76.35 77.15 75.23 Test
01-08-2019 Forecast Green 73.29 74.29 72.30 Test
As you can see from the code, I am able to generate the above output only for Saffron.
How can I also add Green as shown in the above output?
Where I am missing out in for loop?

Related

random.forest.importance and XTS package

I am quite new to R coding, the TTR/XTS package and random.forest.importance functions.
I am extracting trading data using the xts function, calculating whether the difference between Close and Open is positive/negative/flat, applying a handful of technical indicators using the TTR function , and then combining the indicators to calculate the random.forest.importance function.
When I run the code, I get the
Error in model.frame.default(formula, data, na.action = NULL) : variable lengths differ (found for 'Close').
Data:
Date Time Open High Low Close TVolume
2017-10-12 14:00:00 1.18462 1.18487 1.18334 1.18347 1165
2017-10-12 15:00:00 1.18351 1.18377 1.18295 1.18347 884
2017-10-12 16:00:00 1.18348 1.18348 1.18265 1.18276 1000
2017-10-12 17:00:00 1.18245 1.18329 1.18242 1.18303 1184
2017-10-12 18:00:00 1.18305 1.18373 1.18284 1.18343 469
2017-10-12 19:00:00 1.18343 1.18343 1.18247 1.18303 886
Code as follows:
pkgs <- c('class', 'gmodels', 'quantmod', 'TTR','xts','corrplot','caret','FSelector')
z <- head(tail(hist_r, samples+retro), samples)
z <- as.xts(z[,2:6], order.by=as.POSIXct(z$Timestamp, origin='1970-01-01 00:00', tz='UTC'))
hist <- getHist(z)
h <- as.xts(hist)
price <- z$Close-z$Open
class = ifelse(price > 0,""'UP'"",ifelse(price <0,""'DOWN'"",'""FLAT'""))
forceindex <- (z$Close-z$Open) * z$TVolume
WillR5 <- WPR(z[,c(""'High'"",""'Low'"",""'Close'"")], n = 5)
dataset = data.frame(class,forceindex,WillR5)
dataset = na.omit(dataset)
dput(head(dataset, 10))
set.seed(5)
weights <- random.forest.importance(class~., dataset, importance.type = 1)
print(weights)
When I run dput, i get the following:
tructure(list(Close = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 3L), .Label = c("DOWN", "FLAT", "UP"), class = "factor"), Close.1 = c(-12.9382400000007, 0.107400000000227, -1.66915000000001, -0.290530000000006, -1.18667999999979, 0.0752800000000753, -0.244080000000094, 0.0653999999999928, -0.395999999999996, 0.372089999999928)
Would sincerely appreciate any help that anyone can give me
Many thanks in advance
Kkel

Plotting a multidimensional Data Set

I have a 2 dimensional data set (matrix/data frame) that looks like this
779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00
The 779, 489,859, 1156 are values that I want to draw on the x-axis
The rest of the values on the column are values that correpond to each x
Now I want to plot the entire data set, so that I have a graph with the the following points
(779,56916) , (779, 41784)......
(482,78968) , (482, 64440)..... and so on
The way I did it so far is like this (it gives me the plot I am looking for)
plot(colnames(resultsSummary),resultsSummary[1,],ylim=c(0,80000),pch=6)
points(colnames(resultsSummary),resultsSummary[2,],pch=3)
points(colnames(resultsSummary),resultsSummary[3,])
and so on..... plotting row by row
I am sure there is a better way to do it, but I dont know how, any suggestions?
DF <- read.table(text=" 779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00",
header=TRUE, check.names=FALSE)
m <- as.matrix(DF)
matplot(as.integer(colnames(m)),
t(m), pch=seq_len(ncol(m)))
Following also works:
ddf = structure(list(var = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("maxs",
"Mean_Cost", "Means_Stdv", "Means+Stdv", "mins"), class = "factor"),
X779 = c(56916, 41784.7, 31863.18, 21941.66, 21088), X482 = c(78968,
64440.83, 44407.4, 24373.97, 13768), X859 = c(51156, 38319.1,
29365.78, 20412.45, 24132), X1156 = c(44827.01, 42767.14,
38711.29, 34655.43, 31452)), .Names = c("var", "X779", "X482",
"X859", "X1156"), class = "data.frame", row.names = c(NA, -5L
))
ddf
var X779 X482 X859 X1156
1 maxs 56916.00 78968.00 51156.00 44827.01
2 Means+Stdv 41784.70 64440.83 38319.10 42767.14
3 Mean_Cost 31863.18 44407.40 29365.78 38711.29
4 Means_Stdv 21941.66 24373.97 20412.45 34655.43
5 mins 21088.00 13768.00 24132.00 31452.00
ddf[6,2:5]=as.numeric(substr(names(ddf)[2:5],2,4))
ddf2 = data.frame(t(ddf))
ddf2 = ddf2[-1,]
mm = melt(ddf2, id='X6')
ggplot(mm)+geom_point(aes(x=X6, y=value, color=variable))

How to plot events on time scale using R?

I have a csv file like
id,date,event
1,01-01-2014,E1
1,01-02-2014,E2
2,01-03-2014,E1
2,01-04-2014,E1
2,01-05-2014,E2
I would like to plot events using R on time scale. For example x axis would be date and y axis would indicate event happened on a particular date. This would be one graph for one set of id's. In the above data set it would create 2 graphs.
This is little different from time series (i think). Anyway to accomplish this in R?
Thanks
Try:
ddf = structure(list(id = c(1L, 1L, 2L, 2L, 2L), date = structure(1:5, .Label = c("01-01-2014",
"01-02-2014", "01-03-2014", "01-04-2014", "01-05-2014"), class = "factor"),
event = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("E1",
"E2"), class = "factor")), .Names = c("id", "date", "event"
), class = "data.frame", row.names = c(NA, -5L))
>
ddf$date2 = as.Date(ddf$date, format="%m-%d-%Y")
ddf
id date event date2
1 1 01-01-2014 E1 2014-01-01
2 1 01-02-2014 E2 2014-01-02
3 2 01-03-2014 E1 2014-01-03
4 2 01-04-2014 E1 2014-01-04
5 2 01-05-2014 E2 2014-01-05
>
ggplot(data=ddf, aes(x=date2, y=event, group=factor(id), color=factor(id)))+
geom_line()+
geom_point()+
facet_grid(id~.)
Edit: The code is simple and self-explanatory. Basically the date is kept in x-axis and events in y-axis. For clarity, the graphs are plotted for different ID separately (using facet_grid command), although they can be kept in same graph also, as seen in graph below generated by excluding the facet_grid command in above code:
Here there may be some ambiguity when the lines get overlapping.

Frequency of items in a list in R

I have a very large csv file. I want to calculate the frequency of the items in the second column in order to graph histogram. An example of my data:
0010,10.1.1.16
0011,10.2.2.10
0012,192.168.2.61
0013,192.168.173.19
0014,10.2.2.10
0015,10.2.2.10
0016,192.168.2.61
I have used the below:
inFile <- read.csv("file.csv")
summary(inFile)
hist(inFile$secondCol)
output of summary:
X0010 X10.1.1.16
Min. :11.00 10.2.2.10 :3
1st Qu.:12.25 192.168.173.19:1
Median :13.50 192.168.2.61 :2
Mean :13.50
3rd Qu.:14.75
Max. :16.00
Because the file is very large, I'm not getting the right histogram. Any suggestions?
Just use table.
DF <- structure(list(V1 = 10:16, V2 = structure(c(1L, 2L, 4L, 3L, 2L,
2L, 4L), .Label = c("10.1.1.16", "10.2.2.10",
"192.168.173.19", "192.168.2.61"), class = "factor")),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -7L))
table(DF$V2)
# 10.1.1.16 10.2.2.10 192.168.173.19 192.168.2.61
# 1 3 1 2
If you want a data.frame out of this, you can use as.data.frame:
as.data.frame(table(DF$V2))
# Var1 Freq
# 1 10.1.1.16 1
# 2 10.2.2.10 3
# 3 192.168.173.19 1
# 4 192.168.2.61 2
Since you say you want a histogram, this can be done directly using ggplot2 without having to get the counts first as follows:
require(ggplot2)
ggplot(data = DF, aes(x = V2)) + geom_histogram(aes(y = ..count..))
We could have also done a as.numeric() on the column.
typeof(data$hourofcrime)
# gives me a list
#> typeof(data$hourofcrime)
#[1] "list"
hour_crime_rate <- as.numeric(data$hourofcrime)
hist(hour_crime_rate)

Extracting values from R table within grouped values

I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598

Resources