Predict future values using Arma - r

I have a sales data for 51 points. I want to predict the say 10 more future values. It is a sales data and hence seasonal but the data points are very few for predicting seasonality. When I used time series it maybe tried to fit and gave "103" as the results for all the next prediction. I thought using ARMA would help but after fitting to ARMA and using forecast() I still got the same output. I am new to trending and forecasting and do not know if there are different methods other than regression may be to predict future values. Kindly help.
Data:
Product 23 22 21 31 29 13 15 20 15 26 11 24 14 18 15 21 25 23 27 30 19 18 20 13 23 40 14 15 20 14 9 22 14 24 26 22 23 16 24 19 14 10 17 12 11 15 9 24 17 22 28
The code I used:
library("tseries")
arma<-arma(Product)
final<-forecast(arma,10)

Your question is missing some details, like the order of the ARMA model you want to fit and the code you used. Considering that 103 is way larger than any of the values you have given us, I'd suspect that there is an error in your code.
Here's an implementation for ARMA(1,1) that should work:
data<-strsplit("23 22 21 31 29 13 15 20 15 26 11 24 14 18 15 21 25 23 27 30 19 18 20 13 23 40 14 15 20 14 9 22 14 24 26 22 23 16 24 19 14 10 17 12 11 15 9 24 17 22 28",split=" ")
data<-as.numeric(data[[1]])
mod<-arima(data,order=c(1,0,1))
pred<-predict(mod,n.ahead=10)
pred
plot(c(data,pred$pred),type="l")
EDIT: Ken is right, btw, an ARMA model doesn't seem that useful for your data and the predictions will mostly be given by the intercept term.

Fitting an ARIMA-model to your data, results in a ARIMA(0,0,0), which means that the fitted values depend on 0 previous observations and 0 previous fitting error. This again means that the best predictor the ARIMA-model can make (based on this data) is a constant. It will, for every observation, regardless of previous observations, predict the same value.
library(forecast)
df <- c(23, 22, 21, 31, 29, 13, 15, 20, 15, 26, 11, 24, 14, 18, 15, 21,
25, 23 , 27, 30, 19, 18 , 20 , 13 , 23 , 40 ,14 , 15 , 20 ,14 , 9 , 22 ,
14 , 24 ,26 ,22 , 23 , 16 , 24 , 19 ,14 , 10 ,17 , 12, 11, 15 , 9 , 24 , 1,
7, 22, 28)
# auto.arima() selects the ARIMA(r, s, q) model with the highest AIC-score:
(auto.arima(df))
# Series: ts
# ARIMA(0,0,0) with non-zero mean
#Coefficients:
# intercept 19.0000
# s.e. 0.9642
# sigma^2 estimated as 49.29: log likelihood=-174.62
# AIC=353.25 AICc=353.49 BIC=357.15
forecast.Arima(object = auto.arima(df), h = 10)
# Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
# 53 19 10.00226 27.99774 5.239138 32.76086
# 54 19 10.00226 27.99774 5.239138 32.76086
# 55 19 10.00226 27.99774 5.239138 32.76086
# 56 19 10.00226 27.99774 5.239138 32.76086
# 57 19 10.00226 27.99774 5.239138 32.76086
# 58 19 10.00226 27.99774 5.239138 32.76086
# 59 19 10.00226 27.99774 5.239138 32.76086
# 60 19 10.00226 27.99774 5.239138 32.76086
# 61 19 10.00226 27.99774 5.239138 32.76086
# 62 19 10.00226 27.99774 5.239138 32.76086

Related

Convert entire data set into numeric form during retrieving the file

i am working on Shiny app and want to convert entire data set into numeric form.I have used this code for retrieving file from local PC. what changes can be done that while retrieving i can convert entire data set into numeric form
datami <- reactive({
file1 <- input$file
if(is.null(file1)){return()}
read.csv(file=file1$datapath, sep=input$sep, header = input$header, stringsAsFactors = input$stringAsFactors)})
output$table <- renderPrint({
if(is.null(datami())){return ()}
str(datami())})
tabsetPanel(tabPanel("Data",div(h5("Data",style="color:red")),verbatimTextOutput("table"))```
Depending on how you want to deal with lower/uppercase letters (if you have them in your data) we could do one of the following:
MRE:
letter_variable <- c(letters, LETTERS)
Same numeric value for upper and lower case letters:
letter_variable_as_numeric1 <- as.numeric(factor(toupper(letter_variable), levels = LETTERS))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
[22] 22 23 24 25 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
[43] 17 18 19 20 21 22 23 24 25 26
Different numeric value for upper and lower case letters:
letter_variable_as_numeric2 <- as.numeric(factor(letter_variable), levels = c(letters, LETTERS))
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
[22] 43 45 47 49 51 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
[43] 34 36 38 40 42 44 46 48 50 52

How can I create a letter summary for these Least Significant Difference (LSD) results in R?

I am trying to compare the tensile strength of different material weights of material in R. The tensile data is as follows:
tensile <- read.table(text=" Weight Strength Replicate
1 15 7 1
2 15 7 2
3 15 15 3
4 15 11 4
5 15 9 5
6 20 12 1
7 20 17 2
8 20 12 3
9 20 18 4
10 20 18 5
11 25 14 1
12 25 18 2
13 25 18 3
14 25 19 4
15 25 19 5
16 30 19 1
17 30 25 2
18 30 22 3
19 30 19 4
20 30 23 5
21 35 7 1
22 35 10 2
23 35 11 3
24 35 15 4
25 35 11 5", header=TRUE)
The variable Weight should be regarded as a factor (explanatory/independent variable) for the purpose of this analysis:
tensile$Weight <- factor(tensile$Weight)
I first fitted a one-way ANOVA model to my data:
tensile.aov <- aov(Strength ~ Weight, data = tensile)
According to the ANOVA, there appears to be a difference in the different weights with respect to the response (strength). So I then decided to do pairwise comparisons using the LSD (Least Significant Difference):
LSD.aov(tensile.aov)
However, this LSD function was provided through a separate file, so I'm unfortunately unable to share the code here.
I calculated the LSD for my data and got the following table:
Note that, according to the raw p-values, the pairwise comparisons between the 35 and 15 and 25 and 20 weights are the only ones that are not significantly different from each other at the alpha = 0.05 significance level; the other pairwise comparisons are significantly different. I want to create a letter summary to illustrate this, where groups only have the same letter if they are not significantly different from each other, and groups which do not have the same letter are significantly different from each other:
How can I go about creating such a table in R?
I'm also totally open to a 'manual' solution. By this, I mean manually creating a table using vectors and such. I'm new to R, so I don't have a good grasp on even the most basic aspects.
The multcompView package can turn p-values into letters, but in this case, the emmeans package can do both the comparison and the letters.
library(emmeans)
em <- emmeans(tensile.aov, ~Weight)
summary(pairs(em, adjust="none"), infer=TRUE)
#> contrast estimate SE df lower.CL upper.CL t.ratio p.value
#> 15 - 20 -5.6 1.79555 20 -9.3454518 -1.8545482 -3.119 0.0054
#> 15 - 25 -7.8 1.79555 20 -11.5454518 -4.0545482 -4.344 0.0003
#> 15 - 30 -11.8 1.79555 20 -15.5454518 -8.0545482 -6.572 <.0001
#> 15 - 35 -1.0 1.79555 20 -4.7454518 2.7454518 -0.557 0.5838
#> 20 - 25 -2.2 1.79555 20 -5.9454518 1.5454518 -1.225 0.2347
#> 20 - 30 -6.2 1.79555 20 -9.9454518 -2.4545482 -3.453 0.0025
#> 20 - 35 4.6 1.79555 20 0.8545482 8.3454518 2.562 0.0186
#> 25 - 30 -4.0 1.79555 20 -7.7454518 -0.2545482 -2.228 0.0375
#> 25 - 35 6.8 1.79555 20 3.0545482 10.5454518 3.787 0.0012
#> 30 - 35 10.8 1.79555 20 7.0545482 14.5454518 6.015 <.0001
#>
#> Confidence level used: 0.95
cld(em, adjust="none")
#> Weight emmean SE df lower.CL upper.CL .group
#> 15 9.8 1.269646 20 7.151566 12.44843 1
#> 35 10.8 1.269646 20 8.151566 13.44843 1
#> 20 15.4 1.269646 20 12.751566 18.04843 2
#> 25 17.6 1.269646 20 14.951566 20.24843 2
#> 30 21.6 1.269646 20 18.951566 24.24843 3
#>
#> Confidence level used: 0.95
#> significance level used: alpha = 0.05
I managed to do it as follows:
Weight = c(15, 20, 25, 30, 35)
mean = c(9.8, 15.4, 17.6, 21.6, 10.8)
letters = c("a", "b", "b", "", "a")
LSDletterSummary <- data.frame(Weight, mean, letters)
LSDletterSummary
If anyone has a better way to go about it, feel free to share.

Conditional vectorization

Given a data frame like below:
set.seed(123)
df1 <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
I want to sum each row, starting from the first column with a value >=2, and ending with the column with a value >6, else sum until the end of the row.
How would I do this in a vectorized fashion?
Update: This is not for any homework assignment. I just want more examples of vectorization code that I can study and learn from. I had to do something like the above before, but couldn't figure out the apply syntax for this particular task and resorted to for loops.
This is what appeared the most R-like approach but I don't consider it "vectorized" in the R meaning of the term:
apply( df1, 1, function(x) sum( x[which(x>=2)[1]: min(which(x>6)[1], 5, na.rm=TRUE)] ) )
#---------
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18
[25] 21 24 15 20 15 18 17 24 19 18 19 15 18 17 15 17 14 21 13 19 15 15 15 15
[49] 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23 16 14 18 13
[73] 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23
[97] 23 19 20 17
Due to your sampling structure, we can vectorize quite easily.
We know that only the first column can be less than 2, and thus excluded, and that columns V2, V3 and V4 must be included, as they are either below 6, or the first non six. Column V5 is excluded, only if column V4 was above 6.
So:
(df1$V1 == 2) * df1$V1 + df1$V2 + df1$V3 + df1$V4 + df1$V5 * !(df1$V4 > 6)
[1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18 21 24 15 20 15 18 17 24 19 18
[35] 19 15 18 17 15 17 14 21 13 19 15 15 15 15 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23
[69] 16 14 18 13 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23 23 19 20 17
is your vectorized calculation. This is obviously much less general than the other answers here, but fits your question.
Using apply would be the most sensible solution. However, since we seem to be competing on who can answer this without using R-based loops, I humbly offer this
m<-as.matrix(df1)
start<-max.col(m>=2,ties="first")
end<-max.col(`[<-`(m>6,,ncol(m),TRUE),ties="first")
i<-t(matrix(1:ncol(m),nrow=ncol(m),ncol=nrow(m)))
rowSums(m*(i>=start & i<=end))
Output is the same as these answres.
I am sure that there is a more elegant way but for a brute force approach you can write a function and pass it to apply.
First, define your example data
df <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
V2=sample(c(2,3,4),100,replace=TRUE),
V3=sample(c(4,5,6),100,replace=TRUE),
V4=sample(c(6,7,8),100,replace=TRUE),
V5=sample(c(6,7,8),100,replace=TRUE))
Write a function that will define the conditional statement. The use of which returns the position of the condition in the vector. The first use of which "start" pulls the position of the first occurrence of the condition thus the bracket use of [1]. Since there are multiple potential outcomes of the end position I used an if statement to fulfill it. If there is not a value that meets the condition > 6 for "end" the variable is assigned the last position of the vector otherwise the position the meets the condition. Then it is just a matter of subsetting the vector based on the start and end values to be evaluated using sum.
sum.col <- function(x) {
start <- which(x >= 2)[1]
end <- which(x > 6)
if( length(end) == 0 ) {
end <- length(x)
} else {
end <- end[length(end)]
}
return( sum( x[start:end] ) )
}
Now we can pass the function to apply, which deals with the vectorization of each row for us.
apply(df, FUN=sum.col, MARGIN = 1)

How to calculate the mean for a column subsetted from the data

This shouldn't be too hard, but I always have issues when tying to run calculations on a column in a dataframe that relies on the value of a another column in the data frame. Here is my data.frame
stream reach length.km length.m total.sa pools.sa
1 Stream Reach_Code 109 109 1 1
2 Brooks BRK_001 17 14 108 13
3 Brooks BRK_002 15 12 99 9
4 Brooks BRK_003 24 21 94 95
5 Brooks BRK_004 32 29 97 33
6 Brooks BRK_005 27 24 92 79
7 Brooks BRK_006 26 23 95 6
8 Brooks BRK_007 16 13 77 15
9 Brooks BRK_008 29 26 84 26
10 Brooks BRK_009 18 15 87 46
11 Brooks BRK_010 23 20 88 47
12 Brooks BRK_011 22 19 91 40
13 Brooks BRK_012 30 27 98 37
14 Brooks BRK_013 25 22 93 29
19 Buncombe_Hollow BNH_0001 7 4 75 65
20 Buncombe_Hollow BNH_0002 8 5 66 21
21 Buncombe_Hollow BNH_0003 9 6 68 53
22 Buncombe_Hollow BNH_0004 19 16 81 11
23 Buncombe_Hollow BNH_0005 6 3 65 27
24 Buncombe_Hollow BNH_0006 13 10 63 23
25 Buncombe_Hollow BNH_0007 12 9 71 57
I would like to calculate the mean of a column (lets say length.m) where stream = Brooks and then do the same thing for stream = Buncombe_Hollow. I actually have 17 different stream names, and plan on calculating the mean of some column for each stream. I will then store these means as a vector, and bind them to another vector of the stream names, so the end result is something like this
stream truevalue
1 Brooks 0.9440620
2 Siouxon 0.5858527
3 Speelyai 0.5839844
Thanks!
try using aggregate:
# Generate some data to use
someDf <- data.frame(stream = rep(c("Brooks", "Buncombe_Hollow"), each = 10),
length.m = rpois(20, 4))
# Calculate the means with aggregate
with(someDf, aggregate(list(truevalue = length.m), list(stream = stream), mean))
The reason for the "list" bits is to specifically name the columns in the (data frame) output
Start using the dplyr package. It makes such calculations quick as well as very easy to write
library(dplyr)
result <- data %>% group_by(stream) %>% summarize(truevalue = mean(length.m))

Mean and SD in R

maybe it is a very easy question. This is my data.frame:
> read.table("text.txt")
V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860
It means that I have 22516 sequences with length 26, 17129 sequences with length 28, etc. I would like to know the sequence length mean and its standard deviation. I know how to do it, but I know to do it creating a list full of 26 repeated 22516 times and so on... and then compute the mean and SD. However, I thing there is a easier method. Any idea?
Thanks.
For mean: (V1 %*% V2)/sum(V2)
For SD: sqrt(((V1-(V1 %*% V2)/sum(V2))**2 %*% V2)/sum(V2))
I do not find mean(rep(V1,V2)) # 61.902 and sd(rep(V1,V2)) # 14.23891 that complex, but alternatively you might try:
weighted.mean(V1,V2) # 61.902
# recipe from http://www.ltcconline.net/greenl/courses/201/descstat/meansdgrouped.htm
sqrt((sum((V1^2)*V2)-(sum(V1*V2)^2)/sum(V2))/(sum(V2)-1)) # 14.23891
Step1: Set up data:
dat.df <- read.table(text="id V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860",header=T)
Step2: Convert to data.table (only for simplicity and laziness in typing)
library(data.table)
dat <- data.table(dat.df)
Step3: Set up new columns with products, and use them to find mean
dat[,pr:=V1*V2]
dat[,v1sq:=as.numeric(V1*V1*V2)]
dat.Mean <- sum(dat$pr)/sum(dat$V2)
dat.SD <- sqrt( (sum(dat$v1sq)/sum(dat$V2)) - dat.Mean^2)
Hope this helps!!
MEAN = (V1*V2)/sum(V2)
SD = sqrt((V1*V1*V2)/sum(V2) - MEAN^2)

Resources