Holt Winters predict results differ from fitted data dramatically - r

I am having a huge difference between my fitted data with HoltWinters and the predict data. I can understand there being a huge difference after several predictions but shouldn't the first prediction be the same number as the fitted data would be if it had one more number in the data set??
Please correct me if I'm wrong and why that wouldn't be the case?
Here is an example of the actual data.
1
1
1
2
1
1
-1
1
2
2
2
1
2
1
2
1
1
2
1
2
2
1
1
2
2
2
2
2
1
2
2
2
-1
1
Here is an example of the fitted data.
1.84401709401709
0.760477897417666
1.76593566042741
0.85435674207981
0.978449891674328
2.01079668445307
-0.709049507055536
1.39603638693742
2.42620183925688
2.42819282543689
2.40391946256294
1.29795840410863
2.39684770489517
1.35370435531208
2.38165200319969
1.34590347535205
1.38878761417551
2.36316132796798
1.2226736501825
2.2344269563083
2.24742853293732
1.12409156568888
Here is my R code.
randVal <- read.table("~/Documents/workspace/Roulette/play/randVal.txt", sep = "")
test<-ts(randVal$V1, start=c(1,133), freq=12)
test <- fitted(HoltWinters(test))
test.predict<-predict(HoltWinters(test), n.ahead=1*1)
Here is the predicted data after I expand it to n.ahead=1*12. Keep in mind that I only really want the first value. I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data..... Thank you.
0.16860570380268
-0.624454483845195
0.388808753990824
-0.614404235175936
0.285645402877705
-0.746997659036848
-0.736666618626855
0.174830187188718
-1.30499945596422
-0.320145850774167
-0.0917166719596059
-0.63970713627854

Sounds like you need a statistical consultation since the code is not throwing any errors. And you don't explain why you are dissatisfied with the results since the first value for those two calls is the same. With that in mind, you should realize that most time-series methods assume input from de-trended and de-meaned data, so they will return estimated parameters and values that would need in many cases to be offset to the global mean to predict on the original scale. (It's really bad practice to overwrite intermediate values as you are doing with 'test'.) Nonetheless, if you look at the test-object you see a column of yhat values that are on the scale of the input data.
Your question "I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data" doesn't say in what sense you think the "predict data[sic]" is "more accurate" than the actual data. The predict-results is some sort of estimate and they are split into components, as you would see if you ran the code on help page for predict:
plot(test,test.predict12)
It is not "data", either. It's not at all clear how it could be "more accurate" unless you has some sort of gold standard that you are not telling us about.

Related

How do you organize data for and run multinomial probit in R?

I apologize for the "how do I run this model in R" question. I will be the first to admit that i am a newbie when it comes to statistical models. Hopefully I have enough substantive questions surrounding it to be interesting, and the question will come out more like, "Does this command in R correspond to this statistical model?"
I am trying to estimate a model that can estimate the probability of a given Twitter user "following" a political user from a given political party. My dataframe is at the level of individual users, where each user can choose to follow or not follow a party on Twitter. As alternative-specific variables i have measures of ideological distance from the Twitter user and the political party and an interaction term that specifies whether the distance is positive or negative. Thus, the decision to follow a politician on twitter is a function of your ideological distance.
Initially i tried to estimate a conditional logit model, but i quickly got away from that idea since the choices are not mutually exclusive i.e. they can choose to follow more than one party. Now i am in doubt whether i should employ a multinomial probit or a multivariate probit, since i want my model to allow indviduals to choose more than one alternative. However, when i try to estimate a multinomial probit, my code doesn't work. My code is:
mprobit <- mlogit(Follow ~ F1_Distance+F2_Distance+F1_Distance*F1_interaction+F2_Distance*F2_interaction+strata(id),
long, probit = T, seed = 123)
And i get the following error message:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
I've tried looking the error up, but i can't seem to find anything that relates to probit models. Can you tell me what i'm doing wrong? Once again, sorry for my ignorance. Thank you for your help.
Also, i've tried copying my dataframe in the code below. The data is for the first 6 observations for the first Twitter user, but i have a dataset of 5181 users, which corresponds to 51810 observations, since there's 10 parties in Denmark.
id Alternative Follow F1_Distance F2_Distance F1_interaction
1 1 alternativet 1 -0.9672566 -1.3101138 0
2 1 danskfolkeparti 0 0.6038972 1.3799961 1
3 1 konservative 1 1.0759252 0.8665096 1
4 1 enhedslisten 0 -1.0831657 -1.0815424 0
5 1 liberalalliance 0 1.5389934 0.8470291 1
6 1 nyeborgerlige 1 1.4139934 0.9898862 1
F2_interaction
1 0
2 1
3 1
4 0
5 1
6 1
>```

Correlation of categorical data to binomial response in R

I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis.
Here's my data table (variables explained below):
species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table
data table explanation
I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss."
thoughts
I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions.
Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test:
library(dplyr)
library(purrr)
# function for the fisher test
FISHER <- function(x,y){
FT = fisher.test(table(x,y))
data.frame(
pvalue=FT$p.value,
oddsratio=as.numeric(FT$estimate),
lower_limit_OR = FT$conf.int[1],
upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"
results <- data[,FEEDING] %>%
map_dfr(FISHER,y=data$loss) %>%
add_column(var=FEEDING,.before=1)
You get the results for each feeding habit:
> results
var pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465 0.002943469 2.817560
2 dung 1.000000000 1.1582683 0.017827686 20.132849
3 pred 0.263157895 0.0000000 0.000000000 3.189217
4 nectar 0.535201640 0.0000000 0.000000000 5.503659
5 plant 0.002597403 Inf 2.780171314 Inf
6 blood 1.000000000 0.0000000 0.000000000 26.102285
7 mushroom 0.337662338 5.0498688 0.054241930 467.892765
The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check:
> table(loss,plant)
plant
loss 0 1
0 18 0
1 1 3
Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.

Regarding usage of prediction in RandomForest implementation using Ranger

Overview
I am classifying documents using random forest implementation in ranger R.
Now I am facing an issue,
System expecting all the feature that are in Train set to be present in real time data set which is not possible to achieve,
hence I am not able to predict for real time data text.
Procedure following
Aim : To predict description belongs to which type of class (i.e, OutputClass)
Each of the information like description, features are converted into Document term matrix
Document term matrix of Train Set
rpm Velocity Speed OutputClass
doc1 1 0 1 fan
doc2 1 1 1 fan
doc3 1 0 1 referigirator
doc4 1 1 1 washing machine
doc5 1 1 1 washing machine
Now train the model using the above matrix
fit <- ranger(trainingColumnNames,data=trainset)
save(fit,file="C:/TrainedObject.rda”)
Now I am using the above stored object to predict the real time description for their class type.
Load("C:/TrainedObject.rda”)
Again construct the Document matrix for the RealTimeData.
Velocity Speed OutputClass
doc5 0 1 fan
doc6 1 1 fan
doc7 0 1 referigirator
doc8 1 1 washing machine
doc9 1 1 washing machine
In real time data there is no term or feature by name “RPM”.
So moment I call predict function
Predict(fit, RealTimeData)
it is showing an error saying RPM is missing,
which practically impossible to get all the term or feature of the train set in the real time data every time.
I tried in both the implementation of random forest in R (Ranger, RandomForest) with parameter in predict function like
newdata
Predict.all
treetype.
None of the parameter helped to predict for the missing features in real time data.
someone please help me out how to solve the above issue
Thanks in advance
predict expects all the features you provided to Ranger. Hence if you have missing data on the test set you either remove the problematic feature from the train set and run ranger again or fill the missing values. For the latter solution you may want to have a look at the mice package.

How to use the predict() function in the R package "pscl" with categorical predictor variables

I'm fitting count data (number of fledgling birds produced per territory) using zero-inflated poisson models in R, and while model fitting is working fine, I'm having trouble using the predict function to get estimates for multiple values of one category (Year) averaged over the values of another category (StudyArea). Both variables are dummy coded (0,1) and are set up as factors. The data frame sent to the predict function looks like this:
Year_d StudyArea_d
1 0 0.5
2 1 0.5
However, I get the error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
If instead I use a data frame such as:
Year_d StudyArea_d
1 0 0
2 0 1
3 1 0
4 1 1
I get sensible estimates of fledgling counts per year and study site combination. However, I'm not really interested in the effect of study site (the effect is small and isn't involved in an interaction), and the year effect is really what the study was designed to examine.
I have previously used similar code to successfully get estimated counts from a model that had one categorical and one continuous predictor variable (averaging over the levels of the dummy-coded factor), using a data frame similar to:
VegHeight StudyArea_d
1 0 0.5
2 0.5 0.5
3 1 0.5
4 1.5 0.5
So I'm a little confused why the first attempt I describe above doesn't work.
I can work on constructing a reproducible example if it would help, but I have a hunch that I'm not understand something basic about how the predict function works when dealing with factors. If anyone can help me understand what I need to do to get estimates at both levels of one factor, and averaged over the levels of another factor, I would really appreciate it.

consumer surplus in r

I'm trying to do some econometric analysis using R and can't figure out how to do the analysis I'm look for. Specifically, I want to calculate consumer surplus.
I am trying to predict number of trips (dependent) based on variables like water quality, scenery, parking, etc. I've run a regression of my independent variables on my dependent variable using:
lm()
and also got my predicted values using:
y_hat <- as.matrix(mydata[c("y")])
Now I want to calculate the consumer surplus for each individual (~260 total) from my predicted (y_hat) values.
Welcome to R. I studied economics in college and wish R was taught. You will find that the programming language is very useful in your work.
Note that R is able to accomplish vectorized operations that may speed up your analysis. Consider:
mydata <- data.frame(x=letters[1:3], y=1:3)
x y
1 a 1
2 b 2
3 c 3
Let's say your predicted 'y' is 1.25.
y_hat <- 1.25
You can subtract that number by the entire column of the dataset and it will go row by row for you without having to use compicated 'for loops.'
y_hat - mydata[c("y")]
y
1 0.25
2 -0.75
3 -1.75
Without more information about your particular issue, that is all the help that I can offer. In the future, add a reproducible example that illustrates your data and the specific issue that you are stuck on.

Resources