I recently learned about Poisson Regression, and am wanting to apply this new-to-me statistical method to real world problems. So I thought about it for a while, and decided I wanted to try and predict stock volumes of the Fortune 500 based on financial information of a random sampling of companies.
The problem I am encountering, is while the model accounts for a massive amount of variance and contains only significant predictors, when I try to get the Poisson model to make predictions using the predict function, it returns predictions with virtually no variance that are way off the actual value.
The dataset I am playing with is not fully filled in, but I decided to take a peak at results with a small sample size. I did this because I read something online that suggested the needed power for poisson regression was lower for large numbers, and stock market volume includes some massive numbers. The dataset can be accessed here:
https://drive.google.com/file/d/1qvkwWSfUSodfceyNLvPjA4jqnWTDTeSo/view?usp=sharing
The code I used is presented below:
Stock<-read.csv("C:/FilePath/StockPrices.csv")
head(Stock)
summary(StockTest <- step(glm(formula = X2018.Volume ~ X2017.Stock.Price + X2017.Volume+Total.Revenue+Cost.of.Revenue+Research...Development+Selling.General...Administrative+Interest.Expense+Total.Other.Income...Expenses.Net+Income.Before.Tax+Income.Tax.Expense+Income.From.Continuing.Operation+Net.Income+Enviornment+Social+Governance, family = "poisson", data = Stock)))
1-StockTest$deviance/StockTest$null.deviance
predict(StockTest)
The model has a great Pseudo R-squared, but its predicted values are way off the actual values. See for yourself:
predict(StockTest)
15.47486 15.00441 15.00881 14.01175 15.01126 16.24620 15.99307 15.68193 15.67123 14.98932 14.77741 15.43363 12.07001 13.84586 15.83090 14.28052 15.16039 13.83686
Versus
Stock[,"X2018.Volume"]
[1] 5160000 110853500 3310000 3310000 1200000 876000 3310000 11400000 8830000 6380000 6410000
[12] 820000 3500000 2620000 4860000 199000 741000 7680000 1287769 3810000 1460000 2310000
What about this am I doing wrong? Are there special considerations that need to be made when using the predict function on a Poisson function? Is Poisson regression not the appropriate analysis for the data I am playing with?
First you need to read the manual page for predict.glm:
predict(Stock.glm, type="response")
# 1 3 4 5 7 8 9
# 5255867.7 3283450.0 3297945.2 1216812.4 3306021.9 11366695.1 8824739.9
# 10 11 13 14 15 16 17
# 6465084.7 6396289.7 3234293.9 2616649.3 5043601.7 174557.7 1030814.3
# 18 19 20 21
# 7503622.7 1592024.5 3837723.8 1021574.3
Stock.glm$model$X2018.Volume
# [1] 5160000 3310000 3310000 1200000 3310000 11400000 8830000 6380000 6410000
# [10] 3500000 2620000 4860000 199000 741000 7680000 1287769 3810000 1460000
You cannot compare to the original data since there are missing values. As a result 4 rows of the original data are missing from the data used in the analysis.
cor(Stock.glm$model$X2018.Volume, predict(Stock.glm, type="response"))
# [1] 0.9983086
Related
I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.
I need to write R code to model a time serie data from rsav file. Here is detailed information about the question:
The file “file.rsav” (which can be loaded into R using load(“file.rsav”)) contains a time series (“xx”). The series is a “demeaned” monthly revenue stream (in millions of
dollars) for a company. There are n = 96 observations.
The series has been “demeaned”; usually that would mean we subtract off $\bar{X}$ from every data point, but pretend for now we know the mean $miu$ exactly so we have subtracted off µ from every data point, so the new series is exactly (theoretically) mean 0. (But thus its sample mean is not precisely 0.)
We will consider possible ARMA models for the series $X_t$. We assume that the corresponding white noise is Gaussian (so X_t is Gaussian). We will consider first an AR(2) model. We assume we know the true model exactly: $X_t = .1.34X_{t-1} - .48X_{t-2} + W_t, W_t \sim iid N(0, σ^2)$.
I was asked to compute forecasts backcasts using model, up to 25 time steps in the future and into the past.
Write code to do the prediction by hand (i.e., not using the predict() function). Plot the data, forecast, and 95% prediction intervals [assuming gaussianity] (all on one plot). (Note: you do not need to do a multiplicity correction for the prediction intervals.)
Code:
load('./file.rsav')
str(xx)
xx
Out:
Time-Series [1:96] from 1 to 8.92: 2.45 2.18 0.389 -1.44 -1.47 ...
Jan Feb Mar Apr May Jun Jul
1 2.45017780 2.17955829 0.38874020 -1.43979552 -1.47049807 -2.25233354 -0.82580703
2 1.92378321 1.87944208 1.07382472 1.01933130 1.88660307 -0.31109156 -0.25732342
3 0.60732330 1.53185399 1.58614371 0.63922270 0.82728407 0.28910411 -1.18154941
4 0.41375543 1.96633332 1.97402973 4.16058136 5.15474250 5.71865844 3.93136013
5 -1.51228022 -3.03396294 -3.65446772 -4.69589618 -3.51276584 -2.97682246 -3.08655352
6 3.43027017 4.68909032 6.55598795 4.95816124 4.87626503 3.17103291 0.79093946
7 -0.62481997 -0.94469455 -2.13648402 -3.64364158 -2.07214317 -3.26793808 -3.38573375
8 0.67823828 1.09908274 0.93832242 0.08791237 1.77322327 2.01201710 3.70197246
Aug Sep Oct Nov Dec
1 0.53048061 1.31994246 0.69306401 1.25916404 1.53363966
2 -0.47154459 0.52849630 0.90548093 0.71783457 0.86908457
3 -0.52525201 -0.40335058 0.73415310 0.58501633 0.29875228
4 2.50242432 1.69408297 0.96230124 0.53164036 -0.64480235
5 -1.60735865 -0.20500888 -0.44508903 -0.01443040 1.71087427
6 -0.09975821 -0.85972650 -0.41557374 -0.99876068 0.52620555
7 -2.25968715 -0.91700127 -0.49302872 -1.44275203 -0.66221559
8 4.62724761 4.17549847 3.43992950 3.15302462 4.17300576
I don't know too much about rsav extension file, could someone help me to solve this issue or give me some tips? Thanks in advance.
I think with "backcast" the in sample fit for the last 25 observations is meant. To forecast from an AR(2) model you simply need the last 2 observations for the next step.
The model is: x_t = ar1 * x_{t-1} + ar2 * x_{t-2} + error
Now we just need to insert the estimated ar parameters and the observations for x_{t-1} and x_{t-2}. For the next step we need the forecast step and the last observation:
x_{t+1} = ar1 * x_{t} + ar2 * x_{t-1} + error
This is what we repeat 25 times. The error term is assumed to be normal distributed, so it is expected to be zero.
We do the same thing for the "backcast", the in sample fit, but here we only need the observations from the time series.
forecast<-numeric(25)
backcast<-numeric(25)
forecast[1]<-0.134*xx[length(xx)]+0.48*xx[length(xx)-1]
forecast[2]<-0.134*forecast[1]+0.48*xx[length(xx)]
for(i in 3:25)
{
forecast[i]<-0.134*forecast[i-1]+0.48*forecast[i-2]
}
for(i in 1:25)
{
backcast[i]<-0.134*xx[length(xx)-i-1]+0.48*xx[length(xx)-i-2]
}
ts.plot(xx)
New to StackOverflow and R.
I have a question regarding the different loss functions for cross-validation that are provided in R package BNlearn and which one I should use. I have continuous data (example below) with 32 rows and 8 columns, each column representing a species and each row representing the number of individuals of that species that year.
201 1.78e+08 18500000 1.87e+08 6.28e+07 1.08e+09 1.03e+08 7.22e+07 43100000
202 8.06e+07 9040000 5.04e+07 4.49e+07 6.66e+08 8.07e+07 2.58e+07 24100000
203 1.54e+08 4380000 1.51e+08 2.88e+07 9.94e+08 1.44e+08 7.32e+07 39000000
204 1.36e+08 6820000 3.80e+08 8.39e+06 7.38e+08 1.50e+08 4.25e+07 32600000
205 9.94e+07 9530000 8.99e+07 1.05e+07 6.62e+08 1.67e+08 1.90e+07 29200000
206 1.33e+08 6340000 4.27e+07 3.26e+06 5.31e+08 2.93e+08 2.70e+07 41500000
207 1.22e+08 5710000 4.41e+07 3.16e+06 4.58e+08 4.92e+08 4.02e+07 21600000
208 1.33e+08 13500000 1.20e+08 3.56e+06 4.40e+08 2.50e+08 3.93e+07 30000000
209 1.73e+08 21700000 4.35e+07 7.58e+06 5.62e+08 3.31e+08 4.98e+07 42100000
210 1.86e+08 6950000 3.40e+07 1.18e+07 4.41e+08 3.80e+08 4.83e+07 28100000
So far I have used the Tabu Search to make a fixed network structure and analyzed it with the cross-validation command
bn.cv(data = data, bn = bn.tabu, method = "k-fold", k = 10, runs = 100)
which gives the result
k-fold cross-validation for Bayesian networks
number of folds: 10
loss function: Log-Likelihood Loss (Gauss.)
number of runs: 100
average loss over the runs: 151.8083
standard deviation of the loss: 0.2384763
The question is, what loss function should I use for my data so that I can change the data set that I use and get comparable results and what does the "average loss over the runs" mean? The end game is to make joint probability distributions and a prediction for year + 1, so basically a row 33 with numbers and their probability distributions.
Sorry for any inconsistencies, as I'm still learning statistics.
i don't know that I understand correctly your question or not. the second question "what does the "average loss over the runs" mean?" because your code is run for 10 times (k=10) this means the average of loss function of the 10 times. and about first question it's better to have a look at this page.
https://stats.stackexchange.com/questions/339897/what-is-the-difference-between-loss-function-and-mle
sorry for bad language, my English language isn't good as you see.
I've been doing a clustering analysis with a relative large dataset (~50.000 observations and 16 variables).
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
As my computer was unable to do it, I did a few subsets of information (10 x 5.000, 16.000 in the example, but 15min computing) and I was using Mclust to determine the optimal number of groups.
ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary
#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VII (spherical, varying volume) model with 9 components:
log.likelihood n df BIC ICL
128118.2 16000 80 255462 254905.3
Clustering table:
1 2 3 4 5 6 7 8 9
1879 2505 3452 3117 2846 464 822 590 325
Resulting always 9 (10 out 10 of datasets of 5.000), so, I guess it's okay..
Now, I would like to assign to the rest of the data the estimated cluster divisions in order to the multidimensional parts of the cluster.
How can I do it?
I started to play with the Mclust object but I can't see how to handle it and apply to the rest of the data.
The optimal solution would be my original data with an extra column with the cluster number (1 to 9) assigned, for example.
I've got the answer after few minutes working:
First of all, there is a concept mistake, dataset must be scaled before partitioning and then just using predict()
library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data
mdper.f.s <- scale(mdper.f)#Scaling data
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars
prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column
I want to run MLR on my data using lm function in R. However, I am using data splitting cross validation method to access the reliability of the model. I intend using "sample" function to randomly split the data into the calibration and validation datasets by 80:20 ratio. This I want to repeat in say 100 times. Without setting a seed I believe the model from the different samplings will differ. I came across the function in previous post here and it solves the first part;
lst <- lapply(1:100, function(repetition) {
mod <- lm(...)
# Replace this with the code you need to train your model
return(mod)
})
save(lst, file="myfile.RData")
The concern now is how do I validate each of these 100 models and obtain reliability test statistics like RSME, ME, Rsquare for each of the models and hopefully obtain the confidence interval.
If I can get an output in the form of dataframe containing the predicted values for all the 100 models then I should proceed from there.
Any help please?
Thanks
To quickly recap your question: it seems that you want to fit an MLR model to a large training set and then use this model to make predictions on the remaining validation set. You want to repeat this process 100 times and afterwards you want to be able to analyze the characteristics and predictions of the individual models.
To accomplisch this you could just store temporary modelinformation in a datastructure during the modelgeneration and prediction process. You can then re-obtain and process all the information afterwards. You did not provide your own dataset in the description, so I will use one of R's built in datasets in order to demonstrate how this might work:
> library(car)
> Prestige <- Prestige[,c("prestige","education","income","women")]
> Prestige[,c("income")] <- log2(Prestige[,c("income")])
> head(Prestige,n=5)
prestige education income women
gov.administrators 68.8 13.11 -0.09620212 11.16
general.managers 69.1 12.26 -0.04955335 4.02
accountants 63.4 12.77 -0.11643822 15.70
purchasing.officers 56.8 11.42 -0.11972061 9.11
chemists 73.5 14.62 -0.12368966 11.68
We start by initializing some variables first. Let's say you want to create 100 models and use 80% of your data for training purposes:
nrIterations=100
totalSize <- nrow(Prestige)
trainingSize <- floor(0.80*totalSize)
We also want to create the datastructure that will be used to hold the intermediate modelinformation. R is quite a generic high level language in this regard, so we will just create a list of lists. This means that every listentry can by itself again hold another list of information. This gives us the flexibility to add whatever we need:
trainTestTuple <- list(mode="list",length=nrIterations)
We are now ready to create our models and predictions. During every loopiteration a different random trainingsubset is created while using the remaining data for testing purposes. Next, we fit our model to the trainingdata and we then use this obtained model to make predictions on the testdata. Note that we explicitly use the independent variables in order to predict the dependent variable:
for(i in 1:nrIterations)
{
trainIndices <- sample(seq_len(totalSize),size = trainingSize)
trainSet <- Prestige[trainIndices,]
testSet <- Prestige[-trainIndices,]
trainingFit <- lm(prestige ~ education + income + women, data=trainSet)
# Perform predictions on the testdata
testingForecast <- predict(trainingFit,newdata=data.frame(education=testSet$education,income=testSet$income,women=testSet$women),interval="confidence",level=0.95)
# Do whatever else you want to do (compare with actual values, calculate other stuff/metrics ...)
# ...
# add your training and testData to a tuple and add it to a list
tuple <- list(trainingFit,testingForecast) # Add whatever else you need ..
trainTestTuple[[i]] <- tuple # Add this list to the "list of lists"
}
Now, the relevant part: At the end of the iteration we put both the fitted model and the out of sample prediction results in a list. This list contains all the intermediate information that we want to save for the current iteration. We finish by putting this list in our list of lists.
Now that we are done with the modeling, we still have access to all the information we need and we can process and analyze it any way we want. We will take a look at the modeling and prediction results of model 50. First, we extract both the model and the prediction results from the list of lists:
> tuple_50 <- trainTestTuple[[50]]
> trainingFit_50 <- tuple_50[[1]]
> testingForecast_50 <- tuple_50[[2]]
We take a look at the model summary:
> summary(trainingFit_50)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = trainSet)
Residuals:
Min 1Q Median 3Q Max
-15.9552 -4.6461 0.5016 4.3196 18.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -287.96143 70.39697 -4.091 0.000105 ***
education 4.23426 0.43418 9.752 4.3e-15 ***
log2(income) 155.16246 38.94176 3.984 0.000152 ***
women 0.02506 0.03942 0.636 0.526875
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.308 on 77 degrees of freedom
Multiple R-squared: 0.8072, Adjusted R-squared: 0.7997
F-statistic: 107.5 on 3 and 77 DF, p-value: < 2.2e-16
We then explicitly obtain the model R-squared and RMSE:
> summary(trainingFit_50)$r.squared
[1] 0.8072008
> summary(trainingFit_50)$sigma
[1] 7.308057
We take a look at the out of sample forecasts:
> testingForecast_50
fit lwr upr
1 67.38159 63.848326 70.91485
2 74.10724 70.075823 78.13865
3 64.15322 61.284077 67.02236
4 79.61595 75.513602 83.71830
5 63.88237 60.078095 67.68664
6 71.76869 68.388457 75.14893
7 60.99983 57.052282 64.94738
8 82.84507 78.145035 87.54510
9 72.25896 68.874070 75.64384
10 49.19994 45.033546 53.36633
11 48.00888 46.134464 49.88329
12 20.14195 8.196699 32.08720
13 33.76505 27.439318 40.09079
14 24.31853 18.058742 30.57832
15 40.79585 38.329835 43.26187
16 40.35038 37.970858 42.72990
17 38.38186 35.818814 40.94491
18 40.09030 37.739428 42.44117
19 35.81084 33.139461 38.48223
20 43.43717 40.799715 46.07463
21 29.73700 26.317428 33.15657
And finally, we obtain some more detailed results about the 2nd forecasted value and the corresponding confidence intervals:
> testingPredicted_2ndprediction <- testingForecast_50[2,1]
> testingLowerConfidence_2ndprediction <- testingForecast_50[2,2]
> testingUpperConfidence_2ndprediction <- testingForecast_50[2,3]
EDIT
After rereading, it occured to me that you are obviously not splitting up the the same exact dataset each time. You are using completely different partitions of data during each iteration and they should be split up in a 80/20 fashion. However, the same solution can still be applied with minor modifications.
Also: For cross validation purposes you should probably take a look at cv.lm()
Description from the R help:
This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. (For binary logistic regression, use the CVbinary function.) The data are randomly assigned to a number of ‘folds’. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.
EDIT: Reply to comment.
You can just take the means of the relevant performance metrics that you saved. For example, you can use an sapply on the trainTestTuple in order to extract the relevant elements from each sublist. sapply will return these elements as a vector from which you can calculate the mean. This should work:
mean_ME <- mean(sapply(trainTestTuple,"[[",2))
mean_MAD <- mean(sapply(trainTestTuple,"[[",3))
mean_MSE <- mean(sapply(trainTestTuple,"[[",4))
mean_RMSE <- mean(sapply(trainTestTuple,"[[",5))
mean_adjRsq <- mean(sapply(trainTestTuple,"[[",6))
Another small edit: The calculation of your MAD looks rather strange. It might be a good thing to double check if this is exactly what you want.