Calculate coverage rate of an OLS estimator - r

I want to calculate the 95% coverage rate of an simple OLS estimator.
The (for me) tricky addition is that the independent variable has 91 values that i have to test against each other in order to see which value leads to the best estimate.
For each value of the independent variable i want to draw 1000 samples.
I tried looking up on the theory and also on multiple platforms such as stackoverflow, but i didn't manage to find an appropriate answer.
My biggest question is how to calculate a coverage rate for a 95% confidence interval.
I would deeply appreciate it, if you could provide me with some possibilities or insights.

Related

Hitting a Target In-Stock Rate Through More Accurate Prediction Intervals in Univariate Time Series Forecasting

My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.

Introductory Statistics RStudio query

First time here. I'm taking an introductory statistics class which uses R. I have a dataset of 54 observations and I have to determine the 90% upper bounded confidence interval of the variance of the content. From that I have to deduce if it is possible to claim that with confidence at least 90%, is the variance of the content less than 0.001?
From my understanding I have to use the script:
sigma.test(data,alternative="less",conf.level=0.90)$conf.int
From running it, I'm getting a value of 0. That doesn't sound normal at all. I'm not sure if I'm following the correct procedures.
I would appreciate any feedback to help my understanding of this more.
Cheers.

Optimising weights in R (Portfolio)

I tried several packages in R and I am really lost in which one I should be using. I just need help in general direction and I can find my way myself for the exact code.
I am trying portfolio optimization in R. I need weights vector to be calculated where each weight in the vector represents percentage of that stock.
Given the weights, I calculate total return, variance and sharpe ratio (function of return and variance).
There could be constraints like total weights should be equal to 1 (100%) and may be some others on case by case basis.
I am trying to get my code to be flexible that I can optimize with different objectives (one at a time though). For example, I could want minimum variance in one simulation or maximum return in other and even max. sharpe ration in other.
This is pretty straight forward in excel with solver package. Once I have formulas entered, whichever cell I pick for objective function, it will calculate weights based on that and then calculate other parameters based on those weights. (Eg, if I optimize based on min variance, then it calculate weights for min variance and then calculate return and sharpe based on those weights).
I am wondering how to go about it in R? I am lost in reading documetation of several R packages or functions (Lpsolve, Optim, constrOptim, portfoiloAnalytics, etc) but not able to find the starting point. My specific questions are
Which would be the right R package for this kind of analysis?
Do I need to define separate functions for each possible objective, like variance, return and sharpe and optimize those functions? This is little tricky because sharpe depends on variance and returns. So if I want to optimize sharpe functions, then do I need to nest it within the variance and return functions?
I just need some ideas on how to start and I can give it a try. If I at least get the right package and right example to use, it would be great. I searched a lot on the web but I am really lost.

Is the model's prediction probability the same as the confidence level?

This question seems weird, let me explain it by example.
We train a particular classification model to determine if an image contains a person or not.
After the model is trained, we use an new image for predicting.
The predicting result show that there is 94% probability that the image contains a person.
Thus, could I say, the confidence level is 94%, for that the image may contains a person?
Your third item is not properly interpreted. The model returns a normalized score of 0.94 for the category "person". Although this score correlates relatively well with our cognitive notions of "probability" and "confidence", do not confuse it with either of those. It's a convenient metric with some overall useful properties, but it is not an accurate prediction to two digits of accuracy.
Granted, there may well be models for which the model's prediction is an accurate figure. For instance, the RealOdds models you'll find on 538 are built and tested to that standard. However, that is a directed effort of more than a decade; your everyday deep learning model is not held to the same standard ... unless you work to tune it to that, making the accuracy of that number a part of your training (incorporate it into the error function).
You can run a simple (although voluminous) experiment: collect all of the predictions and bin them; say, a range of 0.1 for each of 10 bins. Now, if this "prediction" is, indeed, a probability, then your 0.6-0.7 bin should correctly identify a person 65% of the time. Check that against ground truth: did that bin get 65% correct and 35% wrong? Is the discrepancy within expected ranges: do this for each of the 10 categories and run your favorite applicable statistical measures on it.
I expect that this will convince you that the inference score is neither a prediction nor a confidence score. However, I'm also hoping it will give you some ideas for future work.

How to interpret a VAR model without sigificant coefficients?

I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance

Resources