I want build a model for rainfall prediction using data mining and machine learning. As one of the initial step, I computed the correlation Matrix for weather conditions including Rain Gauge, Average Temperature, Relative Humidity, Wind Speed and O3. I have used spearman and Kendall Correlation formulas and computed the correlation coefficients through RStudio.
The problem is, there is not any significant relation between Rain Gauge and other variables since all correlation coefficients are more close to zero.
I want to know that there is any other way to find out the most related variables for Gain Gauge or is it ok to use these variable even the correlation coefficients are close to zero.
Since this project work is new experience for me, please apologize for any mistakes done by me.
Related
I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.
I have a dataset containing information about weather, air pollution and healthoutcomes. I want to regress temperature (T) and temperature lag (T1) against cardiac deaths (CVD). I have previously used the glm model in R using the following script:
#for mean daily temperature and temperature lags separately.
modelT<-glm(cvd~T, data=datapoisson, family=poisson(link="log"), na=na.omit)
I get the effect estimates and standard error values which i used to convert to risk ratio.
Now i want to use dynamic linear model or distributed linear model for check the predictor-outcome and lagged predictor outcome association. However, i can't find the script for running the model in R.
I installed the DLM package in R, but still can't figure out how to build a model using DLM package in R.
I would appreciate if someone can help with it.
Could you try least squares multiple regression to predict the outcome? I used that method when I tried to 'predict' which factors influenced power in a floating offshore wind turbine. It is good for correlating multiple parameters.
They fit a plane to a set of points, but it seems like a similar idea.
https://math.stackexchange.com/questions/99299/best-fitting-plane-given-a-set-of-points
First post!
I'm a biologist with limited background in applied statistics and R. Basically know enough to be dangerous, so I'd appreciate it someone could confirm/deny that I'm on the right path.
My datasets consists of count data (wildlife visits to water wells) as a response variable and multiple continuous predictor variables (environmental measurements).
First, I eliminated multicolinearity by dropping a few predictor variables. Second, I investigated the distribution of the response variable. Initially, it looked Poisson. However, a Poisson exact test came back as significant, and the variance of the response variable was around 200 with a mean around 9, i.e. overdispersed. Due to this, I decided to move forward with Negative Binomial and Quasipoisson regressions. Both selected the same model, the residuals of which are in a normal distribution. Further, a plot of residuals over predicted values is unbiased and homoscedastic.
Questions:
1. Have I selected the correct regressions to model this data?
2. Are there additional assumptions of the NBR and QpR that I need to test? How should I/Where can I learn about how to do these?
3. Did I check for overdispersion correctly? Is there a difference in comparing the mean and variance vs comparing the conditional mean and variance of the response variable?
4. While the NBR and QpR called the same model, is there a way to select which is the "better" approach?
5. I would like to eventually publish. Are there more analyses I should perform on my selected model?
I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.
I use "vars" R package to do a multivariate time series analysis. The thing is when I conduct a bivariate VAR, the result of serial.test() give always a really low p-value, so we reject H0 and the residuals are correlated. The right thing to do is to increase the order of the VAR but even with a very high order (p=20 or even more) my residuals are still correlated.
How is it possible ?
I can't really give you a reproducible code because i don't know how to reproduce a VAR with residuals always correlated. For me it's a really unusual situation, but if someone know how it's possible it would be great.
This is probably a better question for Cross Validated as it doesn't contain any R code or a reproducible example, but you're probably going to need to do more digging than "I have a low p-value". Have you tested your data for normallity? Also, to say
The right thing to do is to increase the order of the VAR
is very inaccurate. What type of data are you working with that you would set a lag order as high as 20? A Typical value for yearly data is 1, for quarterly is 4, and for monthly is 12. You can't just keep throwing higher and higher orders at your problem and expect it to fix issues in the underlying data.
Assuming you have an optimal lag value and your data is normally distributed and you still have a low p-value there are several ways to go.
Minor cases of positive serial correlation (say, lag-1 residual autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between 1.2 and 1.6) indicate that there is some room for fine-tuning in the model. Consider adding lags of the dependent variable and/or lags of some of the independent variables. Or, if you have an ARIMA+regressor procedure available in your statistical software, try adding an AR(1) or MA(1) term to the regression model. An AR(1) term adds a lag of the dependent variable to the forecasting equation, whereas an MA(1) term adds a lag of the forecast error. If there is significant correlation at lag 2, then a 2nd-order lag may be appropriate.
If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0.3 or DW stat greater than 2.6), watch out for the possibility that you may have overdifferenced some of your variables. Differencing tends to drive autocorrelations in the negative direction, and too much differencing may lead to artificial patterns of negative correlation that lagged variables cannot correct for.
If there is significant correlation at the seasonal period (e.g. at lag 4 for quarterly data or lag 12 for monthly data), this indicates that seasonality has not been properly accounted for in the model. Seasonality can be handled in a regression model in one of the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted), or (ii) use seasonal lags and/or seasonally differenced variables (caution: be careful not to overdifference!), or (iii) add seasonal dummy variables to the model (i.e., indicator variables for different seasons of the year, such as MONTH=1 or QUARTER=2, etc.) The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year. If the dependent variable has been logged, the seasonal adjustment is multiplicative. (Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted, some of your independent variables may not be, causing their seasonal patterns to leak into the forecasts.)
Major cases of serial correlation (a Durbin-Watson statistic well below 1.0, autocorrelations well above 0.5) usually indicate a fundamental structural problem in the model. You may wish to reconsider the transformations (if any) that have been applied to the dependent and independent variables. It may help to stationarize all variables through appropriate combinations of differencing, logging, and/or deflating.