I have a set of numerical features that describe a phenomenon at different time points. In order to evaluate the individual performance of each feature, I perform a linear regression with a leave one out validation, and I compute the correlations and errors to evaluate the results.
So for a single feature, it would be something like:
Input: Feature F = {F_t1, F_t2, ... F_tn}
Input: Phenomenom P = {P_t1, P_t2, ... P_tn}
Linear Regression of P according to F, plus leave one out.
Evaluation: Compute correlations (linear and spearman) and errors (mean absolute and root mean squared)
For some of the variables, both correlations are really good (> 0.9), but when I take a look to the predictions, I realize that the predictions are all really close to the average (of the values to predict), so the errors are big.
How is that possible?
Is there a way to fix it?
For some technical precisions, I use the weka linear regression with the option "-S 1" in order to avoid the feature selection.
It seems to be because the problem we want to regress is not linear and we use a linear approach. Then it is possible to have good correlations and poor errors. It does not mean that the regression is wrong or really poor, but you have to be really careful and investigate further.
Anyway, a non linear approach that minimizes the errors and maximize the correlation is the way to go.
Moreover, outliers also make this problem occur.
Related
Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)
I'm modelling multiple linear regression. I used the bptest function to test for heteroscedasticity. The result was significant at less than 0.05.
How can I resolve the issue of heteroscedasticity?
Try using a different type of linear regression
Ordinary Least Squares (OLS) for homoscedasticity.
Weighted Least Squares (WLS) for heteroscedasticity without correlated errors.
Generalized Least Squares (GLS) for heteroscedasticity with correlated errors.
Welcome to SO, Arun.
Personally, I don't think heteroskedasticity is something you "solve". Rather, it's something you need to allow for in your model.
You haven't given us any of your data, so let's assume that the variance of your residuals increases with the magnitude of your predictor. Typically a simplistic approach to handling it is to transform the data so that the variance is constant. One way of doing this might be to log-transform your data. That might give you a more constant variance. But it also transforms your model. Your errors are no longer IID.
Alternatively, you might have two groups of observarions that you want to compare with a t-test, bit the variance in one group is larger than in the other. That's a different sot of heteroskedasticity. There are variants of the standard "pooled variance" t-test that might handle that.
I realise this isn't an answer to your question in the conventional sense. I would have made it a comment, but I knew before I started that I'd need more words than a comment would let me have.
Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.
He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.
Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?
My coefficients are extremely small numbers and I feel like I did something wrong here.
Build a proper pipeline, not just a hack with some R functions.
The problem is that you treat scaling as part of loading the data, not as part of the prediction process.
The proper protocol is as follows:
"Learn" the transformation parameters
Transform the training data
Train the model
Transform the new data
Predict the value
Inverse-transform the predicted value
During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.
Standardization is a linear transform, so the inverse is trivial to find.
My goal is to estimate two parameters of a model (see CE_hat).
I use 7 observations to fit two parameters: (w,a), so overfitting occurs a few times. One idea would be to restrict the influence of each observation so that outliers do not "hijack" the parameter estimates.
The method that has been previously suggested to me was nlrob. The problem with that however is that extreme cases such as the example below, return Missing value or an infinity produced when evaluating the model.
To avoid this I used nlsLM which works towards a convergence at the cost of returning outlandish estimates.
Any ideas as to how I can use robust fitting with this example?
I include below a reproducible example. The observables here are CE, H and L. These three elements are fed into a function (CE_hat) in order to estimate "a" and "w". Values close to 1 for "a" and close to 0.5 for "w" are generally considered to be more reasonable. As you - hopefully - can see, when all observations are included, a=91, while w=next to 0. However, if we were to exclude the 4th (or 7th) observation (for CE, H and L), we get much more sensible estimates. Ideally, I would like to achieve the same result, without excluding these observations. One idea would be to restrict their influence. I understand that it might not be as clear why these observations constitute some sort of "outliers". It's hard to say something about that without saying too much I am afraid but I am happy to go into more details about the model should a question arise.
library("minpack.lm")
options("scipen"=50)
CE<-c(3.34375,6.6875,7.21875,13.375,14.03125,14.6875,12.03125)
H<-c(4,8,12,16,16,16,16)
L<-c(0,0,0,0,4,8,12)
CE_hat<-function(w,H,a,L){(w*(H^a-L^a)+L^a)^(1/a)}
aw<-nlsLM(CE~CE_hat(w,H,a,L),
start=list(w=0.5,a=1),
control = nls.lm.control(nprint=1,maxiter=100))
summary(aw)$parameters
I am fitting a standard multiple regression with OLS method. I have 5 predictors (2 continuous and 3 categorical) plus 2 two-way interaction terms. I did regression diagnostics using residuals vs. fitted plot. Heteroscedasticity is quite evident, which is also confirmed by bptest().
I don't know what to do next. First, my dependent variable is reasonably symmetric (I don't think I need to try transformations of my DV). My continuous predictors are also not highly skewed. I want to use weights in lm(); however, how do I know what weights to use?
Is there a way to automatically generate weights for performing weighted least squares? or Are you other ways to go about it?
One obvious way to deal with heteroscedasticity is the estimation of heteroscedasticity consistent standard errors. Most often they are referred to as robust or white standard errors.
You can obtain robust standard errors in R in several ways. The following page describes one possible and simple way to obtain robust standard errors in R:
https://economictheoryblog.com/2016/08/08/robust-standard-errors-in-r
However, sometimes there are more subtle and often more precise ways to deal with heteroscedasticity. For instance, you might encounter grouped data and find yourself in a situation where standard errors are heterogeneous in your dataset, but homogenous within groups (clusters). In this case you might want to apply clustered standard errors. See the following link to calculate clustered standard errors in R:
https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r
What is your sample size? I would suggest that you make your standard errors robust to heteroskedasticity, but that you do not worry about heteroskedasticity otherwise. The reason is that with or without heteroskedasticity, your parameter estimates are unbiased (i.e. they are fine as they are). The only thing that is affected (in linear models!) is the variance-covariance matrix, i.e. the standard errors of your parameter estimates will be affected. Unless you only care about prediction, adjusting the standard errors to be robust to heteroskedasticity should be enough.
See e.g. here how to do this in R.
Btw, for your solution with weights (which is not what I would recommend), you may want to look into ?gls from the nlme package.