Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I am trying to compare model accuracy between several different measurement metrics. For example, some citations use accuracy while other use error. That one is rather obvious, but there are lots of different metrics and I am not entirely sure how to compare some of them and not lose some of the individual metrics integrity. Or whether or not some can be compared at all. The list I have is:
Error Rate -
Mean Absolute Error -
Absolute Error -
Log-Loss -
Classification Accuracy -
Root Mean Squared Error -
Classification Error -
F-Measure -
Area Under Curve -
Mean Test Error -
Error Percentage -
Misclassification Error -
Test Error -
Mean Test Error
So my question is how to effectively convert between these, and if no direct conversion is possible, to compare and rank in a meaningful and accurate way.
You usually cannot convert these metrics. They measure subtly different things. But linear error is not the same as squared error.
Winning in one metric does not mean winning on a different metric. Assume we want to summarize univariate data into a single number. The mean minimizes squared error, the median linear error - so they have different optimal solutions, and depending on your evaluation measure, you may get different winners.
Don't compare different articles. They will have used different preprocessing, features, feature selection, normalization, subset, different splits for crossvalidation etc.
Usually, comparing such numbers will not work.
You will have to re-run their experiments yourself, with exactly the same input and evaluation.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.
Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been following the last DESeq2 pipeline to perform an RNAseq analysis. My problem is the rin of the experimental samples is quite low compared to the control ones. Iread a paper in which they perform RNAseq analysis with time-course RNA degradation and conclude that including RIN value as a covariate can mitigate some of the effects of low rin in samples.
My question is how I should construct the design in the DESeq2 object:
~conditions+rin
~conditions*rin
~conditions:rin
none of them... :)
I cannot find proper resources where explain how to construct these models (I am new to the field...) and I recognise I crashed against a wall with these kinds of things. I would appreciate also some links to good resources to be able to understand which one is correct and why.
Thank you so much
Turns out to be quite long for typing in a comment.
It depends on your data.
First of all, counts ~conditions:rin does not make sense in your case, because conditions is categorical. You cannot fit only an interaction term model.
I would go with counts ~condition + rin, this assumes there is a condition effect and a linear effect from rin. And the counts' dependency of rin is independent of condition.
As you mentioned, rin in one of the conditions is quite low, but is there any reason to suspect the relationship between rin and counts to differ in the two conditions? If you fit counts ~condition * rin, you are assuming a condition effect and a rin effect that is different in conditions. Meaning a different slope for rin effect if you plot counts vs rin. You need to take a few genes out, and see whether this is true. And also, for fitting this model, you need quite a lot of samples to estimate the effects accurately. See if both of these holds
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I am performing my analysis using R.
I will be implementing four algorithms.
1. RF
2. Log Reg
3. SVM
4. LDA
I have the following questions:
I convert them all into factors. How should i treat my variables priorly, before feeding them into my other algorithms.
I used the caret package to train my model, it takes very much time. I do practice ML regularly, but I dont know how to proceed with all variable being binary.
How to remove collinear variables?
I'm not mostly R-user, but Python. Bet there is common approach:
1. Check you columns. Remove column if number of zeroes or ones is > 95% of total amount (you can try 2.5% or even 1% later).
2. Run simple Random Forest by default and get feature importance. Columns that are unnecessary you can process with LDA.
3. Check target column. If it's highly unbalanced try oversampling or downsampling. Or use classification methods that can handle unbalanced target column (like XGBoost).
For Linear regression you'll need to calculate correlation matrix and remove correlated columns. Other methods can live without it.
Please check SVM (or SVC) does it support all features to be boolean or not. But usually it works very good with binary classification.
Also I advice to try Neural Network.
PS About collinear variables. I wrote a code on Python for my project. That's simple - you can do it:
- plot correlation matrix
- find pairs that have correlation over some threshold
- remove column that have lower correlation with target variable (you can also check that columns you want to remove is not important, otherwise try other way, probably union columns)
In my code I ran this algorithm iteratively for different thresholds: from 0.99 down to 0.9. Works good.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.
As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I don't know anything about statistics and it was difficult for me to find A way to describe my question that was clear.
I am doing some initial research on a system that will measure the uniformity of electricity across a conductor. Basically we need to measure how evenly a signal is spread out on a surface.
I was doing research on how to determine uniformity of a data set and came across this question which is promising. However I realized that I don't know what unit to use to express uniformity. For example, if I take 100 equally spaced measurements in a grid pattern on the surface of an object and want to describe how uniform the values are, how would you say it?
"98% uniform?" - what does that mean? 98% of what?
"The signal is very evenly dispersed" - OK, great... but there must be a more specific or scientific way to communicate that... how "evenly"? What is a numeric representation of that statement?
Statistics and math are not my thing so if this seems like a dumb question, be gentle...
You are looking for the Variance. From Wikipedia:
In probability theory and statistics, the variance is a measure of how far a set of numbers are spread out from each other. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean
Recipe for calculating the Variance:
1) Calculate the Mean of your dataset
2) For each point, calculate (X - Mean)^2
3) Add up all those (X - Mean)^2
4) Divide the by the number of points
5) That is it
The Variance gives you an idea of how "equal" your points are. A Variance of zero, means all points are equal, and then increases as the points spread out.
Edit
Here you may find better algorithms (more numerically stable) for calculating the variance.
One has to first define "uniformity". Does it mean lack of variance in the data? Or does it also mean other things like lack of average change across a surface or over time?
If it's simply lack of variance in data, then the variance method already described is the ticket.
If you are also concerned about average "shift" in measurement across the surface, you could do a linear (or in this case a "cylindrical" or "planar") fit of the data to determine whether there's a general trend up or down in the data in either of two dimensions. (If the conductor is cylindrical, then radially and axially. If it's planar, then x/y.)
These three parameters, then, would give a reasonable uniformity measure by the above definition: overall variance (that belisarius described), and "flatness" in each of two dimensions.