I am doing a personal project to see how well does FIFA potential player stats predict the actual overall stat after 3 years.
Meaning, if a player has a potential of 85 in 2015, how accurate
should I expect it to be to predict the overall player score in
2018.
Should I use R2?
I also want to check if the histogram of errors (potential_2015 -
overall_2018) has a normal distribution.
Do I need to use Chi-squared for this?
If the prediction is overestimating the player, I would like to know
by how much.
Should I use Standard Deviation on the errors?
If you have a list of key statistical tests, I would appreciate if you could list them so I can research them and learn.
Thank you,
Related
Hello everyone, I am completely new to ML and trying to teach myself what is out there in the books, so apologies for my ignorance in advance.
Basically, I am trying to predict stock return values one period ahead based on a set of 15 predictors in a Random Forest Regression (using tidymodels in R, thank you kindly for your videos #Julia Silge :-)).
What bothers me is that the regression overestimates bad stocks and underestimates good stock. I would like to just rotate this whole point cloud a few degrees counter-clockwise and my life would be easier. Is there an expert on random forest regressions with a trick up their sleeves for solving this?
Thank you in advance.
You are right to be concerned: the model basically returns the average for all stocks (as the predictions lie on a flat horizontal line with a bit of noise). As you point out this means that the model is biased: it underpredicts positive returns and underpredicts negative returns.
In short, this model predicts no returns and no losses (in the next period). This is boring but actually doesn't seem wrong.
Since you are doing this to learn about machine learning, consider picking an "easier" problem. Stock returns are not very predictable in general.
I am trying to predict the Spotify popularity score using a range of machine learning algorithms in the R Caret package including logistic regression. The aim is to predict track popularity based on audio features e.g. danceability, energy etc. The problem I have is that Spotify are not transparent about how the popularity score is calculated but I know it is based on a number of things including play counts and how recent the track is. That means that the number of days released will have an impact on the popularity score so I have included days_released as an independent variable in my modelling to try and control for it.
So, I have 50 variables (days_released being one of them). I am using the rfe function in Caret to perform feature selection but for every algorithm, days_released is the only variable selected. Does anyone have any advice or recommended reading on how to overcome this problem? I want to predict popularity and explore which track features have a significant relationship with popularity, controlling for days_released.
Do I take the days_released variable out altogether?
Do I leave it in but force rfe to select more than one feature?
Any help would be much appreciated! Thanks in advance!
This question seems weird, let me explain it by example.
We train a particular classification model to determine if an image contains a person or not.
After the model is trained, we use an new image for predicting.
The predicting result show that there is 94% probability that the image contains a person.
Thus, could I say, the confidence level is 94%, for that the image may contains a person?
Your third item is not properly interpreted. The model returns a normalized score of 0.94 for the category "person". Although this score correlates relatively well with our cognitive notions of "probability" and "confidence", do not confuse it with either of those. It's a convenient metric with some overall useful properties, but it is not an accurate prediction to two digits of accuracy.
Granted, there may well be models for which the model's prediction is an accurate figure. For instance, the RealOdds models you'll find on 538 are built and tested to that standard. However, that is a directed effort of more than a decade; your everyday deep learning model is not held to the same standard ... unless you work to tune it to that, making the accuracy of that number a part of your training (incorporate it into the error function).
You can run a simple (although voluminous) experiment: collect all of the predictions and bin them; say, a range of 0.1 for each of 10 bins. Now, if this "prediction" is, indeed, a probability, then your 0.6-0.7 bin should correctly identify a person 65% of the time. Check that against ground truth: did that bin get 65% correct and 35% wrong? Is the discrepancy within expected ranges: do this for each of the 10 categories and run your favorite applicable statistical measures on it.
I expect that this will convince you that the inference score is neither a prediction nor a confidence score. However, I'm also hoping it will give you some ideas for future work.
I'm trying to run a GLM in R for biomass data (reductive biomass and ratio of reproductive biomass to vegetative biomass) as a function of habitat type ("hab"), year data was collected ("year"), and site of data collection ("site"). My data looks like it would fit a Gamma distribution well, but I have 8 observations with zero biomass (out of ~800 observations), so the model won't run. What's the best way to deal with this? What would be another error distribution to use? Or would adding a very small value (such as .0000001) to my zero observations be viable?
My model is:
reproductive_biomass<-glm(repro.biomass~hab*year + site, data=biom, family = Gamma(link = "log"))
Ah, zeroes - gotta love them.
Depending on the system you're studying, I'd be tempted to check out zero-inflated or hurdle models - the basic idea is that there are two components to the model: some binomial process deciding whether the response is zero or nonzero, and then a gamma that works on the nonzeroes. Slick part is you can then do inferences on the coefficients of both models and even use different coefficients for both.
http://seananderson.ca/2014/05/18/gamma-hurdle.html ... but a search for "zero-inflated gamma" or "tweedie models" might also yield something informative and/or scholarly.
In an ideal world, your analytic tool should fit your system and your intended inferences. The zero-inflated world is pretty sweet, but is conditional on the assumption of separate processes. Thus an important question to answer, of course, is what zeroes "mean" in the context of your study, and only you can answer that - whether they're numbers that just happened to be really really small, or true zeroes that are the result of some confounding process like your coworker spilling the bleach (or something otherwise uninteresting to your study), or else true zeroes that ARE interesting.
Another thought: ask the same question over on crossvalidated, and you'll probably get an even more statistically informed answer. Good luck!
I am working in machine learning. I am stuck in one of the thing.
I want to compare 4 machine learning techniques among 10 datasets. After performing experiment i got Area Under Curve value. After this i have applied Analysis of variance test which shows there is a significant difference between 4 machine learning techniques.
Now my problem is that which test will conclude that particular algorithm perform well compared to other algorithm and i want only one winner among the machine learning techniques.
A classifier's quality can be measured by the F-Score which measures the test's accuracy. Comparing these respective scores will give you a simple measure.
However, if you want to measure whether the difference between the classifiers' accuracies is significant, you can try the Bayesian Test or, if classifiers are trained once, McNemar's test.
There are other possibilities and the papers On Comparing Classifiers: Pitfalls to Avoid and a
Recommended Approach and Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms are probably worth reading.
If you are gathering performance metrics (ROC,accuracy,sensitivity,specificity...) from identicially resampled data sets then you can perform statistical tests using paired comparisons. Most statistical software impliment Tukeys Range test (ANOVA). https://en.wikipedia.org/wiki/Tukey%27s_range_test. A formal treatment of this material is here: http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf. This is the test I like to use for the purpose you discuss, although there are others and people have varying opinions.
You will still have to choose how you will sample based on your data (k-fold), repeated (k-fold), bootstrap, leave one out, repeated training test splits. Bootstrap methods tend to give you the tightest confidence intervals after leave one out; but leave one out might not be an option if your data is huge.
That being said you may also need to consider the problem domain. False positives may be an issue in classification. You may need to consider other metrics to choose the best performer for the domain. AUC might not always be the best model for a specific domain. For instance a credit card company may not want to deny a transaction to customers, we need a very low false positive on fraud classification.
You may also want to consider implementation. If a logistic regression performs near as well it may be a better choice over a more complicated implementation of a random forest. Are there legal implications to model use (Fair Credit Reporting Act...)?
A common sense approach is to begin with something like RF or Gradient boosted trees to get an empirical sense of a performance ceiling. Then build simpler models and use the simpler model that performs reasonabley well compared to the ceiling.
Or you could combine all your models using something like LASSO... or some other model.