Predict the injury time of a football match? - r

I have a project which requires me to predict the injury time of a football match.
I have the relevant information, such as goals, corners, referee, 2 team names and the injury time for each half.
I tried to use Poisson regression and I am confused how I can include referee as a factor in my model?
As different referees were involved in a different number of matches. Say Tom was involved in 200 games while Jerry was in 30 games.
I tried to add the factor "referee" into the model and the summary told me that only some of the referees have a significant effect on the results. So I wonder is it correct to add the referee into the model directly, and are there any other methods I can use?
Thanks!

Related

Is a hurdle model adapted to analyse data such as the number of trials?

I am analysing data about song recording of birds. Birds had several recording trials to sing, some of them sang during the first or second trial, some needed more than 10 trials, some never sang even after 15 trials or more. Birds that sang were not recorded again. My data contains a binary variable (did or did not sing), the number of trials until singing or until we definitively stopped recording, and the amount of song phrases that were produced.
I have 4 groups of birds with different temperature treatments, and I try to see if those treatments impact the propensity to sing. I first focused on the binary variable, but my colleagues suggested to also include the number of trials (how hard it's been to have them sing) and the number of phrases produced (amount of singing behaviour).
They suggested to use a hurdle model: first, did the bird sang or not, and then if it did, how much. I liked this idea very much, but it doesn't take into account the number of trials before singing. I don't really know how to analyse those 3 variables so I'm asking for advice and help.
I tried:
to include the number of trials as a covariate, but birds in some treatment groups needed significantly more trials to sing than birds in other groups, and I'm afraid it overlaps with the effect of the treatment in the model
to use the number of trials as the dependent variable, but it seems to me that a hurdle model wouldn't be the most adequate method to analyse this type of data. I see the number of trials more like a succession of opportunities for the bird to sing or not than one observation at a given point, contrary to the number of phrases the bird sang during a given recording.
I have very little experience with hurdle models and other zero-inflated models, so I have reached an impasse and I would really appreciate your opinion. Thanks in advance!
After asking to some collaborators, someone suggested a much better way to analyse this type of data.
I was trying to apply a zero-inflated or zero-altered method, but my data is actually right-censored. I used a survival analysis, I just briefly explain here in case someone would have the same problem as I did:
We use a survival analysis when we want to analyse the number of events along a given time (in health studies, the survival within 5 years for instance). But some individuals are censored because the event didn't happen in the time period that we study.
I have exactly this type of data: I analyse if a bird sang or not (event), and how many trials it needed to sing (time), but some birds didn't sing within the time I dedicated for recordings and those individuals are censored because I don't know how many trials they would need to sing.
I hope this can help other people struggling like me with this kind of data, it is not always easy to find an appropriate analysis.

Anomaly detection within a dataset in R

I would like to detect patterns within a weather dataset of around 10'000 data points. I have around 40 possible predictors (temperature, humidity etc.) which may explain good or bad weather the next day (dependent variable). Normally, I would apply classical machine learning methods like Random Forest to build and test models for classifying the whole dataset and find reliable predictors to forecast the next day's weather.
My task though is different. I want to find predictors and their parameters which "guarantee" me good or bad weather in a subset of the overall data. I am not interested in describing the whole dataset but finding the pattern of predictors (and their parameters) that give me good or bad weather indications. So I am trying to find, for example, 100 datapoints with 100% good weather if certain predictors are set to certain levels. I am not interested in the other 9'900 datapoints.
It is kind of the task of trying all combinations and calibrations of the predictors to find a subset of the overall data points which can be predicted with very high accuracy.
How would you do this systematically?

How to adjust for age effect in an ANOVA?

I recently submitted a paper to a peer-review journal. The aim is to assess whether census blocks with worst material deprivation index in a given region are those that have the lowest levels of access to health emergency services (in terms of travel time taking into account the location of services and location of ambulances - accumulated time).
For this we did the following: after measuring the MDI (material deprivation index - a rather classical index) and obtaining the total journey times to the emergency services, we performed an ANOVA (MDI as a categorical variable - 6 levels - and times as a continuous variable). I did it using R.
The reviewer now says that we should see if there is an effect of age. He argues that in some countries, the more deprived patients are also the youngest. He adds that the effect of age could be taking into account in a linear regression (or ordered logistic regression if normality was not assumed) with the proportion of aged 30 (for example) as an independent variable in addition to MDI categories. Thus, we will obtain the beta coefficient of the effect of MDI on travel times adjusted on age (mean age).
To get a view of the age effect, I intuitively would have added an age structure indicator (for example the % of young people, % of older people, etc.) and ANOVA would become an ANCOVA. It would simply be an ANOVA with an additional covariate. But the reviewer suggests a linear or logistic regression. Does it make sense to you? As a non-statistician, I am not sure about what I should do in this case.
Hope I am not too confuse. Please tell me if I am, I will try to clarify.
Thanks in advance,
Cheers
PS: I am not a native English speaker and my English is rather poor. Apologies for the mistakes.

Naive Bayes for GPA classification?

As a beginner in machine learning, I am faced with a project where I have to find a method that uses both categorical and numerical variables collected from surveys to predict a child's "discretized" GPA values.
For example, the x-variables include yes/no/don't know responses to questions such as "I worry about taking tests", and numerical answers such as household income. The surveys were given to teachers, caregivers, and the children themselves.
The y-variable, is GPA and ranges from 1 to 4 in discrete increments of 0.25.
What I have attempted, is to use boruta package to pick out the most relevant 65 features out of over 10000 features (and all of the features do make sense---they are often related to the child's behavior in school, and/or the child's scores/percentiles on standardized tests). Below is a sample of the features selected by boruta.
A3D. Your dad misses events or activities that are important to you
G2C. I worry about taking tests
G2D. It's hard for me to pay attention
G2H. It's hard for me to finish my schoolwork
G2I. I worry about doing well in school
G2M. I get in trouble for talking and disturbing others
G19A. Frequency you had 4 or more drinks in one day in past 12 months
E6A. Father could count on someone to co-sign for a bank loan for $5000
i13. how much you earn in that job, before taxes
I19A. Amount earned from all regular jobs in past 12 months
J1. Total household income before taxes/deductions in past 12 months
J4A. Name on bank account
J6B. Amount owed on your vehicle
Then I ran a naive Bayes classifier. I do not know if this is appropriate or if there are better methods for this task, but the results are simply terrible. The model often produces extreme values such as 1 and 4, when the actual value should be somewhere in between. I thought I had relevant features for the task, but somehow the accuracy is very low.
I have also tried gradient boosting machine from caret package using the default parameters, but the result isn't very satisfying either.
What can I do to improve the model, and is there better methods to try?
Is regression more suited for this if I want to achieve better accuracy/minimize error?
Thanks!

R, classification, variables input as a group

How do I classify variables in R when the classifying output is known only for a group of variables? Think of this as being similar to the 'Mastermind' board game.
Or, here is a concrete example I'm working on: a person eats different types of food on different days, and she either has an allergic reaction (to something she ate that day), or she does not. These data are available for a number of days. What is the person allergic to?
With real data, you cannot do simple elimination (all foods she ate on days with no reaction are fine), because there will be false positives and false negatives in the data. A probabilistic approach is needed (99% allergic to spinach, 20% allergic to mint, etc.).
This is really more of a Statistics 101 question and thus better suited for stats.stackexchange.com but I will handle it.
The answer to your food analogy example is to use something like
lm() # for linear models (least squares), univariate or multivariate
glm() # generalized linear models (note: despite the name you can use these 2 for non-linear models as well, like polynomial regression)
nnet #basic neural networks and
randomForest()
the package caret has over 100 classification models
and etc, etc. There are hundreds if not thousands of probabilistic approaches you could take. You can use normal equations, gradient descent, etc, etc. The possibilities are practically endless.
This should get you started:
https://cran.r-project.org/web/views/MachineLearning.html
https://cran.r-project.org/web/views/Multivariate.html
http://blog.revolutionanalytics.com/2012/08/cheat-sheet-for-prediction-and-classification-models-in-r.html
I'm sorry but I've never heard of 'Mastermind'.

Resources