Naive Bayes for GPA classification? - r

As a beginner in machine learning, I am faced with a project where I have to find a method that uses both categorical and numerical variables collected from surveys to predict a child's "discretized" GPA values.
For example, the x-variables include yes/no/don't know responses to questions such as "I worry about taking tests", and numerical answers such as household income. The surveys were given to teachers, caregivers, and the children themselves.
The y-variable, is GPA and ranges from 1 to 4 in discrete increments of 0.25.
What I have attempted, is to use boruta package to pick out the most relevant 65 features out of over 10000 features (and all of the features do make sense---they are often related to the child's behavior in school, and/or the child's scores/percentiles on standardized tests). Below is a sample of the features selected by boruta.
A3D. Your dad misses events or activities that are important to you
G2C. I worry about taking tests
G2D. It's hard for me to pay attention
G2H. It's hard for me to finish my schoolwork
G2I. I worry about doing well in school
G2M. I get in trouble for talking and disturbing others
G19A. Frequency you had 4 or more drinks in one day in past 12 months
E6A. Father could count on someone to co-sign for a bank loan for $5000
i13. how much you earn in that job, before taxes
I19A. Amount earned from all regular jobs in past 12 months
J1. Total household income before taxes/deductions in past 12 months
J4A. Name on bank account
J6B. Amount owed on your vehicle
Then I ran a naive Bayes classifier. I do not know if this is appropriate or if there are better methods for this task, but the results are simply terrible. The model often produces extreme values such as 1 and 4, when the actual value should be somewhere in between. I thought I had relevant features for the task, but somehow the accuracy is very low.
I have also tried gradient boosting machine from caret package using the default parameters, but the result isn't very satisfying either.
What can I do to improve the model, and is there better methods to try?
Is regression more suited for this if I want to achieve better accuracy/minimize error?
Thanks!

Related

Is a hurdle model adapted to analyse data such as the number of trials?

I am analysing data about song recording of birds. Birds had several recording trials to sing, some of them sang during the first or second trial, some needed more than 10 trials, some never sang even after 15 trials or more. Birds that sang were not recorded again. My data contains a binary variable (did or did not sing), the number of trials until singing or until we definitively stopped recording, and the amount of song phrases that were produced.
I have 4 groups of birds with different temperature treatments, and I try to see if those treatments impact the propensity to sing. I first focused on the binary variable, but my colleagues suggested to also include the number of trials (how hard it's been to have them sing) and the number of phrases produced (amount of singing behaviour).
They suggested to use a hurdle model: first, did the bird sang or not, and then if it did, how much. I liked this idea very much, but it doesn't take into account the number of trials before singing. I don't really know how to analyse those 3 variables so I'm asking for advice and help.
I tried:
to include the number of trials as a covariate, but birds in some treatment groups needed significantly more trials to sing than birds in other groups, and I'm afraid it overlaps with the effect of the treatment in the model
to use the number of trials as the dependent variable, but it seems to me that a hurdle model wouldn't be the most adequate method to analyse this type of data. I see the number of trials more like a succession of opportunities for the bird to sing or not than one observation at a given point, contrary to the number of phrases the bird sang during a given recording.
I have very little experience with hurdle models and other zero-inflated models, so I have reached an impasse and I would really appreciate your opinion. Thanks in advance!
After asking to some collaborators, someone suggested a much better way to analyse this type of data.
I was trying to apply a zero-inflated or zero-altered method, but my data is actually right-censored. I used a survival analysis, I just briefly explain here in case someone would have the same problem as I did:
We use a survival analysis when we want to analyse the number of events along a given time (in health studies, the survival within 5 years for instance). But some individuals are censored because the event didn't happen in the time period that we study.
I have exactly this type of data: I analyse if a bird sang or not (event), and how many trials it needed to sing (time), but some birds didn't sing within the time I dedicated for recordings and those individuals are censored because I don't know how many trials they would need to sing.
I hope this can help other people struggling like me with this kind of data, it is not always easy to find an appropriate analysis.

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

R caret training - but each sample has three separate measurements and I want to use majority vote to predict

I have a very specific datasets with 50 people. Each person has a response (sex) and ~2000 measurements of some biological stuff.
We have three independent replicates from each person, so 3 rows pr. person.
I can easily use caret and groupKFold() to keep each person in either training or test sets - that works fine.
Then I simply predict each replicate separately (so 3 prediction pr person).
I want to use these three prediction together and make a combined prediction pr. person either using majority vote and/or some other scheme.
I.e. - so for each person I get the 3 predictions and predict the response to be the one with most votes. That's pretty easy to do for the final model, but it should also be used in the tuning step (i.e. in the cross validation picking parameter values).
I think I can do that in the summaryFunction=... when calling caret::trainControl() but I would simply like to ask:
Is there a simpler way of doing this?
I have googled around - but I keep failing in finding people with similar problems. And I really hope someone can point me in the right direction.

Predict the injury time of a football match?

I have a project which requires me to predict the injury time of a football match.
I have the relevant information, such as goals, corners, referee, 2 team names and the injury time for each half.
I tried to use Poisson regression and I am confused how I can include referee as a factor in my model?
As different referees were involved in a different number of matches. Say Tom was involved in 200 games while Jerry was in 30 games.
I tried to add the factor "referee" into the model and the summary told me that only some of the referees have a significant effect on the results. So I wonder is it correct to add the referee into the model directly, and are there any other methods I can use?
Thanks!

Weighted mean tending towards center

I'm experimenting on some movie rating data. Currently doing some hybrid item and user based predictions. Mathimatically I'm unsure how to implement what I want and maybe the answer is just straight forward weighed mean but I feel like there might be some other option.
I have 4 values for now, that I want to get the mean of
item based prediction
user based prediction
Global movie average for given item
Global user average for given user
As this progesses there will be other values I'll need to add to the mix such as weighted similarity, genre weighting and I'm sure a few other things.
For now I want to focus on the data available to me as stated above as much for understanding as anything else.
Here is my theory. To start I want to weight the item and user based prediction equally which will have more weight than the global averages.
I feel though on my very rusty maths and some basic attempts to come up with a less linear solution is to use something like Harmonic mean. but instead of natuarlly tending towards the low mean value tend towards the global average.
e.g
predicted item base rating 4.5
predicted user based rating 2.5
global movie rating 3.8
global user rating 3.6
so the "centre"/global average here would be 3.7
I may be way off base with this as my maths is quite rusty but anyone any thoughts on how I could mathematically represent what I'm thinking?
OR
do you have any thoughts on a different approach
I recommend you to look into "Recommender systems handbook" by F. Ricci et al., 2011. It summarizes all the common approaches in recommender engines and provides all the necessary formulas.
Here is an excerpt from 4.2.3:
As the number of neighbors used in the prediction increases, the rating predicted by the regression approach will tend toward the mean rating of item i. Suppose item i has only ratings at either end of the rating range, i.e. it is either loved or hated, then the regression approach will make the safe decision that the item’s worth is average. [...] On the other hand, the classification approach will predict the rating as the most frequent one given to i. This is more risky as the item will be labeled as either “good” or “bad”.

Resources