I am trying to predict the Spotify popularity score using a range of machine learning algorithms in the R Caret package including logistic regression. The aim is to predict track popularity based on audio features e.g. danceability, energy etc. The problem I have is that Spotify are not transparent about how the popularity score is calculated but I know it is based on a number of things including play counts and how recent the track is. That means that the number of days released will have an impact on the popularity score so I have included days_released as an independent variable in my modelling to try and control for it.
So, I have 50 variables (days_released being one of them). I am using the rfe function in Caret to perform feature selection but for every algorithm, days_released is the only variable selected. Does anyone have any advice or recommended reading on how to overcome this problem? I want to predict popularity and explore which track features have a significant relationship with popularity, controlling for days_released.
Do I take the days_released variable out altogether?
Do I leave it in but force rfe to select more than one feature?
Any help would be much appreciated! Thanks in advance!
Related
I'm currently working on my master's thesis. For my research, I have to measure the impact of two independent variables (amount of infographics and amount of other images) on a dependent variable (amount raised). Also other control variables are included. The data consists of a database containing 2000+ observations, with some missing (NA in R) independent variable values. Someone suggested using a Heckman model for this. I watched a youtube video to make it more clear for myself, but I doubt if a Heckman model is the right way to analyze this. Do you recommend to use the Heckman model for this?
How would you analyze this?
Best regards
I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's
I am a university student working on a research project, because of our local lockdown I cannot go into the field to collect observation data, I am therefore looking for an R package that will allow me to model the effects of competition when testing for ideal free distribution (IFD).
To give you a better idea of what I am looking for I have described the project in more detail below.
In my original dataset (which I received i.e., I did not collect the data myself) I have two patches (A,B) which received random treatments of food input (1:1, 2:1, 5:1). Under the ideal free distribution hypothesis species should distribute into the patches in accordance with the treatment ratios. This is not the case.
Under normal circumstances I would go into the field and observe behaviour of individuals in the patches to see if dominance affects distribution. Since we are in a lockdown I am unable to do so. I am hoping that there is a package out there that would allow me to model this scenario and help me investigate how competition affects IFD.
I have already found two packages called coexist and EcoVirtual but they model coexistence and extinction dynamics, whereas I want to investigate how competition might alter distribution between profitable patches when there is variation in the level of competition.
I am fairly new to R and creating my own package is beyond my skillset at this point, so I would appreciate the help.
I hope this makes sense and thanks in advance.
Wow, that's an odd place to find another researcher of IFD. I do not believe there are packages on R specifically about IFD. Its too specific and most models are relatively simple to estimate using common tests. For example, the input-matching rule you mentioned can be tested using a simple run-of-the-mill t-test, already included in base R.
What you have is not a coding problem per say, or even an statistical one. It is a biological problem. What ratio would you expect when animals are ideal (full knowledge of the environment), free (no movement costs), but with the presence of competition? Is this ratio equal to the ratio in your dataset? Sutherland,1983 suggests animals would undermatch.
I would love to discuss this at depth, given my PhD was in IFD, but I fear you hit the wrong forum.
I've been using BTYD package for predicting customer churn and number of orders in the future, but I find the included models (Pareto/NBD, BG/NBD and BG/BB) limited in the sense that they only take recency, frequency, age and monetary value into account. Using these values I receive accuracy of up to 80% but I'm sure this could be improved by incorporating more meaningful predictors into the model. Is it possible in this package? I couldn't find any information about it in the vignette. Any help is welcome, thanks!
Kasia
I have historical purchase data for some 10k customers for 3 months, I want to use that data for making predictions about their purchase in next 3 months. I am using Customer ID as input variable, as I want xgboost to learn for individual spendings among different categories. Is there a way to tweak, so that emphasis is to learn more based on the each Individual purchase? Or better way of addressing this problem?
You can use weight vector which you can pass in weight argument in xgboost; a vector of size equal to nrow(trainingData). However This is generally used to penalize mistake in classification mistake (think of sparse data with items which just sale say once in month or so; you want to learn the sales then you need to give more weight to sales instance or else all prediction will be zero). Apparently you are trying to tweak weight of independent variable which I am not able to understand well.
Learning the behavior of dependent variable (sales in your case) is what machine learning model do, you should let it do its job. You should not tweak it to force learn from some feature only. For learning purchase behavior clustering type of unsupervised techniques will be more useful.
To include user specific behavior first take will be to do clustering and identify under-indexed and over-indexed categories for each user. Then you can create some categorical feature using these flags.
PS: Some data to explain your problem can help others to help you better.
It's arrived with XGBoost 1.3.0 as of the date of 10 December 2020, with the name of feature_weights : https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit , I'll edit here when I can work/see a tutorial with it.