I've been using BTYD package for predicting customer churn and number of orders in the future, but I find the included models (Pareto/NBD, BG/NBD and BG/BB) limited in the sense that they only take recency, frequency, age and monetary value into account. Using these values I receive accuracy of up to 80% but I'm sure this could be improved by incorporating more meaningful predictors into the model. Is it possible in this package? I couldn't find any information about it in the vignette. Any help is welcome, thanks!
Kasia
Related
I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)
I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.
I am using the eRm package in R to examine the properties of a clinical rating scale using a Partial Credit Model (PCM). I understand how to extract the person ability estimations (thetas) from a simple fitted PCM but I have a dataset with repeated observations (~1200 observations of the instrument in ~250 individuals). So as not to violate assumptions of conditional independence I've fitted the PCM to single observations drawn at random from each subject. This all works but I would now like to use the fitted PCM to generate person-ability estimates for the remaining ~950 observations not used to fit the model; and I can't see from the eRm package documentation how to do this?
Advice very much appreciated
For reference should others be asking the same question, Patrick Nair (the package maintainer for eRm) confirms this is not possible in eRm at the time of writing but pointed me to the [PP package] as an interim solution1
I have a model using medical data in R that I have created using XGBoost. I can tell feature importance and the top feature is BMI. However, I do not know how to tell how this impacts the model--it a binary outcome of hemorrhage, so does a lower or higher BMI cause the patient to be classified as being at risk for hemorrhage. Does anyone know how to be able to find this out? Thank you!
As you might be able to tell from the sample of my dataset, it contains a lot of dependency, with each study providing multiple outcomes for the construct I am looking at. I was planning on using the metacor library because I only have information about the sample size but not variance. However, all methods I came across to that deal with dependency such as the package rubometa use variance (I know some people average the effect size for the study but I read that tends to produce larger error rates). Do you know if there is an equivalent package that uses only sample size or is it mathematically impossible to determine the weights without it?
Please note that I am a student, no expert.
You could use the escalc function of the metafor package to calculate variances for each effect size. In the case of correlations, it only needs the raw correlation coefficients and the corresponding sample sizes.
See section "Outcome Measures for Variable Association" # https://www.rdocumentation.org/packages/metafor/versions/2.1-0/topics/escalc
I am trying to predict the Spotify popularity score using a range of machine learning algorithms in the R Caret package including logistic regression. The aim is to predict track popularity based on audio features e.g. danceability, energy etc. The problem I have is that Spotify are not transparent about how the popularity score is calculated but I know it is based on a number of things including play counts and how recent the track is. That means that the number of days released will have an impact on the popularity score so I have included days_released as an independent variable in my modelling to try and control for it.
So, I have 50 variables (days_released being one of them). I am using the rfe function in Caret to perform feature selection but for every algorithm, days_released is the only variable selected. Does anyone have any advice or recommended reading on how to overcome this problem? I want to predict popularity and explore which track features have a significant relationship with popularity, controlling for days_released.
Do I take the days_released variable out altogether?
Do I leave it in but force rfe to select more than one feature?
Any help would be much appreciated! Thanks in advance!