I am relatively new to using predictive modeling and would like some brainstorming help/assessment of feasibility.
I currently have the following variables in the data set for 2018-present with one row per order
date
day of week
item category
order id
lat / long for shipping address.
I would like to predict weekly sales for the remaining weeks of this year BY item category. I am most comfortable using R at the moment.
What algorithm/package would you recommend I look into given that I would like to predict weekly sales volume by category?
The shortest answer is you start with a set of tidyverse packages. group_by() from dplyr is very powerful for computing values by some factor. To me, it sounds like you have your data in a tidy form already which works best with tidyverse framework as it allows one to easily vectorize operations over data.frame. Check out the main packages they have to offer and their overviews here. Start with simpler models like lm() and then if the need arrives continue with more advanced ones. Which one of the variables are you going to use as predictors?
No matter the model you choose, after you build the appropriate one, you can use built-in predict() together with group_by() function. More details on basic prediction here.
By the way, I can't see the data set you talk about, only the description of it. Could you provide a link to a representative sample? It would allow me to provide deeper insight.
Related
I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's
At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)
I have historical purchase data for some 10k customers for 3 months, I want to use that data for making predictions about their purchase in next 3 months. I am using Customer ID as input variable, as I want xgboost to learn for individual spendings among different categories. Is there a way to tweak, so that emphasis is to learn more based on the each Individual purchase? Or better way of addressing this problem?
You can use weight vector which you can pass in weight argument in xgboost; a vector of size equal to nrow(trainingData). However This is generally used to penalize mistake in classification mistake (think of sparse data with items which just sale say once in month or so; you want to learn the sales then you need to give more weight to sales instance or else all prediction will be zero). Apparently you are trying to tweak weight of independent variable which I am not able to understand well.
Learning the behavior of dependent variable (sales in your case) is what machine learning model do, you should let it do its job. You should not tweak it to force learn from some feature only. For learning purchase behavior clustering type of unsupervised techniques will be more useful.
To include user specific behavior first take will be to do clustering and identify under-indexed and over-indexed categories for each user. Then you can create some categorical feature using these flags.
PS: Some data to explain your problem can help others to help you better.
It's arrived with XGBoost 1.3.0 as of the date of 10 December 2020, with the name of feature_weights : https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit , I'll edit here when I can work/see a tutorial with it.
I am relatively new at using R. I have a dataset of around 5000 datapoints.
My goal is to predict a category using the comments entered.
I have a training dataset of 4500 records and a testing data set of 500 records.
I am looking for 2-3 packages which might help me in doing this.I have to evaluate these packages and prepare a report on that. Can anyone suggest me some good packages which might easier to use and also more efficient.
Again, I have 2 columns
1st one is comments and based on this I have to predict the category.
Right now I have defined around 10 independent categories.
Most of the comments have specific keywords which I have defined as categories
One such example
Comment 1
The website is pretty good --->> category would be WebsiteContent
comment 2 might be like
Excellent article ,very detailed--->> same category as above(WebsiteContent)
But the keywords such as article, website are very limited and can be linked to the category
all of comments are different but the underlying keywords are mostly the same
Thanks,
Ankan
Although all you need is a very long and well written set of if-else statements, try using a Decision tree from the package from the rpart and prp package. I'm saying this only cause you're trying to learn and I'm guessing this is for some assignment which you're supposed to be doing on your own.
tree<-rpart(train$decision~train$comment, method"class")
prp(tree)
The first line builds the model and the second one plots it. This might be a bit overboard actually but since you're learning R this is a fun thing to work with and can be used for a wide variety of things. Although, Decision trees work better with more predictor variables.
Use predict(test,tree) to test out the model on your test dataset.
pardon the newbie question, as I just started learning R a couple weeks ago (but intend to use it actively from now on). However, I could use some help if you already have a working example.
In order to determine own price elasticity coefficients for our each of our products (~100) in each of our states, I want to be able to write a multiple regression that regresses Units on a variety of independent variables. That's straightforward. However, I would like R to be able to cycle through EACH product within a particular state, THEN move onto the next state in the data file, and start the regression on the first product, repeating the cycle.
I have attached an example of what I'm trying to accomplish. I would also like R at the end to export the regression coefficients (and summaries, p-value, t-stat) into a separate worksheet.
Does anyone have an example similar to this? I'm comfortable enough to read the source code and make modifications to fit my needs, but certainly not yet comfortable at this point to write one from scratch. And, alas, I am tired of copying/pasting into Minitab/Excel (which is what i've been using up to this point) to run regressions 1,000 times.
Appreciate any help you could offer!