I've downloaded a data set from UCI Machine Learning Repository. In the description of the data set, they talk about "predictive attribute" and "non-predictive attribute". What does it mean and how can you identify them in a data set?
Predictive attribute are attributes that may help your prediction.
Non-predictive attributes are known to not help. For example, a record id, user number, etc. Unique keys usually fall into this category.
To me, it looks like attributes relates to type of data points available; therefore a predictive attribute would be a data point that can be used to "predict" something, such as MYCT, MMIN, MMAX, CACH, CHMIN, CHMAX. The "non-predictive attribute" would be the vendor names and model name. PRP seems to be the goal field, and linear regression guess is ERP.
Related
I am trying to understand the given protocol of recommenderlab library in R.
From the original document https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf :
Testing is perfomed by withholding items (parameter given). Breese et
al. (1998) introduced the four experimental witholding protocols
called Given 2, Given 5, Given 10 and All-but-1. During testing, the
Given x protocol presents the algorithm with only x randomly chosen
items for the test user, and the algorithm is evaluated by how well it
is able to predict the withheld items
I do not understand the fact that this is only applied to the testing users. Why can't we simply apply this to all users, and we need to split between training and testing users AND use given protocol?
Since we give only few items for the algorithm to understand patterns and produce predictions, and then test/measure how good are those predictions based on the rest of the items, why do we need to do this for the test users only, and we can't do it for all? How do the training users serve? When the algorithm is given let's say 10 items to understand the target user's patterns, doesn't it use the whole dataset (both training and testing) to compute the user neighborhood for example (UBCF) in order to make the predictions that will be later evaluated on using the withheld items? And if not, meaning that during this process it only looks at the training users (say 80% for instance) why wouldn't it look to the testing users as well? What is the problem of having a another test user in the neighborhood, and there needs to be only training users? That part I don't get.. Or my assumptions are wrong?
To conclude: Why do we both need the given protocol, as well as the split between training and testing?
I hope my question makes sense, I am really curious to find the solution. Thank you in advance :)
I want to create and analyze data model (or, at least, data schema) of my table dataset in R.
Ideally, it should be an entity-relationship model (https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model). There is not a single one package for work with ER-models on my radar.
Optionally, it may be a relational data model. I saw some packages, but few good packages.
For example, for a table of three variables: time, customerID, orderID.
I want to make each customer an entity, each order an entity, link them by relation "belongs to" and link each subsequent order of each customer by relation "subsequent", using variable time.
That should result in data schema (aka oriented graph with edges of different types). Is there any standard format on how to create, store and analyze that "meta" information in R?
For a report I have to find association rules of a data set of transactions. I downloaded this data set:
http://archive.ics.uci.edu/ml/datasets/online+retail
Then I deleted some columns, converted to nominal values and normalized and then
I got this: https://ufile.io/gz3do
So I thought I had a data set with transactions on which I could use FP-growth and Apriori but I'm not getting any rules.
It just tells me: No rules found!
Can someone please explain to me if and what I'm doing wrong?
one reason could be that your support and/or confidence value are too high. try low ones. e.g. a support and confidence level of 0.001%. another reason could be that your data set just doesn't contain any association rules. try another data set which certainly contains association rules from a set minimum support and confidence value.
I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.
I'm attempting to build a model for sales in R that is then integrated into Tableau so I can look at the predictions as they relate to the actual values. The model I'm building for sales is in R, and I'm trying to integrate it into Tableau by creating a calculated field that uses the model to give the predicted value for each record using the SCRIPT_REAL function in Tableau. The records are all coming from a MySQL database connection. The issue that I'm having comes from using factors in my model (for example, month).
If I want to group all of the predictions by day of week, Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model. When it tries to aggregate month, not all of the values are the same, so it instead returns a "". Obviously a prediction value then can't be reached because there is no value associated with a "". Essentially what I'm trying to do is get a prediction value for each record that I have, and then aggregate those prediction values in various ways.
Okay, now I can understand a little bit better what you're talking about. A twbx with dummy data (and even dummy model, but that generates the same problem you're facing) would help even more, but let me try to say a couple of things
One thing that is important to understand is that SCRIPT functions are like table calculations, i.e., they are performed only with aggregated fields, they are computed last (after all aggregations, measures and regular calculations) and you can define the level of aggregation you want.
So, if you want to display values on a daily basis, put your date field on page, go to the day level, and for the calculation partition by DAY(date_field). If you want by week, same thing.
I find table calculations (including R scripts) very useful when they are an end, i.e. the calculation is the answer. It's not so useful (or better, not so easily manipulable) when it's an end, like an intermediate step before a final calculation to get to the answer. That is mainly because the level of aggregation is based on the fields that are on page. So, for instance, if I have multiple orders from different clients, and want to assess what's the average order value by customer, table calculation is great, WINDOW_AVG(SUM(order_value)) partitioned by customer. If, for some reason, I want to sum all this values, then it's tricky. I can't do it directly, as the avg order value is not stored anywhere, and cannot be retrieved without all the clients being on page. So what I usually do is to create the table with all customers, export it to mdb, and reconnect in Tableau.
I said all this because it might be your problem, when you say "Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model". Yes, Tableau does that and there's nothing you can do about it, but figure out a way around it. Creating an intermediate table in Tableau, exporting it, and connecting to it again in Tableau might be an answer. Performing the calculations in R, exporting it and then connecting to Tableau might be another way.
But again, without actually seeing what you're trying to do, it's hard to say what you need to do