inverse of market basket analysis with R - r

I want to do an analysis on :what item didnt go well together in Market basket analyis . Basically finding out which item together didnt go out of the queue . I have a situation , where an record ( containing 13 attribute/column )is incomplete because of various combinations of attributes .
for ex : a1,a2 .... a13 .
All the above attributes may or maynot have values . But any attribute not having values will make the record to be incomplete
with this situation , I need to see, which combination of incomplete records is mostly occuring in my recordsets. Knowing this pattern will help my team prioritize the records which needs most attention .
I see that Apriori algorithm takes only values whcih are available , but I need to analyse the combination that is not occuring . I am sure this problem should have been solved in the past , but I dont see any hints in the forum .
Does anyone have any experience of such kind? Or do you suggest any other Algorithm that i should use ? I am using R for this analyis. And the total records :218k

If I grasp your stated situation right, you'd like to get of a dataset, where an item of a case either has a value or doesn't have a value, association rules to those cases which has at least one item without value and then only to these items, which has no values. For this purpose is the Apriori algorithm just fine. And you even don't need to invert it. The solution lies here within the formatting of the dataset: Just get rid of the items with values and give the items without values a value like the name of the regarding item, e.g. a12. Then your dataset only contains cases with at least one item without value and items without values, plus those items can be identified by their values, i.e. their names. Now it's possible for the Apriori algorithm to extract of the formatted dataset the frequent itemsets and subsequently association rules. Concerning if you should use another algorithm to extract association rules: Yes. Use the FP-Growth. It is a way faster than the Apriori algorithm.

Thanks , that answer helped .I need to analyse all null items in each transaction and I need to see which combination of null has most occurence from all the transactions.
I tried replacing all my null values with constants . Did some tweaks in the apriori algorithm to get the those constants as rhs . But I didnt understand,how FP growth algorithm could help on this?can you explain .

Related

Building an index in Excel

Working in Excel365, what would you say is the most resource-effective formula for building an index from percentage changes?
Assume you have a time series of percentage changes of any variable (e.g. daily changes in a stock price) in A2:A1000 in the form of a dynamic array, and you want to build an index starting at 100 in column B. In its simplest form, you would enter 100 in B1, enter B1*(1+A2) in B2 and copy that formula down to (in this case) B1000. But how would you suggest to do this in the most resource effective way, so that B1:B1000, or at least B2:B1000 becomes a dynamic array following the length of A2#, i.e. if A2# is 2345 rows (instead of 999 rows as in the example above), B1# becomes 2346 rows (or B2# 2345 rows if that solution is simpler)?
I do not have access to the values of the underlying variable, only to the percentage change, and I have many columns I need to build indexes for, therefore it is preferable if it is as resource-effective as possible.
Thanks a million for any ideas!
Kindly,
Johan
P.S. Using OFFSET() to get a dynamic array doesn't work, since the calculation is iterative (index value at t+1 is dependent on the index value at t), thus yielding a circular reference error. Instead I have tried BYROW() with LAMBDAs without much success and I'm not convinced that they are very resource-effective anyway. A seemingly simple problem that has thrown me into a dead-end street...

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

filter by variable value in report block

I could find no relevant answers in either StackOverflow or google. Perhaps one of you has the answer. This is a continuation from a previous question...
In Business Objects Webi, I have two variables. One dimension, one measure. My goal is to create a simple pie chart.
Here's the dimension variable titled "EWFMCodeSelect":
`=If([Code]InList("BRK1"; "BRK2"; "BRK3" )) Then"BREAK"
ElseIf([Code]InList("TEAM"; "MTG"; "MTNG"; "PROJ"; "TRNG";"WCGB")) Then "DISC"
ElseIf([Code]InList("LUNCH")) Then "LUNCH"
ElseIf([Code]InList("LATE";"NOSHOW";"UNPAID";"UPVAC")) Then "MISS"
ElseIf([Code]InList("COACH";"VTO")) Then "NEUTR"
ElseIf([Code]InList("VAC";"LOA";"SICKUP";"SICKPL")) Then "NODISC"
ElseIf([Code]InList("PREP")) Then "OTHER"
ElseIf([Code]InList("OVER")) Then "OVER"
Else("SHIFT")`
This is the measure variable titled EWFMPieChart(%):
=[TimeDiff (ToInt)]
/ NoFilter(( Sum([TimeDiff (ToInt)]
ForAll([EWFMCodeSelect])
Where ([EWFMCodeSelect] = "SHIFT")))ForEach())
The previous advice I received was to filter the value "SHIFT" from the report block. I thought this would be a simple affair but it's proving more difficult than anticipated. I tried creating a Report Block filter in the Analysis tab "EWFMCodeSelect Not Equal To SHIFT"
"EWFMCodeSelect Not In List > SHIFT"
but only ended up with a single row, the dimension field empty, the measure field showing #MULTIVALUE. I tried a variety of other combinations but all had the same effect.
I tried a Column filter:
=[EWFMCodeSelect] Where ([EWFMCodeSelect] <> "SHIFT")
but ended up with a single row, the dimension field showing: "BREAKDISCLUNCH..." as the value and the measure, again, showing #MULTIVALUE.
I'm missing some important clue here. Can anyone educate me either why this approach is incorrect and maybe supply me with a direction to achieve my goal?
Thanks,
mfc
I don't have the correct answer but I did solve the problem by cleaning up all of the TEST variables, deleting all unused and unnecessary errata that collected up to this point and re-running the report as a scheduled item. I also cleared the browser cache (never bad advice).
After re-opening the report, I was able to filter the report block without issue.
I guess the answer is "When in doubt and receiving undocumented results, clean up your work-space and try again".

What is the best way to determine what articles are available for a given usenet group?

I was wondering what the most efficient way is to get the available articles for a given nntp group. The method I have implemented works as follows:
(i) Select the group:
GROUP group.name.subname
(ii) Get a list of article numbers from the group (pushed back into a vector 'codes'):
LISTGROUP
(iii) Loop over codes and grab articles (e.g. headers)
for code in codes do
HEAD code
end
However, this doesn't scale well with large groups with many article codes.
In RFC 3977, the GROUP command is indicated as also returning the 'low' and 'high' article numbers. For example,
[C] GROUP misc.test
[S] 211 1234 3000234 3002322 misc.test
where 3000234 and 2002322 are the low and high numbers. I'm therefore thinking of using these instead rather than initially pushing back all article codes. But can these numbers be relied upon? Is 3000234 definitely indicative of the first article id in the above-selected group and likewise is 3002322 definitely indicative of the last article id in the above-selected group or are they just estimates?
Many thanks,
Ben
It turns out I was thinking about this all wrong. All I need to do is
(i) set the group using GROUP
(ii) execute the NEXT command followed by HEAD for however many headers I want (up to count):
for c : count do
articleId <-- NEXT
HEAD articleID
end
EDIT: I'm sure there must be a better way but until anyone suggests otherwise I'll assume this way to be the most effective. Cheers.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources