I understand that this site is probably overkill for my simple question, but I can't find the answer online at all.
If I have several ranked lists, like five people's list of their top ten movies...how do I find the average rank of the individual items (movies) when each list does not necessarily have all of the same items on it? I am trying to do this in Excel 2007.
Thanks!
Not sure what you mean by "each list does not necessarily have all of the same items on it", but you probably want to start by looking at Excel's AVERAGE function:
http://office.microsoft.com/en-us/excel-help/average-HP005208994.aspx
They go over some basic and more complex usage there.
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
Not sure if someone could help me with this problem.
I have 5 lists of values of different lengths.
Note: Same value can be presence in different lists.
Does anyone know how to get the combination of 3 lists that will provide more total unique values?
Thanks in advance,
Miguel
I do not really have an answer to your question, which seems to be more of a combinatorics question than programming. My sense is that if you want an exact solution you will have to try all the possible combinations of subsets of 3 lists out of 5 (there are 10 of them). One thing to remember if you go that way is that if you want the number of unique elements of the concatenation of 3 lists you do not have necessarily to do length(unique(c(l1,l2,l3)) which I imagine could be inefficient if you have very long lists. You can use the formula for the size of the intersection of 3 sets, which you can find for example at https://math.stackexchange.com/questions/669249/probability-of-the-union-of-3-events .
This will require you only to compute the length of all the possible intersections of the lists. it could be a completely academic exercise: as I said, I am not offering an answer but if you are not familiar with that formula it is worth reading it, since it is relevant to the problem of finding the size of a set.
I am currently attempting to implement a trading idea that I have been playing around with. It consists of 50+ securities and has a strategy very similar to this one. (Current package I am using is quantmod).
http://www.r-bloggers.com/backtesting-a-simple-stock-trading-strategy/
For those who aren't interested in clicking, it is a strategy that will look at the pass X days( in his case 200 ) and enter a position depending on the peak reached in the stock. I understand how to do this strategy for my idea, but I cannot grasp how to aggregate my data into one summary.
Is there a way I can consolidate the summary for all the positions I have entered into one larger portfolio summary and chart that against the S&P 500?
Any advice on where I can find resources or being lead to the information. I have looked at portfolio analysis package for R and I do not believe that will be much help to me.
Thank you in advance.
Edit: In the link, at the bottom, there are 3 indexes that are FTSE, N225, DJIA. Could i combine those 3 summaries to show the same output as below, BUT combined
FTSE:
Me Index
Cumulative Return 3.56248582 3.8404476
Annual Return 0.05667121 0.0589431
Annualized Sharpe Ratio 0.45907768 0.3298633
Win % 0.53216374 0.5239884
Annualized Volatility 0.12344579 0.1786895
Maximum Drawdown -0.39653398 -0.5256991
Max Length Drawdown 1633.00000 2960.0000
Could I get that same output but for the 3 securities data combined? Is there a effective way of doing that. Thank you so much. Happy holidays
It's a little unclear to me what you mean by "combine" in this case. If you want a single column representing the combined returns from all three exchanges as if they were a single unified market, that's really tricky, because the exchanges trade in different currencies (British pounds; U.S. dollars, Japanese Yen, etc.). The underlying analysis would have to be modified substantially to take into account fluctuating daily foreign exchange rates.
I suspect that this is NOT want you want. Rather, you are simply asking how to take three sequential two-column outputs and turn them into a single parallel six-column output.
If that is indeed what you want, then you need to rewrite the testStrategy() function shown near the bottom of the link. As it's currently written, that function takes three inputs: an index name myStock (with allowed values of FTSE, DJIA, or N225), and two integer values, nHold and nHigh. You would need to change it so that it instead accepts five inputs; e.g., myStockA, myStockB and myStockC, plus the two integer values already mentioned. Then each of the lines currently referring to myStock would have to be replicated three times. Finally, the two cbind() lines that you see at the bottom would have to be modified so that instead of merging the data together into only two columns, you include all six.
For a good intro tutorial on how to write and modify your own R functions, please see this. To understand how to use the cbind() function, which you will have to call with six rather than two inputs, please see this.
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/
I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.