I have a dataset like the following picture and want to read and extract each cell and assign them into parameters in an optimization model. For example, considering only one part of a row:
ID, Min, speed, Distance, Time Latitude, Longitude
1 2506 23271 11.62968 17.7 -37.731 144.898
Every row depicts a persons' information. So, is it better to define a dictionary of person and put all these into that? Or is it better to define a tuple? Arrays(like below)?
for i in 1:n_people
person_id = i
push!(requests, Request(ID[i], Min[i], speed[i], Distance[i], Latitude[i], Longitude[i]))
end
In any case, how can I access (extract), let's say, distance for that person?
I mean, I need to have a set of people in my model like
people[i] and then for each of these connect them to their information (model parameters) like person's 'i' distance, speed,... and then compare them with person j.
What is the best way to do that?
Since JuMP is agnostic to the format of the input data, the answer is: it depends on what you want to do with it. Pick whatever makes the most sense to you.
There are a few data related tutorials that address how to read data into a DataFrame and use that to create JuMP variables and constraints. Reading those is a good next step:
https://jump.dev/JuMP.jl/stable/tutorials/getting_started/getting_started_with_data_and_plotting/
https://jump.dev/JuMP.jl/stable/tutorials/linear/diet/
Let me say that I have a below Data Frame in R with 500 player records with the following columns
PlayerID
TotalRuns
RunRate
AutionCost
Now out of the 500 players, I want my code to give me multiple combinations of 3 players that would satisfy the following criteria. Something like a Moneyball problem.
The sum of auction cost of all the 3 players shouldn't exceed X
They should have a minimum of Y TotalRuns
Their RunRate must be higher than the average run rate of all the players.
Kindly help with this. Thank you.
So there are choose(500,3) ways to choose 3 players which is 20,708,500. It's not impossible to generate all these combinations combn might do it for you, but I couldn't be bothered waiting to find out. If you do this with player IDs and then test your three conditions, this would be one way to solve your problem. An alternative would be a Monte Carlo method. Select three players that initially satisfy your conditions. Randomly select another player who doesn't belong to the current trio, if he satisfies the conditions save the combination and repeat. If you're optimizing (it's not clear but your question has optimization in the tag), then the new player has to result in a new trio that's better than the last, so if he doesn't improve your objective function (whatever it might be), then you don't accept the trade.
choose(500,3)
Shows there are almost 21,000,000 combinations of 3 players drawn from a pool of 500 which means a complete analysis of the entire search space ought to be reasonably doable in a reasonable time on a modern machine.
You can generate the indeces of these combinations using iterpc() and getnext() from the iterpc package. As in
# library(iterpc) # uncomment if not loaded
I <- iterpc(5, 3)
getnext(I)
You can also drastically cut the search space in a number of ways by setting up initial filtering criteria and/or by taking the first solution (while loop with condition = meeting criterion). Or, you can get and rank order all of them (loop through all combinations) or some intermediate where you get n solutions. And preprocessing can help reduce the search space. For example, ordering salaries in ascending order first will give you the cheapest salary solution first. Ordering the file by descending runs will give you the highest runs solutions first.
NOTE: While this works fine, I see iterpc now is superseded by the arrangements package where the relevant iterator is icombinations(). getnext() is still the access method for succeeding iterators.
Thanks, I used a combination of both John's and James's answers.
Filtered out all the players who don't satisfy the criteria and that boiled down only to 90+ players.
Then I used picked up players in random until all the variations got exhausted
Finally, I computed combined metrics for each variation (set) of players to arrive at the optimized set.
The code is a bit messy and doesn't wanna post it here.
I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps
I was wondering what the most efficient way is to get the available articles for a given nntp group. The method I have implemented works as follows:
(i) Select the group:
GROUP group.name.subname
(ii) Get a list of article numbers from the group (pushed back into a vector 'codes'):
LISTGROUP
(iii) Loop over codes and grab articles (e.g. headers)
for code in codes do
HEAD code
end
However, this doesn't scale well with large groups with many article codes.
In RFC 3977, the GROUP command is indicated as also returning the 'low' and 'high' article numbers. For example,
[C] GROUP misc.test
[S] 211 1234 3000234 3002322 misc.test
where 3000234 and 2002322 are the low and high numbers. I'm therefore thinking of using these instead rather than initially pushing back all article codes. But can these numbers be relied upon? Is 3000234 definitely indicative of the first article id in the above-selected group and likewise is 3002322 definitely indicative of the last article id in the above-selected group or are they just estimates?
Many thanks,
Ben
It turns out I was thinking about this all wrong. All I need to do is
(i) set the group using GROUP
(ii) execute the NEXT command followed by HEAD for however many headers I want (up to count):
for c : count do
articleId <-- NEXT
HEAD articleID
end
EDIT: I'm sure there must be a better way but until anyone suggests otherwise I'll assume this way to be the most effective. Cheers.
I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.