Matching specific items in several discrete collections - collections

I have a problem whereby I have several discrete lists of ID's eg.
List (A) 1,2,3,4,5,7,8
List (B) 2,3,4,5
List (C) 4,2,8,9,1
etc...
I then have another collection of ID's...
For example: 1,2,4
I need to try and match one into each list. If I can perfectly match all ID's in my secondary collection (one collection ID matched with an ID from each list) then I get a true result....
I have found that it becomes complicated because if you simply iterate over the lists matching the first collection/list pair that you encounter it may result in you precluding a possible combination further on down the line hence returning a false negative result.
For example:
List (A) 1,2,3,4
List (B) 1,2,3,4
List (C) 3,4
Collection is: 3,1,2
The first ID from the collection (3) matches with an entry in list A, the second ID in the collection (1) matches an item in list B, however the final ID in the collection (2) DOESNT match any entry in list C however if you rearrange the order of the collection to be: 2,1,3 then a match is found.... Therefore I am looking for some form of logic for attempting a match on all possible combinations in an efficient manner(?)
To make it more complicated the ID's are actually GUID's so cant just be sorted in ascending order
I hope I have described this well enough to make it clear what I am attempting and with a bit of luck somebody will be able to tell me that what I need to do is very easy and I am missing something real simple!
I am forced to code this in VB6 but any methods or pseudo code would be great. The backend of this is SQL server so if a solution using TSQL was possible this would be even better as all of the ID's are held in tables already.
Many thanks in advance.

Jake, yep the lists and the collection both contain GUIDS. I used plain integers to simplify the problem a bit.
Once a list has been matched it cant be searched again, hence the ordering problem that I tried to explain. If you say that a list as 'matched' then no further attempts to match this will be performed. It is this very behaviour that can cause a false negative.
'Sending' the collection in in every possible combination of orders would work but would be a massive job .....
I feel I must be missing a really straightforward concept or solution here??!!
Thanks for your assistance so far.

I don't see a way around checking each GUID contained in the lists against each GUID in the collection. You would have to keep record of in which lists each GUID in the collection occurs.
To use your example of the Collection (3, 1, 2), 3 occurs in List A, B and C.
You will basically be left with this dataset.
3 (A, B, C)
1 (A, B)
2 (A, B)
Once you have distilled it down to this dataset you can determine whether there are any GUIDs with zero occurrences in the lists which would result in a negative.
I am not at all well versed in algorithms, but this is how I would proceed after that :
Start with the first set (A, B, C), and check how many times it occurs further on in the dataset. In this case no occurrences are found.
Moving on to the next set (A, B), if the number of occurrences of this set is found to be greater than the length of this set, i.e. more than two occurrences, would result in a negative. If the number of occurrences match the length exactly, as is the case here, the set (A, B) can be removed from any further consideration.
3 (C)
1 ()
2 ()
I guess you would continue to repeat the process until a negative is identified or all the occurrences have been excluded. There is probably a recognized algorithm for this sort of problem, but my knowledge is a bit lacking in that respect. :(

Related

Idiomatic way to spilt sequence into three lists using Kotlin

So this is possibly more about functional programming than Kotlin, I am at that stage were a little knowledge is dangerous, and I wrote the app in Kotlin so seems fair to ask a Kotlin question as its Kotlins structures that i am interested in.
I have a sequence of items, they are in batches of three, so the stream may look like
1,a,+,2,b,*,3,c,&.......
What I want to do is to spilt this into three lists, currently I am doing this by partitioning into two lists, one that contains the numbers and one that contains everything else, then taking the second half of the result, the letters and symbols and partitioning again, into letters and symbols, thus i end up with three lists.
This strikes me as somewhat inefficient, maybe a functional approach isn't the best approach here.
Is there an efficient way of doing this, are my choices, this or a for loop?
Thanks
You can use groupBy method to group elements of your sequence by an element type:
val elementsByType = sequence.groupBy { getElementType(it) }
where getElementType is function returning a type of the element: whether it is a letter, or a number, or a symbol. This function may return either a number, such as 1, 2, 3, or a value of some enum with 3 different entries.
groupBy returns a map from element type to list of elements of that type.

Merge sort, the recursion part

After studying the merge sort for a couple of days, I understand it conceptually, but there is one thing that I don't get.
What I get:
1.) It takes a list, for example an array of numbers and splits it in half and sorts the two halfs, and in the end merges them together.
2.) Because it's an recursive algorithm it uses recursion to do that.
So the split of the mentioned array looks like this:
It, splits the array until there is only one item in each list and by that its considered sorted. And at that point the merge steps in.
Which should look like this:
What I don't get is, how does the recursion "know" after it splits all the lists to only one item in a list, to get back up the recursion tree? How does something that has a left and right side become the left side after it merges?
The thing that bothers me is this. I've taken a snapshot of the code from interactivepython page
How does the code get to the point, after we have lefthalf = 2, and righthalf = 1, to to code that's shown in the picture where the lefthalf = [1,2] and righthalf = [4,3] without going back to the recursion that would divide what we have have merged?
Tnx,
Tom
Once the list only contains one element, each pair of leaves are sorted and joined. Then you can traverse through the list and find out where the next pair should be inserted. The recursion "knows" nothing about going back up the recursion tree, rather it is the act of sorting and joining that has this effect.
The "recursion" does of course know nothing of that sort. It is the code that uses the recursion, which looks like this (a bit simplified):
sort list = merge (sort left_half) (sort right_half)
where
(left_half, right_half) = split list
Here you see that the "recursion" (i.e. the recursive invocations of sort) don't need to "know" anything. Their only job is to deliver a sorted list, array or whatever.
To put it differently: If we have merge satisfying the following invariant:
1. `merge`, given two sorted lists, will return a sorted list.
then we can write mergesort easily like outlined above. What is left to do in sort is to handle the easy cases: empty list, singleton and list with two elements.
If you are talking about odd numbered sub lists, then it is dependant on the implementation.
It either puts the bigger sub list on the left every single time, or it puts it on the right every single time.

hash table suffix tree explanation

I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps

What is the best way to determine what articles are available for a given usenet group?

I was wondering what the most efficient way is to get the available articles for a given nntp group. The method I have implemented works as follows:
(i) Select the group:
GROUP group.name.subname
(ii) Get a list of article numbers from the group (pushed back into a vector 'codes'):
LISTGROUP
(iii) Loop over codes and grab articles (e.g. headers)
for code in codes do
HEAD code
end
However, this doesn't scale well with large groups with many article codes.
In RFC 3977, the GROUP command is indicated as also returning the 'low' and 'high' article numbers. For example,
[C] GROUP misc.test
[S] 211 1234 3000234 3002322 misc.test
where 3000234 and 2002322 are the low and high numbers. I'm therefore thinking of using these instead rather than initially pushing back all article codes. But can these numbers be relied upon? Is 3000234 definitely indicative of the first article id in the above-selected group and likewise is 3002322 definitely indicative of the last article id in the above-selected group or are they just estimates?
Many thanks,
Ben
It turns out I was thinking about this all wrong. All I need to do is
(i) set the group using GROUP
(ii) execute the NEXT command followed by HEAD for however many headers I want (up to count):
for c : count do
articleId <-- NEXT
HEAD articleID
end
EDIT: I'm sure there must be a better way but until anyone suggests otherwise I'll assume this way to be the most effective. Cheers.

Why does union not return a unique list?

Is there a logical reason why the following statement from the Hyperspec is the way it is?
"If there is a duplication between list-1 and list-2, only one of the duplicate instances will be in the result. If either list-1 or list-2 has duplicate entries within it, the redundant entries might or might not appear in the result."
Until I read this I was assuming that union should return a unique list and frustrated why my code didn't do so. It also seems odd to remove duplicates between lists but not within. Why even specify this?
It seems like one should be able to assume union will produce a unique list of the set's elements or am I missing something?
For the full page in Hyperspec see http://clhs.lisp.se/Body/f_unionc.htm
If your code has sets only with unique elements (like 1 2 3 ), then UNION will preserve this property.
If your code has sets with non-unique elements (like 1 2 2 3 ), then UNION does not need to make any effort to enforce uniqueness in the result set.
Removing duplicates is done with a separate function: REMOVE-DUPLICATES.
Assuming that elements of both lists that are arguments to UNION are unique means that the complexity of the algorithm in the worst case (non-sortable, non-hashable elements) is O(n*m). On the other hand removing duplicates in a list in that case is O(n^2). Making UNION remove duplicates would approximately triple the running time even in the case where there were no duplicates, since most of the time is consumed by doing the comparisons.

Resources