I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.
Related
I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.
I firstly use L.LSTM , then I found this NStepLSTM, which is uncovered part of offical tutorial document.
https://docs.chainer.org/en/stable/reference/generated/chainer.links.NStepLSTM.html?highlight=Nstep
Why does chainer.links.NStepLSTM or chainer.links.NStepBiLSTM not have reset_state? how to reset_state?
is it pass a list of sequences(each is one sequence chainer.Variable, e.g. one article contains multiple words is one Variable)? Is this class purpose is to deal with vary length sequence?
can we use truncate BPTT to save memory in chainer.links.NStepLSTM ? how
1.
NStepLSTM gets a batch of sequences and returns a batch of output sequences, though LSTM gets a batch of words. You don't need to use for-loop to use NStepLSTM. NStepLSTM uses cuDNN, that is a library NVIDIA provides, and is very fast.
NStepLSTM does not have a state. If you want to chain NStepLSTMs, use outputs of NStepLSTM. See seq2seq example: https://github.com/chainer/chainer/blob/master/examples/seq2seq/seq2seq.py
2.
Yes. It gots such as a batch of sequences of embed vectors created from sentences. You can use sequences with different lengths. See seq2seq example.
Note that L.NStepLSTM can get a sequence of sentences, but F.NStepLSTM can get transposed sequences. I mean it can get a sequence of batches of words. Actually L.NStepLSTM calls F.transpose_sequences and F.NStepLSTM in its implementation.
3.
Sorry it is difficult. As I said, NStepLSTM is a wrapper of cuDNN's RNN library.It does not support BPTT. Of course you can split sentences and call NStepLSTM twice.
I'm writing a mapper function on R (using Rhipe for map-reduce). The mapper function is supposed to read the text file and create Corpus. Now, R already has a package called tm which does the Text Mining and create DocumentMatrix. If you want to know more about `tm', have a look here.
But the problem with using this package in map-reduce is that the matrix is converted to list, and is difficult to create a matrix in Reduce from this jumbled up "list". I found an algorithm for creating corpus using map-reduce in this website , but I'm slightly confused as to how I could find the name or some unique identification of the mapper document.
For the document that I have which is 196MB text file, hadoop spawned 4 mappers (blocksize=64MB). How can I classify the key value pair such that the mapper sends the pair as ((words#document),1). The article explains it beautifully. However, I'm having a little trouble understanding how mapper can distinguish the document number it's reading between multiple mappers. As far as I understand, the mapper counter is specific only for the corresponding mapper. Anyone care to elaborate, or provide some suggestions as to what I should do?
I think I came up with my own solution. What I did is instead of looking for mapper counts and what not, I added a text at the end of each line followed by number as in "This is a text, n:1". I used gsub to create increment. In the mapper, while I read the line, I also read the value n:1. Since the n increases for each line, no matter which mapper is reading which line, it gets the correct value of n. I'm then using the value of n to create a new key for each line (document) as in ((word#doc=n),1) where n is the value of each line number.
I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps
I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/