Distinguish word count by document number in mapper - Hadoop? - r

I'm writing a mapper function on R (using Rhipe for map-reduce). The mapper function is supposed to read the text file and create Corpus. Now, R already has a package called tm which does the Text Mining and create DocumentMatrix. If you want to know more about `tm', have a look here.
But the problem with using this package in map-reduce is that the matrix is converted to list, and is difficult to create a matrix in Reduce from this jumbled up "list". I found an algorithm for creating corpus using map-reduce in this website , but I'm slightly confused as to how I could find the name or some unique identification of the mapper document.
For the document that I have which is 196MB text file, hadoop spawned 4 mappers (blocksize=64MB). How can I classify the key value pair such that the mapper sends the pair as ((words#document),1). The article explains it beautifully. However, I'm having a little trouble understanding how mapper can distinguish the document number it's reading between multiple mappers. As far as I understand, the mapper counter is specific only for the corresponding mapper. Anyone care to elaborate, or provide some suggestions as to what I should do?

I think I came up with my own solution. What I did is instead of looking for mapper counts and what not, I added a text at the end of each line followed by number as in "This is a text, n:1". I used gsub to create increment. In the mapper, while I read the line, I also read the value n:1. Since the n increases for each line, no matter which mapper is reading which line, it gets the correct value of n. I'm then using the value of n to create a new key for each line (document) as in ((word#doc=n),1) where n is the value of each line number.

Related

How to find out the longest definition entry in an English dictionary text file?

I asked over at the English Stack Exchange, "What is the English word with the longest single definition?" The best answer they could give is that I would need a program that could figure out the longest entry in a (text) file listing dictionary definitions, by counting the amount of characters or words in a given entry, and then provide a list of the longest entries. I also asked at Superuser but they couldn't come up with an answer either, so I decided to give it a shot here.
I managed to find a dictionary file which converted to text has the following format:
a /a/ indefinite article (an before a vowel) 1 any, some, one (have a cookie). 2 one single thing (there’s not a store for miles). 3 per, for each (take this twice a day).
aardvark /ard-vark/ n an African mammal with a long snout that feeds on ants.
abacus /a-ba-kus, a-ba-kus/ n a counting frame with beads.
As you can see, each definition comes after the pronunciation (enclosed by slashes), and then either:
1) ends with a period, or
2) ends before an example (enclosed by parenthesis), or
3) follows a number and ends with a period or before an example, when a word has multiple definitions.
What I would need, then, is a function or program that can distinguish each definition (including considering multiple definitions of a single word as separate ones), then count the amount of characters and/or words within (ignoring the examples in parenthesis since that is not the proper definition), and finally provide a list of the longest definitions (I don't think I would need more than say, a top 20 or so to compare). If the file format was an issue, I can convert the file to PDF, EPUB, etc. with no problem. And, I guess ideally I would want to be able to choose between counting length by characters and by words, if it was possible.
How should I go to do this? I have little experience from programming classes I took a long time ago, but I think it's better to assume I know close to nothing about programming at all.
Thanks in advance.
I'm not going to write code for you, but I'll help think the problem through. Pick the programming language you're most familiar with from long ago, and give it a whack. When you run in to problems, come back and ask for help.
I'd chop this task up into a bunch of subproblems:
Read the dictionary file from the filesystem.
Chunk the file up into discrete entries. If it's a text file like you show, most programming languages have a facility to easily iterate linewise through a file (i.e. take a line ending character or character sequence as the separator).
Filter bad entries: in your example, your lines appear separated by an empty line. As you iterate, you'll just drop those.
Use your human observation and judgement to look for strong patterns in the data that you can give communicate as firm rules -- this is one of the central activities of programming. You've already started identifying some patterns in your question, i.e.
All entries have a preamble with the pronounciation and part of speech.
A multiple definition entry will be interspersed with lone numerals.
Otherwise, a single definition just follows the preamble.
Write the rules you've invented into code. It'll go something like this: First find a way to lop off the word itself and the preamble. With the remainder, identify multiple-def entries by presence of lone numerals or whatever; if it's not, treat it as single-def.
For each entry, iterate over each of the one-or-more definitions you've identified.
Write a function that will count a definition either word-wise or character-wise. If word-wise, you'll probably tokenize based on whitespace. Counting the length of a string character-wise is trivial in most programming languages. Why not implement both!
Keep a data structure in memory as you iterate the file to track "longest". For each definition in each entry, after you apply the length calculation, you'll compare against the previous longest entry. If the new one is longer, you'll record this new leading word and its word count in your data structure. Comparing 'greater than' and storing a variable are fundamental in most programming languages, so while this is the real meat of your program, this shouldn't be hard.
Implement some way to display your results once iteration is done. This may be as simple as a print statement.
Finally, write the glue code that lets you execute the program easily. A program like this could easily be a command-line tool that takes one or two arguments (the path to the file to be analyzed, perhaps you pass your desired counting method 'character|word' as an argument too, since you implemented both). Different languages vary in how easy it is to create an executable to run from the command line, but most support it, so it's a good option for tasks like this.

Why does NStepLSTM not have reset_state method?

I firstly use L.LSTM , then I found this NStepLSTM, which is uncovered part of offical tutorial document.
https://docs.chainer.org/en/stable/reference/generated/chainer.links.NStepLSTM.html?highlight=Nstep
Why does chainer.links.NStepLSTM or chainer.links.NStepBiLSTM not have reset_state? how to reset_state?
is it pass a list of sequences(each is one sequence chainer.Variable, e.g. one article contains multiple words is one Variable)? Is this class purpose is to deal with vary length sequence?
can we use truncate BPTT to save memory in chainer.links.NStepLSTM ? how
1.
NStepLSTM gets a batch of sequences and returns a batch of output sequences, though LSTM gets a batch of words. You don't need to use for-loop to use NStepLSTM. NStepLSTM uses cuDNN, that is a library NVIDIA provides, and is very fast.
NStepLSTM does not have a state. If you want to chain NStepLSTMs, use outputs of NStepLSTM. See seq2seq example: https://github.com/chainer/chainer/blob/master/examples/seq2seq/seq2seq.py
2.
Yes. It gots such as a batch of sequences of embed vectors created from sentences. You can use sequences with different lengths. See seq2seq example.
Note that L.NStepLSTM can get a sequence of sentences, but F.NStepLSTM can get transposed sequences. I mean it can get a sequence of batches of words. Actually L.NStepLSTM calls F.transpose_sequences and F.NStepLSTM in its implementation.
3.
Sorry it is difficult. As I said, NStepLSTM is a wrapper of cuDNN's RNN library.It does not support BPTT. Of course you can split sentences and call NStepLSTM twice.

Weka Apriori No Large Itemset and Rules Found

I am trying to do apriori association mining with WEKA (i use 3.7) using given database table
So, i exported two columns (orderLineNumber and productCode) and load it into weka, as far as i go, i haven't got any success attempt, always ended with "No large itemsets and rules found!"
Again, i tried to convert the csv into ARFF file first using ARFF Converter and still get the same message;
I also tried using database loader in WEKA, the data loaded just fine but still give the same result;
The filter i've applied in preprocessing is only numericToNominal filter;
What have i wrongly done here, i suspiciously think it was my ARFF format though, thank you
Update
After further trial, i found out that i exported wrong column and i lack 1 filter process, which is "denormalized", i installed the plugin via packet manager and denormalized my data after converting it to nominal first;
I then compared the results with "Supermarket" sample's result; The only difference are my output came with 'f' instead of 't' (like shown below) and the confidence value seems like always 100%;
First of all, OrderLine is the wrong column.
Obviously, the position on the printed bill is not very important.
Secondly, the file format is not appropriate.
You want one line for every order, one column for every possible item in the #data section. To save memory, it may be helpful to use sparse formats (do not forget to set flags appropriately)
Other tools like ELKI can process input formats like this, that may be easier to use (it also was a lot faster than Weka):
apple banana
milk diapers beer
but last I checked, ELKI would "only" find frequent itemsets (the harder part) not compute association rules. I then used a tiny python script to produce actual association rules as desired.

arcmap network analyst iteration over multiple files using model builder

I have 10+ files that I want to add to ArcMap then do some spatial analysis in an automated fashion. The files are in csv format which are located in one folder and named in order as "TTS11_path_points_1" to "TTS11_path_points_13". The steps are as follows:
Make XY event layer
Export the XY table to a point shapefile using the feature class to feature class tool
Project the shapefiles
Snap the points to another line shapfile
Make a Route layer - network analyst
Add locations to stops using the output of step 4
Solve to get routes between points based on a RouteName field
I tried to attach a snapshot of the model builder to show the steps visually but I don't have enough points to do so.
I have two problems:
How do I iterate this procedure over the number of files that I have?
How to make sure that every time the output has a different name so it doesn't overwrite the one form the previous iteration?
Your help is much appreciated.
Once you're satisfied with the way the model works on a single input CSV, you can batch the operation 10+ times, manually adjusting the input/output files. This easily addresses your second problem, since you're controlling the output name.
You can use an iterator in your ModelBuilder model -- specifically, Iterate Files. The iterator would be the first input to the model, and has two outputs: File (which you link to other tools), and Name. The latter is a variable which you can use in other tools to control their output -- for example, you can set the final output to C:\temp\out%Name% instead of just C:\temp\output. This can be a little trickier, but once it's in place it tends to work well.
For future reference, gis.stackexchange.com is likely to get you a faster response.

Generating a SequenceFile

Given data in the following format (tag_uri image_uri image_uri image_uri ...), I need to turn them into Hadoop SequenceFile format for further processing by Mahout (e.g. clustering)
http://flickr.com/photos/tags/100commentgroup http://flickr.com/photos/34254318#N06/4019040356 http://flickr.com/photos/46857830#N03/5651576112
http://flickr.com/photos/tags/100faves http://flickr.com/photos/21207178#N07/5441742937
...
Before this I would turn the input into csv (or arff) as follows
http://flickr.com/photos/tags/100commentgroup,http://flickr.com/photos/tags/100faves,...
0,1,...
1,1,...
...
with each row describes one tag. Then the arff file is converted into a vector file used by mahout for further processing. I am trying to skip the arff generation part, and generate a sequenceFile instead. If I am not mistaken, to represent my data as a sequenceFile, I would need to store each row of the data with $tag_uri as key, then $image_vector as value. What is the proper way of doing this (if possible, can I have the tag_url for each row to be included in the sequencefile somewhere)?
Some references that I found, but not sure if they are relevant:
Writing a SequenceFile
Formatting input matrix for svd matrix factorization (can I store my matrix in this form?)
RandomAccessSparseVector (considering I only list images that are assigned with a given tag instead of all the images in a line, is it possible to represent it using this vector?)
SequenceFile write
SequenceFile explanation
You just need a SequenceFile.Writer, which is explained in your link #4. This lets you write key-value pairs to the file. What the key and value are depends on your use case, of course. It's not at all the same for clustering versus matrix decomposition versus collaborative filtering. There's not one SequenceFile format.
Chances are that the key or value will be a Mahout Vector. The thing that knows how to write a Vector is VectorWritable. This is the class you would use to wrap a Vector and write it with SequenceFile.Writer.
You would need to look at the job that will consume it to make sure you're passing what it expects. For clustering, for example, I think the key is ignored and the value is a Vector.

Resources