Implement BidirectionalGridLSTM - grid

I’m implementing a chatbot using Tensorflow’s seq2seq model[1], feeding it with data from the Ubuntu Dialogue Corpus. I want to compare an RNN using standard LSTM cells with Grid LSTM cells described in Kalchbrenner et al [2].
I’m trying to implement the Grid LSTM cell in the translation model described in section 4.4 [2], but I’m struggling with the bidirectional part.
I have tried using BidirectionalGridLSTMCell, but I’m not sure what they mean by num_frequency_block. They do not mention that in the paper. Does anyone know what they mean by num_frequency_block? In the api docs it says:
num_frequecy_blocks: [required] A list of frequency blocks needed to cover the whole input feature splitting defined by start_freqindex_list and end_freqindex_list.
Further, I have tried to create my own cell. First I do the forward processing with the inputs, then I reverse the inputs, and do the backward processing. But when I concatenate these results, the shape changes. E.g. when I try to run the network with a batch size of 32, then i get this error:
ValueError: Dimensions must be equal, but are 64 and 32
How can I concatenate the results without changing the shape? Is that even possible?
Does anyone have any other tips, on how I can implement Bidirectional Grid LSTM?
[1] https://www.tensorflow.org/tutorials/seq2seq/
[2] https://arxiv.org/abs/1507.01526

-tensorflow has bidirectional LSTMs built-in: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/bidirectional_rnn.ipynb
here's a tutorial for using bidirectional LSTMs for intent matching: https://blog.themusio.com/2016/07/18/musios-intent-classifier-2/
you're missing your second [2] reference link.
is this a helpful baseline, even if they don't provide grids?
may i ask what you are using it for?

Related

Using OpenStreetMapX to create a powergrid graph network

I want to find out a few things about OpenStreetMapX which from what I understand works well with transportation-based networks. I am wondering if it's also possible to use this package along with lightgraphs.jl to create a power grid network. In my case, I have filtered some power grid data using osmosis (a piece of software that allows filtering OpenStreetMap data based on a tag)
I want to know whether it is relevant to use OpenStreetMapX for this kind of data (power grid)?
using OpenStreetMapX
# Load power data for Germany
deData = get_map_data("D:/PowerGridNetwork/data/germany/de_power_160718.osm")
# Get roadways (which I believe has the meta data for edges)
deData.roadways
I ended up with metadata for power as well as roads, which I am wondering, how it came in the first place. Since I filtered only the power data.
The next question I have is, does deData.e returns an adjacency list?. Since what I am really after is creating a MetaGraph with nodes and edges with their respective properties.
Any ideas?
Thanks in advance

how to give fixed embedding matrix to EmbeddingLayer in Lasagne?

I have implemented a deep learning architecture which uses Lasagne EmbeddingLayer.
Now I have the word vectors already learned using word2vec and do not want the word vectors to be the parameters of my network.
After reading the documentation, I think it specifies that the numpy array provided to the 'W' parameter is the initial value for the Embedding Matrix.
How can I declare/specify the EmbeddingLayer in the code so that it uses the input weight matrix as a fixed matrix of word vectors??
The above problem can be solved by adding the 'trainable=False' tag to the weight parameter of the custom layer defined to work as the Embedding Layer.

Distinguish word count by document number in mapper - Hadoop?

I'm writing a mapper function on R (using Rhipe for map-reduce). The mapper function is supposed to read the text file and create Corpus. Now, R already has a package called tm which does the Text Mining and create DocumentMatrix. If you want to know more about `tm', have a look here.
But the problem with using this package in map-reduce is that the matrix is converted to list, and is difficult to create a matrix in Reduce from this jumbled up "list". I found an algorithm for creating corpus using map-reduce in this website , but I'm slightly confused as to how I could find the name or some unique identification of the mapper document.
For the document that I have which is 196MB text file, hadoop spawned 4 mappers (blocksize=64MB). How can I classify the key value pair such that the mapper sends the pair as ((words#document),1). The article explains it beautifully. However, I'm having a little trouble understanding how mapper can distinguish the document number it's reading between multiple mappers. As far as I understand, the mapper counter is specific only for the corresponding mapper. Anyone care to elaborate, or provide some suggestions as to what I should do?
I think I came up with my own solution. What I did is instead of looking for mapper counts and what not, I added a text at the end of each line followed by number as in "This is a text, n:1". I used gsub to create increment. In the mapper, while I read the line, I also read the value n:1. Since the n increases for each line, no matter which mapper is reading which line, it gets the correct value of n. I'm then using the value of n to create a new key for each line (document) as in ((word#doc=n),1) where n is the value of each line number.

arcmap network analyst iteration over multiple files using model builder

I have 10+ files that I want to add to ArcMap then do some spatial analysis in an automated fashion. The files are in csv format which are located in one folder and named in order as "TTS11_path_points_1" to "TTS11_path_points_13". The steps are as follows:
Make XY event layer
Export the XY table to a point shapefile using the feature class to feature class tool
Project the shapefiles
Snap the points to another line shapfile
Make a Route layer - network analyst
Add locations to stops using the output of step 4
Solve to get routes between points based on a RouteName field
I tried to attach a snapshot of the model builder to show the steps visually but I don't have enough points to do so.
I have two problems:
How do I iterate this procedure over the number of files that I have?
How to make sure that every time the output has a different name so it doesn't overwrite the one form the previous iteration?
Your help is much appreciated.
Once you're satisfied with the way the model works on a single input CSV, you can batch the operation 10+ times, manually adjusting the input/output files. This easily addresses your second problem, since you're controlling the output name.
You can use an iterator in your ModelBuilder model -- specifically, Iterate Files. The iterator would be the first input to the model, and has two outputs: File (which you link to other tools), and Name. The latter is a variable which you can use in other tools to control their output -- for example, you can set the final output to C:\temp\out%Name% instead of just C:\temp\output. This can be a little trickier, but once it's in place it tends to work well.
For future reference, gis.stackexchange.com is likely to get you a faster response.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources