Finding and suggesting most similar queries from a query log

Finding and suggesting most similar queries from a query log - bigdata

Given a query log of about 10 million queries I have to write a program that will ask query from the user and display most similar 10 queries to the input query as a output. Also in case of spelling mistakes it may suggest the correct spellings.
In this context I have studied a few tutorials on Locality Sensitive Hashing but can not understand how can I apply it in this problem. First I was thinking of sorting the log lexicographically. But I don't think it will be good idea to sort the log as far as size of the log is concerned as it may not be efficient to load the whole log into memory.
So can please anyone suggest me any idea to approach the problem. Thank you.

You would definitely want to look at this if you want to parallelize the processing. Minhash Clustering in Mahout
Generate shingles (n-grams with appropriate n)
Generate MinHash
Run LSH
Very detailed information on LSH can be found here: Mining Massive Datasets

Related

Gremlin correlated queries kill performance

I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.
Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...
g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")
That query is 10x more expensive than simply getting the edges using:
g.V().outE("someLabel").has("someProperty","someValue")
So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.
I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.

OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...
g.inject(true).
union(
__.V().not(outE("someLabel")).constant().as("ridiculous"),
__.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")
In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.

Your original query returned vertex-edge pairs, where as your answer returns only edges.
You could just run g.E().hasLabel("somelabel") to get the same result.
Probably a faster alternative to your original query might be:
g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")
Or
g.V().as("v").outE("somelabel").as("e").select("v","e")

If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem
Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.
I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.
This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.

R-neuralnet: does it randomize the data?

I need to know if the data for training that is passed in the neuralnet call is randomized in the routine or does the routine uses the data in the same order that is given. I really need to know this info for a project that I am working on, and I have not being able to figure it out by looking at the source.
Thnx!

Look into the code - thats one of the most important advantages of FOSS: you can actually check what it is doing (neuralnet is pure R, so you don't even need to fear that you need to dig into FORTRAN or C code, and you can use debug to step through the code with example data to get an overview).
Moreover, if necessary, you can even introduce e.g. a new parameter that allows you to switch off randomization if needed.
Possibly maintainer ("neuralnet") would be willing to help you as well (and able to answer much faster than about everyone else here on SE).

Graph data structures in LabVIEW

What's the best way to represent graph data structures in LabVIEW?
I'm doing some basic algorithm review over the holiday, and I'd prefer to not implement all of the storage and traversals myself, if possible.
(I'm aware that there was a thread a few years ago on LAVA, is that my best bet?)

I've never had a need to do this myself, so I never really looked into it, but there are some people who did do some work as far I know.
Brian K. has posted something over here, although it's been a long time since I looked at it:
https://decibel.ni.com/content/docs/DOC-12668
If that doesn't help, I would suggest you read this and then try sending a PM to Daklu there, as he's the most likely candidate to have something.
https://decibel.ni.com/content/thread/8179?tstart=0
If not, I would suggest posting a question on LAVA, as you're more likely to find the relevant people there.

Well you don't have that many options for graphs , from a simple point of view. It really depends on the types of algorithms you are doing, in order to choose the most convenient representation.
Adjacency matrix is simple, but can be slow for some tasks, and can be wasteful if the graph is not dense.
You can keep a couple of lists and hash maps of your edges and vertices. With each edge or vertex created assigned a unique index into the list,it's pretty simple to keep things under control. Each vertex could then be associated with a list of its neighbors. Depending on your needs you could divide that neighbors list into in and out edges. Also depending on your look up needs, you could choose to index edges by their in or out edge or both, or simple by a unique index number.
I had a glance at the LabView quick reference, and while it was not obvious from there how you would do that, as long as they have arrays of some sort, you can implement a graph. I'm sure you'll be fine.

Is there a way to use arbitrary type of value as key in environment or named list in R?

I've been looking for a proper implementation of hash map in R, with functionalities similar to the map type in Python.
After some googling and searching the R documentations, I found that environment and named list are the ONLY options I can use (is that really so?).
But the problem with the two is that they can only take charaters as key for the hashing, not even a number, let alone other type of things.
So is there a way to use arbitrary things as key? or at least more than just characters.
Or is there a better implemtation of hash map that I didn't find with better functionalities ?
Thanks in advance.
Edit:
My current problem: I need a map to store the distance relationship between data points. That is, the key of the map is a tuple (p1, p2) and the value is a number.
The reason I asked a generic question instead of a concrete one is that I'm learning R recently and I want to know how to manipulate some of the most fundamental data structures, not only what my problem refers to. So I may need to use other things as key in the future, and I want to avoid asking similar questions with only minor difference every time I run into them.
Edit 2:
I got a lot of very good advices on this topic. It seems I'm still thinking quite in the Pythonic way, rather than the should-be R way. I should really get more R-ly ! I think my purpose can easily be satisfied by a matrix in R. Thanks All !

The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.
That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.
Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)

Financial Data/Formula Calculation (Storage / Performance)

I am currently in the analysis phase of developing some sort of Locale-based Stock Screener ( please see Google's' for similar work) and I would appreciate advice from the SO Experts.
Firstly the Stock Screener would obviously need to store the formulas required to perform Calculations. My initial conclusion would that the formulae would need to be stored in the Database Layer. What are your ideas on this? Could I improve speed( very important) by storing formulas in a flat file(XML/TXT)?
Secondly, I would also like to ask advice on the internal execution of formulae by the Application. Currently I am leaning towards executing formulae on parameters AT RUN TIME as against running the formulae on parameters whenever these parameters are provided to the system and storing the execution results in the DB for simple retrieval later( My Local Stock Exchange currently does NOT support Real Time Stock Price updates). While I am quite certain that the initial plan ( executing at run time) is better initially , the application could potentially handle a wide variety of formulae as well as work on a wide variety of input parameters. What are your thoughts on this?
I have also gone through SO to find information on how to store formulae in a DB but wanted to enquire the possible ways one could resolve recursive formulae i.e. formaulae which require the results of other formulae to perform calculations? I wouldn't mind pointers to other questions or fora at all.
[EDIT]
[This page]2 provides a lot of infromation as to what I am trying to achieve but what is different is the fact that I need to design some formulae with SPECIAL tokens such as SP which would represent Stock Price for the current day and SP(-1) would represent price for the previous day. These special token would require the Application to perform some sort of DB access to retrieve the values which they are replaced with.
An example formula would be:
(SP/SP(-1)) / 100
which calculates Price Change for Securities and my idea is to replace the SP tokens with the values for the securities when Requested by the user and THEN perform the calculation and send the result to the user.
Thanks a lot for all your assistance.

Kris, I don't mean to presume that I have a better understanding of your requirements than you, but by coincidence I read this article this afternoon after I posted my earlier comment;
http://thedailywtf.com/Articles/Soft_Coding.aspx
Are you absolutely sure that the "convenience" of updating formulae without recompiling code is worth the maintenance head ache that such a solution may possibly become down the line?
I would strongly recommend that you hard code your logic unless you want someone without access to the source to be updating formulae on a fairly regular basis.
And I can't see this happening too often anyway, given that the particular domain here, stock prices, has a well established set of formulae for calculating the various relevant metrics.
I think your effort will be much better spent in making a solid and easily extensible "stock price" framework, or even searching for some openly available one with a proven track record.
Anyway, your project sounds very interesting, I hope it works out well whatever approach you decide to take. Good luck!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex