ASP.NET - How to properly split a string for search? - asp.net

I'm trying to build a search that is similar to that on Google (with regards to exact match encapsulated in double quotes).
Let's use the following phrase for an example
"phrase search" single terms [different phrase]
Currently if I use the following code
Dim searchTermsArray As String() = searchTerms.Split(New String() {" ", ",", ";"}, StringSplitOptions.RemoveEmptyEntries)
For Each entry In searchTermsArray
Response.Write(entry & "<br>")
Next
my output is
"phrase
search"
single
terms
[different
phrase]
but what I really need is to build a key value pair
phrase search | table1
single | table1
terms | table1
different phrase | table2
where table1 is a table with general info, and table2 is a table of "tags" similar to that on stackoverflow.
Can anybody point me in the right direction on how to properly capture the input?

What are you trying to do is not that trivial. Implementing a search "similar to Google's" is far beyond parsing the search string.
I'd suggest you not to reinvent the wheel and instead use production ready solutions such as Apache Lucene.NET or Apache Solr. Those cope with both parsing and fulltext search.
But if you only need to parse this kind of strings then you should really consider solution Pete pointed to.

Regex is your friend. See this question

Depending on how fancy you plan in getting, you might consider the search grammar/implementation that's included with Irony.
http://irony.codeplex.com/

Search string parsing is a non-regular problem. That means that while a regular expression can get deceptively close, it won't take you all the way there without using proprietary extensions, building an unmaintainable mess of an expression, leaving nasty edge cases open that don't work how you'd like, or some combination of the three.
Instead, there are three correct ways to handle this:
Use a third-party solution like Lucene.
Build a grammar via something like antlr.
Build your own state machine.
For a problem of this level (and assuming that search is core enough to what you're doing to really want to implement it yourself), I'd probably go with option 3. This makes more sense when you realize that regular expressions are themselves instructions for how to set up state machines. All you're doing is building that right into your code. This should give you the ability to tune performance and features as well, without requiring adding a larger lexer component into your code.
For an example of how you might do this take a look at my answer to this question:
Reading CSV files in C#
hat I would do is build a state machine to parse the string character by character. This will be the easiest way to implement a fully-correct solution, and should also result in the fastest code.

Related

Is there a way disable lax quoting rules in sqlite?

It appears that SQLite, apparently as a "compatibility feature", parses double quoted identifiers as string literals if no matching column is found.
I understand that it does so for people who write improper sql, and for backwards compatibility with legacy projects created by such people, but it makes debugging very difficult for those of us writing proper sql on brand new projects.
For example,
SELECT * FROM "users" WHERE "usernme" = 'joe';
returns a query with 0 rows, since the string 'usernme' does not equal the string 'joe'.
This leaves me scratching my head wondering why i'm not getting joe's row even when i know there's a user by that name until I painstakingly backtrack my code and realize that I left out an a.
Is there any "strict mode" PRAGMA or API option to enforce quoting rules and treat all double-quoted strings as identifiers so that it will inform me immediately if one is misspelled?
(And please, no answers telling me not to quote identifiers if I don't need to, because any such answer is basically telling me that in order to get proper debugging, you have to write bad code in the first place.)
This is hardcoded in the SQLite parser and cannot be changed from the outside.
I also asked in the SQLite channel and someone there was kind enough to look through the source code and create a patch, and even started a thread on the mailing list describing the patch:
http://www.mail-archive.com/sqlite-users#sqlite.org/msg73832.html
It's not an answer that works for the official builds, but it may be someday. For the moment, I'm just going to recompile it myself with this patch.
Ten years later, and this doesn't completely meet your criteria about "strict mode" kinds of things, but here's a trick I used to make some queries safer, if you can remember to use it. It's to give your table an alias and reference it:
SELECT t."nosuch_column" FROM some_table t;
I suppose in this form, it's clear to SQLite that a literal isn't desired.

SQLite FTS necessary?

I am using Python with sqlite3.
Is there an advantage to using FTS3 or FTS4 if I only want to search for words in one column, or I could just use LIKE "%word%"?
While yes, you could handle the simplest of cases with LIKE '%word%', FTS allows user queries like clown NOT circus (matches rows talking about clowns but not circuses). It handles stemming (well, in English) so that break would match against breaks but not breakdance. It also builds indexes in a different way so that searching is much faster for this sort of query, and can return not just where the match happened but also a snippet of the matched text.
Finally, all the parsing of those potentially-complex user queries is done for you; that's code you don't have to write at all.

What is a strategy for a simple site site search in a SQL Server 2008 and ASP.NET MVC environment?

I am trying to hash out a strategy for implementing a very simple site search in ASP.NET MVC and SQL Server 2008.
Really, all I want to to do is to be able to rank search results based on the number of times a search word or phrase is found in the webpage. I attempted to do this using LINQtoSQL but I ran into a lot of issues where some LINQ commands don't have a SQL equivalent. This was a few months ago so I don't remember specific errors.
So, I'm just trying to figure out an approach. What I'm thinking is this:
Approach 1:
I should probably write a program to spider the site and somehow index the site's text - I'm thinking I should save information in a table like:
ID
Word
URL
I could then query that and rank based on how many time that word is associated with a certain URL. But then I realized that this technique would completely breakdown if a user was searching for a phrase.
Approach 2:
Then I was toying with the idea of using SPROCs to create a temporary table with a record for each URL that would somehow parse the text and determine how many times the phrase or word appeared in each individual URL. and then we would return the results from the temp table. I am thinking the temporary table would look something like this:
ID
SearchText
URL
Frequency
And then select * from temptable order by Frequency asc or something like that.
However, I'm not sure if SPROCs are capable of parsing text like that, or if simultanious searching would be possible.
I am looking for something very lightweight. I'm not really interested in using Lucene or Solr or anything like that because the learning curve seems very steep and those applications' features are far away more than what I need.
Any thoughts on how I should approach this problem? Is there a different approach that I should consider?
For your phrase versus word issue, why not use wildcards and LIKE operators?
Select Count(*) from temptable where SearchPhrase LIKE '%Apple%'
Maybe not exactly what you want, but Windows SharePoint Search Server isn't all that bad.
Yes, it has the word 'SharePoint' in it, which would usually make me grab the scissors on my desk and start stabbing my eyes out, but having to use it once in a pinch, I was actually somewhat impressed with it.
It's free, so maybe worth a couple of hours playing with it for comparison to writing something custom.
After a little poking around, it looks like SQL Server 2008's Full Text Search is what I would want to use. I'm not 100% sure yet, but it looks promising.
http://msdn.microsoft.com/en-us/library/ms142547.aspx
If you're considering Full Text Search, then also check out lucene.net.
I used FTS for one project, and later used lucene.net for another, and although the requirements were different from yours, I'd never go back to FTS now.

Is there a utility for finding SQL statements in multiple files and listing any referenced tables and stored procedures

I'm currently looking at a terrible legacy ColdFusion app written with very few stored procedures and lots of nasty inline SQL statements (it has a similarly bad database too).
Does anyone know of any app which could be used to search the files of the app picking out any SQL statements and listing the tables/stored procedures which are referenced?
Dreamweaver will allow you to search the code of the entire site. If the site is setup properly including the RDS password and provide a data source it can tell you a lot of information. I've only set it up once so I can't remember exactly what information it gives you, I think maybe just the DB structure. Application window > databases. Even if it isn't set up properly just searching for "cfquery" will quickly find all your queries.
You could also write a CF script using CFDirectory/CFFile to loop the .cfm files and parse everything between cfquery and /cfquery tags.
CFBuilder may have some features like that but I'm not to familiar with it yet.
edit I've heard that CFBuilder can't natively find all your cfqueries that don't have cfqueryparam but you can use CF to extend CFB to do so. I imagine you could find/write something for CFB to help you with your problem.
another edit
I know it isn't indexing the contents of the query, but you can use regex to search using the editor as well. searching for <cfquery.+(select|insert|update|delete) checking the regex box should find the queries that aren't using cfstoredProc (be sure to uncheck the match case option if there is one). I know Dreamweaver and Eclipse can both search for Regex.
HTH
As mentioned above I would try a grep with a regex looking for
"<cfquery*" "</cfquery>" and "<cfstoredproc*" "</cfstoredproc>"
In addition if you have tests that have good code coverage or even just feel like the app is fully exercised in production you could try turning on "Log Database Calls" in Admin - > Datasources or maybe even at the JDBC driver level, just monitor performance to make sure it does not slow the site down unacceptably.
In short: no. You'd have to do alot of tricky parsing to make sure you get all the SQL. And because you can glob SQL together from lots of strings, you'll almost always miss some of it.
The best you're likely to do will be a case insensitive grep for "SELECT|INSERT|UPDATE|DELETE" and then manually pulling out the table names.
Depending on how the code is structured, you might be able to get the table names by regexing the SQL from clause. But that's not foolproof. Alot of people use string concatenation to build SQL statements. This is bad because it can introduce SQL injection attacks, and it also make this particular problem harder.

How do I handle an apostrophe (') in MS SQL 2008 FTS?

I have a website that utilizes MS SQL 2008's FTS (Full-Text Search). The search works fine if the user searches for a string with an apostrophe like that's - it returns any results that contain that's. However, it will not return a result if the user searches for thats, and the database stores that's.
Also, ideally a search for that's should also return thats and that, etc.
Anybody know if FTS supports this, and what I can do to enable it?
WT
use a parameter
myCommand.Parameters.AddWithValue("#searchterm", txtSearch.Text.Trim());
it will be handle by you, without any hassle.
regarding that, that's and thats you should look up SOUNDEX keyword. 4GuysfromRolla have an old article about it as well.
updated
there is a great talk from one of the TFS Team member regarding this here.
quoting him:
Daniel is correct Full-text Search (FTS) does not use SOUNDEX directly,
but it can be used in combination with SOUNDEX. Additionally, you may
want to review the following links as well as the below TSQL
examples of combining CONTAINS & SOUNDEXYou may want to look at
some of the improved soundex algorithms as well as the
Levenshtein Distance algorithm You
should be able to search Google to find more code examples, for example:
'METAPHONE soundex "sql server" fuzzy name search' and I quickly found -
"Double Metaphone Sounds Great" at http://www.winnetmag.com/Article/ArticleID/26094/26094.html
You can freely download the code in a zip file that has several a user-defined function (UDF) that implement Double Metaphone. Below
are some additional SOUNDEX links:
http://www.merriampark.com/ld.htm
http://www.bcs-mt.org.uk/nala_006.htm
(omitted for out of scoop)
use pubs
-- Combined SOUNDEX OR CONTAINS query that
-- Searches for names that sound like "Michael".
SELECT
au_lname, au_fname
FROM
authors -- returns 2 rows
WHERE
contains(au_fname, 'Mich*') or SOUNDEX(au_fname) = 'M240'
Thanks,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/

Resources