How to query for similar words? - mariadb

I'm not sure how I would accomplish this, but I have built an application in PHP with MariaDB. I want users to be able to search for a word, and exact results, and results similar show up.
For example, a recent one, a user typed Adison into the search box, and nothing came up. The real spelling is Addison.
I saw some answers based on Levenshtein Distance, but I figure there has to be a simpler solution

Related

Wordpress search results by keyword

Is there a way to prioritise the search results by keywords or tags.
For example, if a user types in Admissions then domain/admissions page comes first and then all the rest. If the word is apply then domain/apply page is first on the search result page and etc
Was trying to find the examples, but no luck. Never used the extended search before, so even don't know where to dig
What you are talking about here is returning search results by "relevance" and there are a few examples out there including plugins such as Relevanssi.
I have no example of how to do this, however, I thought it might be useful to know the term you need to dig deeper.

Advice to implement a search for a big table

I've created a table in asp.net (with Gridview) and it's expected to have lots of rows. There's also pagination and because I didn't want the page to refresh every time the user use it, everything go through Javascript.
Now I want to implement a search option with regex support but I'm not sure what the best way is to do it. I need the search on all the rows (not only the rows of the current page).
I'm not sure if it's possible to use regex in an sql query (I searched a bit on Google but didn't really find anything recent that helped me). The other option would be to query all the rows and filter them on the server but I'm afraid it will slow things down when we have like 2000 rows.
What would you do? Is there another option I didn't think of?

What is a strategy for a simple site site search in a SQL Server 2008 and ASP.NET MVC environment?

I am trying to hash out a strategy for implementing a very simple site search in ASP.NET MVC and SQL Server 2008.
Really, all I want to to do is to be able to rank search results based on the number of times a search word or phrase is found in the webpage. I attempted to do this using LINQtoSQL but I ran into a lot of issues where some LINQ commands don't have a SQL equivalent. This was a few months ago so I don't remember specific errors.
So, I'm just trying to figure out an approach. What I'm thinking is this:
Approach 1:
I should probably write a program to spider the site and somehow index the site's text - I'm thinking I should save information in a table like:
ID
Word
URL
I could then query that and rank based on how many time that word is associated with a certain URL. But then I realized that this technique would completely breakdown if a user was searching for a phrase.
Approach 2:
Then I was toying with the idea of using SPROCs to create a temporary table with a record for each URL that would somehow parse the text and determine how many times the phrase or word appeared in each individual URL. and then we would return the results from the temp table. I am thinking the temporary table would look something like this:
ID
SearchText
URL
Frequency
And then select * from temptable order by Frequency asc or something like that.
However, I'm not sure if SPROCs are capable of parsing text like that, or if simultanious searching would be possible.
I am looking for something very lightweight. I'm not really interested in using Lucene or Solr or anything like that because the learning curve seems very steep and those applications' features are far away more than what I need.
Any thoughts on how I should approach this problem? Is there a different approach that I should consider?
For your phrase versus word issue, why not use wildcards and LIKE operators?
Select Count(*) from temptable where SearchPhrase LIKE '%Apple%'
Maybe not exactly what you want, but Windows SharePoint Search Server isn't all that bad.
Yes, it has the word 'SharePoint' in it, which would usually make me grab the scissors on my desk and start stabbing my eyes out, but having to use it once in a pinch, I was actually somewhat impressed with it.
It's free, so maybe worth a couple of hours playing with it for comparison to writing something custom.
After a little poking around, it looks like SQL Server 2008's Full Text Search is what I would want to use. I'm not 100% sure yet, but it looks promising.
http://msdn.microsoft.com/en-us/library/ms142547.aspx
If you're considering Full Text Search, then also check out lucene.net.
I used FTS for one project, and later used lucene.net for another, and although the requirements were different from yours, I'd never go back to FTS now.

Interpreting Search Results

I am tasked with writing a program that, given a search term and the HTML source of a page representing search results of some unknown search engine (it can really be anything, a blog, a shop, Google, eBay, ...), needs to build a data structure of the results containing "what's in the results": a title for earch result, the "details" link, the position within the results etc. It is not known whether the results page contains any of the data at all, and whether there are any search results. The goal is to feed the data structure into another program that extracts meaning.
What I am looking for is not BeautifulSoup or a RegExp but rather some clever ideas or algorithms on how to interpret the HTML source. What do I do to find out what part of the page constitutes a single result item? How do I filter the markup noise to extract the important bits? What would you do? Pointers to fields of research covering what I try to to are aly greatly appreciated.
Thanks, Simon
I doubt that there exist a silver-bullet algorithm that without any training will just work on any arbitrary search query output.
However, this task can be solved and is actually solved in many applications, but with different approach. First you have to define general structure of single search result item based on what you actually going to do with it (it could be name, date, link, description snippet, etc.), and then write number of html parsers that will extract necessary necessary fields from search result output of particular web sites.
I know it is not super sexy solution, but it probably the only one that works. And it is not rocket science. Writing parsers is actually extremly simple, you can make dozen per day. If you will look into html source of search result, you will notice that output results are typically very structured and marked with specific div sections or class atributes, so it is very easy to find it in the document. You dont have even use any complicated HTML parsing library for that, something grep-like will be enough.
For example, on this particular page your question starts with <div class="post-text"> and ends with </div>. Everything in between is actually a post text with some HTML formatting that you may want to remove along with extra spaces and "\n". And this <div class="post-text"> appears on the page only once.
Once you go at large scale with your retrieval applicaiton, you will find out that there is not that big variety of different search engines on different sites, and you will be able to re-use already created parsers for sties using similar search engines.
The only thing you have to remember is built-in self-testing. Sites tend to upgrade and change design from time to time. If your application is going to live for some time, you will need to include into your parsers some logic that will check validity of their results and notify you every time search output has changed and is not compatible anymore with your parser. Then you will have to modify particular parser or write new one.
Hope this helps.

Token replacement

I currently implement a replace function in the page render method which replaces commonly used strings - such as replace [cfe] with the root to the customer front end. This is because the value may be different based on the version of the site - for example the root to the image folder ([imagepath]) is /Images on development and live, but /Test/Images on test.
I have a catalogue of products for which I would like to change [productName] to a link to the catalogue page for that product. I would like to go through the entire page and replace all instances of [someValue] with the relevant link. Currently I do this by looping through all the products in the product database and replacing [productName] with the link to the catalog page for that product. However this is limited to products which exist in the database. "Links" to products which have been removed currently wont be replaced, so [someValue] will be displayed to the user. This does not look good.
So you should be able to see my problem from this. Does anyone know of a way to achieve what I would like to easily? I could use regexes, but I don't have much experience of those. If this is the easiest way, using "For Each Match As String In Regex.Matches(blah, blah)" then I am willing to look further into this.
However at some point I would like to take this further - for example setting page layouts such as 3 columns with an image top right using [layout type="3colImageTopRight" imageURL="imageURL"]Content here[/layout]. I think I could kind of do this now, but I cant figure out how to deal with this if the imageURL were, say, [Image:Product01.gif] (using regex.match("[[a-zA-Z]{0,}]") I think would match just [layout type="3colImageTopRight" imageURL="[Image:Product01.gif] (it would not get to the end of the layout tag). Obviously the above wouldn't quite work, as I haven't included double quotes in the match string or anything, but you get the general idea. You should be able to get the general idea of what I am getting at and what I am trying to do though.
Does anyone have any ideas or pointers which could help me with this? Also if this is not strictly token replacement then please point me to what it is, so I can further develop this.
Aristos - hope reexplaining this resolves the confusion.
Thanks in advance,
Regards,
Richard Clarke
#RichardClarke - I would go with Regular Expressions, they're not as terrible to learn as you might think and with a bit of careful usage will solve your problems.
I've always found this a very useful tool.
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
goes nicely with a cheat sheet ;-)
http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
Good luck.

Resources