Find Unique Words in One or More Columns? - asp.net

I'm looking at implementing tags in my ASP.NET website. After looking at several algorithms, I'm leaning towards having a couple of database columns that contain one or more tag words. I will then use full-text search to locate rows with specified tags.
All of this seems pretty straight forward except for one thing: I need to be able to generate a list of available tags, which the user can select from.
I know I can write a C# program to build the list of available tags, and then run it once every week or so, but I was just wondering if there's any SQL-method for doing stuff like this more efficiently.
Also, I can't help but notice that the words will be extracted anyway as part of building the full-text index. I don't suppose there's any way to access that information?

This isn't how I'd choose to structure this but to answer the actual question...
In SQL Server 2008 you can query the sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document table valued functions to get the information that you want.

Why not to use separate table for tags with many-to-many relationship with tagged items table?
I mean something like that:
--Articles
ArticleId
Text
--Tags
TagId
Name
--TagsToArticles
ArticleRef
TagRef

Related

How to query elements from a list of items in wikidata?

There is a list of proper names of stars here: https://www.wikidata.org/wiki/Q1433418
How can I query this in the Wikidata Query Service so that all individual names of stars are listed, alongwith other data in the list, such as Constellation?
In other words, how do I get at the members of the list? "Instance of" doesn't seem to work.
There is a confusion here coming from the fact that this List of proper names of stars (Q1433418) is an element centralizing links to Wikipedia pages playing this role in the different Wikipedia editions but isn't really playing any meaningful role in Wikidata: there are no instance of (P31) List of proper names of stars (Q1433418) in Wikidata.
You would have more luck looking for instance of (P31) Stars (Q523) and instance of elements that are a subclass of (P279) Star, a pattern that you will find in many of the SPARQL query examples: ?star wdt:P31/wdt:P279* wd:Q523 .
That could give this query (json version).
And if you're into JS, you can parse the JSON result with this function I wrote: wdk.simplifySparqlResults
I would not take official names of stars from there. The Wikipedia is one of the most useful resources to get first hand, somewhat organised information, on any topic. It is irreplaceable for this, and it would be a great mess not having it. However, the information is very sensitive to misuse caused by vandalism or clumsy editors.
To get (the only) official proper names of stars, the IAU is making an effort started this year. I would use this as reference. It is also stored in a text file which is easy to retrieve by a program, and is being updated while the Committee accepts more star names. It is here:
http://www.pas.rochester.edu/~emamajek/WGSN/IAU-CSN.txt
In fact, as you see, the file structure is presented in a format ready to use by software applications. It has been made to meet needs as yours.

Full-Text search in Sql server with multiple tables and ranking

We have a website which is running on DNN 7.1 with SQL server. We implemented full text search to show search results. We need to search several tables and show the results to the user. Right now the implementation is user enters search word(s) and clicks search, the code behind creates several threads to search different tables, and merge the data. Currently we are using contains predicate, issue with this is, there is no ranking and sometimes after the merge the results on the first page are not the best matches. I thought that I can use containstable and order the results by ranking but I read ranking doesn't have any meaning by itself, it merely tells which one best matches in the current resultset. But in my scenario I have multiple resultsets, how will I know which are best matches across multiple resultsets. Or am I going about this wrong way? What is a good way to handle this scenario? We need to improve the response time along with better results. Any help is greatly appreciated.
this is how we implemented full text searching across multiple tables:
1) create a new table to store the primary keys of the other tables in each column, another column to store the string concatenated values of all the search fields from each table, and another column to store the checksum value of the concatenated values.
2) implement the FTI on this new table, and create a job that regularly syncrhonizes/updates the concatenated search values only if the binary_checksum value is different
3) use the contains predicate on this new table and based on the results, join back to their corresponding tables based on the primary keys returned.

How to setup data model for customizable application

I have an ASP.NET data entry application that is used by multiple clients. The application consists of multiple data entry modules that are common to all clients.
I now have multiple clients that want their own custom module added which will typically consist of a dozen or so data points. Some values will be text, others numeric, some will be dropdown selections, etc.
I'm in need of suggestions for handling the data model for this. I have two thoughts on how to handle. First would be to create a new table for each new module for each client. This is pretty clean but I don't particular like it. My other thought is to have one table with columns for each custom data point for each client. This table would end up with a lot of columns and a lot of NULL values. I don't really like either solution and suspect there's a better way to do this, so any feedback you have will be appreciated.
I'm using SQL Server 2008.
As always with these questions, "it depends".
The dreaded key-value table.
This approach relies on a table which lists the fields and their values as individual records.
CustomFields(clientId int, fieldName sysname, fieldValue varbinary)
Benefits:
Infinitely flexible
Easy to implement
Easy to index
non existing values take no space
Disadvantage:
Showing a list of all records with complete field list is a very dirty query
The Microsoft way
The Microsoft way of this kind of problem is "sparse columns" (introduced in SQL 2008)
Benefits:
Blessed by the people who design SQL Server
records can be queried without having to apply fancy pivots
Fields without data don't take space on disk
Disadvantage:
Many technical restrictions
a new field requires DML
The xml tax
You can add an xml field to the table which will be used to store all the "extra" fields.
Benefits:
unlimited flexibility
can be indexed
storage efficient (when it fits in a page)
With some xpath gymnastics the fields can be included in a flat recordset.
schema can be enforced with schema collections
Disadvantages:
not clearly visible what's in the field
xquery support in SQL Server has gaps which makes getting your data a real nightmare sometimes
There are maybe more solutions, but to me these are the main contenders. Which one to choose:
key-value seems appropriate when the number of extra fields is limited. (say no more than 10-20 or so)
Sparse columns is more suitable for data with many properties which are filled out infrequent. Sounds more appropriate when you can have many extra fields
xml column is very flexible, but a pain to query. Appropriate for solutions that write rarely and query rarely. ie: don't run aggregates etc on the data stored in this field.
I'd suggest you go with the first option you described. I wouldn't over think it. The second option you outlined would be a bad idea in my opinion.
If there are fields common to all the modules you're adding to the system you should consider keeping those in a single table then have other tables with the fields specific to a particular module related back to the primary key in the common table. This is basically table inheritance (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server) and will centralize the common module data and make it easier to query across modules.

Search newbie, should I use Full Text or not? (SQL Server 2008 Express R2)

For a website I'm creating I need to search a few tables like Articles, Products and maybe the ForumThread and ForumPosts tables. I now have a very simple LIKE search query for each of these tables title columns VARCHAR(255). The title column is indexed too.
In the future however I want to look in Description fields too which are VARCHAR(Max) and I'm guessing this will be very slow when there's lots of records.
Now I came across full text search and have the following questions about it:
Will full text search speed up these kind of simple search operations?
Can I still use a LIKE query in similar ways or do I need to rewrite all search queries?
Maybe not full text search related but how can I search in multiple tables? I'm now querying each table one by one.
If I enable full text search, will this eat more RAM (Since I'm on a 1 GB RAM VPS right now)
As you can see I have absolutely no experience with this, and even after reading theory I'm still a little confused about what it really does.
I hope someone can give me a little guidance on this,
Thank you for your time.
Kind regards,
Mark
The big problem with your LIKE-based queries is that they almost certainly can't use normal indexes. So it won't do you any good to add an index on the description column to help with performance. Full Text queries consist of two parts: 1) changing your query to use (for example) the CONTAINS() keyword instead of LIKE and 2) creating a different kind of index that the queries using these keywords will be able to take advantage of.
Here's the thing: it's not just the size of the field that determines whether full text will have a big impact. It's also the number of rows. You may have a simple nvarchar(100) that's only expected to hold a short phrase, but if you have to search millions of rows full text can still search this faster. The key there is the "have to search" part - if you have other filters that can significantly limit the working set, your LIKE query might still do fine. Another scenario is an nvarchar(max) field with only a few dozen rows, but each of those records has as much text as a novel. In this case, you'll still want to use a full text index.
There are two other important considerations for full text searches. One is that they tend to hog disk space. This isn't hugely important for most databases, but it is worth mentioning. The other is that they often need to be manually re-calculated, such that an article isn't ready for searching the moment it's added to the DB.
An alternative that is somewhere between full-text searching and simple LIKE searches that will give you much better performance, some weighting ability, and also simplify searching multiple tables, is to build your own keyword index, e.g. create a table:
keyword count tableid columnid rowid
------- ----- ------- -------- -----
varchar int int int int
You would of course need triggers or a service of some kind to keep this up to date, but what you end up with is a lightweight cross reference of the counts of all relevant keywords and where they appear. Your search queries then only need to look up the keywords in this index.
This only works for keywords, though, so if you want to let people search on phrases it won't work. You'll also have to incorporate logic to deal with things like plurals and irrelevant words. On the other hand it is extremely fast. If performance is becoming a problem for LIKE searches and you need more than just keywords searching, full-text searching is probably the best way to go.
Full-text search is really intended for when your application needs to do intensive searching of BIG blocks of text rather than simple fields of text for storing names, descriptions etc.
For example I've used it for such things as quickily searching through the content of books/CVs - it actually creates word-by-word indexes of all the content stored and will probably be overkill if you're not working with massive bits of text.
One design change you could make instead is to use nVarchar(Max) instead of Varchar - this gives you the ability to handle Unicode text (from most known human alphabet systems) and should be large enough for your needs as outlined above.

Keyword search with SQL Server

I have a scenario where I need to search for cars by keywords using a single search field. The keywords can relate to any of the car's attributes for e.g. the make or the model or the body style. In the database there is a table named 'Car' with foreign keys referencing tables that represent models or makes or body style.
What would be the best way of doing this? Specifically, How should I take the query from user(must support exact phrase search, or, and) and how do I actually do the search.
I am using SQL Server and ASP.NET 3.5 (Data access using LINQ)
Easily the best and most comprehensive article on the subject : http://www.sommarskog.se/dyn-search-2005.html
Regardless of which implementation you pick from Aaron's article, I always log the search criteria and execution time in this situation. Just because you provide search flexibility, it doesn't mean most users will make use of it. You usally find most searches occur on a limited number of fields and logging the search criteria will allow you to create targetted indexes.

Resources