I am developing an ASP.NET web application. The web application searches for people across about thirty different database (it is a law enforcement app). For example, a police office searches for Fred Smith DOB 01/01/1950 (not a real person) and it returns any hits from all the databases.
For the majority of searches; the speed is acceptable i.e. if the average person has five hits then the average time to load the page is five seconds. However, some searches have hundreds and sometimes thousands of hits. I saw one search which took 25 minutes, which is obviously not acceptable.
Longer term a data warehouse will probably be created to ensure all the data is in one database. However, what is the best strategy for speeding up searches in this scenario? I thought of caching but the same person is rarely searched for twice in a small amount of time. Are there any other ideas?
You have insufficient details (what databases, whats the frontend language, how are you querying, is there indexing etc.)
but here are some preliminary suggestions.. (in the increasing order of effort, probably)
heavily index all the databases for the key columns and do your searches.
multi threading - spawn 30 threads (1 per database) and do the search. start displaying the results as the threads come back.
have a backend job to consolidate all data from 30 databases into a single denormalized table which is fully indexed. query that table.
setup #3 with mostly a SOLR/LUCERNE like indexing engine, for even faster querying.
use big data etc.
You are asking the wrong question, if you're db searches are taking too long, it's because the columns you are searching are not indexed or other db related issue. Showing them 'faster' with asp.net is the least of your problems.
Show us some code and it's easier to help out.
Related
We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.
Background:
I'm using SQL Server 2008 and ASP.NET 4 on Windows 2008
I have one table with about 10 million rows of products that I make available online for users to browse -- not search. Each of the 10 million products have extra attributes -- like categories -- that I keep in lookup tables -- there are three or four lookup tables.
Problem
When someone browses and starts using filters (shipping location, price, quality, brand), I need to join the tables, apply all the filters, and return the results. It's very slow and I want to make it faster. Sometimes users will apply a very broad filter, resulting in 800,000 results, and though I only return the first 10 of those for browsing, I still need to run the query for the full 800,000.
What I've Tried Already
I've joined all the information from the various tables into one physical table and then created a covering index for the table.
The queries are much faster, but there is a good bit of maintenance I have to do on the table behind the scenes with jobs to make sure if something goes out of stock I take it out within a reasonable time frame (5 mins or so).
I don't use materialized/indexed views b/c I've got aggregates in the results which SQL Server doesn't seem to like.
Question
How can I speed up browse results beyond the indexing and table optimization that I've already done? I'm not doing any full-text searches -- I'm filtering with exact parameters.
Possible Solutions I've Thought Of
Large caching solution -- AppFabric or MemCached. I'm know next to nothign about these and don't know they are appropriate.
Small caching solution -- Maybe leveraging ASP.NET caching -- but every person is going to apply different filters so I'm not sure how much this will give me.
SSDs -- as a larger-scale solution I've thought about getting SSDs but that will be down the road
CDN -- I don't think a CDN will help b/c the bottleneck here is my database's search capabilities, not the bandwidth/distance to the requester.
I had a similar problem with a complex join query causing horrible response times. I was able to solve it via using Lucene.NET. It's a .NET implementation of the Lucene search index. Basically, you build indexes on data fields (your categories) and then you can search via those categories and return thousands of rows very quickly. Basically, it takes the join operation out of the equation because it already knows, via the indexes, which records fit your criteria.
The following is a very good article on Lucene.NET. I highly recommend it. It took a search result that was taking 20 seconds using standard joins and reduced the response time to less than a second.
http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net
Also, feel free to ping me if you have specific Lucene.NET implmenetation questions. I just got through a lot of research/learning in order to implement it properly on my site, so if you have specific questions on how to make it work I may be able to help with that as well.
"I perform the full query b/c I need to populate the new filters and
the number of results along with the search results. For example, if
someeone filters on category of "Shoes", and location of TX, some of
the other filters are going to be restricted based on the previous
filter."
Try executing two queries: One to count all results and one to select the top N. Maybe your bottleneck is copying 800,000 rows to the client. Doing two queries would fix this at the cost of an additional query. The cost is likely to be less than 2x though due to optimizations for few rows and for count-only queries.
I am trying to figure out whether a web development project is feasible at the moment and have so far learned that the total row count of the proposed database (30 million rows, 5 columns and about 3 gb of storage) is well within the budget limits in terms of storage requirements, but because of the anticipated large number of queries that users will make to the database I am not sure if this will cause an unrealistic load to manage for the server to provide adequate performance (within my budget).
I will be using this grid (a live demo of performance benchmarks for 300,000 rows - http://demos.telerik.com/aspnet-ajax/grid/examples/performance/linq/defaultcs.aspx). Inserting a search term in the "product name" box and pressing enter takes 1.6 seconds from query to results render. It seems to me (a newbie) that 300,000 rows which take 1.6 seconds all in all must take much longer with 30 million rows, and so I am trying to figure out
what the increase in time would be the more rows are added up to 30 million
what the increase in time would be for each additional 1000 people using the search grid at the same time.
what hardware requirements are necessary to reduce the delays to an acceptable level
Hopefully if I can figure that out I can get a more realistic assessment for feasibility. FYI: The database need not be updated very regularly, it is more for readonly purposes.
Can this problem be prototyped on paper for these 3 points?
Even wide ball park estimates- without considering optimisation, am I talking hundreds of dollars for 5000 users to have searches below 10 seconds each, thousands, or tens of thousands of dollars?
[Will be asp.net RadControls for AJAX Grid, One of these cloud hosted servers: 4,096MB RAM
160GB Diskspace, and either Microsoft® SQL Server® 2008 R2 and SQL Server 2012 ]
The database need not be updated very regularly, it is more for readonly purposes.
Your search filters allow for substring searches, so db indexes are not going to help you and the search will go row-by-row.
It looks like your data would probably fit in 5GB of memory or so. I would store the whole thing in memory and seach there.
What would be the best way to store a very large amount of data for a web-based application?
Each record has just 3 fields, but there will be around 144 million records a day - stored for one month - 4,464,000,000 records total. Let's round up to 5 billion.
Data has to be searchable on keyword & return results as fast as possible to the end user.
Which programming language?
JSON / XML / Some Database System I've Never Heard Of?
What sort of infrastructure? Imagine this system is only serving the needs of a maximum of 1,000 users at the same time.
I assume the code is the same whether you're searching 10 records or 10 billion, you just have to be a whole lot more efficient. I also assume mySQL/PHP doesn't stand a chance, and we're going to be paying out a very large sum for a hosting solution.
Just need some guidance on where to start, really. Thank you!
There are many tools in the Big Data ecosystem (NoSQL databases, distributed computing, machine learning, search, etc) which can form an answer to your question. Since your application will be write-heavy, I would advocate Apache Cassandra for its excellent write-performance (although it requires more data modeling than a NoSQL/document database such as MongoDB). You also need a Solr or ElasticSearch based search solution, and Map/Reduce for indexes and queries.
The programming language doesn't matter unless you have business end-users which will be writing queries against your Big Data in which case you can use something very SQL-like such as Hive or Pig. To get you started, the following (recent) link might give you some idea on how to pick an analytics stack based on your needs - please note that every database or distributed computing paradigm specializes for some particular use case:
How we picked our analytics stack
Also look at High Scalability for various use cases on how companies tackle their scalability problems.
We've developed a system with a search screen that looks a little something like this:
(source: nsourceservices.com)
As you can see, there is some fairly serious search functionality. You can use any combination of statuses, channels, languages, campaign types, and then narrow it down by name and so on as well.
Then, once you've searched and the leads pop up at the bottom, you can sort the headers.
The query uses ROWNUM to do a paging scheme, so we only return something like 70 rows at a time.
The Problem
Even though we're only returning 70 rows, an awful lot of IO and sorting is going on. This makes sense of course.
This has always caused some minor spikes to the Disk Queue. It started slowing down more when we hit 3 million leads, and now that we're getting closer to 5, the Disk Queue pegs for up to a second or two straight sometimes.
That would actually still be workable, but this system has another area with a time-sensitive process, lets say for simplicity that it's a web service, that needs to serve up responses very quickly or it will cause a timeout on the other end. The Disk Queue spikes are causing that part to bog down, which is causing timeouts downstream. The end result is actually dropped phone calls in our automated VoiceXML-based IVR, and that's very bad for us.
What We've Tried
We've tried:
Maintenance tasks that reduce the number of leads in the system to the bare minimum.
Added the obvious indexes to help.
Ran the index tuning wizard in profiler and applied most of its suggestions. One of them was going to more or less reproduce the entire table inside an index so I tweaked it by hand to do a bit less than that.
Added more RAM to the server. It was a little low but now it always has something like 8 gigs idle, and the SQL server is configured to use no more than 8 gigs, however it never uses more than 2 or 3. I found that odd. Why isn't it just putting the whole table in RAM? It's only 5 million leads and there's plenty of room.
Poured over query execution plans. I can see that at this point the indexes seem to be mostly doing their job -- about 90% of the work is happening during the sorting stage.
Considered partitioning the Leads table out to a different physical drive, but we don't have the resources for that, and it seems like it shouldn't be necessary.
In Closing...
Part of me feels like the server should be able to handle this. Five million records is not so many given the power of that server, which is a decent quad core with 16 gigs of ram. However, I can see how the sorting part is causing millions of rows to be touched just to return a handful.
So what have you done in situations like this? My instinct is that we should maybe slash some functionality, but if there's a way to keep this intact that will save me a war with the business unit.
Thanks in advance!
Database bottlenecks can frequently be improved by improving your SQL queries. Without knowing what those look like, consider creating an operational data store or a data warehouse that you populate on a scheduled basis.
Sometimes flattening out your complex relational databases is the way to go. It can make queries run significantly faster, and make it a lot easier to optimize your queries, since the model is very flat. That may also make it easier to determine if you need to scale your database server up or out. A capacity and growth analysis may help to make that call.
Transactional/highly normalized databases are not usually as scalable as an ODS or data warehouse.
Edit: Your ORM may have optimizations as well that it may support, that may be worth looking into, rather than just looking into how to optimize the queries that it's sending to your database. Perhaps bypassing your ORM altogether for the reports could be one way to have full control over your queries in order to gain better performance.
Consider how your ORM is creating the queries.
If you're having poor search performance perhaps you could try using stored procedures to return your results and, if necessary, multiple stored procedures specifically tailored to which search criteria are in use.
determine which ad-hoc queries will most likely be run or limit the search criteria with stored procedures.. can you summarize data?.. treat this
app like a data warehouse.
create indexes on each column involved in the search to avoid table scans.
create fragments on expressions.
periodically reorg the data and update statistics as more leads are loaded.
put the temporary files created by queries (result sets) in ramdisk.
consider migrating to a high-performance RDBMS engine like Informix OnLine.
Initiate another thread to start displaying N rows from the result set while the query
continues to execute.