MS SQL product list with filtering - asp.net

I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.

This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.

Related

Can DynamoDB be used for this simple problem?

I am trying to understand the limitations of DynamoDB/NoSQL, mostly as a learning exercise. I came across a problem that is fairly simple in a relational database, but I cannot figure out how to accomplish it in DynamoDB even with full control of rebuilding the tables and indexes.
Problem: Every day everyone in an office chooses one fruit for lunch. At the end of the week, I just want a list of everyone who ate both an apple and a banana.
Example Data
I thought employee name should be the PK, day of the week should be the SK.. and Fruit would be an attribute. But that doesn't seem to work, because you cant query against an attribute.
Is there a way to structure the data to make this work? Is there another tool like OpenSearch, HiveQL, GraphQL that can help me do what i am trying to do here?
Thanks.
When you say it's "fairly simple in a relational database", what you mean is it's simple to express, not exactly simple to compute. You're pushing a lot of list intersection work to the database. As your data set grows, the response time for your query will get slower and slower. At some point the database will no longer be able to give you the answer. And while it's consuming CPU (before timing out) you're negatively impacting the load on the relational database server for other users.
With DynamoDB you can't express queries that take unbounded effort to compute or that depend so much on total data set size for their performance characteristics. You have to design a query system up front that doesn't get exponentially slower as the data set grows.
The DynamoDB design then depends on what you know up front. For example, do you know it's always the intersection of an apple and banana? Then during insert of a new food note if the person ate both, and mark them as such on a user metadata item. Use that marker later during the query phase.
Sound like a nuisance? Well, if your data set isn't growing large and/or you don't need reliably fast query performance, then a relational database solves this problem well. Different databases for different purposes.
DynamoDB also supports SCAN and not only QUERY.
A simple design for the table is to have the PK to be the name of the person, and the attributes will be the numeric values of the fruits that you can increase every day.
UPDATE "FRUIT_COUNTS"
SET BANANA=BANANA + 1
WHERE Employee='Bob'
Then, at the end of the week, you can run a simple PartiQL query on the table:
SELECT * FROM "FRUIT_COUNTS"
WHERE BANANA > 0 AND APPLE > 0

I need to Edit 100,000+ Products

I'm looking at accepting a project that would require me to clean up an existing e-commerce website. Its been relatively successful and has over 100,000 individual products - loaded both by the client and its publishers.
The site wasn't originally designed for this many products and has become fairly disorganized.
SO, the client has asked I look at a more robust search option - filterable and so forth. I completely agree it needs to be improved, but after looking at the database, I can tell that there are dozens and dozens of categories and not everything is labeled correctly etc.
Is there any database management software that could help me clean up 100,000 entries quickly? Make categories consistent - fix uppercase/lowercase problems etc.
Are there any companies out there that I can source just this particular part of the project to?
Its a massive amount of data-entry. If I spent 2 minutes per product, it would take me 6 months full time to just to complete the database cleanup. I either need to get it down to a matter of seconds per product or find a company that specializes in this type of work.
I don't even know what to search for on Google.
Thanks guys!
--
Thanks everyone for your ideas! I have a lot of options now so I feel a lot more comfortable heading in to this project. Right now I think the direction we will go is to build a tool that allows the client to hire data entry people that can update it as necessary. Then I will work as a consultant, taking care of any UPDATE-WHERE type functions as necessary.
Thanks again!
If there are inconsistencies like you are describing, it sounds like the problem may be more an issues of a bad data model (i.e. lack of normalization) than just dirty data. If good normalization is in place, cleaning up categories should be as simple as updating a single record per each category - but if category name is used instead of a foreign key, then you will most likely need to perform a series of UPDATE WHERE statements to clean up the text.
You may want to look into an ETL (extract, transform, load) tool that can help with bulk data transformation. I'm not familiar with ETL tools for mysql, but I'm sure they exist. SQL Server has a build in service called SQL Integration Services that provides the ability to extract data from an existing data source, perform bulk changes or transformations, and then reload the data back into a destination database. Tools like this may help speed up the process of standardizing capitalization, punctuation, changing categories etc.
Even still, don't overlook the possibility that the data model may need tweaking to help prevent this type of situation in the future.
Edit: Wikipedia has a list of opensource ETL products that you may want to investigate.
In any case you'll probability need to do more than "clean the data", which means you'll need to build new normalized tables. So start there, build a new database that is fully normalized, import the data "as is", with all the duplicate categories, etc.
for example, new tables:
Items
ItemID int identity/auto number
ItemName string
CategoryID int
....
Categories
CategoryID int identity/auto number
CategoryName string
....
import the bad data into the new system:
Items
ItemID ItemName CategoryID
1 thing A 1
2 thing B 2
3 thing C 3
4 thing D 1
Categories
CategoryID CategoryName
1 Game
2 food
3 games
now, you can consolidate the data using the PKs
UPDATE Items
SET CategoryID=1
WHERE CategoryID=3
DELETE Categories
WHERE CategoryID=3
You might just write an application where the customer can do the consolidation. Let them select the duplicates on a screen and merge to a selected parent category. you have this application do the merge sql from above.
If there are issues of needing to have a clean cut over date, create an application that generates a series of "Map" tables, where you store the CategoryNameOld="games" and the CategoryNameNew="Game" and use these when you do the conversion/load of the bad data into the new system's tables.
I would implement the new search system or whatever and build them a tool that would allow them to easily go through and cleanup the listings, re-categorize, etc. This task requires domain knowledge, so they're the best ones to do it.
Do some number crunching so they can prioritize the list and clean in order of importance.
Keep in mind that one or your options is to build a crappy interface that somebody can use to edit records, hire half a dozen data-entry people from a temp agency, spend two days training them, and let them go to town.

Large Product catalog with statistics - alternatives to Sql Server?

I am building UI for a large product catalog (millions of products).
I am using Sql Server, FreeText search and ASP.NET MVC.
Tables are normalized and indexed. Most queries take less then a second to return.
The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for:
Display 20 matching products on first page(paged, sorted)
Total count of matching products for paging
List of stores only of all matching products
List of brands only of all matching products
List of colors only of all matching products
Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds.
I would like to get the whole page to load under 1 second.
There are several approaches:
Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further.
Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI.
Re-organize data to be more Report friendly. Already aggregated a lot of fields.
I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category).
The following is the search for keyword "white"
http://www.zappos.com/white
How do sites like zappos, amazon make their results, filters and stats appear almost instantly?
So you asked specifically "how does Zappos.com do this". Here is the answer from our Search team.
An alternative idea for your issue would be using a search index such as solr. Basically, the way these work is you load your data set into the system and it does a huge amount of indexing. My projects include product catalogs with 200+ data points for each of the 140k products. The average return time is less than 20ms.
The search indexing system I would recommend is Solr which is based on lucene. Both of these projects are open source and free to use.
Solr fits perfectly for your described use case in that it can actually do all of those things all in one query. You can use facets (essentially group by in sql) to return the list of different data values for all applicable results. In the case of keywords it also would allow you to search across multiple fields in one query without performance degradation.
you could try replacing you aggergate queries with materialized indexed views of those aggregates. this will pre-compute all the aggregates and will be as fast as selecting any regular row data.
.5 sec is too long for an appropriate hardware. I agree with Aaronaught and first thing to do is to convert it in single SQL or possibly Stored Procedure to ensure it's compiled only once.
Analyze your queries to see if you can create even better indexes (consider covering indexes), fine tune existing indexes, employ partitioning.
Make sure you have appropriate hardware config - data, log, temp and even index files should be located on independent spindles. make sure you have enough RAM and CPU's. I hope you are running 64-bit platform.
After all this, if you still need more - analyze most used keywords and create aggregate result tables for top 10 keywords.
Amount Amazon - they most likely use superior hardware and also take advantage of CDN's. Also, they have thousands of servers surviving up the content and there is no performance bottlenecks - data is duplicated multiple times across several data centers.
As completely separate approach - you may want to look into "in-memory" databases such as CACHE - this is the fastest you can get on DB side.

De-normalize live data for the sake of reports - Good or Bad?

What are the pros/cons of de-normalizing an enterprise application database because it will make writing reports easier?
Pro - designing reports in SSRS will probably be "easier" since no joins will be necessary.
Con - developing/maintaining the app to handle de-normalized data will become more difficult due to duplication of data and synchronization.
Others?
Denormalization for the sake of reports is Bad, m'kay.
Creating views, or a denormalized data warehouse is good.
Views have solved most of my reporting related needs. Data warehouses are great when users will be generating reports almost constantly or when your views start to slow down.
This is why you want to normalize your database
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
—E.F. Codd, "Further Normalization of the Data Base Relational Model" via wikipedia
The only time you should consider de-normaliozation is when the time it takes the report to generate is not acceptable. De-normalization will cause consistentcy issues that are sometimes impossible to determine especially in large datasets
Don't denormalize just to get rid of complexity in reporting, it can cause huge problems in the rest of the application. Either you don't enforce the rules resulting in bad data or if you do then inserts, deletes and updates can be seriously slowed for everyone not just the two or three people who run reports.
If the reports truly can't run well, then create a data warehouse that is denormalized and populate it in a nightly or weekly feed. The kind of reports that typically need this do not generally care if the data is up-to-the minute as they are usually monthly, quarterly, or annual reports that process (and especially aggregate) large amounts of data after the fact.
You can do both... let the normalized database for applications.
Then create a denormalized database for reports, and create an application which regulary copy data from one database to the other.
After all, reports don't always need to have the latest updated data, most of the time you can easily launch an update every 1 hour on the reporting database, and only once a day day.
Beyond the data warehouse and views solutions provided in other answers, which are good in some ways, if you are willing to sacrifice some performance to get a good to the last second data, but still want a normalized database, you could use on Oracle a Materialized View with fast refresh on commit, or in Sql Server, you could use clustered indexes for a view.
Another Con is that the data is likely not to be real-time as there is some time moving around the data to go from a normalized form to a de-normalized. If someone wants the report to be up to the very second it was requested, that can be tough to do in this situation.
If this is a duplication of the synchronization in the original post, sorry I didn't quite see it that way.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources