I need to Edit 100,000+ Products - asp.net

I'm looking at accepting a project that would require me to clean up an existing e-commerce website. Its been relatively successful and has over 100,000 individual products - loaded both by the client and its publishers.
The site wasn't originally designed for this many products and has become fairly disorganized.
SO, the client has asked I look at a more robust search option - filterable and so forth. I completely agree it needs to be improved, but after looking at the database, I can tell that there are dozens and dozens of categories and not everything is labeled correctly etc.
Is there any database management software that could help me clean up 100,000 entries quickly? Make categories consistent - fix uppercase/lowercase problems etc.
Are there any companies out there that I can source just this particular part of the project to?
Its a massive amount of data-entry. If I spent 2 minutes per product, it would take me 6 months full time to just to complete the database cleanup. I either need to get it down to a matter of seconds per product or find a company that specializes in this type of work.
I don't even know what to search for on Google.
Thanks guys!
--
Thanks everyone for your ideas! I have a lot of options now so I feel a lot more comfortable heading in to this project. Right now I think the direction we will go is to build a tool that allows the client to hire data entry people that can update it as necessary. Then I will work as a consultant, taking care of any UPDATE-WHERE type functions as necessary.
Thanks again!

If there are inconsistencies like you are describing, it sounds like the problem may be more an issues of a bad data model (i.e. lack of normalization) than just dirty data. If good normalization is in place, cleaning up categories should be as simple as updating a single record per each category - but if category name is used instead of a foreign key, then you will most likely need to perform a series of UPDATE WHERE statements to clean up the text.
You may want to look into an ETL (extract, transform, load) tool that can help with bulk data transformation. I'm not familiar with ETL tools for mysql, but I'm sure they exist. SQL Server has a build in service called SQL Integration Services that provides the ability to extract data from an existing data source, perform bulk changes or transformations, and then reload the data back into a destination database. Tools like this may help speed up the process of standardizing capitalization, punctuation, changing categories etc.
Even still, don't overlook the possibility that the data model may need tweaking to help prevent this type of situation in the future.
Edit: Wikipedia has a list of opensource ETL products that you may want to investigate.

In any case you'll probability need to do more than "clean the data", which means you'll need to build new normalized tables. So start there, build a new database that is fully normalized, import the data "as is", with all the duplicate categories, etc.
for example, new tables:
Items
ItemID int identity/auto number
ItemName string
CategoryID int
....
Categories
CategoryID int identity/auto number
CategoryName string
....
import the bad data into the new system:
Items
ItemID ItemName CategoryID
1 thing A 1
2 thing B 2
3 thing C 3
4 thing D 1
Categories
CategoryID CategoryName
1 Game
2 food
3 games
now, you can consolidate the data using the PKs
UPDATE Items
SET CategoryID=1
WHERE CategoryID=3
DELETE Categories
WHERE CategoryID=3
You might just write an application where the customer can do the consolidation. Let them select the duplicates on a screen and merge to a selected parent category. you have this application do the merge sql from above.
If there are issues of needing to have a clean cut over date, create an application that generates a series of "Map" tables, where you store the CategoryNameOld="games" and the CategoryNameNew="Game" and use these when you do the conversion/load of the bad data into the new system's tables.

I would implement the new search system or whatever and build them a tool that would allow them to easily go through and cleanup the listings, re-categorize, etc. This task requires domain knowledge, so they're the best ones to do it.
Do some number crunching so they can prioritize the list and clean in order of importance.

Keep in mind that one or your options is to build a crappy interface that somebody can use to edit records, hire half a dozen data-entry people from a temp agency, spend two days training them, and let them go to town.

Related

Is there a best practice limitation of how many items I should keep in a single DynamoDB table?

I am setting up a Serverless application for a system and I am wondering the following:
Say that my table handle Companies. Each Company can have Invoices. Each company has roughly 6-8000 Invoices. Say that I have 14 Companies, that results in roughly 112 000 items in my table.
Is it "okay" to handle it this way? I will only pay for each Get request I do, and I can query a lot of items into the same get request.
I will not fetch every single item each time I write or get items.
So, is there a recommendation for how many items I should max have in a table? I could bake some items together, but I mainly want a general recommendation.
There is no practical limit to the number of items you can have in a table. How many items each invoice is depends on your application's access patterns. You need to ask, what data does your app need, when does it need that data, and how large is the data, how often is the item updated. For example, if all the data in one item comes in under the 1Kb WCU and 4Kb RCU and you do not write to it often, and when you read it, you need all of the data in the item, then shove it in one item perhaps. If the data is larger, or part of it gets written to more often, then perhaps split it up.
An example might be a package tracking app. You have the initial information about the package, size, weight, source address, destination address, etc. That could be a lot of data. When that package enters a sorting facility it is checked in. Do you want to update that entire item you already wrote? Or do you just write an item that has the same PK (item collection), but a different SK and then the info that it made it to the sorting facility? When it leaves the sorting facility, you want to write to the DB that it left, which truck it was on, etc. Same questions.
Now when you need to present the shipping information by tracking ID number, the PK, you can do a query to DynamoDB and get the entire item collection for that tracking ID number. Therefore you get all items with that ID as your app presents much of that information on the tracking web site for the customer.
So again, it really depends on the app and your access patterns, but you want to TRY to only read and write the items your app needs, when you need them, how you need them, and no more...within reason (there is such a thing as over slicing your data). That is how, in my opinion, you will make a NoSQL database like DynamoDB be the most performant and most cost effective.
Dynamo Db won't even notice 100K entries...
As mentioned by LifeOfPi, entries should be less than 400k.
The question indicates a distinct lack of understanding of what/why/how to use DDB. I suggest you do some more learning. The AWS Reinvent videos around DDB are quite useful.
In a standard RDBMS, you need to know the structure from the beginning. Accessing that data is then very flexible.
DDB is the opposite, you need to understand how you'll need to access you data; the structure is not important. You should end up with something like so:
For 100K items and for most applications, you may find Aurora serverless to be an easier fit for your needs; especially if you have complicated searching and/or sorting needs.

DynamoDB tables per customer considering DynamoDB's advanced recovery abilities

I am deciding whether or not I have tables per a customer, or a customer shares a table with everybody else. Creating a table for every customer seems problematic, as it is just another thing to manage.
But then I thought about backing up the database. There could be a situation where a customer does not have strong IT security, or even a disgruntled employee, and that this person goes and deletes a whole bunch of crucial data of the customer.
In this scenario if all the customers are on the same table, one couldn't just restore from a DynamoDB snapshot 2 days ago for instance, as then all other customers would lose the past 2 days of data. Before cloud this really wasn't such a prevalent consideration IMO because backups were not as straight forward offering such functionality to your customers who are not tier 1 businesses wasn't really on the table.
But this functionality could be a huge selling point for my SAAS application so now I am thinking it will be worth the hassle for me to have table per customer. Is this the right line of thinking?
Sounds like a good line of thinking to me. A couple of other things you might want to consider:
Having all customer data in one table will probably be cheaper as you can distribute RCUs and WCUs more efficiently. From your customer point of view this might be good or bad because one customer can spend any customers RCUs/WCUs (if you want to think about like that). If you split customer data into separate tables your can provision them independently.
Fine grained security isn't great in DynamoDB. You can only really implement row (item) level security if the partition key of the table is an Amazon uid. If this isn't possible you are relying on application code to protect customer data. Splitting customer data into separate tables will improve security (if you cant use item level security).
On to your question. DynamoDB backups don't actually have to be restored into the same table. So potentially you could have all your customer data in one table which is backed up. If one customer requests a restore you could load the data into a new table, sync their data into the live table and then remove the restore table. This wouldn't necessarily be easy, but you could give it a try. Also you could be paying for all the RCUs/WCUs as you perform your sync - a cost you don't incur on a restore.
Hope some of that is useful.
Separate tables:
Max number of tables. It's probably a soft limit but you'd have to contact support rather often - extra overhead for you because they prefer to raise limits in small (reasonable) bits.
A lot more things to manage, secure, monitor etc.
There's probably a lot more RCU and WCU waste.
Just throwing another idea up in the air, haven't tried it or considered every pro and con.
Pick up all the write ops with Lambda and write them to backup table(s). Use TTL (for how long can users restore their stuff) to delete old entries for free. You could even modify TTL per costumer basis if you e.g provide longer backups for different price classes of your service.
You need a good schema to avoid hot keys.
customer-id (partition ID) | time-of-operation#uuid (sort key) | data, source table etc
-------------------------------------------------------------------------------------------
E.g this example might be problematic if some of your costumers are a lot more active than others.
Possible solution: use known range of int-s to suffix IDs, e.g customer-id#1, customer-id#2 ... customer-id#100 etc. This will spread the writes and your app knows the range - able to query.
Anyway, this is just a quick and dirty example off the top of my head.
Few pros and cons that come to my mind:
Probably more expensive unless separate tables have big RCU/WCU headroom.
Restoring from regular backup might be a huge headache, e.g which items to sync?
This is very cranual, users can pick any moment in your TTL range to restore.
Restore specific items, revert specific ops w/ very low cost if your schema allows it.
Could use that backup data to e.g show history of items in front-end.

What is the advantage of using 1 to many relationship over adding 1 more column in this particular situation?

This is a typical situation for 1 to many relationships: a chat group iOS app, a group table to record all the group chat related information, like group id, create time, thread title, etc.
To record the participants, of course, I would assume there is another 1:m table. So I was rather surprised to see the app just added another column called "participants" to record it, with each participant is separated by a delimiter (':' to be exact). The problem with that is quite obvious, mixing application code with sql code, e.g. no way to see how many groups a specific user is in with sql code, violated 1NF/2NF, etc.
But they said we understood all your points. But
as this is a mobile app, you always need to use objective c code to access sqlite tables, you won't use sql codes alone. So not a "big deal" to mix them together.
participants don't change often and normally are set when a group is created. If we have 100 participants we would rather just insert 1 record to group table instead of insert 100 records into another group-participants table.
The participant data will be used when someone wants to see who are in this chat group (by several taps on the menu) and when someone joins or leaves the chat group, assume it won't happen often.
So my question is in this particular situation what is the advantage I will gain if I use another 1:m table?
----- update -----
Except for the answer I got, Renzo kindly pointed this discussion to me, which is also very helpful!
It's hard to respond to "is this design better/worse" style questions without understanding the full context. I'm going to make some assumptions based on your question.
You appear to be building a mobile application, supporting "many to many" user chat. I'm picturing something like Slack.
Your application design is using the SQLite database for local storage.
Your local sqlite database on the phone is some kind of subset of the overall application data - like a cache, only showing the data for the current user.
If all that is true, the question really is down to style/maintainability on the one hand, and performance and scalability on the other.
From a "style" point of view, storing the data in a comma-separated value in a column is ugly. A new developer who joins the project, with a background in "regular" database design will consider it at best a hack. On the other hand, iOS developers may consider it perfectly normal.
From a performance point of view, it's probably not worth arguing about - parsing the CSV is probably just as slow as reading/writing from the database.
From a scalability point of view, you may have a problem. If the application design needs to capture in which order users joined the chat, or capture some kind of status (active/asleep, for instance), or provide a bit of history (user x exited at 21:20), you almost certainly end up re-designing the database.

MS SQL product list with filtering

I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.

Large Product catalog with statistics - alternatives to Sql Server?

I am building UI for a large product catalog (millions of products).
I am using Sql Server, FreeText search and ASP.NET MVC.
Tables are normalized and indexed. Most queries take less then a second to return.
The issue is this. Let's say user does the search by keyword. On search results page I need to display/query for:
Display 20 matching products on first page(paged, sorted)
Total count of matching products for paging
List of stores only of all matching products
List of brands only of all matching products
List of colors only of all matching products
Each query takes about .5 to 1 seconds. Altogether it is like 5 seconds.
I would like to get the whole page to load under 1 second.
There are several approaches:
Optimize queries even more. I already spent a lot of time on this one, so not sure it can be pushed further.
Load products first, then load the rest of the information using AJAX. More like a workaround. Will need to revise UI.
Re-organize data to be more Report friendly. Already aggregated a lot of fields.
I checked out several similar sites. For ex. zappos.com. Not only they display the same information as I would like in under 1 second, but they also include statistics (number of results in each category).
The following is the search for keyword "white"
http://www.zappos.com/white
How do sites like zappos, amazon make their results, filters and stats appear almost instantly?
So you asked specifically "how does Zappos.com do this". Here is the answer from our Search team.
An alternative idea for your issue would be using a search index such as solr. Basically, the way these work is you load your data set into the system and it does a huge amount of indexing. My projects include product catalogs with 200+ data points for each of the 140k products. The average return time is less than 20ms.
The search indexing system I would recommend is Solr which is based on lucene. Both of these projects are open source and free to use.
Solr fits perfectly for your described use case in that it can actually do all of those things all in one query. You can use facets (essentially group by in sql) to return the list of different data values for all applicable results. In the case of keywords it also would allow you to search across multiple fields in one query without performance degradation.
you could try replacing you aggergate queries with materialized indexed views of those aggregates. this will pre-compute all the aggregates and will be as fast as selecting any regular row data.
.5 sec is too long for an appropriate hardware. I agree with Aaronaught and first thing to do is to convert it in single SQL or possibly Stored Procedure to ensure it's compiled only once.
Analyze your queries to see if you can create even better indexes (consider covering indexes), fine tune existing indexes, employ partitioning.
Make sure you have appropriate hardware config - data, log, temp and even index files should be located on independent spindles. make sure you have enough RAM and CPU's. I hope you are running 64-bit platform.
After all this, if you still need more - analyze most used keywords and create aggregate result tables for top 10 keywords.
Amount Amazon - they most likely use superior hardware and also take advantage of CDN's. Also, they have thousands of servers surviving up the content and there is no performance bottlenecks - data is duplicated multiple times across several data centers.
As completely separate approach - you may want to look into "in-memory" databases such as CACHE - this is the fastest you can get on DB side.

Resources