What would be the best way to store a very large amount of data for a web-based application?
Each record has just 3 fields, but there will be around 144 million records a day - stored for one month - 4,464,000,000 records total. Let's round up to 5 billion.
Data has to be searchable on keyword & return results as fast as possible to the end user.
Which programming language?
JSON / XML / Some Database System I've Never Heard Of?
What sort of infrastructure? Imagine this system is only serving the needs of a maximum of 1,000 users at the same time.
I assume the code is the same whether you're searching 10 records or 10 billion, you just have to be a whole lot more efficient. I also assume mySQL/PHP doesn't stand a chance, and we're going to be paying out a very large sum for a hosting solution.
Just need some guidance on where to start, really. Thank you!
There are many tools in the Big Data ecosystem (NoSQL databases, distributed computing, machine learning, search, etc) which can form an answer to your question. Since your application will be write-heavy, I would advocate Apache Cassandra for its excellent write-performance (although it requires more data modeling than a NoSQL/document database such as MongoDB). You also need a Solr or ElasticSearch based search solution, and Map/Reduce for indexes and queries.
The programming language doesn't matter unless you have business end-users which will be writing queries against your Big Data in which case you can use something very SQL-like such as Hive or Pig. To get you started, the following (recent) link might give you some idea on how to pick an analytics stack based on your needs - please note that every database or distributed computing paradigm specializes for some particular use case:
How we picked our analytics stack
Also look at High Scalability for various use cases on how companies tackle their scalability problems.
Related
I am developing an ASP.NET web application. The web application searches for people across about thirty different database (it is a law enforcement app). For example, a police office searches for Fred Smith DOB 01/01/1950 (not a real person) and it returns any hits from all the databases.
For the majority of searches; the speed is acceptable i.e. if the average person has five hits then the average time to load the page is five seconds. However, some searches have hundreds and sometimes thousands of hits. I saw one search which took 25 minutes, which is obviously not acceptable.
Longer term a data warehouse will probably be created to ensure all the data is in one database. However, what is the best strategy for speeding up searches in this scenario? I thought of caching but the same person is rarely searched for twice in a small amount of time. Are there any other ideas?
You have insufficient details (what databases, whats the frontend language, how are you querying, is there indexing etc.)
but here are some preliminary suggestions.. (in the increasing order of effort, probably)
heavily index all the databases for the key columns and do your searches.
multi threading - spawn 30 threads (1 per database) and do the search. start displaying the results as the threads come back.
have a backend job to consolidate all data from 30 databases into a single denormalized table which is fully indexed. query that table.
setup #3 with mostly a SOLR/LUCERNE like indexing engine, for even faster querying.
use big data etc.
You are asking the wrong question, if you're db searches are taking too long, it's because the columns you are searching are not indexed or other db related issue. Showing them 'faster' with asp.net is the least of your problems.
Show us some code and it's easier to help out.
We have a large table of data with about 30 000 0000 rows and growing each day currently at 100 000 rows a day and that number will increase over time.
Today we generate different reports directly from the database (MS-SQL 2012) and do a lot of calculations.
The problem is that this takes time. We have indexes and so on but people today want blazingly fast reports.
We also want to be able to change timeperiods, different ways to look at the data and so on.
We only need to look at data that is one day old so we can take all the data from yesterday and do something with it to speed up the queries and reports.
So do any of you got any good ideas on a solution that will be fast and still on the web not in excel or a BI tool.
Today all the reports are in asp.net c# webforms with querys against MS SQL 2012 tables..
You have an OLTP system. You generally want to maximize your throughput on a system like this. Reporting is going to require latches and locks be taken to acquire data. This has a drag on your OLTP's throughput and what's good for reporting (additional indexes) is going to be detrimental to your OLTP as it will negatively impact performance. And don't even think that slapping WITH(NOLOCK) is going to alleviate some of that burden. ;)
As others have stated, you would probably want to look at separating the active data from the report data.
Partitioning a table could accomplish this if you have Enterprise Edition. Otherwise, you'll need to do some hackery like Paritioned Views which may or may not work for you based on how your data is accessed.
I would look at extracted the needed data out of the system at a regular interval and pushing it elsewhere. Whether that elsewhere is a different set of tables in the same database or a different catalog on the same server or an entirely different server would depend a host of variables (cost, time to implement, complexity of data, speed requirements, storage subsystem, etc).
Since it sounds like you don't have super specific reporting requirements (currently you look at yesterday's data but it'd be nice to see more, etc), I'd look at implementing Columnstore Indexes in the reporting tables. It provides amazing performance for query aggregation, even over aggregate tables with the benefit you don't have to specify a specific grain (WTD, MTD, YTD, etc). The downside though is that it is a read-only data structure (and a memory & cpu hog while creating the index). SQL Server 2014 is going to introduce updatable columnstore indexes which will be giggity but that's some time off.
I have a big analytics module in my system and plan to use vertica for it.
Someone suggested that we also use vertica in the rest of our app (standard crud app with models from our domain) so not to manage multiple databases.
would vertica fit this dual scenario?
High frequency UPDATEs is probably where Vertica lags behind the worst. I would avoid using it for such data models.
Alec - I would like to respectfully challenge your comments on Vertica. In no way do you need to denormalize or sort data before loading. Vertica also holds the record for fastest loading of data over all databases.
You also talk about Vertica not being able to do complex analytics as well as an RDBMS. Vertica IS an RDBMS and can do analytics faster than any other RDBMS and they prove it over and over.
As far as your numbers, in my use case I load roughly 5 million records per second into my Vertica cluster and have 100's of billions of records.
So Yaron - I would highly recommend you look at Vertica before you rule it out based on this information.
As is often the case these days, a meaningful answer depends on what you need to do. In a general sense, 'big data' solutions have grown from large data volume deficiencies in RDBMS systems. No 'big data' solution can compete with the core capabilities of RDBMS systems, ie complex analytics, but RDBMS systems are poor (expensive) solutions for large data volume procesing. Practical solutions for now have to be hybrid solutions. Vertica can be good once data is loaded, but I believe (not an expert) it requires denormalisation of data and pre-sorting before loading to perform at it's best. For large data volumes this may add significantly to the required resources. There is a definite benefit to using one system for all your needs, but there are also benefits to keeping your options open.
The approach I take is to store and index new data and then provide specific feeds to various reporting/analytic engines as required. This separates the collection and storage of raw data from the complex analytic processing. I am happy to provide more details if you are interested. This separation addresses a core problem which has always been present in database systems. In the past you used to hear 'store fast, report slowly or store slowly, report fast, but you cannot do both'. The search for a complete solution has, in the last few years, spawned the many NoSQL offerings which typically address the 'store fast' task. Some systems also provide impressive query performance by storing data in memory or cache but this requires many servers for large data volumes. I believe NoSQL and SQL solutions can, and will be, integrated, but this is till down the track.
To give you some context, I work with scenarios where at least 1 billion records a day are loaded. If you are dealing with say 100 million records a day (big is relative), then your Vertica approach will probably suffice, otherwise I think you need to expand your options.
Test it. Each use case is different. Assuming Vertica is a solution for every use case is almost as bad as using MongoDB for every use case.
Vertica is a high performance analytics database, column oriented, designed to analyze incredibly large datasets and scale horizontally. It's also expensive, hard to administer, and documentation is spotty. The payoff in the right environment can be easily worth the work, obviously
MySQL is a traditional RDBMS, row oriented, designed to model relationships between structured data, and works well on a single node scale (though many companies have retrofitted it to great success, exemplar gratia, Facebook). It's incredibly well documented and seemingly works on any platform, language, or framework and can be used by anyone.
My guess is using Vertica for an employee address book database is like showing up to a blue collar job in a $3000 suit. Sure it works, but is it the right tool for the job? Maybe if you already have a Vertica license and your applications already have the requisite data adaptors/ORM/etc..., go ahead and give it a shot. It's still a SQL database so it should work fine in those situations. If your goal is minimal programming as opposed to optimal performance, then why use Vertica at all? Sounds like something simpler would be more ideal. Vertica may or may not give better performance in a regular CRUD application environment since it's not optimized for that, but you can always test both and see.
Vertiy have many issues with high concurrency (Many small transaction per minute )
In MPP systems , the data is segmented across the cluster and any time there is need to take cluster level lock ( mainly in commit time ) , so many commits many cluster level X locks .
high concurrency is less the use case in DWH and reporting , so vertica is perfect for that .
In most of the cases OLTP solutions ( like CRM and etc ) required to provide high concurrency for that very is bad choice
Thanks
We have a system that creates a lot of data, up to 1.5 million time stamped records, about 24MB, per second or about 2TB per day.
The data comes from multiple sources and has multiple formats, the one thing in common is the time stamp.
Currently we save about 5 days of data in files and have in-house software that generates reports.
We are contemplating creating a scalable system that can hold and query years of data.
We're leaning towards something like what Nathan Marz describes in How to beat the CAP theorem, using Hadoop/ElephantDB for long term batch storage and Storm/Cassandra for a realtime layer.
I'm wondering if the community can point out any alternatives or suggest further reading?
Does the fact that our data is primarily organized by time lend itself to a particular type of solution?
Is there a better forum to ask this kind of question?
Thanks
It is tough problem to have both real time access and scalable batch processing in the same time.
While there is no perfect solution I would explore two following capabilities:
a) Hive, with partitions by time and subpartitions by some other key (like client id or something similar). This solution will give you:
Good performance for data import
Good throughput on the aggregated reports
Probably acceptable time of one sub partition access. Although - it will never be 1-2 seconds.
b) Brisk. It is hadoop with cassandra replacing HDFS. It promised to give you all you need, although i would expect data load performance and batch reports performance to be inferior to the vanilla hadoop - because it built specifically for it.
Using ASP.NET and Windows Stack.
Purpose:
Ive got a website that takes in over 1GB of data about every 6 months. So as you can tell my database can become huge.
Problem:
Most hosting providers only offer Databases in 1GB increments. This means that every time I go over another 1GB, I will need to create another Database. I have absolutely no experience in this type of setup and Im looking for some advice on what to do?
Wondering:
Do I move the membership stuff over to a separate database? This still won't solve much because of the size of the other data I have.
Do I archive data into another database? If I do, how to I allow users to access it?
If I split the data between two databases, do I name the tables the same?
I query all my data with LINQ. So establishing a few different connections wouldn't be a horrible thing.
Is there a hosting provider that anyone knows of that can scale their databases?
I just want to know what to do? How can I solve this dilemma? I don't have the advertising dollars coming in to spend more than $50 a month so far...
While http://www.ultimahosts.net/windows/vps/ seems to offer the best solution for the best price, they still split the databases up. So where do I go from here?
Again, I am a total amateur to multiple databases. Ive only used one at a time..
I'd be genuinely surprised if they actually impose a hard 1GB per DB limit and create a new one for each additional GB, but on the assumption that that actually is the case -
Designate a particular database as your master database. This is the only one your app will directly connect to.
Create a clone of all the tables you'll need in your second (and third, fourth etc) databases.
Within your master database, create a view that does a UNION on the tables as a cross-DB query - SELECT * FROM Master..TableName UNION SELECT * FROM DB2..TableName UNION SELECT * FROM DB3..TableName
For writing, you'll need to use sprocs to locate the relevant records and update them, but you shouldn't have a major problem there. In principle you could extend the view above to return which DB the record was in if you wanted.
Answering this question is very hard for it requires knowing at least some basic facts about the data model, the way the data is queried, etc. Also as suggested by rexem, a better understanding of the use model may allow using normalization to limit the growth (and I had may also allow introducing compression, if applicable)
I'm more puzzled at the general approach and business model (and I do understand the need to keep cost down with a startup application based on ad revenues). Wouldn't you be able to contract an amount that will fit your need for the next 6 months, then, when you start outgrowing this space, purchase additional storage (for an extra 6 month/year, by then you may be "rich"); such may not even require anything on your end (depends on the way hosting service manages racks etc.), or at worse, may require you to copy the old database to the new (bigger) storage?
In this fashion, you wouldn't need to split the database in any artificial fashion, and hence focus on customer-oriented features, rather than optimizing queries that need to compile info from multiple servers.
I believe solution is much more simpler than that: also if your provider manage database in 1 GB space it does not means that you have N databases of 1 GB each, it means that once you reach 1 GB the database could be increased to move to 2 GB, 3 GB and so on...
Regards
Massimo
You would have multiple questions to answer:
It seems the current hosting provider can not be very reliable if it is the way you say: they create a new database every time the initial one gets more then 1GB - this sounds strange... at least they should increase the storage for the current db and announce you that you'll be charged more... Find other hosting solutions with better options...
Is there any information into your current DB that could be archived? That's a very important question since you may carry over "useless" data that could be archived into separate databases and queried only when special requests. As other colleagues told you already, that would be difficult for us to evaluate since we do not know the data model.
Can you split the data model into two total different storages and only replicate between them the common information? You could use SQL Server Replication (http://technet.microsoft.com/en-us/library/ms151198.aspx) to maintain the same membership information between the databases.
If the data model can not be splited then I do not see any practical choice to have multiple databases - just find a bigger storage solution.
You may want to look for a better hosting provider.
Even SQL Express supports a 4GB database, and it's free. Some hosts don't like using SQL Express in a shared environment, but disk space is so cheap these days that finding a plan that starts at or grows in chunks of more than 1GB should be pretty easy.
You should go for a Windows VPS solution. Most of the Windows VPS providers will offer SQL 2008 Web Edition that can support upto 10 GB of database space ...