How to store big data? [closed]

How to store big data? [closed] - sqlite

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
Suppose we have a web service that aggregates 20 000 users, and each one of them is linked to 300 unique user data entities containing whatever. Here's naive approach on how to design an example relational database that would be able to store above data:
Create table for users.
Create table for user data.
And thus, user data table contains 6 000 000 rows.
Querying tables that have millions of rows is slow, especially since we have to deal with hierarchical data and do some uncommon computations much different from SELECT * FROM userdata. At any given point, we only need specific user's data, not the whole thing - getting it is fast - but we have to do weird stuff with it later. Multiple times.
I'd like our web service to be fast, so I thought of following approaches:
Optimize the hell out of queries, do a lot of caching etc. This is nice, but these are just temporary workarounds. When database grows even further, these will cease to work.
Rewriting our model layer to use NoSQL technology. This is not possible due to lack of relational database features and even if we wanted this approach, early tests made some functionalities even slower than they already were.
Implement some kind of scalability. (You hear about cloud computing a lot nowadays.) This is the most wanted option.
Implement some manual solution. For example, I could store all the users with names beginning with letter "A..M" on server 1, while all other users would belong to server 2. The problem with this approach is that I have to redesign our architecture quite a lot and I'd like to avoid that.
Ideally, I'd have some kind of transparent solution that would allow me to query seemingly uniform database server with no changes to code whatsoever. The database server would scatter its table data on many workers in a smart way (much like database optimizers), thus effectively speeding everything up. (Is this even possible?)
In both cases, achieving interoperability seems like a lot of trouble...
Switching from SQLite to Postgres or Oracle solution. This isn't going to be cheap, so I'd like some kind of confirmation before doing this.
What are my options? I want all my SELECTs and JOINs with indexed data to be real-time, but the bigger the userdata is, the more expensive queries get.

I don't think that you should use NoSQL by default if you have such amount of data. Which kind of issue are you expecting that it will solve?
IMHO this depends on your queries. You haven't mentioned some kind of massive writing so SQL is still appropriate so far.
It sounds like you want to perform queries using JOINs. This could be slow on very large data even with appropriate indexes. What you can do is to lower your level of decomposition and just duplicate a data (so they all are in one database row and are fetched together from hard drive). If you concern latency, avoid joining is good approach. But it still does not eliminates SQL as you can duplicate data even in SQL.
Significant for your decision making should be structure of your queries. Do you want to SELECT only few fields within your queries (SQL) or do you want to always get the whole document (e.g. Mongo & Json).
The second significant criteria is scalability as NoSQL often relaxes usual SQL things (like eventual consistency) so it can provide better results using scaling out.

Related

Using LMDB to implement a sqlite-alike relational database, relevant resources?

For educational reasons I wish to build a functional, full, relational database. I'm aware LMDB was used to be the storage backend of sqlite, but I don't know C. I'm on .NET and I'm not interested in just duplicate a "traditional" RDBMS (so, for example, I not worry about implement a sql parser but my own custom scripting language that I'm building), but expose the full relational model.
Consider this question similar to "How I implement a programming language on top of LLVM" before worry about why I'm not using sqlite or similar.
From the material I read, LMDB look great, specially because It provide transactions and reliability, plus the low-level plumbing. How that translate to changes that could touch several rows at several tables is another question..
Exist material that explain how is implemented a relational layer on top of something like LMDB? Is using LMDB (or their competitors) optimal enough or exist another better way to get results?
Is possible to use LMDB to store other structures like hashtables, arrays and (the one I'm more interested for a columnar database) bitmap arrays?, ie, similar to redis?
P.D: Exist a forum or another place to talk more about this subject?

I had this idea too. You should realize that this is tons of work and most likely no one will care. I haven't built full-blown relational db as this is crazy to do for one person. You could check it out here
Anyway I've used leveldb (and later rocksdb) and so you have keys-values sorted by key, ability to get value by key, iterate keys, have atomic writes of many values (WriteBatch) and consistent view of data at given time - snapshots. These features are enough to build correct thread-safe reading of table rows (using snapshots), correct writing of data and related indexes - all or nothing (using writebatch) and even transactions.
Each column has it's on disk index - keys sorted by values - so you could efficiently do various operations on it and keys with values themselves so you could efficiently read values with given id.
This setup is efficient for writing and reading using available operations on tables with little data (say less than a million rows). However, if table grows iterating over many keys can become not so fast. To solve this and to add a group-by statement I've decided to add memory indexes, but that's another story. So all-in-all it might be fun idea but in reality a lot of work and often frustrating results - why would you want to do that?

Best way to persist data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a complex JSON which I need to persist over two POST requests. Currently i'm storing the serialized JSON in tempdata though the second post never succeeds before of Error 400(The size of headers is too long). In this image I viewed the cookies in Chrome Debugger
Am I doing MVC wrong? The data is probably too complex to be stored in tempdata. However for this example this JSON is only 234 line(Unsure if this reflexes cookie size accurately). I know I could increase the size of the cookie but this wouldn't fix the real issues.
Should I be storing the data in a different method?
Basically in my project i'm posting a value to the controller(Many times via POST) which then uses the value to get a certain part of the JSON. Is Session the only alternative?
I'm still a novice to MVC so forgive me if i've made a simple mistake

First, TempData and Session are the same thing. The only difference is the length of persistence: in the former, just until the next request, while in the latter for the life of the session.
Second, session storage has to be configured. If you don't configure it, then something like TempData will attempt to use cookies to persist the data. Otherwise, it will use your session store. Basically, by using any actual session store, you should have no issues with the size of the data.
Third, you have not provided much information about what you're actually doing here, but for the most part, sessions (Session or TempData) are a poor choice for persistence. The data you're trying to store between requests does not sound like it is user-specific, which makes sessions a particular poor choice. Most likely, you want a distributed cache here, though you could potentially get by with an in-memory cache. You should also consider whether you need to persist this data at all. It's far too common to over-optimize by worrying about running the same query against at database, for example, multiple times. Databases are designed to efficiently retrieve large amounts of data, and properly set up, can handle many thousands of simultaneous queries without breaking a sweat. Ironically, sometimes a caching a query doesn't actually save you anything over actually running the query, especially with distributed caching mechanisms.
Simple is better than complex. Start simple. Solve the problem in the most straight-forward way possible. If that involves issuing the same query multiple times, do so. It doesn't matter. Then, once you have a working solution, profile. If it's running slower than you like, or starts to fall down when fielding 1000s of requests, then look into ways to optimize it by caching, etc. Developers waste an enormous amount of time and energy trying to optimize things that aren't actually even problems.

Does it make sense to make multiple SQLite databases to improve performance?

I'm just learning SQL/SQLite, and plan to use SQLite 3 for a new website I'm building. It's replacing XML, so concurrency isn't a big concern. But I would like to make it as performant as possible with the technology I'm using. Are there any benefits to using multiple databases for performance, or is the best performance keeping all the data for the site in one file? I ask because 99% of the data will be read-only 99% of the time, but that last 1% will be written to 99% of the time. I know databases don't read in and re-write the whole file for every little change, but I guess I'm wondering if the writes will be much faster if the data is going to a separate 5KB database, rather than part of the ~ 250MB main database.

With proper performance tuning, sqlite can do around 63 300 inserts-per-second. Unless you're planning on some really heavy volume, I would avoid pre-optimizing. Splitting into two databases doesn't feel right to me, and if you're planning on doing joins in the future, you'll be hosed. Especially since you say concurrency isn't a big problem, I would avoid complicating the database design.

Actually with 50 000 databases you will have very bad performance
you should try several tables in single database, sometimes it really can speed up something, but as description of initial task is very general - hard to say exactly what you need, try single table and multiple tables - measure speed

What cache strategy do I need in this case ?

I have what I consider to be a fairly simple application. A service returns some data based on another piece of data. A simple example, given a state name, the service returns the capital city.
All the data resides in a SQL Server 2008 database. The majority of this "static" data will rarely change. It will occassionally need to be updated and, when it does, I have no problem restarting the application to refresh the cache, if implemented.
Some data, which is more "dynamic", will be kept in the same database. This data includes contacts, statistics, etc. and will change more frequently (anywhere from hourly to daily to weekly). This data will be linked to the static data above via foreign keys (just like a SQL JOIN).
My question is, what exactly am I trying to implement here ? and how do I get started doing it ? I know the static data will be cached but I don't know where to start with that. I tried searching but came up with so much stuff and I'm not sure where to start. Recommendations for tutorials would also be appreciated.

You don't need to cache anything until you have a performance problem. Until you have a noticeable problem and have measured your application tiers to determine your database is in fact a bottleneck, which it rarely is, then start looking into caching data. It is always a tradeoff, memory vs CPU vs real time data availability. There is no reason to make your application more complicated than it needs to be just because.

An extremely simple 'win' here (I assume you're using WCF here) would be to use the declarative attribute-based caching mechanism built into the framework. It's easy to set up and manage, but you need to analyze your usage scenarios to make sure it's applied at the right locations to really benefit from it. This article is a good starting point.
Beyond that, I'd recommend looking into one of the many WCF books that deal with higher-level concepts like caching and try to figure out if their implementation patterns are applicable to your design.

What are some ways to optimize your use of ASP.NET caching? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have been doing some reading on this subject, but I'm curious to see what the best ways are to optimize your use of the ASP.NET cache and what some of the tips are in regards to how to determine what should and should not go in the cache. Also, are there any rules of thumb for determining how long something should say in the cache?

Some rules of thumb
Think in terms of cache miss to request ratio each time you contemplate using the cache. If cache requests for the item will miss most of the time then the benefits may not outweigh the cost of maintaining that cache item
Contemplate the query expense vs cache retrieval expense (e.g. for simple reads, SQL Server is often faster than distributed cache due to serialization costs)
Some tricks
gzip strings before sticking them in cache. Effectively expands the cache and reduces network traffic in a distributed cache situation
If you're worried about how long to cache aggregates (e.g. counts) consider having non-expiring (or long-lived) cached aggregates and pro-actively updating those when changing the underlying data. This is a controversial technique and you should really consider your request/invalidation ratio before proceeding but in some cases the benefits can be worth it (e.g. SO rep for each user might be a good candidate depending on implementation details, number of unanswered SO questions would probably be a poor candidate)

Don't implement caching yet.
Put it off until you've exhausted all the Indexing, query tuning, page simplification, and other more pedestrian means of boosting performance. If you flip caching on before it's the last resort, you're going to have a much harder time figuring out where the performance bottlenecks really live.
And, of course, if you have the backend tuned right when you finally do turn on caching, it will work a lot better for a lot longer than it would if you did it today.

The best quote i've heard about performance tuning and caching is that it's an art not a science, sorry can't remember who said it but the point here is that there are so many factors that can have an effect on the performance of your app that you need to evaluate each situation case by case and make considered tweaks to that case until you reach a desired outcome.
I realise i'm not giving any specifics here but I don't really think you can
I will give one previous example though. I worked on an app that made alot of calls to webservices to built up a client profile e.g.
GET client
GET client quotes
GET client quote
Each object returned by the webservice contributed to a higher level object that was then used to build the resulting page. At first we gathered up all the objects into the master object and cached that. However we realised when things were not as quick as we would like that it would make more sense to cache each called object individually, this way it could be re-used on the next page the client sees e.g.
[Cache] client
[Cache] client quotes
[Cache] client quote
GET client quote upgrades

Unfortunately there is no pre-established rules...but to give you a common sense, I would say that you can easily cache:
Application Parameters (list of countries, phone codes, etc...)
Any other application non-volatile data (list of roles even if configurable)
Business data that is often read and does not change much (or not a big deal if it is not 100% accurate)
What you should not cache:
Volatile data that change frequently (usually the business data)
As for the cache duration, I tend to use different durations depending on the type of data and its size. Application Parameters can be cached for several hours or even days.
For some business data, you may want to have smaller cache duration (minutes to 1h)
One last thing is always to challenge the amount of data you manipulate. Remember that the end-user won't read thousands of records at the same time.
Hope this will give you some guidance.

It's very hard to generalize this sort of thing. The only hard-and-fast rule to follow is not to waste time optimizing something unless you know it needs to be done. Then the proper course of action is going to be very much dependent on the nitty gritty details of your application.
That said... I'll almost always cache global applications parameters in some easy to use object. This is certainly more of a programming convenience rather than optimization.
The one time I've written specific data caching code was for an app that interfaced with a very slow accounting database, and then it was read-only for data that didn't change very often. All writes went to the DB. With SQL Server, I've never run into a situation where the built-in ASP.NET-to-SQL Server interface was the slow part of the equation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex