I've been googling around and can't really find a answer to my question. Lets say i have 2 large tables and my final destination table requires rows that are joined between these 2 tables. In terms of scalability and best practices, where should I do these joins? On the source database? In memory after extraction? Or staging tables?
Thanks
I agree, there are no rules here, just common sense.
It is good to get rid of the unnecessary data as soon as possible, so you will spend less resources/storage down the road, but you should consider impact on your PROD environment.
staging tables
Having copy of the data in Staging gives you more freedom and flexibility, ability to try different ETL approaches, and etc. I would do it there.
Even if you ETL looks simple for now, it can grow in the future, so you need a quite place to play with your data.
in memory
in memory where? If it is a prod instance, and you consume 95% of memory... :) All computations are "in memory" anyway.
Best regards.
Related
I've researched on what I can about SQLite and UnQLite but there are still a few things that haven't quite been answered yet. UnQLite appears to have been released within the past few years which would attribute to the lack of benchmarks. "Performance" (read/write speed, querying, avg. database size before significant slowdown, etc.) comparisons may be somewhat apples-to-oranges here.
From all that I have seen the two have very few differences comparatively speaking, namely that SQLite is a relational database whereas UnQLite is a key-value pair and document (via Jx9) database. They're both portable, cross-platform, and 32/64-bit friendly, and can have single-write and multi-read connections. Very little can be found on UnQLite benchmarks while SQLite has quite a few with different implementations across various (scripting) languages. SQLite has some varied performance across in-memory databases, indexed data, and read/write modes with varying data size. Overall SQLite appears quick and reliable.
All that I can find on UnQLite are unreliable and confusing. I cannot seem to find anything helpful. What read/writes speeds does UnQLite seem to peak at? What languages are (not) recommended when using UnQLite? What are some known disadvantages and bugs?
If it helps at all to explain my intrigue, I'm developing a network utility that will be reading and processing packets with hot-swapping between network interfaces. Since the connections can, though unlikely, reach speeds up to 1 Gbps there will be a lot of raw data being written out to a database. It's still in the early stages of development and I'm having to find a way to balance out performance. There are a lot of factors such as missed packets, how large each write size is, how quickly it can process and move data, how much organization will be required, how many tables will be needed, if I can implement multiprocessing, how reliant each database is on HDD speeds, etc. etc.. My data will need tables but whether or not I have to store them as relational is still in the air. Seeing how the two stack up with their own pros and cons (aside from the usual KVP vs Relational debate) may push me towards either one or, if I'm crazy enough, a mix of both
I've done a bit of fooling around with UnQLite using python bindings I wrote. The Python bindings use cython and are quite fast.
What I've found from my experimentation is that UnQLite's key/value APIs are pretty damn fast, comparable to other DBMs. Things slow down a bit when you start using Jx9 and the document store, though.
Basically depends on what you need...
If you want SQL and ad-hoc querying, I'd suggest using SQLite. It is plenty fast and quite flexible.
If you want just keys and values, I'd use something like leveldb or rocksdb.
If you want a lightweight JSON document store, or key/value with a bit "extra", then UnQLite may be a good fit.
I am confused because statistics are not gathered for some tables on many schemas.
These tables has been analyzed last time during night, I assume it had been done by job auto optimizer stats job which is enabled.
I realized this when try to gather statistics manually by receiving:
ora-20005 object statistics are locked
after Tuning Advisor ordered gathers statistics for long running query.
What could locked this statistics? Could be default disabled? I assume no one did this because there is no benefit of such behaviour in long term perspective.
After some research I found partial answer:
https://blogs.oracle.com/optimizer/entry/maintaining_statistics_on_large_partitioned_tables
I also found discovered that statistic are locked for partitioned table by partitioning procedure which runs every night, there is line:
dbms_stats.lock_table_stats(...)
I wonder is good or bad practice? I suppose some time ago it was good but since Oracle 11g it has no sense at all.
I will try to introduce approach with Incremental Statistics Maintenance (docs) instead of disabling global statistic gathering which I think is DEPRECATED idea...
Why do you believe that "it has no sense at all"?
Locking statistics is neither good nor bad practice. It all depends on why you're locking them. Presumably, someone in the past identified some sort of problem that locking the statistics solved. You'd need to find out what problem that was and whether it is still an issue. If you have tables with large amounts of transient data, for example, you may want to gather statistics when the tables are relatively full and lock the statistics so that the automatic statistics gathering job doesn't accidentally run when the tables are mostly empty and cause very expensive table scans later when the tables are full.
If the problem that was being solved was that gathering global statistics on the partitioned table was slow, then potentially using incremental statistics maintenance would be a better solution. Given that incremental statistics maintenance is not the default behavior, however, it would be incorrect to consider alternative approaches "deprecated". Particularly when you have an existing solution that meets your needs.
I'm just learning SQL/SQLite, and plan to use SQLite 3 for a new website I'm building. It's replacing XML, so concurrency isn't a big concern. But I would like to make it as performant as possible with the technology I'm using. Are there any benefits to using multiple databases for performance, or is the best performance keeping all the data for the site in one file? I ask because 99% of the data will be read-only 99% of the time, but that last 1% will be written to 99% of the time. I know databases don't read in and re-write the whole file for every little change, but I guess I'm wondering if the writes will be much faster if the data is going to a separate 5KB database, rather than part of the ~ 250MB main database.
With proper performance tuning, sqlite can do around 63 300 inserts-per-second. Unless you're planning on some really heavy volume, I would avoid pre-optimizing. Splitting into two databases doesn't feel right to me, and if you're planning on doing joins in the future, you'll be hosed. Especially since you say concurrency isn't a big problem, I would avoid complicating the database design.
Actually with 50 000 databases you will have very bad performance
you should try several tables in single database, sometimes it really can speed up something, but as description of initial task is very general - hard to say exactly what you need, try single table and multiple tables - measure speed
I'm working on a website running ASP.NET with MSSQL Database. I realized, that the number of rows in several tables can be very high (possibly something like hundread million rows). I thnik, that this would make the website run very slowly, am I right?
How should I proceed? Should I base it on a multidatabase system, so that users will be separated in different databases and each of the databases will be smaller? Or is there a different, more effective and easier approach?
Thank you for your help.
Oded's comment is a good starting point and you may be able to just stop there.
Start by indexing properly and only returning relevant results sets. Consider archiving unused data (or rarely accessed data
However if it isn't Partioning or Sharding is your next step. This is better than a "multidatabase" solution because your logical entities remain intact.
Finally if that doesn't work you could introduce caching. Jesper Mortensen gives a nice summary of the options that are out there for SQL Server
Sharedcache -- open source, mature.
Appfabric -- from Microsoft, quite mature despite being "late
beta".
NCache -- commercial, I don't know much about it.
StateServer and family -- commercial, mature.
Try partitioning the data. This should make each query faster and the website shouldn't be as slow
I don't know what kind of data you'll be displaying, but try to give users the option to filter it. As someone had already commented, partitioned data will make everything faster.
This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!
Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!
Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.
Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)
I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.
It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.
Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.
The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.