Big database and how to proceed

Big database and how to proceed - asp.net

I'm working on a website running ASP.NET with MSSQL Database. I realized, that the number of rows in several tables can be very high (possibly something like hundread million rows). I thnik, that this would make the website run very slowly, am I right?
How should I proceed? Should I base it on a multidatabase system, so that users will be separated in different databases and each of the databases will be smaller? Or is there a different, more effective and easier approach?
Thank you for your help.

Oded's comment is a good starting point and you may be able to just stop there.
Start by indexing properly and only returning relevant results sets. Consider archiving unused data (or rarely accessed data
However if it isn't Partioning or Sharding is your next step. This is better than a "multidatabase" solution because your logical entities remain intact.
Finally if that doesn't work you could introduce caching. Jesper Mortensen gives a nice summary of the options that are out there for SQL Server
Sharedcache -- open source, mature.
Appfabric -- from Microsoft, quite mature despite being "late
beta".
NCache -- commercial, I don't know much about it.
StateServer and family -- commercial, mature.

Try partitioning the data. This should make each query faster and the website shouldn't be as slow

I don't know what kind of data you'll be displaying, but try to give users the option to filter it. As someone had already commented, partitioned data will make everything faster.

Related

best practices on joining data during ETL

I've been googling around and can't really find a answer to my question. Lets say i have 2 large tables and my final destination table requires rows that are joined between these 2 tables. In terms of scalability and best practices, where should I do these joins? On the source database? In memory after extraction? Or staging tables?
Thanks

I agree, there are no rules here, just common sense.
It is good to get rid of the unnecessary data as soon as possible, so you will spend less resources/storage down the road, but you should consider impact on your PROD environment.
staging tables
Having copy of the data in Staging gives you more freedom and flexibility, ability to try different ETL approaches, and etc. I would do it there.
Even if you ETL looks simple for now, it can grow in the future, so you need a quite place to play with your data.
in memory
in memory where? If it is a prod instance, and you consume 95% of memory... :) All computations are "in memory" anyway.
Best regards.

Is R+bigmemory package sufficient for column-oriented data management?

I have a collection of financial time series of various sorts. Most of my analysis is either a column or a row oriented, very rarely I have to do any sort of complex queries. Also, I am (by now) doing almost all analysis in R.
Because of this, I am seriously considering not deploying any sort of RDBMS and instead managing data in R directly (saving RDS files). This would save me the pain of installing an administering a DB as well as probably improve the data loading speeds.
Is there any reason I should consider otherwise? Do you know anyone who manages their data this way? I know this is vague, but I am looking for opinions, not answers.

If working in R is your comfort zone.. I'd keep your data management there as well, even if your analyses or runs are longer.
I've had a similar decision lately:
Should I go in the direction of learning and applying a new (language/dialect/system) to shave some milliseconds off execution time.
or...
Should I go forth with the same stodgy old tools I have used, even if they will run slower at execution time?
Is the product of your runs for you only? If so, I'd stick with data management in R only.. even if production runs are slower.
If you were designing something for a Bank, Cell Phone Service, or a similar transactional environment, I'd recommend finding the super solution.
But if your R production is for you.. I'd stay in R.

Consider the opportunity cost. Learning a new language/ecosystem - and something like PostgreSQL surely qualifies - will soak up far more time than you likely think. Those skills may be valuable, but will they generate a return on time invested that is as high as the return you would get from additional time spent on your existing analysis?
If it's for personal use and there is no pressing performance issue, stick with R. Given that it's generally easier to do foolish things with text and RDS files than it is with a fully-fledged DB, just make sure you back up everything. From being a huge skeptic about cloud-based storage I have over the past half-year become a huge convert and all but my most sensitive information is now stored there. I use Dropbox, which maintains previous versions of data if you do mess up badly.
Being able to check a document or script from the cafe on the corner on your smartphone is nice.

There is a column-by-column management package, colbycol in CRAN designed to provide DB-like functions for large datasets. I assume the author must have conducted the same sort of analysis.

What cache strategy do I need in this case ?

I have what I consider to be a fairly simple application. A service returns some data based on another piece of data. A simple example, given a state name, the service returns the capital city.
All the data resides in a SQL Server 2008 database. The majority of this "static" data will rarely change. It will occassionally need to be updated and, when it does, I have no problem restarting the application to refresh the cache, if implemented.
Some data, which is more "dynamic", will be kept in the same database. This data includes contacts, statistics, etc. and will change more frequently (anywhere from hourly to daily to weekly). This data will be linked to the static data above via foreign keys (just like a SQL JOIN).
My question is, what exactly am I trying to implement here ? and how do I get started doing it ? I know the static data will be cached but I don't know where to start with that. I tried searching but came up with so much stuff and I'm not sure where to start. Recommendations for tutorials would also be appreciated.

You don't need to cache anything until you have a performance problem. Until you have a noticeable problem and have measured your application tiers to determine your database is in fact a bottleneck, which it rarely is, then start looking into caching data. It is always a tradeoff, memory vs CPU vs real time data availability. There is no reason to make your application more complicated than it needs to be just because.

An extremely simple 'win' here (I assume you're using WCF here) would be to use the declarative attribute-based caching mechanism built into the framework. It's easy to set up and manage, but you need to analyze your usage scenarios to make sure it's applied at the right locations to really benefit from it. This article is a good starting point.
Beyond that, I'd recommend looking into one of the many WCF books that deal with higher-level concepts like caching and try to figure out if their implementation patterns are applicable to your design.

Heavy calculations in mysql

I'm using asp.net together with mysql and my site requires some quite heavy calculations on the data. These calculations are made with mysql, I thought it was easier to make the calculations in mysql to just be able to work with the data easy in asp.net and databind my gridviews and other data controls not using that much code.
But I start thinking I made a mistake making all calculations in the back because everything seems quite slow. In one of my queries I need to get data from several tables at once and add them together into my gridview so what I do in that query is:
Selecting from four different tables each having some inner joins. Then union them together using union all. Then some sum and grouping.
I can't post the full query here because its quite long. But do you think it's bad way to do the calculations like I've done? Should it be better doing them in asp.net? What is your opinion?

MySQL does not allow embedded assembly, so writing a query which whill input a sequence of RGB records and output MPEG-4 is probably not the best idea.
On the other hand, it's good in summing or averaging.
Seeing what calculations you are talking about will surely help to improve the answer.
Self-update:
MySQL does not allow embedded assembly, so writing a query which whill input a sequence of RGB records and output MPEG-4 is probably not the best idea.
I'm really thinking how to implement this query, and worse than that: I feel it's possible :)

My experience with MySql is that it can be quite slow for calculations. I would suggest moving a substantial amount of work (particularly grouping and sum) to the ASP.NET. This has not been my experience with other database engines, but a 30 day moving average in MySQL seems quite slow.

Without knowing the actual query, it sounds like you are doing relational/table work on your query, which RDMS are good at so it seems that your are doing it at the correct place... the problem may be on the optimization of the query, you may want to do "EXPLAIN(query)" and get an idea of the query plan that MySql is doing and try to optimize that way....

How to Convince Programming Team to Let Go of Old Ways?

This is more of a business-oriented programming question that I can't seem to figure out how to resolve. I work with a team of programmers who have been working with BASIC for over 20 years. I was brought in to help write the same software in .NET, only with updates and modern practices. The problem is that I can't seem to get any of the other 3 team members(all BASIC programmers, though one does .NET now as well) to understand how to correctly do a relational database. Here's the thing they won't understand:
We basically have a transaction that keeps track of a customer's tag information. We need to be able to track current transactions and past transactions. In the old system, a flat-file database was used that had one table that contained records with the basic current transaction of the customer, and another transaction that contained all the previous transactions of the customer along with important money information. To prevent redundancy, they would overwrite the current transaction with the history transactions-(the history file was updated first, then the current one.) It's totally unneccessary since you only need one transaction table, but my supervisor or any of my other two co-workers can't seem to understand this. How exactly can I convince them to see the light so that we won't have to do ridiculous amounts of work and end up hitting the datatabse too many times? Thanks for the input!

Firstly I must admit it's not absolutely clear to me from your description what the data structures and logic flows in the existing structures actually are. This does imply to me that perhaps you are not making yourself clear to your co-workers either, so one of your priorities must be to be able explain, either verbally or preferably in writing and diagrams, the current situation and the proposed replacement. Please take this as an observation rather than any criticism of your question.
Secondly I do find it quite remarkable that programmers of 20 years experience do not understand relational databases and transactions. Flat file coding went out of the mainstream a very long time ago - I first handled relational databases in a commercial setting back in 1988 and they were pretty commonplace by the mid-90s. What sector and product type are you working on? It sounds possible to me that you might be dealing with some sort of embedded or otherwise 'unusual' system, in which case you do need to make sure that you don't have some sort of communication issue and you're overlooking a large elephant that hasn't been pointed out to you - you wouldn't be the first 'consultant' brought into a team who has been set up in some manner by not being fed the appropriate information. That said such archaic shops do still exist - one of my current clients systems interfaces to a flat-file based system coded in COBOL, and yes, it is hell to manage ;-)
Finally, if you are completely sure of your ground and you are faced with a team who won't take on board your recommendations - and demonstration code is a good idea if you can spare the time -then you'll probably have to accept the decision gracefully and move one. Myself in this position I would attempt to abstract out the issue - can the database updates be moved into stored procedures for example so the code to update both tables is in the SP and can be modified at a later date to move to your schema without a corresponding application change? Make sure your arguments are well documented and recorded so you can revisit them later should the opportunity arise.
You will not be the first coder who's had to implement a sub-optimal solution because of office politics - use it as a learning experience for your own personal development about handling such situations and commiserate yourself with the thought you'll get paid for the additional work. Often the deciding factor in such arguments is not the logic, but the 'weight of reputation' you yourself bring to the table - it sounds like having been brought in you don't have much of that sort of leverage with your team, so you may have to work on gaining a reputation by exceling at implementing what they do agree to do before you have sufficient reputation in subsequent cases - you need to be modded up first!

Sometimes you can't.
If you read some XP books, they often say that one of your biggest hurdles will be convincing your team to abandon what they have always done.
Generally they will recommend letting people who can't adapt go to other projects (Or just letting them go).
Code reviews might help in your case. Mandatory code reviews of every line of code is not unheard of.

Sometime the best argument is an example. I'd write a prototype (or a replacement if not too much work). With an example to examine it will be easier to see the pros and cons of a relational database.
As an aside, flat-file databases have their places since they are so much easier to "administer" than a true relational database. Keep an open mind. ;-)

I think you may have to lead by example - when people see that the "new" way is less work they will adopt it (as long as you don't rub their noses in it).
I would also ask yourself whether the old design is actually causing a problem or whether it is just aesthetically annoying. It's important to pick your battles - if the old design isn't causing a performance problem or making the system hard to maintain you may want to leave the old design alone.
Finally, if you do leave the old design in place, try and abstract the interface between your new code and the old database so if you do persuade your co-workers to improve the design later you can drop the new schema in without having to change anything else.

It is difficult to extract a whole lot except general frustration from the original question.
Yes, there are a lot of techniques and habits long-timers pick up over time that can be useless and even costly in light of technology changes. Some things that made sense when processing power, memory, and even disk was expensive can be foolish attempts at optimization now. It is also very much the case that people accumulate bad habits and bad programming patterns over time.
You have to be careful though.
Sometimes there are good reasons for the things those old timers do. Sadly, they may not even be able to verbalize the "why" - if they even know why anymore.
I see a lot of this sort of frustration when newbies come into an enterprise software development shop. It can be bad even when the environment is all fairly modern technology and tools. If most of your experience is in writing small-community desktop and Web applications a lot of what you "know" may be wrong.
Often there are requirements for transaction journaling at a level above what your DBMS may do. Quite often it can be necessary to go beyond DB transaction semantics in order to ensure time-sequence correctness, once and only once updating, resiliancy, and non-repudiation.
And this doesn't even begin to address the issues involved in enterprise or inter-enterprise scalability. When you begin to approach half a million complex transactions a day you will find that RDBMS technology fails you. Because relational databases are not designed to handle high transaction volumes you must often break with standard paradigms for normalization and updating. Conventional RDBMS locking techniques can destroy scalability no matter how much hardware you throw at the problem.
It is easy to dismiss all of it as stodginess or general wrong-headedness - even incompetence. But be careful because this isn't always the case.
And by the way: There are other models besides the RDBMS, and the alternative to an RDBMS is not necessarily "flat files" - contrary to the experience of of most coders today. There are transactional hierarchical DBMSs that can handle much higher throughput than an RDBMS. IMS is still very much alive in large IBM shops, for example. Other vendors offer similar software for different platforms.
Of course in a 4-man shop maybe none of this applies.

Sign them up for some decent trainings and then it's up to you to convince them that with new technologies a lot more is possible (or at least easier!).
But I think the most important thing here is that professional, certified trainers teach them the basics first. They will be more impressed by that instead of just one of their colleagues telling them: "hey, why not use this?"
Related post here.

The following may not apply in yr situation, but you make very little mention of technical details, so I thought I'd mention it...
Sometimes, if the access patterns are very different for current data than for historical data (I'm making this example up, but say that Current data is accessed 1000s of times per second, and accesses a small subset of columns, and all current data fits in less than 1 GB, whereas, say, historical data uses 1000s of GBs, is accessed only 100s of times per day, and access is to all columns),
then, what your co-workers are doing would make perfect sense, for performance optimization. By separating the current data (albiet redundantly) you can optimize the indices and data structures in that table, for the higher frequency access paterns that you could not do in the historical table.
Not everything that is "academically", or "technically" correct from a purely relational perspective makes sense when applied in an actual practical situation.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex