I'm building business app that will hold somewhere between 50,000 to 150,000 companies. Each company (db row) is represented with 4-5 properties/columns (title, location,...). ORM is LINQ2SQL.
I have to do some calculation, and for that I have lot of queries for specific company. Now, i go to db every time when i need something, and it produces 50-200 queries, depending on calculation complexy. I tried to put all companies to cache, and for 10,000 rows (companies) in db, it takes around 5,5MB of cache. In this scenario, I have only one query.
This application will be on shared hosting server, so my resources are limited. I'm interested, what will happen if I try to load, let say 100,000 companies (rows, objects)? Or put that in cache?
Is there any RAM limit that average hosting company give to ASP.NET application? Does it depend on dedicated Applcation Pool (I can put app to dedicated pool)?
Options are:
- load whole table to c# objects. Id did some memory profiling, 10,000 objects needs 5MB RAM
- query db to get referenced objects when needed.
Task is: for given company A, build tree of connected companies.
Table and columns:
Company : IdCompany, Title, Address, Contact
CompanyConnection: IdParentCompany, IdChildCompany
Your shared host will likely be IIS 7 on Windows Server running as a virtual machine. This machine will behave as any ordinary machine would - it is not 'aware' of being shared or virtualised.
You should expect Windows to begin paging to disk when it is out of physical RAM and then out of memory errors only get thrown only when the page file has filled the disk. Of course, you don't ever want to page any part of the warm cache to disk.
Windows itself can begin nagging you about being out of memory, but this is not the same 'urgency' and applications will continue to be able to request RAM and it will continue being given (albeit serviced from the page file).
If you application could crash and leave corrupt state or a partial transaction, then you should code defensively and check memory is available before embarking upon an action.
Create the expected number of objects in a loop with pretend data and watch the memory consumption on the box - the Working Set of the worker process is the one to watch. You can do this in Task Manager.
Watch for Page Faults. These are events when a memory operation had to be directed to disk.
Also, very large sets of objects can cause long garbage collection cycles >1second. This can be a big issue in time-sensitive applications like trading and market data.
Hope that helps.
Update: I do a similar caching thang for a mega data-mining application.
Each ORM type has a GetObject method which uses a giant cache or goes to disk and then updates the cache: Person.GetPerson( check people cache, go to db, add to people cache )
Now my queries return just the unique keys of the results. Then each key is fetched using the above method. This is slow initially until the cache builds up but...
The point being that each query result points to the same instance in memory! This means the RAM footprint is much smaller due to sharing.
The query results are then cached, too. Of course.
Where objects are not immutable, each object-write updates its own instance in the giant cache but also causes all query caches that concern that type of object to void themselves!
Of course, in this application, writes are rare as its mainly reference data.
Related
guys i have question,
lets say i want to upload hundreds thousands of post / product to wordpress, which will slow down the website performance, and the database size will also getting bigger.
what if i split the wordpress site into several different installation to different sub directory based on the product or post category, so lets say one website only contain 25-30k post / producst, but there will be like 10 of those in different installation, in this way the database will be a lot smaller.
do you think it will make the performance better than put everything in single website?
my server is around 16gb ram and 8 cpu cores.
I don't think it will make any difference given you will run it on the same hardware. In case of multiple machines and one ingress node/load balancer you could route the request to the different backend server basing on the product requested, but if you have only one server for hosting everything: web server, database, etc. you will hit the limits of CPU/RAM/etc. much faster than the size of the database table (given it's properly designed, has indices and so on)
However you can measure the performance in both cases using a load testing tool and see how does response time, resources usage and database slow query log looks like in both deployment scenarios.
Data size doesn't have to slow the site. It becomes a matter of how fast can you get the data from the DB. A few things to consider:
Place the Database on a dedicated host. If locally hosted dedicate a crossover cable from the web tier to the DB tier, with a second IP for admin on the database host. You might consider a managed instance of your database with a cloud provider.
Indexes are your friend. Larger datasets result in longer indexes, but you can make deliberately shortened indexes. Choose a database that supports partitioned indexes. Combine these partitioned indexes along with higher I/O operations per second of SSDs for your index partitions and ensuring that all lookup access via index will result in your performance for large data sets doesn't suffer. How does a partitioned index increase access speed? Instead of having to traverse an index from A to S for an index supported query with an S based where clause, in a partioned index you might have 26 indexes, one for A, then B, then C, then ... You jump straight to the S partition for the lookup.
Shape your pool size on the PHP/Web tier. You have already increased the pool size by pulling the database onto its own host. The next thing to do is to effectively manage your cache of fixed assets, the items that do not change across user sessions. Commonly these items are style sheets, images, fonts, javascript files, ... Minimally look at a cache node in front of your wordpress site. Take a look at Varnish or Nginx for this. I am partial to Varnish, but either should do the trick. If you pair this with a CDN for a multigenerational cache then all the better. If you are in the cloud then you have built in CDN options with each cloud provider. You can also widen your bandwidth by placing these fixed assets ona dedicated host and then caching that one host, but this would require a lot of base modification of your wordpress image.
There is no reason why you cannot have multiple web fronts with a common database back end. You would need a load balancer to distribute the load and your first generation cache would sit in front of the load balancer. Realistically, if all of your queries are index supported and your cache is effectively managed, then you can easily scale to hundreds of concurrent users on moderate hardware. Your most taxing item is going to be your PHP execution to pull dynamic data for user sessions. Make the queries respond as fast as possible then you have a small lock window on PHP for each session.
Watch your locks per session! You may be at the mercy of a template and how it is managing your finite resource pool, but in general (a) unless 50%+1 use something, do not allocate it early, (b) be merciless in cutting sessions to release the session based locks on memory, (c) pinch your assets until they bleed - No 45 MB images on the front page when a color optimized 120K compressed image will do the job, (d) Watch the repeat access problem - This applies to subqueries in the database as well as building a web page with hundreds of assets to resolve a page.
Have you considered other options, such as Drupal? The setup is a bit more complex, but I can validate running a dozen distinct websites out of a single Drupal instance with no degradation in performance with the above dedicated database and cache nodes with hundreds of concurrent users on fairly moderate hardware (mini-itx atom based PCs)
Hi I am writing a web application and it connects to 700 Databases and executes a basic SELECT query.
For example:
There is a button to retrieve Managers of each branch.
There are 700 branches of a company and each of the branch details are stored in separate databases.
Select query retrieves 1 record from each of the database and returns the Manager of that branch.
So executing this code takes a long time.
I cannot make the user wait till such time (30 minutes)
Due to memory constraints I cannot use multi threading.
Note: This web application uses Spring MVC. Server Tomcat7.
Any workaround possible?
With that many databases to query, the only possible solutions I can see is caching. If real time is not a concern (note that 30 minutes of execution will push you out of real time anyway), then you might explore the following possibilities, all of which require centralizing data into a single, logical or physical database:
Clustering: put the database servers in a huge cluster, which is configured for performance hence uses caching internally. Depending upon licence costs, this solution might be too impractical or even too expensive.
Push data to a central database: all of the 700 database servers would push the data you need to a central database that your application will use. You can use database servers' replication features (such as in MSSQL or PostgreSQL) or scheduled data transfers. This method requires administrative access to the database servers to either configure replication or drop scripts to run on a scheduled basis.
Pull data from a central database host: have a centralized host fetch the required data into a local database, the tables of which are updated through scheduled data transfers. This is the simplest method. Its drawback is that real time querying is impossible.
It is key to transfer only the data you need. Make your select statements as narrow as possible to limit execution time.
The central database could be your web application server or a distinct machine if your resource constraints are tight. I've found PostgreSQL, with little effort, has an excellent compatibility with MSSQL. Without further information it's difficult to be more accurate.
I have an application that performs complex queries against what amounts to data organized in a "star schema". The gold-owner keeps adding new "axes" to perform searches on, with the result that performance becomes worse over time. Currently, the execution of a search operation, using a stored procedure to do all the work on the SQL server, takes about 2 seconds, which doesn't fit the gold-owner's desire to have the code be interactive (<0.1 sec response time). Looking at the SQL Server query analyzer, the search is IO-bound on 9 table scans of 100,000 records, and then doing brutal joins. Due to the nature of the queries I need to perform and the limitations of SQL, this cannot be improved.
In desperation, I've rewritten the query processor so that it sucks in the 100,000 records into a cache at application start, then perform the complex queries against the cached memory. Loading all the records from the database takes about 12 seconds. This expensive initial load is mitigated by my rewritten query processor. It now only needs to do a single scan through the records, and gives a response time of 0.02 seconds.
This good news is tainted by the gold-owner's discovery that the 12-second hit for populating the cache is being experienced every hour or so. I'm currently storing the data in the ASP.NET application state, as Application["FactTable"]. It seems the application state is being reset after the ASP.NET application is idle for longer than a dozen minutes or so.
If I move the 100,000 records into the ASP.NET application cache, will I be experiencing these evictions just as often, or can I rely on the data remaining in memory for the fast retrievals for longer periods of time? If the ASP.NET cache is also victim to application resets, what other mechanism should I use? A separate app domain hosting an instance of my database cache comes to mind, but I don't want to go down that route unless my other options are closed off.
I realise that you have a lot of data and processing and you must have tried a few things to speed this scenario up, but using Application State which is managed by IIS will be volatile...
Have you thought of running the your calculations etc in another process, ie, create a windows service that periodically runs the queries to organise your data and save that "flat" data to a database cache. When the user requests the data, they will just get the last DB cached results... and then further speed this up by holding those results in the Application state which can just refresh itself if that gets destroyed?
I have a huge (>10GB) sqlite database that is shared among many (up to CPU core count) processes (same executable). This is a specialized application so RAM is not an issue and I want to cache as much of the database in memory. I have found about PRAGMA cache_size; and I am successfully using it but this blows the RAM usage out of proportion as each of many processes has its own private cache.
Now, I found SQLite Shared-Cache Mode but I can't see if this applies to different processes or just threads in one process. I have run some tests which confirm the latter but I am not sure if I am doing something wrong or whether something else needs to be done to make this work.
That page explains that "the same cache can be shared across an entire process".
In theory, you could try to configure your OS so that the entire database is held in the file cache.
If the amount of data in individual queries is small, it might be worthwhile to use a client/server database so that the caching needs to be done only in the server process.
I'm developing a web app for which the client wants us to query their data as little as possible. The data will be coming from a Microsoft CRM instance.
So we've agreed that data will only be queried as and when it is needed, therefore if a web user wants to see a list of contacts (for example) that list is fetched into a local DataTable. Then if a new contact is created on the website the new contact is sent to CRM and added to the local DataTable at the same time. Likewise for edits.
If the user then looks at their contacts again the data will just come from the local DataTable.
At the moment local data is being kept in Session but my concern is that too much memory will start being used up. However traffic is expected to be pretty small, perhaps no more than 20 concurrent users so am I worrying about nothing or is there a better way you can suggest to handle this?
You worry about nothing. Basically it is a scalability dump - stupid desig. BUT: if you can throw 1gb of memory at the problem, for 20 users, storing 16mb of memory is not a problem.
The main problem starts when pepople count grows and the application needs to be rewritten.
20 concurrent users is not too many.
Clients "looks at their contacts": Depending on "contacts" table size, could you consider storing it in in-memory dataset( all contacts). You could then filter acc to primary key.
Alternative to session:Cache, Application
Cache with SqlCacheDependency and CacheItemRemovedCallback should be a good option to session.
XML files for each customer contacts.