How to config a shared cache for multiple environments with C API? - berkeley-db

How to config a shared cache for multiple environments with C API? Just like Java edition.
http://docs.oracle.com/cd/E17277_02/html/GettingStartedGuide/env.html#multienvsharedcache
I want to open large number of databases, at least 100,000. But as the counts of databases opened increase, the db->open operation become very slow. It almost cost 2 hours to 100,000 databases.
So I try to distribute these databases to multiple environments ( for example, 5 envs ). And in order to improve the efficient of memory use, I want to share cache between envs.

I'm pretty sure you can't do this with the C API. However, consider some alternative solutions:
Use relatively small caches for each environment. You need only have enough memory in each environment's cache to hold the working set of pages for a read.
Open all the databases in the same environment. Though, if you need to recover each database independently, this isn't going to work.
It's likely that it's taking so long to open all those databases because your system is swapping. But even a 1MB cache per environment is going to need 100GiB of RAM. You may be able to get away with a 96KiB or smaller cache per environment, which is less than 10GiB. Try 16KiB!
This won't wreck performance as it seems it might. Your OS already does a pretty good job at caching data that's on disk.
If you can open all the databases in the same environment, that's even better.

Related

New Azure Server - CSV Reader takes much longer

We have a asp.net website that allows users to import data from a CSV file. Recently we moved to a from a dedicated server to an Azure Virtual Machine and it is taking much longer. The hardware specs of the two systems are similar.
It used to take less than a minute for data to import now it can take 10 - 15 minutes. The original file upload speed is fine it is looping through the data and organizing it in the SQL database that takes the time.
Why is the Azure VM with similar specs taking so much longer and what can I do to fix it?
Our database is using Microsoft SQL Server 2012 installed on the same VM as the website.
Very hard to make a comparison between the two environments. Was the previous environment virtualized? It might do with speed of the hard disks, the placement of the Sql Server files, or some other infrastructural setup (or simply the iron). I would recommend have a look into the performance of the machine under load (resource monitor). This kind of operation is usually both processor and i/o intense. This operation should be done in parallell as well.
Hth
//Peter

Memory consumption differs by environment

I have an MVC4 web application that, when volume is put through it, consumes ~400MB RAM in all environments excluding the production environment. When a similar volume of load is put through it on a production server (hosted externally), the memory utilisation trebles to ~1.2GB and the memory isn't released even when the application is idle. The IIS configuration across all environments is the same.
Its also worth noting that the application, when idle, releases some of that memory in my test environments, but doesn't do the same in production. The RAM gradually increases and tops out at 1.2-1.3GB, but never drops below – even if traffic is completely routed away from the server.
I have not been able to recreate this issue on any other environment other than my third party hosting platform, but before I conclusively blame the infrastructure and get the hosting company on the case I wondered:
a) Is this a common problem and why does it happen
b) How can I see what is using the memory
c) Would you expect the same code to consume significantly different levels of system resources based on platform (I know my host may have monitoring etc. in production which will perhaps inflate a little)
Any help on this is appreciated.
This is a common problem which we normally face when we work on Different Environments. This is because System configuration, Windows etc differs from system to system.
In this particular case as we see its a big difference, probably there is some loops or memory is not freed at regular intervals.
Few steps:
Try to get root of the problem i.e. which method is taking time. Use Loggers like nlog.
Try using profilers if you are using Sql Server
And the third is use ants-performance-profiler
Also it depends on number of user hitting on site and some deadlock conditions.
There can be numerous reasons for the same.

How do you set up a large scale Alfresco CIFS server?

Alfresco provides a CIFS connector so it can act just a normal file-server in your intranet.
Compared with a "normal" (windows/samba) based fileserver, certain operations can really hurt the system, e.g. listing a folder with a few thousand files using windows explorer. Not quite sure, but I think permission checking is the primary reason for this case. Anyways, now assume you have a big filesystem hierarchy exposed and many users using CIFS, stressing the system, effectively "knocking it down".
What is the suggested approach to scale / improve performance ?
In my experience Windows Explorer is part of the CIFS performance issue. I don't have exact numbers, but I remember working on an instance with roughly 500GB data, mostly composed of small images and a few texts in a not well balanced folder tree, for which listing a folder with a thousand children was taking in Explorer around a minute to display. The same operation was taking around 3s on Chrome browser.
We never had time to investigate the issue thoroughly, but we saw an impressive amount of traffic generated by Explorer due to prefetch of information of the subfolders of the currently open folder.
Been revisiting the issue a little, and I guess the best answer I can give for now is: Tweak the cache(s).
I used a 5k children space, default cache values and benchmarked executing "ls -alrt" on the CIFS mount running alfresco 4.0.d.
The first execution took roughly two minutes bombarding the (lightning fast) mysql database with approx 200k queries.
The second execution took "only" around 40 seconds, but the amount of queries did not change significantly.
Increasing the CIFS fileinfo cache, I got the second time down to 30 seconds, but I still see 160k DB queries firing. I'm fairly sure this lions share has to do with permissions/ACLs and it should be possible improve the situation a lot.
PS: Windows Explorer definitely behaves a little unexpected, but I cannot confirm that it makes a significant difference regarding user experience.
PPS: https://issues.alfresco.com/jira/browse/ALFCOM-2951
PPPS: I'll look into this further when I find the time - should be this year. ;)
Update: The massive amount of queries is no permission issue.
Permission checks definitely IS a part of the problem. I can't link to anything specific, but browsing alfresco forums and the net for the last few years I've learned that permissions can hurt the performance.
I've read (and experienced) in several scenarios that alfresco spaces with large numbers of children (1000+) can be painfully slow. One part you noticed yourself: it takes a while to go through 100-200k queries. But hook up something into alfresco to watch what's it doing and you'll see that massive amounts of time go on serialization/deserialization (e.g.webscripts for share) and also node traversal (hence the thousands of queries and averages of 400-500 qps when nobody is logged on).
So you're on the right way with your cache optimizations.
Do you have dedicated hardware for your installation? I've had big issues with performance, but I've moved the MySQL server to a separate box (server-grade hardware - 4 cores, 8GB ram, SSD for myqsl server and SAS for tomcat server etc) and I gained a lot. So, get on with begging for the new hardware too :)
I think you're on the right path here.

rsync vs SyncML (Funambol)

I would like some idea about how rsync compares to SyncML/Funambol, especially when it comes to bandwidth, sync over unstable network and multiple clients to one server.
This is to sync several mobile devices with a directory structure of growing text-files. (Se we essentially want as much as possible on the server, and inconsistent files is not really a problem, also we know where changes originates).
So far, it seems Funambol doesn't compress, doesn't handle partial updates, and it is difficult to handle interruptions in a file-transfer.
I know rsync doesn't go through the server, but I don't quite see how that is a disadvantage.
Olav,
rsync can:
Compress the data (as you said) - thus gaining better performances over the net.
Synchronize only the newest data within each file - thus, once again, saving time.
Can be ran by multiple users at the same time. It's a very basic backup software behavior.
And one of my favorites: work over a secure shell.
You might want to check Rsyncrypto, for compressing and encrypting at the same time.
Dotan

Build Server Hardware Configuration

So I've seen this question, but I'm looking for some more general advice: How do you spec out a build server? Specifically what steps should I take to decide exactly what processor, HD, RAM, etc. to use for a new build server. What factors should I consider to decide whether to use virtualization?
I'm looking for general steps I need to take to come to the decision of what hardware to buy. Steps that lead me to specific conclusions - think "I will need 4 gigs of ram" instead of "As much RAM as you can afford"
P.S. I'm deliberately not giving specifics because I'm looking for the teach-a-man-to-fish answer, not an answer that will only apply to my situation.
The answer is what requirements will the machine need in order to "build" your code. That is entirely dependent on the code you're talking about.
If its a few thousand lines of code then just pull that old desktop out of the closet. If its a few billion lines of code then speak to the bank manager about giving you a loan for a blade enclosure!
I think the best place to start with a build server though is buy yourself a new developer machine and then rebuild your old one to be your build server.
I would start by collecting some performance metrics on the build on whatever system you currently use to build. I would specifically look at CPU and memory utilization, the amount of data read and written from disk, and the amount of network traffic (if any) generated. On Windows you can use perfmon to get all of this data; on Linux, you can use tools like vmstat, iostat and top. Figure out where the bottlenecks are -- is your build CPU bound? Disk bound? Starved for RAM? The answers to these questions will guide your purchase decision -- if your build hammers the CPU but generates relatively little data, putting in a screaming SCSI-based RAID disk is a waste of money.
You may want to try running your build with varying levels of parallelism as you collect these metrics as well. If you're using gnumake, run your build with -j 2, -j 4 and -j 8. This will help you see if the build is CPU or disk limited.
Also consider the possibility that the right build server for your needs might actually be a cluster of cheap systems rather than a single massive box -- there are lots of distributed build systems out there (gmake/distcc, pvmgmake, ElectricAccelerator, etc) that can help you leverage an array of cheap computers better than you could a single big system.
Things to consider:
How many projects are going to be expected to build simultaneously? Is it acceptable for one project to wait while another finishes?
Are you going to do CI or scheduled builds?
How long do your builds normally take?
What build software are you using?
Most web projects are small enough (build times under 5 minutes) that buying a large server just doesn't make sense.
As an example,
We have about 20 devs actively working on 6 different projects. We are using a single TFS Build server running CI for all of the projects. They are set to build on every check in.
All of our projects build in under 3 minutes.
The build server is a single quad core with 4GB of ram. The primary reason we use it is to performance dev and staging builds for QA. Once a build completes, that application is auto deployed to the appropriate server(s). It is also responsible for running unit and web tests against those projects.
The type of build software you use is very important. TFS can take advantage of each core to parallel build projects within a solution. If your build software can't do that, then you might investigate having multiple build servers depending on your needs.
Our shop supports 16 products that range from a few thousands of lines of code to hundreds of thousands of lines (maybe a million+ at this point). We use 3 HP servers (about 5 years old), dual quad core with 10GB of RAM. The disks are 7200 RPM SCSI drives. All compiled via msbuild on the command line with the parallel compilations enabled.
With that setup, our biggest bottleneck by far is the disk I/O. We will completely wipe our source code and re-checkout on every build, and the delete and checkout times are really slow. The compilation and publishing times are slow as well. The CPU and RAM are not remotely taxed.
I am in the process of refreshing these servers, so I am going the route of workstation class machines, go with 4 instead of 3, and replacing the SCSI drives with the best/fastest SSDs I can afford. If you have a setup similar to this, then disk I/O should be a consideration.

Resources