ESENT performance Vista vs. XP - sqlite

I'm in the process of testing ESENT (Extensible Storage Engine) from Microsoft for my company. However, I have weird performance results.
In comparison of similar technologies (SqLite), the performance was very weak when reading the data.
In my performance test, I read more or less randomly all data in the database. I do not read the same data twice, so I think that the cache cannot help me. I run the test many times to have the speed when the data is "hot". I use an index on an id of type long. I use the following functions :JetSetCurrentIndex, JetMakeKey, JetSeek and JetRetrieveColumn to read.
In Windows Vista, I activated the parameter JET_paramEnableFileCache and it did miracles and was even faster than SqLite.
However, since this parameter is available on Windows Vista or later, the performance in Windows XP is nothing comparable to SQlite (like 15x slower). It reads on the disk every time. When using Sqlite on Windows XP, all read tests (except the first) doesn't read on the disk.
Am I missing another parameter or something that would make the difference ?
Thanks a lot !

If JET_paramEnableFileCache is helping then you must be terminating and restarting the process each time. JET_paramEnableFileCache was introduced to deal with applications which frequently initialize and terminate, which means the OS file cache has to be used instead of the normal database cache.
If you keep the process alive on XP then you will see the performance when the data is "hot".

#Spaceboy: I was thinking this myself ... but do you replace the ESENT.DLL in the windir\system32 ? Sometimes I succeeded by putting the DLL into my \bin subdir ...

Related

cannot allocate memory - RSelenium and EC2

I am trying to implement a Selenium test to perform automated actions on a website (looping through pages). I am using R and RSelenium package as well as a PostgreSQL database using DBI package. All this using EC2 AWS server.
My problem is that after a few minutes that the script was launched, my RStudio session freezes (as well as my Linux session) and I can see a message like "cannot allocate memory".
So this is clearly a memory issue without a doubt, and by doing top I could see that my Selenium docker was using most of the resources.
But my question is how can I reduce the amount of memory used by the Selenium test?
IMHO there is no practical way for a test to use less memory than the memory required by the given test. You can try to simplify the given test by breaking it up into 2 or more tests. Check for memory leaks, as suggested in another answer.
It would be much easier to use the next largest instance type with more memory, and shut down the instance when not in use to save money, if that is an issue.
Don't forget drive.close() in your code, if you don't close your driver, you will have a lot instance of Chrome.

Difference between Jstack and gcore when generating Core Dumps?

We all know that Core Dumps are an essential diagnostic tools for analysing various processes in Unix . I know both jstack and gcore are both used for generating Javacore files or Core Dumps but I have a doubt that Gcore is mainly used for Processes and Jstack is used for threads .
As from an Operating System perspective Process and Threads though interrelated (Process comprises of Threads only) they are relatively different from each other w.r.t memory/speed/execution . So is that gcore will diagnose the process and jstack will analyse the threads in that process ???
GCore act at OS level and you got a dump of native code that is currently running. From a java point of view, it is not really understandable.
JStack get you the stack trace at VM level (the java stack) of all thread your application have. You can find from that what is the real java code executed at a point.
Clearly, GCore is almost never used (too low level, native code...). Only really strange issue with native library or stuff like that will perhaps need this kind of tool.
There is also jmap that can generate a hprof file which is the heap data from you VM. A tool like 'Memory Analyser Tool' can open the hprof, and you can drill down on what was going on (on memory side).
If your VM crash because of a OutOfMemory, you can also set parameter to get the hprof when the event occurs. It helps to understand why (too many users, A DB query that fetch too much data...)
Last thing is the fact that you can add a debug option when your start your VM, so that you can connect to it, and put debug on running process. It can help if you have some strange issue that you are not able to reproduce in your local environment.

What a good way to get in-memory cache with data.table

Let's say I have a 4GB dataset on a server with 32 GB.
I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.
Is there a good way to cache this in memory that survives the exit of an R session?
A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ...
Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.
But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.
Update:
I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.
But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.
I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.
I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.
However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.
Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :
Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.
Use a database for persistency such as SQL or any other noSQL database.
If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.
Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.
In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :
use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)
The internals have been written with on-disk tables in mind. But don't hold your breath!
If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)
This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:
none /data tmpfs nodev,nosuid,noatime,size=5000M,mode=1777 0 0
Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R.
ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.
To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.

WinDBG - Analyse dump file on local PC

I have created a memory dump of an ASP.NET process on a server using the following command: .dump /ma mydump.dmp. I am trying to identify a memory leak.
I want to look at the dump file in more detail on my local development PC. I read somewhere that it is advisable to debug on the same machine as you create the dump file. However, I have also read that some developers do analyse the dump file on their local development PC's. What is the best approach?
I notice that when I create a dump file using the command above the W3WP process memory increases by about 1.5 times. Why this this? I suppose this should be avoided on a live server.
Analyzing on the same machine can save you from SOS loading issues thereafter. Unless you are familiar with WinDbg and SOS, you will find it confusing and frustrating then.
If you have to use another machine for analysis, make sure you read carefully this blog post, http://blogs.msdn.com/b/dougste/archive/2009/02/18/failed-to-load-data-access-dll-0x80004005-or-what-is-mscordacwks-dll.aspx as it shows you how to copy the necessary files from the source machine (where the dump is captured) to the target machine (the one you launch WinDbg).
For your second question, as you use WinDbg to attach to the process directly, and use .dump command to capture the dump, the target process unfortunately is modified. Not easy to explain in a few words. The recommended way is to use ADPlus.exe or Debug Diag. Even procdump from SysInternals is better. Those tools are designed for dump capture and they have minimal impact on the target processes.
For memory leak from unmanaged libraries, you should use memory leak rule of Debug Diag. for managed memory leak, you can simply capture hang dumps when memory usage is high.
I am no expert on WinDBG but I once had to analyse a dump file on my ASP.NET site to find a StackOverflowException.
While I got a dump file of my live site (I had no choice since that was what was failing), originally I tried to analyse that dump file on my local dev PC but ran into problems when trying to load the CLR data from it. The reason being that the exact version of the .NET framework differed between my dev PC and the server - both were .NET 4 but I imagine my dev PC had some cumulative updates installed that the server did not. The SOS module simply refused to load because of this discrepancy. I actually wrote a blog post about my findings.
So to answer part of your question it may be that you have no choice but to run WinDBG from your server, at least you can be sure that the dump file will match your environment.
It is not necessary to debug on the actual machine unless the problem is difficult to manifest on your development machine.
So long as you have the pdbs with the private symbols then the symbols should be resolved and call stacks correctly displayed and the correct version of .NET installed.
In terms of looking at memory leaks you should enable Gflags user stack trace and take memory dumps at 2 intervals so you can compare the memory usage before and after the action that provokes the memory leak, remember to disable gflags afterwards!
You could also run DebugDiag on the server which has automated memory pressure analysis scripts that will work with .Net leaks.

Compare ODBC settings across multiple database servers

I've run into this problem with my three-node SQL Cluster, though it's not unique to clusters. We have a dozen different ODBC drivers installed, both x86 and x64 versions, and we're constantly finding instances where some nodes in our cluster has either a different version of the driver, are missing the driver, or it's not configured properly. Especially in a cluster, it's critical that different nodes all have the same configurations, or jobs can fail unexpectedly on one node and run fine on another, and it leads to hours of frustration.
Is there a tool out there that will compare the installed/configured ODBC drivers and data sources and produce a report of what's out of sync? I've considered writing something in the past to do this, but haven't gotten around to it. If it's an issue for others and there's not a tool that does it, I'll put one together.
It seems that all the information related to your ODBC settings is stored in the registry, all together. Since nobody else knows of an app to compare these settings, I'll throw one together and post it on my website, putting a link here.
If you want to compare the settings yoursef, they're stored at:
HKLM\SOFTWARE\ODBC\ODBC.INI\ (your data sources)
HKLM\SOFTWARE\ODBC\ODBCINST.INI\ (your installed providers)
Also, it's worth noting that if you're on an x64 machine, there are both x64 and x86 ODBC drivers and data sources, and they're stored separately - in this case, check out the accepted answer on the following post to see which location you should be checking in:
http://social.msdn.microsoft.com/Forums/en/netfx64bit/thread/92f962d6-7f5e-4e62-ac0a-b8b0c9f552a3

Resources