Running a simple SELECT * FROM TABLE from a small table (10k rows) takes 1 minute to run. It used to take 1 second and still takes 1 second on my coworkers computers. Running queries on larger tables in reasonable time is now impossible.
I've tried querying from Teradata SQL Assistant and a Jupyter Notebook; both are extremely slow. I've tried querying various ODBC Data Sources, all are suddenly slow.
I've tried deleting and then adding back the Data Sources in my ODBC Administrator. I've compared my ODBC settings to those of coworkers and they're exactly the same.
Related
Ever since upgrading to Fedora 23, information schema queries have become very slow. This is an installation that started as mysql in Fedora 17. The change definitely happened with the upgrade to 23.
mysql
use information_schema
select * from tables
....
+---------------+--------------------+--------------------------------------------------------+-------------+--------------------+---------+------------+------------+----------------+-------------+--------------------+--------------+-----------+---------------------+---------------------+---------------------+---------------------+-------------------+----------+------------------+----------------------------------------------------------------------------------------------------------+
5237 rows in set, 11 warnings (1 min 7.32 sec)
MariaDB [information_schema]>
There are 28 databases, none particularly large.
Is there any sort of clean up or optimization that can be done to make this reasonable again?
Thanks
Probably not a regression...
That query has to "open" every table in every database. This can be a lot of OS I/O to get the .frm files. The OS caches such. I tested your query with my 1177 tables:
1st run: 32.54 seconds.
2nd run: 0.7 seconds.
3rd run: 0.7 seconds.
Try a second run on your "slower" machine.
Also, check this on both machines:
SHOW VARIABLES LIKE 'table_open_cache';
It might be more than 5237 on the fast machine and less than 5237 on the slower machine. (Actually, I don't think this is an issue. I shrank my setting, but the SELECT continued to be about 0.7 sec.)
Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)
Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?
Let's say I have a 4GB dataset on a server with 32 GB.
I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.
Is there a good way to cache this in memory that survives the exit of an R session?
A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ...
Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.
But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.
Update:
I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.
But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.
I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.
I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.
However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.
Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :
Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.
Use a database for persistency such as SQL or any other noSQL database.
If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.
Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.
In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :
use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)
The internals have been written with on-disk tables in mind. But don't hold your breath!
If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)
This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:
none /data tmpfs nodev,nosuid,noatime,size=5000M,mode=1777 0 0
Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R.
ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.
To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.
I'm in the process of testing ESENT (Extensible Storage Engine) from Microsoft for my company. However, I have weird performance results.
In comparison of similar technologies (SqLite), the performance was very weak when reading the data.
In my performance test, I read more or less randomly all data in the database. I do not read the same data twice, so I think that the cache cannot help me. I run the test many times to have the speed when the data is "hot". I use an index on an id of type long. I use the following functions :JetSetCurrentIndex, JetMakeKey, JetSeek and JetRetrieveColumn to read.
In Windows Vista, I activated the parameter JET_paramEnableFileCache and it did miracles and was even faster than SqLite.
However, since this parameter is available on Windows Vista or later, the performance in Windows XP is nothing comparable to SQlite (like 15x slower). It reads on the disk every time. When using Sqlite on Windows XP, all read tests (except the first) doesn't read on the disk.
Am I missing another parameter or something that would make the difference ?
Thanks a lot !
If JET_paramEnableFileCache is helping then you must be terminating and restarting the process each time. JET_paramEnableFileCache was introduced to deal with applications which frequently initialize and terminate, which means the OS file cache has to be used instead of the normal database cache.
If you keep the process alive on XP then you will see the performance when the data is "hot".
#Spaceboy: I was thinking this myself ... but do you replace the ESENT.DLL in the windir\system32 ? Sometimes I succeeded by putting the DLL into my \bin subdir ...
I've run into this problem with my three-node SQL Cluster, though it's not unique to clusters. We have a dozen different ODBC drivers installed, both x86 and x64 versions, and we're constantly finding instances where some nodes in our cluster has either a different version of the driver, are missing the driver, or it's not configured properly. Especially in a cluster, it's critical that different nodes all have the same configurations, or jobs can fail unexpectedly on one node and run fine on another, and it leads to hours of frustration.
Is there a tool out there that will compare the installed/configured ODBC drivers and data sources and produce a report of what's out of sync? I've considered writing something in the past to do this, but haven't gotten around to it. If it's an issue for others and there's not a tool that does it, I'll put one together.
It seems that all the information related to your ODBC settings is stored in the registry, all together. Since nobody else knows of an app to compare these settings, I'll throw one together and post it on my website, putting a link here.
If you want to compare the settings yoursef, they're stored at:
HKLM\SOFTWARE\ODBC\ODBC.INI\ (your data sources)
HKLM\SOFTWARE\ODBC\ODBCINST.INI\ (your installed providers)
Also, it's worth noting that if you're on an x64 machine, there are both x64 and x86 ODBC drivers and data sources, and they're stored separately - in this case, check out the accepted answer on the following post to see which location you should be checking in:
http://social.msdn.microsoft.com/Forums/en/netfx64bit/thread/92f962d6-7f5e-4e62-ac0a-b8b0c9f552a3