I am trying to create a local SPARQL endpoint for Freebase for running some local experiments. While using Virtuoso 7, I regularly see server getting killed by OOM killer. I have followed all the required steps as mentioned here. I have also made the required changes to my virtuoso.ini file as mentioned in RDF Performance Tuning.
My system configuration is:
8 CPU 2.9 Ghz
16 GB RAM
I have enough hard disk too.
Regarding data dumps, I have split the freebase data dump (23GB gzipped, approx 250 GB uncompressed) into 10 smaller gzipped files containing 200,000,000 triples each.
Following are the changes I made to virtuoso.ini
NumberOfBuffers = 1360000
MaxDirtyBuffers = 1000000
MaxCheckpointRemap = 340000 # (1/4th of NumberOfBuffers)
Along with this I have set vm.swapiness = 10 as mentioned in 2.
Am I missing something obvious?
P.S.:
I did try virtuoso-opensource-6.1 too. But it appeared to be too slow.
One interesting observation I had was that during bulk loading process, virtuoso-6.1 memory consumption was rising too slowly, but it might be because general indexing itself was too slow.
Another observation I had was the virtuoso-6.1 at start time occupies almost negligible memory (order of 500MB) whereas virtuoso-7 starts with approx 6500 MB and grows quickly.
Any help in this regard would be highly appreciated.
Numbers of buffers you are using is little bit too high. Do not forget that some memory is also consume by OS and other processes.
Which exact version do you use? (development or stable branch?)
Do you use disk striping ?
I load freebase to Virtuoso 7 too, but I used smaller files. Circa 260 gzipped files, 10mil triples each = circa 100M. A commit is executed after every file load.
Maybe would be easier for you to use images with Virtuoso preloaded by Freebase
Related
We have a project that deals with millions of transactions everyday which has some tight SLAs. As part of parsing the flat file that comes as input to a bean , we used beanio which was working better with out load. But with load its taking around 250ms to parse a flat file to a bean.
Requirement: Simple string has to converted to a single bean(nested and converted)
Heard the univocity can do better here - and tried the same with below settings.
FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.getFormat().setLineSeparator("\n");
settings.setRecordEndsOnNewline(false);
settings.setHeaderExtractionEnabled(false);
settings.setIgnoreLeadingWhitespaces(false);
settings.setIgnoreTrailingWhitespaces(false);
settings.setMaxColumns(100);
settings.setMaxCharsPerColumn(100);
settings.setNumberOfRecordsToRead(1);
settings.setReadInputOnSeparateThread(false);
settings.setInputBufferSize(10*1024);
settings.setLineSeparatorDetectionEnabled(false);
settings.setColumnReorderingEnabled(false);
When running with jmeter, with 200 parallel threads - the average time taken is 10ms(to parse and convert around 10 fields where in actual use case we have to the same for around 500 fields)
but when we increased it to 300 or 350 parallel threads , the average time was around 300ms. But our total SLA is around 10ms.
Any help here is highly appreciated!
Probably you are running out of memory on your JVM. Try increasing it with the -Xms and -Xmx flags. Also too many threads won't help you if you don't have enough cores available.
My Codename One application downloads around 16000 records of data (approx 10 fields in each record).
On my Android phone (OS6.0, RAM 2GB) it's able to load 8000 to 9000 records but then shows out of memory error.
From the trace, it looks like it run out of heap memory allocated to the app.
Any suggestion what would be the ideal way to handle that large amount of data, please?
Here is the log file
The amount of RAM on the phone doesn't mean much. The OS takes about half and then divides the rest to the various apps running in parallel. You would typically have much less see What is the maximum amount of RAM an app can use?
You need to review your code and check what is eating up memory. 16k records of 1kb each would be 16Mb which probably shouldn't crash an app so the question is where is memory taken, I would suggest reading the performance section of the developer guide to figure out memory usage.
This might not apply to your situation, but would it be possible to only download x number of records at a time? Then, when the user takes some action (scrolls, hits next page, etc) it loads the next batch? Codename one has a great endless scroller implementation. See here for an example - https://www.codenameone.com/blog/property-cross-revisited.html
I'm using flat file data sources with the incremental load functionality and am seeing different performance depending on how I do the load. I have 3 datasets {d1,d2,d3} with d1 and d2 being the same size and d3 being 3 times larger. I am doing the following test on a machine with 16GB memory:
Load d1 - time: 1m07s
incrementally load d2 - time: 2m53s
incrementally load d3 - runs out of memory
On the other hand, if I do a single load of d1+d2+d3, the total time is 5m29s and there are no memory issues.
Is this just a matter of memory overhead when doing incremental vs single load or should I be better managing the performance?
Incremental load has been implemented for supporting real time and it has not the same logic as a normal load.
Additional data is pre-loaded into memory, that's why it takes more memory. During this pre-load the schema is still available, once the new data is fully preloaded and a first quality check is done, the schema is write locked and the actual load is done. This allows for having the schema locked for a few miliseconds.
The incremental load is suitable for for real-time, 'small' amount of additional data, not really for your scenario.
Slow times are not due to the fact you're running out of memory ( a lot of GC's )
?
Hope that helps.
PS: If you need an additional support please contact support directly.
We have an application which will need to store thousands of fairly small CSV files. 100,000+ and growing annually by the same amount. Each file contains around 20-80KB of vehicle tracking data. Each data set (or file) represents a single vehicle journey.
We are currently storing this information in SQL Server, but the size of the database is getting a little unwieldy and we only ever need to access the journey data one file at time (so the need to query it in bulk or otherwise store in a relational database is not needed). The performance of the database is degrading as we add more tracks, due to the time taken to rebuild or update indexes when inserting or deleting data.
There are 3 options we are considering:
We could use the FILESTREAM feature of SQL to externalise the data into files, but I've not used this feature before. Would Filestream still result in one physical file per database object (blob)?
Alternatively, we could store the files individually on disk. There
could end being half a million of them after 3+ years. Will the
NTFS file system cope OK with this amount?
If lots of files is a problem, should we consider grouping the datasets/files into a small database (one peruser) so that each user? Is there a very lightweight database like SQLite that can store files?
One further point: the data is highly compressible. Zipping the files reduces them to only 10% of their original size. I would like to utilise compression if possible to minimise disk space used and backup size.
I have a few thoughts, and this is very subjective, so your mileage ond other readers' mileage may vary, but hopefully it will still get the ball rolling for you even if other folks want to put differing points of view...
Firstly, I have seen performance issues with folders containing too many files. One project got around this by creating 256 directories, called 00, 01, 02... fd, fe, ff and inside each one of those a further 256 directories with the same naming convention. That potentially divides your 500,000 files across 65,536 directories giving you only a few in each - if you use a good hash/random generator to spread them out. Also, the filenames are pretty short to store in your database - e.g. 32/af/file-xyz.csv. Doubtless someone will bite my head off, but I feel 10,000 files in one directory is plenty to be going on with.
Secondly, 100,000 files of 80kB amounts to 8GB of data which is really not very big these days - a small USB flash drive in fact - so I think any arguments about compression are not that valid - storage is cheap. What could be important though, is backup. If you have 500,000 files you have lots of 'inodes' to traverse and I think the statistic used to be that many backup products can only traverse 50-100 'inodes' per second - so you are going to be waiting a very long time. Depending on the downtime you can tolerate, it may be better to take the system offline and back up from the raw, block device - at say 100MB/s you can back up 8GB in 80 seconds and I can't imagine a traditional, file-based backup can get close to that. Alternatives may be a filesysten that permits snapshots and then you can backup from a snapshot. Or a mirrored filesystem which permits you to split the mirror, backup from one copy and then rejoin the mirror.
As I said, pretty subjective and I am sure others will have other ideas.
I work on an application that uses a hybrid approach, primarily because we wanted our application to be able to work (in small installations) in freebie versions of SQL Server...and the file load would have thrown us over the top quickly. We have gobs of files - tens of millions in large installations.
We considered the same scenarios you've enumerated, but what we eventually decided to do was to have a series of moderately large (2gb) memory mapped files that contain the would-be files as opaque blobs. Then, in the database, the blobs are keyed by blob-id (a sha1 hash of the uncompressed blob), and have fields for the container-file-id, offset, length, and uncompressed-length. There's also a "published" flag in the blob-referencing table. Because the hash faithfully represents the content, a blob is only ever written once. Modified files produce new hashes, and they're written to new locations in the blob store.
In our case, the blobs weren't consistently text files - in fact, they're chunks of files of all types. Big files are broken up with a rolling-hash function into roughly 64k chunks. We attempt to compress each blob with lz4 compression (which is way fast compression - and aborts quickly on effectively-incompressible data).
This approach works really well, but isn't lightly recommended. It can get complicated. For example, grooming the container files in the face of deleted content. For this, we chose to use sparse files and just tell NTFS the extents of deleted blobs. Transactional demands are more complicated.
All of the goop for db-to-blob-store is c# with a little interop for the memory-mapped files. Your scenario sounds similar, but somewhat less demanding. I suspect you could get away without the memory-mapped I/O complications.
Did anyone face any problem with submitting job on large data. Data is around 5-10 TB uncompressed, it is in approximate 500K files. When we try to submit a simple java map reduce job, it's mostly spend more than hour on getsplits() function call. And takes multiple hour to appear in job tracker. Is there any possible solution to solve this problem?
with 500k files, you are spending a lot of time tree walking to find all these files, which then need to be assigned to list of InputSplits (the result of getSplits).
As Thomas points out in his answer, if your machine performing the job submission has a low amount of memory assigned to the JVM, then you're going to see issues with the JVM performing garbage collection to try and find the memory required to build up the splits for these 500K files.
To makes matters worse, if these 500K files are splittable, and larger than a single block size, then you'll get even more input splits to process the files (a file of size say 1GB, with a block size of 256MB, you'll by default get 4 map tasks to process this file, assuming the input format and file compression supports splitting the file). If this is applicable to your job (look at the number of map tasks spawned for your job, are there more than 500k?), then you can force less mappers to be created by amending the mapred.min.split.size configuration property to a size larger then the current block size (setting it to 1GB for the previous example means you'll get a single mapper to process the file, rather than 4). This will help the performance of getSplits method the resultant list of getSplits will be smaller, requiring less memory.
The second symptom of your problem is the time is takes to serialize the input splits to a file (client side), and then the deserialization time at the job tracker end. 500K+ splits is going to take time, and the jobtracker will have similar GC issues if it has a low JVM memory limit.
It largely depends on how "strong" your submission server is (or your laptop client), maybe you need to upgrade RAM and CPU to make the getSplits call faster.
I believe you ran into swap issues there and the computation takes therfore multiple times longer than usual.