How do I speed up loading many small RDF files into Sesame? - http

I'm working with an RDF dataset generated as part of our data collection which consists of around 1.6M small files totalling 6.5G of text (ntriples) and around 20M triples. My problem relates to the time it's taking to load this data into a Sesame triple store running under Tomcat.
I'm currently loading it from a Python script via the HTTP api (on the same machine) using simple POST requests one file at a time and it's taking around five days to complete the load. Looking at the published benchmarks, this seems very slow and I'm wondering what method I might use to load the data more quickly.
I did think that I could write Java to connect directly to the store and so do without the HTTP overhead. However I read in an answer to another question here that concurrent access is not supported, so that doesn't look like an option.
If I were to write Java code to connect to the HTTP repository does the Sesame library do some special magic that would make the data load faster?
Would grouping the files into larger chunks help? This would cut down the HTTP overhead for sending the files. What size of chunk would be good? This blog post suggest 100,000 lines per chunk (it's cutting a larger file up but the idea would be the same).
Thanks,
Steve

If you are able to work in Java instead of Python I would recommend using the transactional support of Sesame's Repository API to your advantage - start a transaction, add several files, then commit; rinse & repeat until you've sent all files.
If that is not an option then indeed chunking the data into larger files (or larger POST request bodies - you of course do not necessarily need to physically modify your files) would help. A good chunk size would probably be around 500,000 triples in your case - it's a bit of a guess to be honest, but I think that will give you good results.
You can also cut down on overhead by using gzip compression on the POST request body (if you don't do so already).

Related

Converapi - Pdf Limitation and Load

We are thinking to use ConverAPI component to handle the pdf conversion in our application.
But we are still unclear about the Limitation of Pdf generation and the Load handling.
How much Load will it support to do the pdf conversion? (e.g. in a sequence if we send 100 request at a time to do the pdf conversion will it work without any crash?)
What is the limitation of handling the pdf conversion? (e.g. if i send a document size around 800MB -1024MB will it be able to handle it for doing the Pdf conversion?)
100 simultaneous file uploads is inefficient. The best is to use ~4 (and it also highly depends on the situation). If you are really planning to convert 100x1Gb files simultaneously, please consult with the support.
The hard limit is 1Gb for files that are processed. Rest depends on file complexity and conversion.
The best would be to register and try it for free with your files.

Use Julia to perform computations on a webpage

I was wondering if it is possible to use Julia to perform computations on a webpage in an automated way.
For example suppose we have a 3x3 html form in which we input some numbers. These form a square matrix A, and we can find its eigenvalues in Julia pretty straightforward. I would like to use Julia to make the computation and then return the results.
In my understanding (which is limited in this direction) I guess the process should be something like:
collect the data entered in the form
send the data to a machine which has Julia installed
run the Julia code with the given data and store the result
send the result back to the webpage and show it.
Do you think something like this is possible? (I've seen some stuff using HttpServer which allows computation with the browser, but I'm not sure this is the right thing to use) If yes, which are the things which I need to look into? Do you have any examples of such implementations of web calculations?
If you are using or can use Node.js, you can use node-julia. It has some limitations, but should work fine for this.
Coincidentally, I was already mostly done with putting together an example that does this. A rough mockup is available here, which uses express to serve the pages and plotly to display results (among other node modules).
Another option would be to write the server itself in Julia using Mux.jl and skip server-side javascript entirely.
Yes, it can be done with HttpServer.jl
It's pretty simple - you make a small script that starts your HttpServer, which now listens to the designated port. Part of configuring the web server is that you define some handlers (functions) that are invoked when certain events take place in your app's life cycle (new request, error, etc).
Here's a very simple official example:
https://github.com/JuliaWeb/HttpServer.jl/blob/master/examples/fibonacci.jl
However, things can get complex fast:
you already need to perform 2 actions:
a. render your HTML page where you take the user input (by default)
b. render the response page as a consequence of receiving a POST request
you'll need to extract the data payload coming through the form. Data sent via GET is easy to reach, data sent via POST not so much.
if you expose this to users you need to setup some failsafe measures to respawn your server script - otherwise it might just crash and exit.
if you open your script to the world you must make sure that it's not vulnerable to attacks - you don't want to empower a hacker to execute random Julia code on your server or access your DB.
So for basic usage on a small case, yes, HttpServer.jl should be enough.
If however you expect a bigger project, you can give Genie a try (https://github.com/essenciary/Genie.jl). It's still work in progress but it handles most of the low level work allowing developers to focus on the specific app logic, rather than on the transport layer (Genie's author here, btw).
If you get stuck there's GitHub issues and a Gitter channel.
Try Escher.jl.
This enables you to build up the web page in Julia.

Store map key/values in a persistent file

I will be creating a structure more or less of the form:
type FileState struct {
LastModified int64
Hash string
Path string
}
I want to write these values to a file and read them in on subsequent calls. My initial plan is to read them into a map and lookup values (Hash and LastModified) using the key (Path). Is there a slick way of doing this in Go?
If not, what file format can you recommend? I have read about and experimented with with some key/value file stores in previous projects, but not using Go. Right now, my requirements are probably fairly simple so a big database server system would be overkill. I just want something I can write to and read from quickly, easily, and portably (Windows, Mac, Linux). Because I have to deploy on multiple platforms I am trying to keep my non-go dependencies to a minimum.
I've considered XML, CSV, JSON. I've briefly looked at the gob package in Go and noticed a BSON package on the Go package dashboard, but I'm not sure if those apply.
My primary goal here is to get up and running quickly, which means the least amount of code I need to write along with ease of deployment.
As long as your entiere data fits in memory, you should't have a problem. Using an in-memory map and writing snapshots to disk regularly (e.g. by using the gob package) is a good idea. The Practical Go Programming talk by Andrew Gerrand uses this technique.
If you need to access those files with different programs, using a popular encoding like json or csv is probably a good idea. If you just have to access those file from within Go, I would use the excellent gob package, which has a lot of nice features.
As soon as your data becomes bigger, it's not a good idea to always write the whole database to disk on every change. Also, your data might not fit into the RAM anymore. In that case, you might want to take a look at the leveldb key-value database package by Nigel Tao, another Go developer. It's currently under active development (but not yet usable), but it will also offer some advanced features like transactions and automatic compression. Also, the read/write throughput should be quite good because of the leveldb design.
There's an ordered, key-value persistence library for the go that I wrote called gkvlite -
https://github.com/steveyen/gkvlite
JSON is very simple but makes bigger files because of the repeated variable names. XML has no advantage. You should go with CSV, which is really simple too. Your program will make less than one page.
But it depends, in fact, upon your modifications. If you make a lot of modifications and must have them stored synchronously on disk, you may need something a little more complex that a single file. If your map is mainly read-only or if you can afford to dump it on file rarely (not every second) a single csv file along an in-memory map will keep things simple and efficient.
BTW, use the csv package of go to do this.

Drawbacks to having (potentially) thousands of directories in a server instead of a database?

I'm trying to start using plain text files to store data on a server, rather than storing them all in a big MySQL database. The problem is that I would likely be generating thousands of folders and hundreds of thousands of files (if I ever have to scale).
What are the problems with doing this? Does it get really slow? Is it about the same performance as using a Database?
What I mean:
Instead of having a database that stores a blog table, then has a row that contains "author", "message" and "date" I would instead have:
A folder for the specific post, then *.txt files inside that folder than has "author", "message" and "date" stored in them.
This would be immensely slower reading than a database (file writes all happen at about the same speed--you can't store a write in memory).
Databases are optimized and meant to handle such large amounts of structured data. File systems are not. It would be a mistake to try to replicate a database with a file system. After all, you can index your database columns, but it's tough to index the file system without another tool.
Databases are built for rapid data access and retrieval. File systems are built for data storage. Use the right tool for the job. In this case, it's absolutely a database.
That being said, if you want to create HTML files for the posts and then store those locales in a DB so that you can easily get to them, then that's definitely a good solution (a la Movable Type).
But if you store these things on a file system, how can you find out your latest post? Most prolific author? Most controversial author? All of those things are trivial with a database, and very hard with a file system. Stick with the database, you'll be glad you did.
It is really depends:
What is file size
What durability requirements do you have?
How many updates do you perform?
What is file system?
It is not obvious that MySQL would be faster:
I did once such comparison for small object in order to use it as sessions storage for CppCMS. With one index (Key Only) and Two indexes (primary key and secondary timeout).
File System: XFS ext3
-----------------------------
Writes/s: 322 20,000
Data Base \ Indexes: Key Only Key+Timeout
-----------------------------------------------
Berkeley DB 34,400 1,450
Sqlite No Sync 4,600 3,400
Sqlite Delayed Commit 20,800 11,700
As you can see, with simple Ext3 file system was faster or as fast as Sqlite3 for storing data because it does not give you (D) of ACID.
On the other hand... DB gives you many, many important features you probably need, so
I would not recommend using files as storage unless you really need it.
Remember, DB is not always the bottle neck of the system
Forget about long-winded answers, here's the simplest reasons why storing data in plaintext files is a bad idea:
It's near-impossible to query. How would you sort blog posts by date? You'd have to read all the files and compare their date, or maintain your own index file (basically, write your own database system.)
It's a nightmare to backup. tar cjf won't cut it, and if you try you may end up with an inconsistent snapshot.
There's probably a dozen other good reasons not to use files, it's hard to monitor performance, very hard to debug, near impossible to recover in case of error, there's no tools to handle them, etc...
I think the key here is that there will be NO indexing on your data. SO to retrieve anything in say a search would be rediculously slow compared to an indexed database. Also, IO operations are expensive, a database could be (partially) in memory, which makes the data available much faster.
You don't really say why you won't use a database yourself... But in the scenario you are describing I would definitely use a DB over folder any day, for a couple of reasons. First of all, the blog scenario seems very simple but it is very easy to imagine that you, someday, would like to expand it with more functionality such as search, more post details, categories etc.
I think that growing the model would be harder to do in a folder structure than in a DB.
Also, databases are usually MUCH faster that file access due to indexing and memory caching.
IIRC Fudforum used the file-storage for speed reasons, it can be a lot faster to grab a file than to search a DB index, retrieve the data from the DB and send it to the user. You're trading the filesystem interface with the DB and DB-library interfaces.
However, that doesn't mean it will be faster or slower. I think you'll find writing is quicker on the filesystem, but reading faster on the DB for general issues. If, like fudforum, you have relatively immutable data that you want to show several posts in one, then a file-basd approach may be a lot faster: eg they don't have to search for every related post, they stick it all in 1 text file and display it once. If you can employ that kind of optimisation, then your file-based approach will work.
Also, mail servers work in the file-based approach too, the Maildir format stores each email message as a file in a directory, not in a database.
one thing I would say though, you'll be better storing everything in 1 file, not 3. The filesystem is better at reading (and caching) a single file than it is with multiple ones. So if you want to store each message as 3 parts, save them all in a single file, read it to get any of the parts and just display the one you want to show.
...and then you want to search all posts by an author and you get to read a million files instead of a simple SQL query...
Databases are NOT faster. Think about it: In the end they store the data in the filesystem as well. So the question if a database is faster depends strongly on the access path.
If you have only one access path, which correlates with your file structure the file system might be way faster then a database. Just make sure you have some caching available for the filesystem.
Of course you do loose all the nice things of a database:
- transactions
- flexible ways to index data, and therefore access data in a flexible way reasonably fast.
- flexible (though ugly) query language
- high recoverability.
The scaling really depends on the filesystem used. AFAIK most file system have some kind of upper limit for number of files (totally or per directory), though on the new ones this is often very high. For hundreds and thousands of files with some directory structure to keep directories to a reasonable size it should be possible to find a well performing file system.
#Eric's comment:
It depends on what you need. If you only need the content of exact on file per query, and you can determine the location and name of the file in a deterministic way the direct access is faster than what a database does, which is roughly:
access a bunch of index entries, in order to
access a bunch of table rows (rdbms typically read blocks that contain multiple rows), in order to
pick a single row from the block.
If you look at it: you have indexes and additional rows in memory, which make your caching inefficient, where is the the speedup of a db supposed to come from?
Databases are great for the general case. But if you have a special case, there is almost always a special solution that is better in some sense.
if you are preferred to go away with RDBMS, why dont u try the other open source key value or document DBs (Non- relational Dbs)..
From ur posting i understand that u r not goin to follow any ACID properties of relational db.. it would be better to adapt other key value dbs (mongodb,coutchdb or hyphertable) instead of your own file system implementation.. it will give better performance than the existing approaches..
Note: I am not also expert in this.. just started working on MongoDB and find useful in similar scenarios. just wanted to share in case u r not aware of these approaches

Application for graphing lots of web related data

I know this isn't programming related, but I hope some feedback which helps me out the misery.
We've actually lots of and different data from our web applications, dating years back.
For example, we've
Apache logfiles
Daily statistics files from our tracking software (CSV)
Another daily statistics from nation-wide rankings for advertisement (CSV)
.. and I can probably produce new data from other sources, too.
Some of the data records started in 2005, some in 2006, etc. However at some point in time we start to have data of all of them.
What's I'm drea^H^H^H^Hsearching for is an application to understand all the data, lets me load them, compare individual data sets and timelines (graphically), compare different data sets within the same time span, allow me to filter (especially the Apache logfile); and of course this all should be interactively.
Just the BZ2 compressed Apache logfiles are already 21GB in total, growing weekly.
I've had no real success with things like awstats, Nihu Web Log Analyzer or similar tools. They can just produce statical information, but I would need to interactive query the information, apply filters, lay over other datas, etc.
I've also tried data mining tools in hope they can help me but didn't really success in using them (i.e. they're over my head), e.g. RapidMiner.
Just to make it sure: it can be a commercial application. But yet have to find something which is really useful.
Somehow I get the impression I'm searching for something which does not exist or I've the wrong approach. Any hints are very welcome.
Update:
In the end I it was a mixture of the following things:
wrote bash and PHP scripts to parse and managing parsing the log files, including lots of filtering capabilities
generated plain old CSV file to read into Excel. I'm lucky to use Excel 2007 and it's graphical capabilities, albeit still working on a fixed set of data, helped a lot
I used Amazon EC2 to run the script and send me the CSV via email. I had to crawl through around 200GB of data and thus used one of the large instances to parallelize the parsing. I had to execute numerous parsing attempts to get the data right, the overall processing duration was 45 minutes. I don't know what I could have done without Amazon EC2. It was worth every buck I paid for it.
Splunk is a product for this sort of thing.
I have not used it my self though.
http://www.splunk.com/
The open source data mining and web mining software RapidMiner can import both Apache web server log files as well as CSV files and it can also import and export Excel sheets. Rapid-I offers a lot of training courses for RapidMiner, some also on web mining and web usage mining.
In the interest of full disclosure, I've not used any commercial tools for what your describing.
Have you looked at LogParser? It might be more manual than what your looking for, but it will allow you to query many different structured formats.
As for the graphical aspect of it, there is some basic charting capabilities built in, but your likely to get much more mileage piping the log parser output into a tabular/delimited format and loading into Excel. From there you can chart/graph just about anything.
As for cross joining different data sources, you can always pump all the data into the database where you'll have a richer language for querying the data.
What you're looking for is a "data mining framework", i.e. something which will happily eat gigabytes of somewhat random data and then lets you slice'n'dice it in yet unknown ways to find the gold nuggets buried deep inside of the static.
Some links:
CloudBase: "CloudBase is a high-performance data warehouse system built on top of Map-Reduce architecture. It enables business analysts using ANSI SQL to directly query large-scale log files arising in web site, telecommunications or IT operations."
RapidMiner: "RapidMiner aleady is a full data mining and business intelligence engine which also covers many related aspects ranging from ETL (Extract, Transform & Load) over Analysis to Reporting."

Resources