How to analyse large Networks/Graphs with limited amount of RAM - networking

For a current project, I need to analyse a large Protein-Protein-Interaction Network given to me in an >300mb .csv file.
The Graph will have way over 5 Million edges and tens of thousands of nodes.
I allready tried using Cytoscape and Gephi to load and analyse my Data, but both seem to not be capable of handling networks of this size.
While Cytoscape crashes seconds after trying to load the file, Gephi manages to load ~50% until it runs out of memory. (Yes, I set -Xmx to max.)
Note, my PC has 8GB of RAM.
At this point, I'm starting to question myself: Is it even possible to analyse networks of this size with common Network-Analysis Software, or am I forced to write and tweak my own algorithms? Or is there Software available you guys know about?
For now, I don't necessarily need Graph visualisation, just simple Centrality measurements etc.
I really hope my question isn't too unspecific.
Thanks in advance!

Both Cytoscape.js and Cytoscape desktop (Java) support headless mode.
You can use Cytoscape.js directly in Node.js, with direct access to its API. Just require('cytoscape') and you're good to go. Using Cytoscape.js headlessly in Node.js is much less expensive w.r.t. CPU usage and RAM, as compared to visualising. Cytoscape.js supports lots of types of centrality calculations. It's just one API call to calculate the values, so it would be easy to try out even just in the Node.js REPL (and you could write out a JSON file).
You can communicate with Cytoscape desktop headlessly via CyRest -- i.e. HTTP/REST requests. This means you can do your analysis in any language, but everything you do will be async and require constant serialisation and deserialisation. I think you could alternatively write an app for Cytoscape desktop, as long as it's all headless.

Related

Why is it hard to implement fast network transfer of small files?

I use some kinds of network mounts (like Samba Windows shares, sshfs, scp) on different networks (LAN, Dialup). Whenever it comes to transferring a large amount of small files, I see poor performance. Far away from what would theoretically be possible. No resource appears really busy, so it seems to be a question about the software behind that (this is why I'm hopefully not OT here).
What is the problem from a software developer perspective behind that? Why do those tools not saturate any component of my system or the network?
Is that just because the Linux kernel makes some stuff complicated, or is there more to know about?

Freebase - how to use the freebase-rdf-latest?

I downloaded the freebase-rdf-latest from freebase.com. I uncompressed it and now I have a file of 380.7Gb.
How can I read that data? Which program do you recommend me?
Thanks for your help!
I'll disagree with #Nandana and say that you definitely should not load it into a triple store for most uses. There's a ton of redundancy in it and, even without the redundancy, usually you're only interested in a small portion of it.
Also, for most applications, you probably want to leave the file compressed. You can probably decompress it quicker than you can read the uncompressed version from the file system. If you need to split it for processing in a MapReduce environment, the file is (or at least used to be) a series of concatenated compressed files which can be split apart without having to decompress them.
Nandana has a good suggestion about considering derivative data products. The tradeoff to consider is how often they are updated and how transparent their filtering/extraction pipeline is.
For simple tasks, you can get pretty far with the very latest data using zgrep, cut, and associated Unix command line tools.
You have to load the data to a triple store such as Virtuoso. You can take a look at how load the data in following references.
Virtuoso Freebase Setup
Load Freebase Dump into Virtuoso
Bulk Loading RDF Source Files into one or more Graph IRIs
Loading freebase to Jena Fuseki
However, you might be interested in other projects that provide a cleaned version of freebase pre-loaded into a triple store.
SindiceTech Freebase distribution Freebase data is available for
full download but as today, using it "as a whole" is all but simple.
The SindiceTech Freebase distribution solves that by providing all the
Freebase knowledge preloaded in an RDF specific database (also called
triplestore) and equipped with a set of tools that make it much easier
to compose queries and understand the data as a whole.
:BaseKB :BaseKB is an RDF knowledge base derived from Freebase, a
major source of the Google Knowledge Graph; :BaseKB contains about
half as many facts as the Freebase dump because it removes trivial,
ill-formed and repetitive facts that make processing difficult. The
most recent version of :BaseKB Gold can be downloaded via BitTorrent,
or, if you wish to run SPARQL queries against it, you can run it in
the AWS cloud, pre-loaded into OpenLink Virtuoso 7.

rsync vs SyncML (Funambol)

I would like some idea about how rsync compares to SyncML/Funambol, especially when it comes to bandwidth, sync over unstable network and multiple clients to one server.
This is to sync several mobile devices with a directory structure of growing text-files. (Se we essentially want as much as possible on the server, and inconsistent files is not really a problem, also we know where changes originates).
So far, it seems Funambol doesn't compress, doesn't handle partial updates, and it is difficult to handle interruptions in a file-transfer.
I know rsync doesn't go through the server, but I don't quite see how that is a disadvantage.
Olav,
rsync can:
Compress the data (as you said) - thus gaining better performances over the net.
Synchronize only the newest data within each file - thus, once again, saving time.
Can be ran by multiple users at the same time. It's a very basic backup software behavior.
And one of my favorites: work over a secure shell.
You might want to check Rsyncrypto, for compressing and encrypting at the same time.
Dotan

What's the best strategy for large amounts of audio files in mobile application?

I have an S60 app which I am writing in Qt and which has a large number (several thousand) small audio files attached to it, each containing a clip of a spoken word. My app doesn't demand high fidelity reproduction of the sounds, but even at the lowest bit rate and in mono MP3 they average 6k each. That means my app may have a footprint of up to 10Mb. The nature of the app is such that only a few audio clips will be needed at any one time - maybe as many as 50, but more like 1-10.
So my question has two parts:
1) is 10Mb for a mobile app too large?
2) what is a reasonable alternative to shipping all audio files at install time?
Thanks
Have you considered rolling all clips into a single file and then seek in the stream? I'm not sure how much the per-file overhead of MP3 is but it might help.
That said, every S60 mobile phone out there should have 1GB or more, so 10MB doesn't sound like "too much". But you should deliver the app as a JAR file which people can download from your website with a PC and then install by cable. Downloading large amounts of data by the phone itself is pretty expensive in many parts of the world (Switzerland, for example).
In terms of persistent storage, 10Mb isn't a lot for modern mobile devices, so - once downloaded - storing your application data on the device shouldn't be a problem.
The other type of footprint you may want to consider, however, is memory usage. Ideally, you'd have clips buffered in RAM before starting playback, to minimise latency. Given that most Symbian devices impose a per-process default heap size limit of 1Mb, you can't hold all clips in memory, so your app would need to manage the loading and clearing of a cache.
It isn't generally possible to buffer multiple compressed clips at a time on Symbian however, since buffering a clip typically requires usage of a scarce resource (namely an audio co-processor). Opening a new clip while another is already open will typically cause the first to be closed, meaning that you can only buffer one in memory at a time.
If you do need to reduce latency, your app will therefore need to take care of loading and decompressing where necessary, to give PCM which you can then feed to the audio stack.
10MB is definitely on the large side. Most apps are < 1MB, but I think that I've seen some large ones (6-10-15 MB), like dictionaries.
Most S60 phones have in-phone storage space of around 100MB, but they also have memory cards and these are usually 128MB+, and 4GB is not uncommon for higher-end phones. You need to check the specs for your target phones!
Having such a large install pack will make installing over the air prohibitive. Try to merge the files so that you only have a few large files instead of many small ones, or the install will take too long.
An alternative would be to ship the most used sounds and download the rest as needed. S60 has security checks and you will need to give the app special permissions when you sign it.
Have you thought about separating the thousands of audio files into batches of, say, 20?
You can include a few batches into the application installation file and let the user download one (or more) batch at a time from your application GUI, as and when needed...
Store the sound files in a SQLite database, and access them only upon demand. Sounds like you are writing a speaking dictionary. Keep the app itself as small as possible. This will make the app load very fast by comparison. As the database market has matured, it seems a developer only needs to know about two database engines anymore: SQLite, for maximum-performance desktop and handheld applications, and MySQL for huge multi-user databases. Do not load all those sounds on startup unless it is critical. My favorite speaking dictionary application is still the creaking Microsoft Bookshelf '96; no kidding.
10MB for a mobile app is not too large provided, you convince the user that the content he / she is going to pull over the air is worth the data charges the user will incur.
Symbian as a platform can work well with this app as the actual audio files will be delivered from within the SIS file but the binary will not contain them and hence will not cause memory problems...
The best option would be to offer the media files for download via your website so that the user can download and sync them via PC- Suite / Mass Storage transfer. Allow the user to download the files into e:\Others or some publicly available folder and offer to read the media from there...
My 2cents...

Developing a distributed system as a grid

Has anyone had experience with developing a distributed system as a grid?
By grid, I mean, a distributed system where all nodes are identical and there is no central management, database etc.
How can the grid achieve even distribution of: CPU, Memory, Disk, Bandwidth etc.?
Something akin to Plan9 perhaps?
wikipedia entry.
What you're actually talking about is a cluster. There is a lot of software available for load balancing etc, even specific linux distros such as Rocks, which comes complete with MPI/PVM and monitoring tools built in.

Resources