File Storage in clusters - mpi

I am working on a hybrid programming exercise using Open Mp and MPI, that has to be tested up on a cluster. The goal is basically to make an index of relevant words after processing a large set of files. What I am confused about is whether these files have to be sent to the respective nodes, from the root, or that they should already be there in the respective nodes. Thanks.

Related

How can I update all UIDs of a DICOM study while maintaining its structure?

I have a DICOM study with 3 series and want to refresh its UIDs (StudyInstanceUID, SeriesInstanceUID, SOPInstanceUID) to do some tests. All the data is in a single directory so it's not possible to tell which DICOM belongs to which series.
What I have tried is using dcmodify (dcmtk) with some generate options :
dcmodify mydirectory/*.dcm -gst -gse -gin
but it makes all single files different studies, the structure was broken.
Is there a way to do this or do I have to use other dcmtk tools to identify series UIDs that every single file has?
-gst -gse and -gin
Create a new Study-, Series and SOP Instance UID for each individual image matching mydirectory/*.dcm, hence destroying the study/series structure as you already observed.
The answer is two-fold:
To assign the same UID to all images, you rather use
-m (0020,000D)=...
(this example for the Study Instance UID)
But there is no command line tool in DCMTK that I am aware of which would completely solve your problem. The storescp has an option to create subdirectories for each study (e.g. --sort-on-study-uid) but that does not solve the series-level problem.
With the means of DCMTK, I think you need to do some scripting work around it using dcmdump to dump the files to text, extracting Study- and Series Instance UID from it and then moving the file to an appropriate Study+Series folder.

Insert multiple xml data files into tinkerpop graph database via gremlin server (with gremlin code)

I have huge dataset that must be inserted into a graph database via gremlin (gremlin server). Becuase the xml file is too big (over 8gb in size), I decided to split it into multiple, manageable nine xml files (about 1gb each). My question is, is there a way to use the insert each of these data files into my tinkerpop graph database via gremlin server? i.e. trying something like this? or, what is the best way to insert these data please?
graph.io(IoCore.graphml()).readGraph("data01.xml")
graph.io(IoCore.graphml()).readGraph("data02.xml")
graph.io(IoCore.graphml()).readGraph("data03.xml")
graph.io(IoCore.graphml()).readGraph("data04.xml")
graph.io(IoCore.graphml()).readGraph("data05.xml")
That's a large GraphML file. I'm not sure I've ever come across one so large. I'd wonder how you went about splitting it since GraphML files aren't easily split as they are XML based, have a header and a structure where vertices and edges are in separate nodes. It was for these reasons (and others) that TinkerPop developed formats like Gryo and GraphSON which were easily split for processing in Hadoop-like file structures.
That said, assuming you split your GraphML files correctly so that each file was a complete subgraph I suppose that you would be able to load them the way you are suggesting, however, I'd be concerned about how much memory you'd need to do this. The io() loader is not mean for bulk parallel loading and basically holds an in-memory cache of vertices to speed loading. That in-memory cache is essentially just a HashMap that doesn't expire its contents. Therefore, while the loading is occurring you would need to be able to hold all of the Vertex instances in memory for a particular file.
I don't know what your requirements are or how you ended up with such a large GraphML file, but for a graph of this size I would look at either the provider specific bulk loading tools of the graph you are using or some custom method that used spark-gremlin or a Gremlin script of some sort to parallel load your data.

adding arbitrary information to recorder

I'm running an optimization in OpenMDAO. One of the components in the model writes a few files to a directory which is given a random name. I track the progress of the optimization using a SqliteRecorder. I would like to be able to correlate iterations in the sqlite database to the directories of each evaluation.
Is there a way to attach arbitrary information to a recorder - in this case, the directory name?
i suggest that you add a string typed output to the component and set it to the folder name. Then the recorder will capture it.

Routing Engine Using OpenStreetMap Data

As part of my academic project, I have to build a routing engine based on data supplied from OSM. I have looked at the data model of OSM and I'm all fine with it. However, I'm having trouble converting an OSM XML file into a graph structure (nodes and edges) that I can use to apply search algorithms (Dijkstra, A* etc.) on. I would like the graph to be stored in memory to allow fast read/write.
So can anyone shed light or suggest techniques on how this can be done, or even provide pointers for further research.
Please note that I'm not allowed to re-use existing routing engines as this would defeat the purpose of doing the project.
All you need to do is:
create a node for every <node> item
every <way> entry is a sequenced list of <nd> items, each of which is a backreference to a node. So for each <way>, you iterate pairwise through its <nd>s and create an arc between the two nodes referenced.
You can do this in one pass using a streaming XML parser, since the XML data defines all the nodes before the ways.
The data doesn't intrinsically include distances, so you need to calculate that from the latlon of each node. You should also take account of the road type (highway=*) and the access info (access=*) in your routing, and you probably also want to ignore ways that are not traversable (eg waterway=stream) but that's all down to your specific situation.
http://wiki.openstreetmap.org/wiki/Elements

Does Riak (open source) support some form of multi-site replication?

The site is unclear on this one, only saying
Masterless multi-site replication
Does this imply that there is some master-master or master-slave system to replicate to another site?
What are the other options to back up a single-server or multi-server Riak DB to another site?
We only provide multi-site replication in the enterprise product. It is separate feature not present in the open source code. As the description notes, it is not a master-slave system - this allows for nodes to be down at either end.
Riak is partition tolerant because it's eventually consistant (AP in CAP Theorem) however just having nodes in two data centers doesn't give you all the benefits of full replication. You may not have any copies of a particular piece of data in one data center just because you have nodes there. If a data center went down or there was a routing issue on the net, when it became available again the data would eventually become consistant but during the outage the full set of data would not be in both places.
For example, the default bucket property for r (read quorum) is n_val/2 + 1 - this means if you are configured for 3 replicas (n_val) at least 2 nodes have to respond. This would mean even if that one data center that was still up had a node with a copy of a piece of data, it wouldn't be considered a valid read because the other two nodes were in the data center that was down.
For information on backing up a Riak cluster see: http://wiki.basho.com/Backups.html
If you have specific questions, please feel free to contact us on the riak-users mailing list:
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Masterless means exactly that. There is no one master node and therefore no slave nodes in the system.
Riak partitions your data among whatever servers (Basho people call them nodes) you give it and then it replicates, by default, each node's data to 2 other nodes. In essence, if your nodes are in separate data centres, then your data is replicated to multiple sites automatically.
There are few extra details I've left out, like virtual nodes and I'm willing to expand on that, if you need it. The gist of my answer, though, is servers in multiple data centres added to the system and managed by Riak will give you multi-site replication.

Resources