Spark RDD lineage graph representation - graph

I would like to know if there is a way of using the information provided by
the function of the spark api RDD.toDebugString() to a more structured format, so it can be used to automatically get a graphical representation, for example with graphviz.
It seems that there is some activity around this going on:
https://issues.apache.org/jira/browse/SPARK-1015
But I would like to get the info from toDebugString() to a structured format,
and later decide which graph format to use for representation.

toDebugString() internally iterates through the recursive structure of an RDD, building a displayable string.
Instead of making toDebugString() return a more structured output, read its inner implementation (which does rely on structured data), and modify it to save the data the way suitable for you.
You don't have to wait for any issue on JIRA, just DIY :)

A more detailed and formatted visual representation can be seen using the spark UI which run by default on 4040 port.
Here it the screenshot showing all the details:

Related

Ada dependency graph

I need to create a dependency graph for a software suite that I am working on. In the past the company I work for has always done this manually, but I am guessing that there is a tool somewhere that will do what we need.
The software I am working with is Ada95, and has about 200 code modules/files, with about 40 packages. I need to create a map that will trace every output, individually, back to each input or constant that will have an impact on the output. Does anybody know of a tool that would accomplish this? Or even just partially accomplish it?
AdaCore's GPS (available from http://libre.adacore.com) comes with a command line tool named gnatinspect. You can use this tool to load all cross-reference information generated by the compiler (assuming you are compiling with GNAT). This creates a sqlite database (gnatinspect.db) which contains all information you need. gnatinspect itself provides a number of pre-made queries that might get you at least partially to where you want to go.
You could also look at ASIS, as a way to do this kind of queries directly on the code. I am told this is not so easy to use the first time around though.
There is also an older tool provided with gnat (gnatxref) which does something similar, although it is being superceded by gnatinspect.
Finally, you could look at gnat2xml as an alternative to ASIS if you are more comfortable parsing XML files.

Convert database data into graph with tinkerpop tools

I am looking to find graph software that will create a graph from a database automatically. Upon exploration of the tinkerpop documentation, the provided tutorials discuss querying ready-made graphs but there is not much about creating graphs from a database. Is it possible to use any of the tools in the tinkerpop suite to automatically convert data from a database into a graph ready for querying?
Let's say we have an event stream like this:
event_type=create_file name="filename.txt" handle=1
event_type=read handle=1 data="file content"
event_type=write handle=1 data="new file content"
event_type=close handle=1
Is there a way to convert the event stream into a graph automatically by specifying which properties to follow for creating edges? For example, by selecting the "handle" property I should get:
create_file-->read-->write-->close
All the examples I could find teach me how to do some activity like
add_node create_file
add_node read
add_node write
add_node close
followed by adding all desired edges manually.
Thank you for your help.
I just came across http://neo4j-contrib.github.io/neo4j-etl-components/ which is a very interesting tool. It takes an RDBMS schema and generates a graph representation, turning foreign keys and join tables into relationships, and (other) tables into nodes in the graph. And then it generates CSV files to load into a graph database, either for wholesale or incremental updates.
I don't know if there's an equivalent tool for Tinkerpop. But I'd hope that since much of the work is done (reading the SQL schema, mapping tables with foreign keys into vertices and edges) in this open source project, perhaps it'd be a good starting point?
The output of the tool looks like it depends on having a clean source data model, and may be naive. It looks like it's configurable so when it guesses wrongly about which tables are vertices and which are edges you can override it I think.
What you are suggesting is simply not possible. A graph database is very different from a traditional relational database. The biggest difference being that graphs are naturally unstructured which allow for more flexibility and manipulation. On the other hand, traditional relational or tabular databases are more rigidly structured which provide less flexibility but easier control and querying.
As stated by the answer provided here you also show not be using your original database as a frame of reference. You should instead be thinking about how to manipulate your data into a graph so as to take advantage of graphs.
For example, a traversal in a graph as opposed to a query in a tabular DB is a lot more flexible (and arguably powerful) but harder to construct and formalise.
There is a lot of good material providing guidelines for how to approach this problem [1] [2] [3] [4]. Unfortunately though there is no good automated migration at the moment.

SSIS asynchronous component source code?

I'm building a custom SSIS data flow component that, as part of its process, takes multiple inputs and rolls them into a single output (along the lines of what the UNION ALL component does). I'm having a little bit of trouble implementing the runtime methods and having been looking for source code for asynchronous transformations that use multiple inputs that I can use as a model for my own code.
Does anyone know of available source code or have their own code they'd be willing to share with me to help get me on the right track? Much appreciated!
Off the top of my head, I'd look at the Programming Samples from MS's IS team as well as the Community Transform. Some of those are open source so if you can find one that's async, you should be able to see what you're missing.
My half-remembered logic is that you'll need to look at all the InputCollections for a given IDTSComponentMetaData100 and then work with that data.

Accessing CoreData tables from fmdb

I'm using CoreData in my application for DML statements and everything is fine with it.
However I don't want use NSFetchedResultsController for simple queries like getting count of rows, etc.
I've decided to use fmdb, but don't know actual table names to write sql. Entity and table names don't match.
I've even looked inside .sqllite file with TextEdit but no hope :)
FMResultSet *rs = [db getSchema] doesn't return any rows
Maybe there's a better solution to my problem?
Thanks in advance
Core Data prefixes all its SQL names with Z_. Use the SQL command line tools to check out the your persistent store file to see what names it uses.
However, this is a very complicated and fragile solution. The Core Data schema is undocumented and changes without warning because Core Data does not support direct SQL access. You are likely to make error access the store file directly and your solution may break at random when the API is next updated.
The Core Data API provides the functionality you are seeking. IJust use a fetch request that fetches on a specific value using an NSExpressionDescription to perform a function. This allows you to get information like counts, minimums, maximums etc. You can create and use such fetches independent of a NSFetchedResultsController.
The Core Data API is very feature rich. If you find yourself looking outside the API for a data solution, chances are you've missed something in the API.

Convert nested dictionary/xml to flat file for sqlite

I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.

Resources