If I have a directory which contains a volumetric data from a variety of sources, say PET and a 4D CT. I know how for any given dataset to use say, vtkGDCMImageReader, to load a 3D image from a series of files. To handle multiple modalities and/or 4D datasets I am currently just manually peeking at tags and dividing up the files into lists and parsing them separately.
Is there a particularly general way of going this or even better a method within GDCM ? What I am doing seems to work but feels like a bit of a hack and there must be a proper way of doing it, I just can't seem to find it.
You can checkout the following example
Related
One of the ancillary tools bundled with quilt is guards, which processes a list of guards and a configuration file matching guards and files, and outputs a list of files whose guard specifications match the provided guards.
However I can't figure how they're supposed to fit together: quilt(1) doesn't show any way to invoke a command to generate series files, I didn't find examples in the mans or the working copy, and the internets are less than helpful (all the hits talk about bedding).
I feel like guard has to be manually invoked whenever its "dependencies" change and the series file overwritten, is that the case? If so, how is data fed back the other way around e.g. when adding a new patch to the series, does it have to be manually synchronised to the guard file?
Background: a few years back I used mq quite a bit, but it integrates guards natively so the synchronisation back and forth is not an issue at all.
we have a larger dataset and have several preprocessing scripts.
These scripts alter data in place.
It seems when I try to register it with dvc run it complains about cyclic dependencies (input is the same as output).
I would assume this is a very common use case.
What is the best practice here ?
Tried to google around but i did not see any solution to this (besides creating another folder for the output).
Usually, we split input and output into separate files rather than modify everything in place, not only for the separation of concerns principles but also to make it fit with tools like DVC.
Hope you can try this way instead.
I need to create a dependency graph for a software suite that I am working on. In the past the company I work for has always done this manually, but I am guessing that there is a tool somewhere that will do what we need.
The software I am working with is Ada95, and has about 200 code modules/files, with about 40 packages. I need to create a map that will trace every output, individually, back to each input or constant that will have an impact on the output. Does anybody know of a tool that would accomplish this? Or even just partially accomplish it?
AdaCore's GPS (available from http://libre.adacore.com) comes with a command line tool named gnatinspect. You can use this tool to load all cross-reference information generated by the compiler (assuming you are compiling with GNAT). This creates a sqlite database (gnatinspect.db) which contains all information you need. gnatinspect itself provides a number of pre-made queries that might get you at least partially to where you want to go.
You could also look at ASIS, as a way to do this kind of queries directly on the code. I am told this is not so easy to use the first time around though.
There is also an older tool provided with gnat (gnatxref) which does something similar, although it is being superceded by gnatinspect.
Finally, you could look at gnat2xml as an alternative to ASIS if you are more comfortable parsing XML files.
Let me first say that I assiduously avoid hand-cleaning data in favor of regular expressions and the like. However, occasionally it is inevitable.
I use something like the Load-Clean-Func-Do workflow normally, so this obviously fits into the cleaning phase. However, any hand-editing breaks the ability to run the stuff before the hand-cleaning if it needs updating.
I can think of at least three ways to handle this:
Put the by-hand changes as early in the workflow as possible, so that everything after that remains runnable.
Write out regexes or assignment operations for every single change.
Use a tool that generates (2) for you after you close the spreadsheet where you've made the changes.
The problem with 2 is that it can be extremely unweildy. The problem with 3 is that I'm unaware of any such tool existing for R. Stata has an extremely good implementation of this.
So the questions are:
Which results in the most replicable code with the least-frustrating code writing?
Does a tool as in (3) exist?
I agree that hand-cleaning is generally a rather bad idea. However, sometimes it is unavoidable. I'd suggest one of the two, or both:
Keep a separate data file with "data fixing" containing three variables "case_id", "variable_name", "value". Use it to store information about which values in the original data need to be replaced. You may add some additional variables to extra information about cleaning (e.g. why value on variable "variable_name" need to be replaced with "value" for case "case_id", etc.). Then have a short piece of R code, which loads your original data and then cleans it with the additional information in the "fixing" file.
Perhaps you should start using some version control system like git or subversion (there are other progs also). Every hand-made change to the data could be recorded in the system as a separate commit. By the end of the day, you will be able to easily check the log for what change you made to the data and when. Moreover, you will be able to generate patch files that transform original data files to the cleaned ones. It is also beneficial to have your R code files version-controlled.
I've scoured the net and cannot seem to find an appropriate example so I thought I'd ask...
(Btw, much of this is new to me- not all, just most.)
Problem: trying to convert a bio/python nested dictionary (or xml) of pubmed citation data into a flat (normalized) structure eg, sqlite. Citation data was fetched from pubmed using biopython and was parsed into a dictionary, but can also retrieve as xml if needed.
Not all citations will have all fields/keys and not all fields/keys will have the same number of items (authors, mesh terms, refs, etc...) and understand that this is part of the normalization process.
This is about where my practical understanding ends.
That said, I think the process should go something like this: first remove/normalize all unique fields (those that have 1 per paper eg, title, abstract, date, citation, etc..., but say not affiliation as that would be linked to first author). Papers with no abstract could be filled as null?
Then move on to, say, authors and create a separate table again using PMID as the fk and then do same for the various other fields/keys/items in separate tables eg, mesh headings, EC numbers, ref, etc...
Is there a way to do this that removes (pops?) keys/items from the master dictionary so that I can visually see what's been done/needs to be done (obviously leaving the PMID)?
Again, apologies in advance if I'm asking a blindingly obvious question to the initiated- and I do understand that you can't fit a nested structure into a flat space- just looking for the least boneheaded way of going about this and hopefully one that will allow me to make sure that everything was properly captured.
Many thanks,
chris
A quick question -- if you already have the data in XML, why are you normalizing it into a SQL format? Why not just use the raw XML? Berkeley DB XML is a library (like SQLite) that links into your application. There is no separate server to install or maintain. The library allows you to store and query XML data using XPath or XQuery. It's very fast, has a small footprint. is transactional, recoverable and highly reliable. It has HA features as well, if that is required.
Keeping the data in XML should simplify the whole data import process and still allow you to query the semi-structured data.