Black box testing a remote DICOM Q/R server - dicom

I am wondering if anyone has ever tried to work on the following issue. I need to execute a series of test on a remote DICOM Q/R server. This would allow some easy DICOM Conformance Statement checking.
As an implementation detail of the test suite, I am running the following (DCMTK style command):
$ findscu --study --cancel 1 --key 0020,0010=* --key 8,52=STUDY --aetitle MINE --call THEIR dicom.example.com 11112
The goal here is to find a valid StudyID (later on I'll use that StudyID to execute lower key level C-FIND, and some related C-MOVE queries). Of course it would be much easier if I could upload my own dataset and try to fetch it back, but I cannot do that against a running PACS in a clinical environment. I need to define with a minimal number of queries how to find a valid StudyID.
However I fear that some DICOM implementation may have policies where quering the entire database is forbidden.
So I was wondering if anyone has written a list of those policies, and maybe describe a way to retrieve a valid StudyID from a remote server with a minimal number of C-FIND queries.

I think I may simply go with:
TODAY=`date +"%Y%m%d"`
findscu --study --key 0008,0020="$TODAY-" --key 0020,0010=* --key 8,52=STUDY --aetitle MINE --call THEIR dicom.example.com 11112
If this does not work (return empty), I'll check yesterday results.

Welcome to DICOM-wonderland.
You are right that you should be very, very, very, very careful to run just random queries on a clinical PACS. I've seen commercial PACses send their whole(!) database as a result of a query which it did not understand. Not a pretty sight. This (and privacy) is one of the reasons that in a lot of hospitals around the world PACS admins are very afraid of giving direct access to their PACS via DICOM.
In general I would say that standardization is not going to help you. So you have to find something which works for you, and which will not get bring the PACS down. No guarantees here.
Just a list of observations from querying PACSes in hospitals:
Some are case sensitive in their matching, some are not.
Most support some kind of wild card. This normally is a '*'. but I've also seen '%' (since that is a SQL wildcard, and the query is just passed as a SQL string). This is not well-defined I think.
The list you will get back might be limited to say the first 500 entries. Or 1000. Or random number between 500 and 1000. Or the whole PACS. You just don't know.
DICOM and cancellation do not play well. Cancelling a query is not implemented well. Normally a PACS sees it as a failed transfer, and will retry after some time. And the retry-queue is limited in size, so it might ignore new queries. So always let your STORE-SCP server running to drain this queue.
Sometimes queries take minutes, especially for retrieve. The next time it might have been retrieved (from tape?) and be fast for a while.
A DICOM query may take a lot of resources from the PACS, depending on the PACS. Don't be suprised if a PACS admin shows up if you experiment a little too much.
The queries supported differ very much. Only basic queries are supported by all: list of patients, list of studyID/study instanceuid for patients, list of series per study, retrieve study or series. Unless you get a funky research department which uses Osirix, which does not support patient-level queries but only study-level-queries.
So what I would advise if you want to have something working on any random PACS:
Use empty-return-key instead of '*'. This is the DICOM way to retrieve information.
Do not use '-cancel'. If you really need to cancel, just close the TCP connection (not supported in DCMTK)
Use a query on PatientId, PatientName, Birthdate, StudyDate to get a list of StudyIDs/StudyInstanceUids.
The simplest is just use a fixed StudyID, assuming that it stays in the PACS long enough. If not, think of a limiting query to not overload the PACS (the 'TODAY' suggestion of you fitted that description).
Good luck!

Related

Gremlin: OLAP vs dividing query

I have a query (link below) I must execute once per day or once per week in my application to find groups of connected users. In the query I check all possible groups for each user of the application (not all users are evaluated but could be a lot). For the moment I'm only making performance tests in localhost using Gremlin Server, since my application is not live yet.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default, another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Solution 1:
Send a query to get the users first and then send one query per user, then remove duplicates in the server code, this should work in my case and since I can send all the queries at the same time I can use all resources available and bypass the time limits.
Solution 2:
Use OLAP. I guess OLAP does not have a time limit. The problem: My idea is to use Amazon Neptune and OLAP is not supported there as far as I know.
In this question about it:
Gremlin OLAP queries on AWS Neptune
David says:
Update: Since GA (June 2018), Neptune supports multiple queries in a single request/transaction
What does it mean "multiple queries in a single request"?
How my solution 1 compares with OLAP?
Should I look for another database service that supports OLAP instead of Neptune? Which one could be? I don't want an option that implies learning to setup my own "Neptune like" server, I have limited time.
My query in case you want to take a look:
https://gremlify.com/69cb606uzaj
This is a bit of a complicated question.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default,
I'll assume there is a reason you can't change the default value, but for those who might be reading this answer the timeout is configurable both at the server (with evaluationTimeout in the server yaml) and per request both for scripts and bytecode based requests.
another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
If you're testing with TinkerGraph in Gremlin Server then know that TinkerGraph is really simple. It doesn't do anything internally to run any aspect of a traversal in parallel (without TinkerGraphComputer which is OLAP related).
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Either approach has the potential to work. In the first solution you suggest a form of poor man's OLAP where you must devise your own methods for doing this parallel processing (i.e. manage thread pools, synchronize state, etc). I think that this approach is a common first step that folks take to deal with this sort of problem. I'd wonder if you need to be as fine grained as one user per request. I would think that sending several at a time would be acceptable but only testing in your actual environment would yield the answer to that. The nice thing about this solution is that it will typically work on any graph system, including Neptune.
Using your second solution with OLAP is trickier. You have the obvious problem that Neptune does not directly support it, but going to a different provider that does will not instantly solve your problem. While OLAP rids you of having to worry about how to optimally parallelize your workload, it doesn't mean that you can instantly take that Gremlin query you want to run, throw it into Spark and get an instant win. For example, and I take this from the TinkerPop Reference Documentation:
In OLAP, where the atomic unit of computing is the vertex and its local
"star graph," it is important that the anonymous traversal does not leave the
confines of the vertex’s star graph. In other words, it can not traverse to an
adjacent vertex’s properties or edges.
In your query, there are already a places where you "leave the star graph" so you would immediately find problems there to solve. Usually that limitation can be worked around for OLAP purposes but it's not as simple as adding withComputer() to your traversal and getting a win in this case.
Going further down this path of using OLAP with a graph other than Neptune, you would probably want to at least consider if this complex traversal could be better written as a custom VertexProgram which might better bind your use case to the the capabilities of BSP than what the more generic TraversalVertexProgram does when processing arbitrary Gremlin. For that matter, a mix of Gremlin OLAP, a custom VertexProgram and some standard map/reduce style processing might ultimately lead to the most elegant and efficient answer.
An idea I've been considering for graphs that don't support OLAP has been to subgraph() (with Java) the portion of the graph that is relevant to your algorithm and then execute it locally in TinkerGraph! I think that might make sense in some use cases where the algorithm has some limits that can be defined ahead of time to form the subgraph, where those limits can be easily filtered and where the resulting subgraph is not so large that it takes an obscene amount of time to construct. It would be even better if the subgraph had some use beyond a single algorithm - almost behaving like a cache graph. I have no idea if that is useful to you but it's a thought. Here's a recent blog post I wrote that talks about writing VertexPrograms. Perhaps you will find it interesting.
All that said about OLAP, I think that your first solution seems fine to start with. You don't have a multi-billion edge graph yet and can probably afford to take this approach for now.
What does it mean "multiple queries in a single request"?
I believe that this just means that you can send a script like:
g.addV().iterate()
g.addV().iterate()
g.V()
where multiple Gremlin commands can be executed within the scope of a single transaction where each command must be "separated by newline ('\n'), spaces (' '), semicolon ('; '), or nothing (for example: g.addV(‘person’).next()g.V() is valid)". I think that only the last command returns a value. It doesn't seem like that particular feature would be helpful in your case. I would look more to batch users within a particular request where possible.
If you a looking for a native OLAP graph engine, perhaps take look at AnzoGraphDB which scales and performs much better for that style of more complex querying than anything else we know of. It's an MPP engine, so every core works on the query in parallel. Depending on how much data you need it to act on, the free version (single node only, RAM limited) may well be all you need and can be used commercially. You can find it in the AWS Marketplace or on Docker Hub.
Disclaimer: I work for Cambridge Semantics Inc.

Unity + Firebase: is it possible to append data to a keys value, or do I have to retrieve keys data every time?

I'm a bit worried that I will reach the free data limits of Firebase in a student project.
Basically my question is:
is it possible to append to the end of the string instead of retrieving key and value, appending and uploading again.
What I want to achieve:
I have to create statistics of user right/wrong answers for particular questions.
I want to have a kvp:
answers: 1r/5w/3r
Where number is the number of users guesses and r/w means right wrong. Whenever the guessing session ends I want to add /numberOfGuesses+RightOrWrongAnswer and the end.
I'm using Unity 2018.
Thank you in advance for all the help!
I don't know how your game is architected or how many people are playing, but I'd be surprised if you hit your free limit on a student project (you can store 1GB and download 10GB). That string is 8 bytes, let's assume worst case scenario: as a UTF32 string, that would be 32 bytes of data - you'd have to pull that down 312 million times to hit a cap (there'll be some overhead, but I can't imagine it being a hugely impactful). If you're afraid of being charged, you can opt to not have a credit card on file to be doubly sure you stay on a student budget.
If you want to reduce the amount of reading/writing though, I might suggest that instead of:
key: <value_string> (so, instead of session_id: "1r/5w/3r")
you structure more like:
key:
- wrong: 5
- right: 3
So have two more values nested under your key. One for all the wrong answers, just an incrementing integer. Then one for all the right answers: just an incrementing integer.
The mechanism to "append" would be a transaction, and you should use these whether you're mutating a string or counter. Firebase tries to be smart with data usage and offline caching, but you don't get much more control other than that.
If order really matters, you might want to get cleverer. You'll generally want to work with the abstractions Realtime Database gives you though to maximize any inherent optimizations (it likes to think in terms of JSON documents, so think about your data layout similarly). This may not be as data optimal, but you may want to consider instead using a ledger of some kind (perhaps using ServerValue.Timestamp to record a single right or wrong answer, and having a cloud function listening to sum up the results in the background after a game - this would be especially useful if you plan on having a lot of users trying to write the same key at the same time).

System Design Interview - Car API

System Design Question:
You are given a dataset of a few million used cars and information about them -- miles, color, price, etc. You have to create an API endpoint in two days that allows users to query the dataset.
This was the answer I gave:
Use a relational database (let's say PostgreSQL) to house the data. Expose a GET endpoint that takes query string parameters corresponding to the attributes in the dataset, parses them and uses them to query the database. The endpoint can also track which attributes are queried the most and add indexes to those attributes to speed up the queries. I was asked how I would handle a range (e.g. "car with 50,000 <= miles <= 100,000") to which I said this can be handled by the query string parameter and translated into the SQL query by the GET endpoint.
Feedback
I was told in feedback afterwards that this answer "didn't convey a strong understanding of how to design web systems." I was hoping for some insights as to where my solution may have been insufficient/weak or may have overlooked something about designing web systems.
Note: I reconstructed my answer from memory so it may be clearer here than it was in the interview.
Thanks for any help!
Like already discussed in the comments, the Interviewer wanted to hear something about SQL Injection. There are some counter measures, which you can do to avoid SQL Injection. These are (most probably not a complete list, but should give a hint, on what to look out for):
Use Prepared Statements
Take care about Access restrictions (in the DB as well as on the OS)
Validate the User Input

Can you get a log file of 'reads' on specific RECID(Tablename) in Progress-4GL/Openedge at RunTime without access to Source Code?

I want to know which tables are being read by a query.
for each Customer where CustomerID = 12345.
Eventually this customer will be found in the following example, but progress must 'read' many tables before getting to customer 12345.
How do I know exactly which tables are read (By CustomerID), prior to getting to customer 12345?
*NOTE: I do not have access to modify the code being run for this selection. Ideally I would run a separate set of code that is executed at the same time as the customer query above to track the reads.
EDIT: More clearly - Can you track reads from a given program (.p) OR ProcessID and output either a RECID or the PrimaryKey to a file?
I understand the information is being read off the Disk and probably stored in a database buffer. So how would I get at the information in the database buffer?
You seem to be mixing up a few different things.
In a situation like your example where you FIND a specific record in one, and only one table then there is just a single record read. Progress will find that record by first scanning a relevant index. That might be 2 or 3 "logical reads" of the b-tree to get to the proper node. The record block and index blocks may, or may not be read from disk - that depends on what has happened previously.
There are "Virtual System Tables" available that can tell you how many READ operations take place against a particular table or index. But they do not trace the specific ROWID or other identifying data. _TableStat and _IndexStat are aggregates for all users on the system, _UserTableStat and _UserIndexStat are specific to a particular user's activity. You do need to set the -tablerangesize and -indexrangesize parameters adequately to take advantage of these.
If you have enabled the table and index statistics then you can use a tool like ProTop - http://protop.wss.com to get insight into this activity. Or you can write your own code.
OpenEdge Auditing does not track reads. That would be prohibitively expensive.
It's probably not really a good idea but, in theory, you could write FIND triggers for the tables you are interested in. That doesn't require access to the application source but you would need a development license. It will probably kill performance to do this though - so unless this is a non-production test environment that you just want to fiddle with I wouldn't really do that.
You mention wanting to know how you got to that point. That sounds more like you might need to have a "4gl trace". One easy way to get the stack trace of a running process is to execute:
$DLC/bin/proGetStack PID (UNIX)
or
%DLC%\bin\proGetStack PID (Windows)
This command will generate a "protrace.pid" file containing a 4gl stack trace and other interesting information.
There are also more complicated ways to get that info like using PROMON and the "client statement cache" or setting various log entry types at session startup. But proGetStack is pretty convenient and requires no code or scripting changes.
Some great options from Tom above. And all of them may be relevant to you. The option he only skirts around is the logging options. I feel obliged to expand on this because I'm giving a talk on it in a couple of weeks!
Assuming you are running a modern version of Progress, or even 10.2B08, then you have client logging available to you. Start your session with these additional options:
-clientlog "\somefolder\somefile.txt"
-logentrytypes "QryInfo:3"
This will log all the info of all the queries in your session to the file you specified above. If you navigate to the point in the system where you want to analyse your query and empty the logfile and save it, you can then run the offending query and see all the detail you need.
The output tells you all sorts of useful info, including the number of reads on each table, compared with the number returned to the user. You also get the index selected.
Using Tom's advice and/or this will get you what you need.

How to hack proof a data submission program

I am writing a score submission system for games where I need to ensure that reports back to the server are not falsified (aka, hacked).
I know that I can store a password or private passkey in the program to authenticate or encrypt the request but if the program is decompiled, a crafty hacker can extract the password/passkey and use it to falsify reports.
Does a perfect solution exist?
Thanks in advance.
No. All you can do is make it difficult for cheaters.
You don't say what environment you're running on, but it sounds like you're trying to solve a code authentication problem*: knowing that the code that is executing is actually what you think it is. This is a problem that has plagued online games forever and does not have a good solution.
Common ways in which such systems are commonly broken:
Capture, modification and replay of submissions to the server
Modifying the binary to allow cheating
Using a debugger to modify the submission in-memory before the program applies signatures/encryption/whatever
Punkbuster is an example of a system which attempts to solve some of these problems: http://en.wikipedia.org/wiki/PunkBuster
Also consider http://en.wikipedia.org/wiki/Cheating_in_online_games
Chances are, this is probably too hard for your game. Hiding a public key in your binary and signing everything that leaves it will probably put you well ahead of the pack, security-wise.
* Apologies, I don't actually remember what the formal name for this is. I keep thinking "running code authentication", but Google comes up with nothing for the term.
There is one thing you can do - record all of the user inputs and send those to the server as part of the submission. The server can then replay the inputs through a local copy of the game engine to determine the score. Obviously this isn't appropriate for every type of game, though. Depending on the game, you may need to include replay protection.
Another method that may be appropriate for some types of games is to include a video recording of the high-scoring play within the submission. Provide links to the videos from the high score table, along with a link to report suspicious entries. This will let you "crowd-source" cheat detection - if a cheater's score hits the table at number 1, then the players behind scores 2 through 10 have a pretty big incentive to validate the video for you. If a score is reported enough times, you can check the video yourself and decide if it should be removed (and the user banned).

Resources