Python: best practice to decouple different access protocols to persisted processed data and back - python-requests

I am writing a python backend which will retrieve a text via several methods such as:
URL fetched via a request and beautifulsoup4 parsing -> str
RSS news text fetched via feedparser -> str
MongoDB Document key/value from a given collection/db -> str
MySQL select -> str
SPARQL query -> str
For every str I am receiving from any of the above, I will perform a NLP pipeline and wish to persist its resulting findings with some reference to be able to go back to that same text.
For cases 1, maybe case 2 (also with an URL?), and case 5, I can benefit of a unique ID which is the URL/URI, and correspondingly I could also use the unique MongoDB _ID for case 3, while for case 4 I am not sure what could I store to be able to go back to the original text.
Maybe I could then persist the NLP processing using a schema such as:
METHOD | IDENTIFIER | NLPRESULT to persist the results of all of these access methods?
Just as an additional clarification, if the above works well and I am presented with a text that originates from a resource I have already processed, I should be able to just skip the processing and fetch the previous result.
Are there best practices recommended to approach this task?

Related

Is there a way to convert numbers to the strings that DynamoDB expects in Step Functions?

I have an IoT Topic receiving data from devices. Each IoT payload includes some properties and an array of objects, which looks something like this.
{
"batchId": "someBatchId",
"prop1": "someProp1",
"objArray": [
{
"arrString1": "someArrString1",
"arrString2": "someArrString2",
"arrNum1": 1,
"arrNum2": 2,
"arrString3": "someArrString3"
},
{
"arrString1": "someArrString4",
"arrString2": "someArrString5",
"arrNum1": 3,
"arrNum2": 4,
"arrString3": "someArrString6"
}
]
}
The array can have hundreds of objects in it. We want to flatten this data out using a Map step and associate the top-level properties with each element in the array and insert that element into DynamoDB. We have the table set-up and the IoT topic working just fine.
The problem we have is that DynamoDB expects strings when inserting numbers. However, since we're receiving this data as a JSON object from IoT and the numbers are inside of the array of objects, we're having a hard time massaging the numbers into strings. So, we want the Step Function to convert the numbers into strings somehow, but I can't see how to do it. The goal here is to build a simple pipeline for storing IoT data into DynamoDB.
We also don't fully control all of the properties that could be sent, so we're also storing copies of the IoT payloads in S3 (which is already wired with the IoT rules engine and works just fine), but this is more of a backup and catch-all. We're mostly interested in the data getting into DynamoDB so that we can actually query it. How can we convince the Step Function to insert the numbers from the JSON payloads into DynamoDB?
You are really asking two questions here.
Can Amazon States Language convert numbers to strings?
How can you get Step Functions to add things to Dynamo DB without specifying the data type.
The answer to the first question is that yes, you can use ASL to convert numbers to strings using the concatenation intrinsic function like so. Given a payload of a single number
{
"key.$": "States.Format('{}',$.Payload)"
}
We can use the States intrinsic function Format to add quotes around the output of this step.
This will not be helpful in your use case however, as you have hundreds of numbers potentially which may not always follow a set format.
In your case, the solution would be to save your data to DynamoDB using a Lambda function with the Document Client.
It would be nice if they had an option to use the document client directly within Step Functions, but as of this writing that is not the case. You simply need to perform the action manually within a Lambda function using the document client. Same result.

Can you save a result (Given) to a variable in a Gherkin feature file, and then compare the variable with another result (Then)? (Cucumber for Java)

I am new to Cucumber for Java and trying to automate testing of a SpringBoot server backed by a MS SQL Server.
I have an endpoint "Get All Employees".
Writing the traditional feature file, I will have to list all the Employees in the #Then clause.
This is not possible with thousands of employees.
So I just want to get a row count of the Employee table in the database, and then compare with the number of objects returned from the "Get All Employees" endpoint.
Compare
SELECT count(*) from EMPLOYEE
with size of the list returned from
List<Employee> getAllEmployees()
But how does one save the rowcount in a variable in the feature file and then pass it into the stepdefs Java method?
I have not found any way that Gherkin allows this.
After writing a few scenario and feature files, I understood this about Cucumber and fixed the issue.
Gherkin/Cucumber is not a programming language. It is just a specification language. When keywords like Given, Then are reached by the interpreter, the matching methods in Java code are called. So they are just triggers.
These methods are part of a Java glue class. Data is not passed out of the Java class and into the gherkin feature file. The class is instantiated at the beginning and retained until the end. Because of this, it can store state.
So from my example in the question above, the Then response from a Spring endpoint call will be stored in a member variable in the glue class. The next Then invocation to verify the result will call the corresponding glue method which will access the data in the member variable to perform the comparison.
So Gherkin cannot do this, but Java at a lower level in the glue class, can.
You could create a package for example named dataRun (with the corresponding classes in the package) and save there the details during the test via setters.
During the execution of the step "And I get the count of employees from the data base" you set this count via the corresponding setter, during the step "And I get all employees" you set the number via the dedicated setter.
Then during the step "And I verify the number of employees is the same as the one in the data base" you get the two numbers via the getters and compare them.
Btw it's possible to compare the names of the employees (not just the count) if you put them into list and compare the lists.

How to retrieve all document content from alfresco repository with seperation of document types using Open CMIS

I Want to retrieve All document content from alfresco repository. So can anyone help me that how can i traverse the repository using CMIS. And while traversing i also want to separate the documents based on its type.
At this moment i am able to get any one document by specifying the path. but now my requirement is to traverse whole repository and get all the documents.
So can any one help me with this.
Also suggest me that "Traversal of all folders and later separate by specific type" will be the good approach OR "Search specific type of document using CMIS query" will be the good approach.
Thanks in Advance.
Yagami's answer is a good start, but there are a few things to add.
First, do not do "select *" unless you actually need every single property the repository has. That is a potential performance problem. Only ask for what you need.
Second, one of your comments talks about segmenting results by type. In CMIS, a type is kind of like a SQL table. So in your case, you would do three different queries using each of your three custom types as a different type in the from clause:
select * from test:mainContract;
select * from test:subContract;
select * from test:royaltyStatement;
Finally, unless you have just a handful of documents in your repository, you are almost certainly going to want to use a paged result set. Otherwise, you will only get back the maximum number of results the server is configured to return. That may not be large enough to get the entire set.
For an example showing paging the result set, see Apache CMIS: Paging query result
To perform an action like this (getting all the document content) you need to follow this steps
Step 1 : Create a saver Class
What i mean with sever class, it will hold two information (for me it's the most valuable informations) the two of them most be character varying
1 - The document ID
2 - The document Name
Step 2 : Get all the document
To get all the document we have to use a query
String query;
query = "SELECT * FROM cmis:document ";
You will get all the document that you have in your repository.
You can add some condition to make your research more easier like in this example :
query = "SELECT * FROM cmis:document WHERE IN_FOLDER('" + folderId + "')";
In this example you will get document of a particular folder.
ItemIterable<QueryResult> resultList = session.query(query, false);
and finally
for (QueryResult qr : resultList) {
String idDocument = qr.getPropertyByQueryName("cmis:objectId").getFirstValue().toString();
String name = qr.getPropertyByQueryName("cmis:name").getFirstValue().toString();
Document doc = (Document) session.getObject(idDocument);// this is how you can get document with add that's mean no need of path
}
You can read more about query in CMIS query.
Step 3 : Save every time the information in the saver class
I think it's clear that you have to save every time you use the loop (in step 2) in an occurence of saver class.
I hope that helped you.

How to model Not In query in Couch DB [duplicate]

Folks, I was wondering what is the best way to model document and/or map functions that allows me "Not Equals" queries.
For example, my documents are:
1. { name : 'George' }
2. { name : 'Carlin' }
I want to trigger a query that returns every documents where name not equals 'John'.
Note: I don't have all possible names before hand. So the parameters in query can be any random text like 'John' in my example.
In short: there is no easy solution.
You have four options:
sending a multi range query
filter the view response with a server-side list function
using a CouchDB plugin
use the mango query language
sending a multi range query
You can request the view with two ranges defined by startkey and endkey. You have to choose the range so, that the key John is not requested.
Unfortunately you have to find the commit request that somewhere exists and compile your CouchDB with it. Its not included in the official source.
filter the view response with a server-side list function
Its not recommended but you can use a list function and ignore the row with the key John in your response. Its like you will do it with a JavaScript array.
using a CouchDB plugin
Create an additional index with e.g. couchdb-lucene. The lucene server has such query capabilities.
use the "mango" query language
Its included in the CouchDB 2.0 developer preview. Not ready for production but will be definitely included in the stable release.

Modality work list - Which items are returned for C-FIND request of a sequence?

My question is a really basic question. Consider to query a modality work list to get some work items by a C-FIND query. Consider using a sequence (SQ) as Return Key attribute for the C-FIND query, for example: [0040,0100] (Scheduled Procedure Step) and universal matching.
What should I expect in the SCP's C-FIND response? Or, better say, what should I expect to find with regards of the scheduled procedure step for a specific work item? All the mandatory items that Modality Work List Information Model declare as encapsulated in the sequence? Should I instead explicitly issue a C-FIND request for those keys I want the SCP return in the response?
For example: if I want the SCP return the Scheduled Procedure Step Start Time and Scheduled Procedure Start Date, do I need to issue a specific C-FIND request with those keys or querying for Scheduled Procedure Step key is enough to force the SCP to send all items related to the Scheduled Procedure Step itself?
Yes, you should include the Scheduled Procedure Step Start Time / Date Tags into the 0040,0100 sequence.
See also Service Class Specifications (K6.1.2.2)
This will not ensure you will retrieve this information, because it depends on the Modality Worklist Provider, which information will be returned.
You could also request a Dicom Conformance Statement from the Modality Provider to know the necessary tags for request/retrieve.
As for table K.6-1, you can consider it as showing only the requirement of the SCP side or what SCP is required to use for matching key (i.e. query filter) and additional required attribute values to return (i.e. Return Key) with successful match. It is up to SCP’s implementation to support matching against required key but you can always expect SCP to use the values in matching key for query filter.
Also note that, SCP is only required to return values for attributes that are present in the C-FIND Request. One exception is the sequence matching and there you have the universal matching like mechanism where you can pass a zero length ITEM to retrieve entire sequence. So as stated in PS 3.4 section C.2.2.2.6, you can just include an empty ITEM (FFFE, E000) element with VR of SQ under Scheduled Procedure Step Sequence (0040, 0100) for universal matching.

Resources