How to define multiple CAS Consumers in UIMA DUCC? - information-retrieval

I am designing a text mining pipeline in UIMA DUCC as follows:
|-----------------|
| | ==CAS_1==> Pipeline A ==> Consumer A
| CAS Multiplier | ==CAS_2==> Pipeline B ==> Consumer B
| | ==CAS_3==> Pipeline C ==> Consumer C
|-----------------|
I intend to run Piepline A, B and C in parallel. I believe it can be done using flow controller. Is my unsderstanding right ? If yes, how do I define multiple CCs. The process_descriptor_CC field in the job description file takes only one consumer. How can we pass multiple consumers and its piepline assosciation ?

If the intention is to process a large collection of documents
with high throughput then the three pipelines, each including its
CAS consumer, would all be in the AE (process_descriptor_AE) and
the AE would include a custom flow controller to route CASes
as desired. CASes in an AE would run one at a time, but multiple
CM+AE threads could be run in parallel by specifying the number
of JP threads (process_thread_count) to be greater than 1.

Firstly you need to understand the flow controller and create an aggregate descriptor using flow contoller and add cas consumer descriptor just like analysis engine descriptor in flow controller.
After this, there are two use cases for your scenario:
Use process_descriptor_CR and process_descriptor_AE only and use the flow controller based aggregate descriptor in AE.
Use process_descriptor_CR and process_dd only and use the flow controller based aggregate descriptor in deployment descriptor.

make a flowcontroller and add cas consumer as delegate analysis engine.
in this way you can add as many as you want.
then give the path of flowcontroller in deployment descriptor and give this path in job specification.

Related

Making blocking http call in akka stream processing

I am new to akka and still trying to understand the different akka and streaming concepts. For some new feature i need to add a http call to already existing stream which is working on an internal object. Something like this -
val step1Flow = Flow[SampleObject].filter(...--Filtering condition--...)
val step2Flow = Flow[SampleObject].map(obj => {
...
-- Business logic to update values in the obj --
...
})
...
override val flowGraph: Flow[SampleObject, SampleObject, NotUsed] =
bufferIn.via(Flow.fromGraph(GraphDSL.create() {
implicit builder =>
import GraphDSL.Implicits._
...
val step1 = builder.add(step1Flow)
val step2 = builder.add(step2Flow)
val step3 = builder.add(step3Flow)
...
source ~> step1 ~> step2 ~> step3 ~> merge
...
}
I need to add the new http request flow (lets call it newFlow) after step1. All these flow have Inlet and Outlet as SampleObject. Now my understanding is that the newFlow would need to be blocking because the outlet need to be SampleObject only. For that I have used Await function on the http call future. The code looks like this -
val responseFuture: Future[(Try[HttpResponse], SomeContext)] =
Source
.single(httpRequest -> context)
.via(Retry(retrySettings).join(clientFlow))
.runWith(Sink.head)
...
val (httpTry, passedAlongContext) = Await.result(responseFuture, 30.seconds)
-- logic to process response and return SampleObject --
Now this works fine but i think there should be a better way to do this without using wait. Also i think this would block the main thread till the request completes, which is going to affect the app throughput.
Could you please guide if the approach i used is correct or not. And how do i make use of some other thread pool to handle these blocking call so my main threadpool is not affected
This question seems very similar to mine but i do not understand it completely - connect Akka HTTP to Akka stream . Also i can't change the step2 or further flows.
EDIT : Added some code details for the stream
I ended up using the approach mentioned in the question because i couldn't find anything better after looking around. Adding this step decreased the throughput of my application as expected, but there are approaches to increase that can be used. Check these awesome blogs by Colin Breck -
https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/
https://blog.colinbreck.com/partitioning-akka-streams-to-maximize-throughput/
To summarize -
Use Asynchronous Boundaries for flows which are blocking.
Use Futures if possible and add callbacks to futures. There are several ways to do that.
Use Buffers. There are several types of buffers available, choose what suits your needs.
Other than these, you can use inbuilt flows like -
Use "Broadcast" to broadcast your events to multiple consumers.
Use "Partition" to partition your stream into multiple streams based
on some condition.
Use "Balance" to partition your stream when there is no logical way to partition your events or they all could have different work loads.
You could use any one or multiple things from above options.

MDriven ECO_ID duplicates

We appear to have a problem with MDriven generating the same ECO_ID for multiple objects. For the most part it seems to happen in conjunction with unexpected process shutdowns and/or server shutdowns, but it does also happen during normal activity.
Our system consists of one ASP.NET application and one WinForms application. The ASP.NET app is setup in IIS to use a single worker process. We have a mixture of WebForms and MVC, including ApiControllers. We're using a rather old version of the ECO packages: 7.0.0.10021. We're on VS 2017, target framework is 4.7.1.
We have it configured to use 64 bit integers for object id:s. Database is Firebird. SQL configuration is set to use ReadCommitted transaction isolation.
As far as I can tell we have configured EcoSpaceStrategyHandler with EcoSpaceStrategyHandler.SessionStateMode.Never, which should mean that EcoSpaces are not reused at all, right? (Why would I even use EcoSpaceStrategyHandler in this case, instead of just creating EcoSpace normally with the new keyword?)
We have created MasterController : Controller and MasterApiController : ApiController classes that we use for all our controllers. These have a EcoSpace property that simply does this:
if (ecoSpace == null)
{
if (ecoSpaceStrategyHandler == null)
ecoSpaceStrategyHandler = new EcoSpaceStrategyHandler(
EcoSpaceStrategyHandler.SessionStateMode.Never,
typeof(DiamondsEcoSpace),
null,
false
);
ecoSpace = (DiamondsEcoSpace)ecoSpaceStrategyHandler.GetEcoSpace();
}
return ecoSpace;
I.e. if no strategy handler has been created, create one specifying no pooling and no session state persisting of eco spaces. Then, if no ecospace has been fetched, fetch one from the strategy handler. Return the ecospace. Is this an acceptable approach? Why would it be better than simply doing this:
if (ecoSpace = null)
ecoSpace = new DiamondsEcoSpace();
return ecoSpace;
In aspx we have a master page that has an EcoSpaceManager. It has been configured to use a pool but SessionStateMode is Never. It has EnableViewState set to true. Is this acceptable? Does it mean that EcoSpaces will be pooled but inactivated between round trips?
It is possible that we receive multiple incoming API calls in tight succession, so that one API call hasn't been completed before the next one comes in. I assume that this means that multiple instances of MasterApiController can execute simultaneously but in separate threads. There may of course also be MasterController instances executing MVC requests and also the WinForms app may be running some batch job or other.
But as far as I understand id reservation is made at the beginning of any UpdateDatabase call, in this way:
update "ECO_ID" set "BOLD_ID" = "BOLD_ID" + :N;
select "BOLD_ID" from "ECO_ID";
If the returned value is K, this will reserve N new id:s ranging from K - N to K - 1. Using ReadCommitted transactions everywhere should ensure that the update locks the id data row, forcing any concurrent save operations to wait, then fetches the update result without interference from other transactions, then commits. At that point any other pending save operation can proceed with its own id reservation. I fail to see how this could result in the same ID being used for multiple objects.
I should note that it does seem like it sometimes produces id duplicates within one single UpdateDatabase, i.e. when saving a set of new related objects, some of them end up with the same id. I haven't really confirmed this though.
Any ideas what might be going on here? What should I look for?
The issue is most likely that you use ReadCommitted isolation.
This allows for 2 systems to simultaneously start a transaction, read the current value, increase the batch, and then save after each other.
You must use Serializable isolation for key generation; ie only read things not currently in a write operation.
MDriven use 2 settings for isolation level UpdateIsolationLevel and FetchIsolationLevel.
Set your UpdateIsolationLevel to Serializable

Elixir - Changing Behavior

This is mostly a Functional Programming question rather than an Elixir one, but since I'm learning Elixir it would be nice if someone can answer it using that language. Even so, if someone wants to give a more general answer it'll be appreciated.
I'm an OO programmer myself and I can't wrap my head around how to change the behavior of a component based on a configuration file (for example).
Example:
I have an application that loads/saves users from a database. In a production environment, I want my users to be saved and retrieved from a MongoDB database, while in development and testing I want to use an in-memory map. If I was programming given system in an OO language (Lets say Java), I would simply make an Interface named "UserRepository" with 2 implementations: "MemoryUserRepository" and "MongoDBUserRepository". I would then instantiate the corresponding Repository based on a configuration file (or hardcoding it, it doesn't matter) at startup and right after it, all the objects that interact with the Repository will never know its implementation (they will use a repository, but will never care if it's in memory or in mongo).
That gives me the ability to create as many implementations as I want and the only thing I need to do to change the behavior of the system is instantiate the implementation that I want to use.
I want the same behavior but in Elixir (let's use the same example). Since it's not an Object Oriented language I can't use the above approach. Obviously I want it to be extensible (I could easily pass a String with the type of repository I want to use in each call and use pattern matching to determine what behavior to use, but that doesn't scale well because every time I'll want to add an implementation I will have to look in every piece of code I'm pattern matching the type and add the new implementation). What would be the best approach to achieve this?
Thanks in advance!
Suppose you have these two (or more) repository implementations, which implement the same interface:
defmodule MyApp.Repository.Memory do
def get(key) do
# ...
end
def put(key, value) do
# ...
end
end
defmodule MyApp.Repository.Disk do
def get(key) do
# ...
end
def put(key, value) do
# ...
end
end
Then you can write a general repository module that will just forward the function calls to one of the repository backends, based on a configuration value in your config/config.exs file:
defmodule MyApp.Repository do
#backend Application.get_env(:my_app, :repository_backend)
defdelegate [get(key), put(key, value)], to: #backend
end
The configuration can be made so that it is environment specific (just look at the default config.exs in a mix project freshly created with mix new my_app):
# config/config.exs
import_config "#{Mix.env}.exs"
# config/dev.exs
config :my_app, repository_backend: MyApp.Repository.Memory
# config/prod.exs
config :my_app, repository_backend: MyApp.Repository.Disk
Throughout your entire code, you can then just use the MyApp.Repository module without explicitly referencing one of the specific implementations:
MyApp.Repository.put(:foo, "Hello world!")
value = MyApp.Repository.get(:foo)

Query workflow tasks based on custom property with other criteria than equals

I have the need to construct a WorkflowTaskQuery with a custom workflow model date as criteria. The criteria needs to be "currentDate >= myCustomDate".
I have noticed that it is possible to add custom properties to the WorkflowTaskQuery but looking into the implementation it seems like those properties all are added as equals-criterias. (reference(4.2.x): org.alfresco.repo.workflow.activiti.ActivitiWorkflowEngine.addTaskPropertiesToQuery)
To get all active tasks and do the filtering on the returned result will not be a good approach since there will be thousands of running workflow tasks in this implementation.
The only other approach I can think of would be to subclass both WorkflowTaskQuery and ActivitiWorkflowEngine and rewrite some private methods (like createRuntimeTaskQuery) and handle my special cases on my own there. (Activiti has methods like greaterThan and so on when searching for tasks based on variables....)
If anyone have any better suggestions, please feel free to share them with me :)
We are implementing a solution that drives Activiti using the Rest interface and have successfully implemented task queries using the POST /rest/service/query/task
The body of the request contains the conditions and the operator to use in query can have the following values: "equals", "notEquals", "equalsIgnoreCase", "notEqualsIgnoreCase", "lessThan", "greaterThan", "lessThanOrEquals", "greaterThanOrEquals" and "like".
Now, with that said.....I'm not sure I understand your query.
currentData >= customDate, obviously currentDate is self explanatory, but is customDate a process instance variable or a task local variable? It may impact the format of the query.

Salesforce Batch Apex Class - Querying Against Large Data Sets

I have a batch apex class where i'm building collections of websites and emails, so that i can use those collections to filter other other queries which will be made into collections. With all collections set, i want to run through a final loop of the scope to perform business processes.
Mockup:
for(Object o : scope)
{
listEmails.add(o.Email);
listWebsites.add(o.Websites);
}
Map<String, Account> accounts = Gather all accounts where website not in :listWebsties; //Website is key
List<String, Contact> contacts = Gather all contacts where email not in :listEmails; //Email is key
for(Object o : scope)
{
Account = accounts.get(o.website);
Contact = contacts.get(o.Email);
Perform business logic here
}
The problem is when i run this batch it stays processing for hours. When working with a rather small database this works fine. But in working in a larger environment perhaps this is not the best solution.
Can anyone help me speed up the batch process with a more effective approach?
Is there anyway to post the entire batch apex class? Or help understand the data more?
It looks like from your map that all of your accounts (in theory) have unique websites and all of your contacts have unique emails?
I assume you build those maps by hand? That is you loop over the accounts and do a
map.put(account.website,account)?
Do you have any system debug statements to confirm your map sizes?
What happens if there is no account or no contact when you call accounts.get()?
And the business logic - is it more looping?
And are you using Batch variables in a static manner - i.e. you can have a counter to count the total number of records processed. If so, is your variable a list? that can be dangerous of course.
Also what object is your scope object? Not that it matters, but I'd think you'd want to have your scope be the Accounts themselves or the Contacts themselves.
I'd try adding system.debug statements to your batch to verify it's running and to see where the infinite loop may be occurring.

Resources