Rules Engine Real Time Execution - rules

I am using PHP and MySQL to run a backend server.
Currently I have a "Action" table that contains certain criteria and what action should be taken, the properties are following:
Conditions (contains complex criteria, e.g. [Item.Qty] > 100000 AND [User.Level] < 5 AND [Item.Price] > 100000)
Actions (e.g. SEND EMAIL TO [SUPERVISOR])
I already have the condition parser that will return boolean.
The software will need to validate every user's action to the every rules (e.g. enterprise software). But I have thousands of rules and number of request is high.
Any idea to run this in real-time when user submitting / saving data to the system?
Thanks :)
Notes:
Since I have tons of feature to compare, decision table is not a feasible option here

Related

How do I provide my Identity / Email when connecting to NCBI through Rentrez?

My project head is telling me that its unacceptable to connect with NCBI to retrieve sequence entries without sending along identifying information such as our institution email. They claim this means NCBI won't instantly block our connection if we violate their user guidelines, they'll 'email' us first. We are using Rstudio with the Rentrez package to retrieve protein sequences from NCBI Genbank.
But I'm not certain that's necessary or IF rentrez has any way to even do that. For reference this is general format of our code.
sequence <- entrez_fetch(db="nuccore", id=**accession_number**, rettype="fasta")
Rentrez says on their documentation: "The NCBI will ban IPs that don't use EUtils within their user guidelines. In particular /enumerated /item Don't send more than three request per second (rentrez enforces this limit) /item If you plan on sending a sequence of more than ~100 requests, do so outside of peak times for the US /item For large requests use the web history method (see examples for entrez_search or use entrez_post to upload IDs)"
Both entrez_search and entrez_post include an argument called "web_history A web_history object for use in subsequent calls to NCBI" I'm not sure if this is what I'm looking for though.
I can't find any arguments or functions etc. which allow the user to send identifying information to NCBI when connecting.
It seems like you need an API key. You can get one from your NCBI account interactively, and it needs to be specified in your .bash_profile (at least on a mac, using bash, not sure your OS / terminal of choice here).
For command line usage it just needs to be set as a variable with the following line added to your profile:
export NCBI_API_KEY=<yourkeyehere>
Then as long as R is loading up that profile when it spins up, you should be fine.
EDIT:
A bit of a tangential note here, you can grab files from the FTP site with utilities like curl and wget, or even Biostrings' functions like readDNAStringSet() without an API key, but if you're going to access things with eutils, you need one - as long as you're going OVER the X-number of queries per second - but if you're under that threshold, i don't think they care that much.

Fetching parent and child item in single query in DynamoDB

I have the following one-to-many relationship:
Account 1--* User
The Account contains global account-level information, which is mutable.
The User contains user-level information, which is also mutable.
When the user signs-in, they need both Account and User information. (I only know the UserId at this point).
I ideally want to design the schema such that a single query is necessary. However, I cannot determine how to do this without duplicating the Account into each User and thus requiring some background Lambda job to propagate changes to Account attributes across all User objects -- which, for the record, seems like more resource usage (and code to maintain) than simply normalizing the data and having 2 queries on each sign-in: fetch user, then fetch account (using an FK inside the user object that identifies the account).
Is it possible to design a schema that allows one query to fetch both and doesn't require a non-transactional background job to propagate updates? (Transactional batch updates are out of the question, since there's >25 users.) And if not, is the 2-query idea the best / an acceptable method?
I'll focus on one angle in your question - the 2-query idea. In many cases it is indeed an acceptable method, better than the alternatives. In fact in many NoSQL uses, every user-visible request results in significantly more than two database requests. In fact, it is often stated that this is the reason why NoSQL systems care about low tail latencies (i.e., even 99th percentile latencies should be low).
You didn't say why you wanted to avoid the 2-query solution. The 2-query implementation you presented has two downsides:
It is more costly: you need to do two queries instead of one, costing (when the reads are shorter than 4 KB) double than a single read.
Latency doubles if you need to do the first query, and only then can do the second query.
There may be tricks you can use to solve both problems, depending on more details of your use case:
For the latency: You didn't say what is a "user id" in your application. If it is some sort of unique numeric identifier, maybe it can be set up such that the account id can be determined from the user id directly, without a table lookup (e.g., the first bits of the user id are the account id). If this is the case, you can start both lookups at the same time, and not double the latency. The cost will still be double, but not the latency.
For the cost: If there is a large number of users per account (you said there are more than 25 - I don't know if it's much more or not), it may be useful to cache the Account data, so that not every user lookup will need to read the Account data again - it might often be cached. If Account information rarely changes and consistency of it is not a big deal (I don't know if it is...), you can also get by with doing an "eventual consistency" read for the Account information - which costs half of the regular "consistent" read.
I think the following scheme will be useful for.
You will store both account and user records inthe same table
You want to get both account metadata and linked users in a single query
PK: account SK: recordId
=== Account record ===
account: 123512321 recordId: METADATA attributes: name, environment, ownerId...
=== User record ===
account: 123512321 recordId: USERID#34543543 attributes: name, email, phone...
With this denormalization of the data, you can retrieve both account metadata and related users in a single query. You can also change the account metadata without a need to apply any change to related users.
BONUS: you can also link other types of assets to the account record

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Alfresco CMIS different result with same query

we have a bit of a problem.
We've builded a GWT application on top of our two Alfresco instances. The application should work like this:
User search a document
Our web app spam two same queries against two repositories, wait for both results and expose a merged resultset.
This is true in case the search is for a specific documento (number id for example) or 10, 20, 50 documents (we don't know when this begins to act strange).
If the query is a consistent one (like all documents from last month, there should be about 30-60k/month) obviously the limit of cmis query (500) stops before.
BUT, if the user hits "search" the first time, after a while, the resultset is composed of 2 documents. And if the users hits "search" right after that again, with the same query, the resultset is exposed almost immediately and there are 500 documents listed.
What the heck is wrong? Does CMIS caches results in some way? How do big CMIS queries work?
Thanks
A.
As you mentioned you're using Apache Chemistry. Chemistry has a clientside caching mechanism:
http://chemistry.apache.org/java/how-to/how-to-tune-perfomance.html
I suspect this is not CMIS related at all but is instead due to the Alfresco Lucene "max permission check" problem. At a high-level, there is a config setting for the maximum number of permission checks that Alfresco will do against a search result set. There is also a limit to the total amount of time it will spend performing such checks. These limits are configured in the repository properties file as:
# The maximum time spent pruning results
system.acl.maxPermissionCheckTimeMillis=10000
# The maximum number of results to perform permission checks against
system.acl.maxPermissionChecks=1000
The first time you run a search the server begins performing these checks and hits the limit. It then returns the search results it was able to filter. Now the permission cache is populated so the next time you run the search the results come back much faster and the result set is larger.
Searches in Alfresco are non-deterministic--you cannot guarantee that, for large result sets, you will get back the exact same result set every time, regardless of how big you make those settings.
If you are able to upgrade at some point you may find that configuring Alfresco to use Solr rather than Lucene could help alleviate this, but I'm not 100% sure it will.
To disable security checks replace public SearchService with searchService. Public services have enforced security so with searchService you can avoid security checking.

Explain Query Bands in Teradata

Can anyone explain Query Bands in Teradata?
I've searched regarding this a lot, but wasnt able to get information which I can understand.
Please be a bit detailed.
Thanks!!!
QUERY BANDING IN TERADATA:
QUERY BANDING PROVIDES CIRCUMSTANTIAL WORKFLOW INFORMATION.
Concept:
Scientists will often band the legs of birds with devices to track their flight paths. Monitoring and analyzing the data retrieved via the bands provides critical information about the species.
The same process is followed by DBAs who need some more information about a query than what is available.
Metadata—such as the name of requesting user, work unit & the application name is important, Workload management will be tracking the entire use of data warehouse & query troubleshooting.
Query banding feature is used such a way that, these metadata details are linked to the query in database.
A query band can contain any number of name or value pairs such as initiating users corporate ID, department & location, also the time of the initiation execution started.
Prashanth provided a good analogy with birds and bands. Adam is asking for specific situations. I can come up with several examples, when query banding may be very useful:
Your system is used by hundreds of users via an Application Server with a custom application or a reporting application like Business Objects, Tableau or Qlikview. Application server connects to Teradata using one user ID, however the administrator would still like to know what users, departments and groups of users generate each query to be able to analyze later in DBQL or simply to allocate proper system resources using TASM. For this the application can be configured in such a way that each query is "banded" with information like "AppUser:User1;Appgroup:DataScientists;QueryType:strategic02". Despite the fact that Application Server uses one Teradata user and a limited number of connection to route all the queries from hundreds of users, each individual query is marked with information exactly which user has initiated the query. You can then perform all kinds of analysis based on this information.
Suppose you have a complex ETL application, and you want to track and analyze your execution of loads - what and when went wrong. Usually you would need to log all the steps of your ETL process, and in the logs you must specify unique Load ID, Process ID, Step ID, etc. You do this because you want to be able to understand what specific process caused this halt or a performance degradation, and without such logging it would not be possible to distinguish running of the same steps between different runs of your ETL application. A good alternative would be to switch on DBQL and embellish your queries with Query Band information with Load ID, Process ID, Step ID, etc. In this way you would have all necessary information in DBQL without the necessity to create additional elaborate log tables.
SET QUERY BAND = 'name=value; name2=value;' FOR SESSION|TRANSACTION;
this will tag your query with some name value pairs. This can be used to manage your query's workload management for example in TDWM you have throttles and priority management hooks that will priorities all name2 types with the value "value". It means you can submit a very rich detail on the session or transaction
Yes, what you described can easily be done with QueryBanding; think of it as a "wagon of key-pair attributes in transit". you can access them via sql or prgrammatically with session attributes in bteq or jdbc for example.
Necromancing... Existing answers do a good job at explaining how query bands work, but since I could not find a complete working example, I thought of adding one here.
Setting query bands in Teradata is already covered, so I will provide an example of how to set them from a .NET client:
private void SetQueryBands()
{
TdQueryBand qb = Connection.QueryBand;
qb["CustomApplicationName"] = "MyAppName";
foreach (string key in CustomQueryBands.Keys)
{
qb[key] = CustomQueryBands[key];
}
Connection.ChangeQueryBand(qb);
}
Connection = new TdConnection(GetConnectionString());
Connection.Open();
SetQueryBands();
More details can be found here.
To retrieve stored queryband data, GetQueryBandValue function can be used:
SELECT CollectTimestamp, QueryBand,
GetQuerybandValue(queryband, 0, 'Key1') AS Value1,
GetQuerybandValue(queryband, 0, 'Key2') AS Value2,
GetQuerybandValue(queryband, 0, 'Key3') AS Value3,
FROM dbql_data.dbqlogtbl
WHERE dateofday = DATE - 1
AND queryband LIKE '%somekeyorvalue%'

Resources