Avoiding salesforce governing limits on soql queries getting group members for each group? - dictionary

I am working in apex on salesforce platform. I have this loop to grab all group names, Ids, and their respective group members, place them in an object to collect all this info, then put that in a list to have a list of all groups and all information I need:
List<groupInfo> memberList = new List<groupInfo>();
for(Id key : groupMap.keySet()){
groupInfo newGroup = new groupInfo();
Group g = groupMap.get(key);
if(g.Name != null){
set<Id> memberSet = getGroupEventRelations(new set<Id>{g.Id});
if(memberSet.size() != 0){
newGroup.groupId = g.Id;
newGroup.groupName = g.Name;
newGroup.groupMemberIds = memberSet;
memberList.add(newGroup);
}
}
}
My getGroupEventRelations method is as such:
global static set<Id> getGroupEventRelations(set<Id> groupIds){
set<Id> nestedIds = new set<Id>();
set<Id> returnIds = new set<Id>();
List<GroupMember> members = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groupIds];
for(GroupMember member : members){
if(Schema.Group.SObjectType == member.UserOrGroupId.getSObjectType()){
nestedIds.add(member.UserOrGroupId);
} else{
returnIds.add(member.UserOrGroupId);
}
}
if(nestedIds.size() > 0){
returnIds.addAll(getGroupEventRelations(nestedIds));
}
return returnIds;
}
getGroupEventRelations contains a soql query, and considering this is called inside a loop of groups... if someone has over 100 groups with group members or possibly a series of 100 nested groups inside groups... then this is going to hit the governing limits of salesforce soql queries pretty quickly...
I am wondering if anyone knows of a way to possibly get rid of the soql query inside getGroupEventRelations to get rid of the query in the loop. When I want group members for a specific group, I am not really seeing a way to get by this without more loops inside loops where I could risk running into CPU timeout salesforce governing limit :(
Thank you in advance for any help!

At large enough numbers there's no solution, you'll run into SOME governor limit. But you can certainly make your code work with bigger numbers than it does now. Here's a quick little cheat you could do to cut nesting 5-fold. Instead of just looking at the immediate parent (single level of children) look for parent, grandparent, great grandparent, etc, all in one query.
[SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE (GroupId IN :groupIds OR Group.GroupId IN :groupIds OR Group.Group.GroupId IN :groupIds OR Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.GroupId IN :groupIds OR Group.Group.Group.Group.Group.GroupId IN :groupIds) AND Id NOT IN :returnIds];
You just got 5 (or is it 6?) levels of children in one SOQL call, so you can support that many times more nest levels now. Note that I added a 'NOT IN' clause to make sure you don't repeat children that you already have, since you won't know which Ids came from the bottom level.
You can also make your very first call for all groups instead of each group at a time. So if someone has 100 groups you'll make just one call instead of 100.
List<Group> groups = groupMap.values();
List<GroupMember> allMembers = [SELECT Id, GroupId, UserOrGroupId FROM GroupMember WHERE GroupId IN :groups];
Lastly, you could query all GroupMembers in a single SOQL call and then iterate yourself. Like you said, you risk running into the 10 second limit here, but if the number of groups isn't in the millions you'll likely be just fine, especially if you do some O(n) analysis and choose good data structures and algorithms. On the plus side, you won't have to worry about SOQL limits regardless of the nesting and the tree complexity. This answer should be very helpful, they are doing almost exactly what you'd have to do if you pulled all members in one call.
How to efficiently build a tree from a flat structure?

Related

Order results by number of coincidences in edge properties

I'm working on a recommendation system that recommends other users. The first results should be the most "similar" users to the "searcher" user. Users respond to questions and the amount of questions responded in the same way is the amount of similarity.
The problem is that I don't know how to write the query
So in technical words I need to sort the users by the amount of edges that has specific property values, I tried with this query, I thought it should work but it doesn't work:
let query = g.V().hasLabel('user');
let search = __;
for (const question of searcher.questions) {
search = search.outE('response')
.has('questionId', question.questionId)
.has('answerId', question.answerId)
.aggregate('x')
.cap('x')
}
query = query.order().by(search.unfold().count(), order.asc);
Throws this gremlin internal error:
org.apache.tinkerpop.gremlin.process.traversal.step.util.BulkSet cannot be cast to org.apache.tinkerpop.gremlin.structure.Vertex
I also tried with multiple .by() for each question, but the result was not ordered by the amount of coincidence.
How can I write this query?
When you cap() an aggregate() it returns a BulkSet which is a Set that has counts for how many times each object exists in that Set. It behaves like a List when you iterate through it by unrolling each object the associated size of the count. So you get your error because the output of cap('x') is a BulkSet but because you are building search in a loop you are basically just calling outE('response') on that BulkSet and that's not valid syntax as has() expects a graph Element such as a Vertex as indicated by the error.
I think you would prefer something more like:
let query = g.V().hasLabel('user').
outE('response');
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.or(...search).
groupCount().
by(outV())
order(local).by(values, asc)
I may not have the javascript syntax exactly right (and I used spread syntax in my or() to just convey the idea quickly of what needs to happen) but basically the idea here is to filter edges that match your question criteria and then use groupCount() to count up those edges.
If you need to count users who have no connection then perhaps you could switch to project() - maybe like:
let query = g.V().hasLabel('user').
project('user','count').
by();
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.by(outE('response').or(...search).count()).
order().by('count', asc);
fwiw, I think you might consider a different schema for your data that might make this recommendation algorithm a bit more graph-like. A thought might be to make the question/answer a vertex (a "qa" label perhaps) and have edges go from the user vertex to the "qa" vertex. Then users directly link to the question/answers they gave. You can easily see by way of edges, a direct relationship, which users gave the same question/answer combination. That change allows the query to flow much more naturally when asking the question, "What users answered questions in the same way user 'A' did?"
g.V().has('person','name','A').
out('responds').
in('responds').
groupCount().
order(local).by(values)
With that change you can see that we can rid ourselves of all those has() filters because they are implicitly implied by the "responds" edges which encode them into the graph data itself.

Arango DB performace: edge vs. DOCUMENT()

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?
LET a = DOCUMENT(#from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})
vs
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
A simple benchmark you can try:
Create the collections products, categories and an edge collection has_category. Then generate some sample data:
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
LET random_categories = (
FOR c IN categories
SORT RAND()
LIMIT 5
RETURN c._id
)
LET category_subset = SLICE(random_categories, 0, RAND()*5+1)
UPDATE p WITH {
categories: category_subset,
categoriesEmbedded: DOCUMENT(category_subset)[*].name
} INTO products
FOR cat IN category_subset
INSERT {_from: p._id, _to: cat} INTO has_category
Then compare the query times for the different approaches.
Graph traversal (depth 1..1):
FOR p IN products
RETURN {
product: p.name,
categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
}
Look-up in categories collection using DOCUMENT():
FOR p IN products
RETURN {
product: p.name,
categories: DOCUMENT(p.categories)[*].name
}
Using the directly embedded category names:
FOR p IN products
RETURN {
product: p.name,
categories: p.categoriesEmbedded
}
Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.
If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.
The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:
Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)
Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance
Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.
Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.
Generally spoken, the latter variant
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.
When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.
However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

Retain value/state of a variable for a particular id of a spout in storm

I have defined a single bolt that calculates certain threshold. The bolt is receiving data for several values of a field. Is it possible that I can retain the value/state of a variable for a particular value of field.
Suppose I have two set of tuple inputs s$tuple$input:
s$id = "21343254545454354343" s$id="45645465645456561234"
s$tuple$input = ["ABC",2] s$tuple$input= ["CDE",5]
Is it possible to retain the value of a variable like counter=5 for "ABC" and counter=9 for "CDE" and update them only when a tuple for corresponding id is received.
I haven't played with Storm and R but hopefully the ideas will be similar to Java.
You have a few options for storing state:
In worker memory (per bolt)
External store (not within Storm)
What you choose depends on your requirements but lets assume you're just trying to count words and don't really care if a worker dies. For this, the implementation is simple. Just create a private variable in your bolt and keep track.
For example, lets say you have a counts variable:
Map<String, Integer> counts = new HashMap<String, Integer>();
Then, in your bolt's execute method you just check if you've gotten the word before and if so increment the count:
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
Source: WordCountBolt.java
You also want to consider how tuples flow to the workers. You probably don't want to use shuffle grouping anymore. Instead you want to do a field grouping by ID so that tuples with the same ID go to the same bolt.
Going forward you probably want something more durable (so if you lost a worker then you don't lose all your counts) so you'd probably store your counts in something like HBase.

Graph DB get the next best recommended node in Neo4j cypher

I have a graph using NEO4j and currently trying to build a simple recommendation system that is better than text based search.
Nodes are created such as: Album, People, Type, Chart
Relationship are created such as:
People - [:role] -> Album
where roles are: Artist, Producer, Songwriter
Album-[:is_a_type_of]->Type (type is basically Pop, Rock, Disco...)
People -[:POPULAR_ON]->Chart (Chart is which Billboard they might have been)
People -[:SIMILAR_TO]->People (Predetermined similarity connection)
I have written the following cypher:
MATCH (a:Album { id: { id } })-[:is_a_type_of]->(t)<-[:is_a_type_of]-(recommend)
WITH recommend, t, a
MATCH (recommend)<-[:ARTIST_OF]-(p)
OPTIONAL MATCH (p)-[:POPULAR_ON]->()
RETURN recommend, count(DISTINCT t) AS type
ORDER BY type DESC
LIMIT 25;
It works however, it easily repeats itself if it has only one type of music connected to it, therefore has the same neighbors.
Is there a suggested way to say:
Find me the next best album that has the most similar connected relationships to the starting Album from.
Any Recommendation for a tie breaker scenario? Right now it is order by type (so if an album has more than one type of music it is valued more but if everyone has the same number, there is no more
significant)
-I made the [:SIMILAR_TO] link to enforce a priority to consider that relationship as important, but I haven't had a working cypher with it
-Same goes for [:Popular_On] (Maybe Drop this relationship?)
You can use 4 configurations and order albums according to higher value in this order. Keep configuration between 0 to 1 (ex. 0.6)
a. People Popular on Chart and People are similar
b. People Popular on Chart and People are Not similar
c. People Not Popular on Chart and People are similar
d. People Not Popular on Chart and People are Not similar
Calculate and sum these 4 values with each album. Higher the value, higher recommended Album.
I have temporarily made config as a = 1, b =0.8, c=0.6, d = 0.4. And assumed some relationship present which suggests some People Likes Album. If you are making logic based on Chart only then use a & b only.
MATCH (me:People)
where id(me) = 123
MATCH (a:Album { id: 456 })-[:is_a_type_of]->(t:Type)<-[:is_a_type_of]-(recommend)
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(a:People)-[:POPULAR_ON]->(:Chart)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(b:People)-[:POPULAR_ON]->(:Chart)
WHERE NOT exists((me)-[:SIMILAR_TO]->(b))
OPTIONAL MATCH (recommend)<-[:LIKES]-(c:People)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:LIKES]-(d:People)
WHERE NOT exists((me)-[:SIMILAR_TO]->(a))
RETURN recommend, (count(a)*1 + count(b)*0.8 + count(c)* 0.6+count(d)*0.4) as rec_order
ORDER BY rec_order DESC
LIMIT 10;

CouchDB: Merging Objects in Reduce Function

I'm new to CouchDB, so bear with me. I searched SO for an answer, but couldn't narrow it down to this specifically.
I have a mapper function which creates values for a user. The users have seen different product pages, and we want to tally the type and products they've seen.
var emit_values = {};
emit_values.name = doc.name;
...
emit_values.productsViewed = {};
emit_values.productsViewed[doc.product] = 1
emit([doc.id, doc.customer], emit_values);
In the reduce function, I want to gather different values into that productsViewed object for that given user. So after the reduce, I have this:
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
Unfortunately, doing this creates a reduce overflow error. According to the other posts, this is because the productsViewed object is growing in size in the reduce function, and Couch doesn't like that. Specifically:
A common mistake new CouchDB users make is attempting to construct complex aggregate values with a reduce function. Full reductions should result in a scalar value, like 5, and not, for instance, a JSON hash with a set of unique keys and the count of each.
So, I understand this is not the right way to do this in Couch. Does anyone have any insight into how to properly gather values into a document after reduce?
You simple build a view with the customer as key
emit(doc.customer, doc.product);
Then you can call
/:db/_design/:name/_view/:name?key=":customer"
to get all products an user has viewed.
If a customer can have viewed a product several times you can build a multi-key view
emit([doc.customer, doc.product], null);
and reduce it with the built-in function _count
/:db/_design/:name/_view/:name?startkey=[":customer","\u0000"]&endkey=[":customer","\u9999"]&reduce=true&group_level=2
You have to accept that you cannot
construct complex aggregate values
with CouchDB by requesting the view. If you want to have a data structure like your wished payload
productsViewed: {
book1: 1,
book3: 2,
book8: 1
}
i recommend to use an _update handler on the customer doc. Every request that logs a product visit adds a value to the customer.productsViewed property instead of creating a new doc.

Resources