I have a couchdb that stores nearly half million tweets. Each tweet has a screen_name. I use map reduce function in couchdb to list all unique screen names. But how can I know how many different screen names there are in this database? My JavaScript code:
map.js:
function(doc) {
emit(doc.screen_name, 1);
}
reudce.js:
_stats
You can answer the "how many" questions by using the group parameter. You already have a _stats reduce in place, all you need to do now is:-
http://localhost:5984/your_db/_design/your_ddoc/_view/your_view?group=false&reduce=true
Which will give you a result like
{"rows":[
{"key":null,"value":{"sum":13700,"count":40,"min":232,"max":674,"sumsqr":6157480}}
]}
If you look in the value object in the result returned, you have a "count" key which holds a count of all the screen_names in your view. This should give you answer for "how many screen_names are there?"
If you do
?group=true
to the same query url you should get a result like
{"rows":[
{"key":"some_key","value":{"sum":696,"count":3,"min":232,"max":232,"sumsqr":161472}}
]}
which gives you _stats for unique keys. This should give you an answer for "how many unique screen_names are there?"
You can use group levels for complex keys. Bur for your use case I think group=false and group=true should be sufficient.
Related
I want to retrieve the specific range of users(required for pagination) and want to retrieve the total count as well, I'm executing the below query which is retrieving the list of user vertices as expected but the total count is returned as BulkSet
Map<String, Object> result = gt.V().hasLabel("user").sideEffect(__.count().store("total"))
.order().by("name", Order.desc)
.range(0, 10).fold().as("list")
.select("list","total").next();
The output is as below
How do I get the correct count as a Long value instead of the BulkSet?
Paging with Gremlin is discussed here and references this blog post which provides additional information on the topic. Those resources should help you with your paging strategy.
You framed this question in terms of inquiring about BulkSet so it isn't quite a duplicate of the answer I referenced, so I will try to answer that much for you. BulkSet allows for an important traversal optimization in TinkerPop which helps reduce object propagation, thus reducing memory requirements for a particular query. It does this by holding the traverser object and its count where the count is the number of times that object has been added to the BulkSet. Calling size() or longSize() (where the latter returns a long and the former returns int) will return the summation of the counts and therefore the "correct" or actual count of the objects. A call to uniqueSize() will return actual size of the set which will be the unique objects within it.
If you want the size of the BulkSet you just need to count() it:
gt.V().hasLabel("user").sideEffect(__.count().store("total"))
.order().by("name", Order.desc)
.range(0, 10).fold().as("list")
.select("list","total")
.by().
.by(count(local))
That said, I don't think your traversal isn't really doing what you want . The sideEffect() is just counting the current traverser which will simply return "1" and then you store that "1" in the list "total". At least that's what I see with TinkerGraph:
gremlin> g.V().hasLabel("person").sideEffect(count().store("total")).range(0,1).fold().as('list').select('list','total').by().by(count(local))
==>[list:[v[1]],total:1]
gremlin> g.V().hasLabel("person").sideEffect(count().store("total")).range(0,10).fold().as('list').select('list','total').by().by(count(local))
==>[list:[v[1],v[2],v[4],v[6]],total:4]
Interesting that JanusGraph somehow gives you 114 rather than 10 for the "total". I'd not expect that. I'd consider avoiding reliance on that "feature" in the case it is a "bug" that is later "fixed". Instead, please consider the posts I'd provided and look at them for inspiration.
I'm working on a recommendation system that recommends other users. The first results should be the most "similar" users to the "searcher" user. Users respond to questions and the amount of questions responded in the same way is the amount of similarity.
The problem is that I don't know how to write the query
So in technical words I need to sort the users by the amount of edges that has specific property values, I tried with this query, I thought it should work but it doesn't work:
let query = g.V().hasLabel('user');
let search = __;
for (const question of searcher.questions) {
search = search.outE('response')
.has('questionId', question.questionId)
.has('answerId', question.answerId)
.aggregate('x')
.cap('x')
}
query = query.order().by(search.unfold().count(), order.asc);
Throws this gremlin internal error:
org.apache.tinkerpop.gremlin.process.traversal.step.util.BulkSet cannot be cast to org.apache.tinkerpop.gremlin.structure.Vertex
I also tried with multiple .by() for each question, but the result was not ordered by the amount of coincidence.
How can I write this query?
When you cap() an aggregate() it returns a BulkSet which is a Set that has counts for how many times each object exists in that Set. It behaves like a List when you iterate through it by unrolling each object the associated size of the count. So you get your error because the output of cap('x') is a BulkSet but because you are building search in a loop you are basically just calling outE('response') on that BulkSet and that's not valid syntax as has() expects a graph Element such as a Vertex as indicated by the error.
I think you would prefer something more like:
let query = g.V().hasLabel('user').
outE('response');
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.or(...search).
groupCount().
by(outV())
order(local).by(values, asc)
I may not have the javascript syntax exactly right (and I used spread syntax in my or() to just convey the idea quickly of what needs to happen) but basically the idea here is to filter edges that match your question criteria and then use groupCount() to count up those edges.
If you need to count users who have no connection then perhaps you could switch to project() - maybe like:
let query = g.V().hasLabel('user').
project('user','count').
by();
let search = [];
for (const question of searcher.questions) {
search.push(has('questionId', question.questionId).
has('answerId', question.answerId));
}
query = query.by(outE('response').or(...search).count()).
order().by('count', asc);
fwiw, I think you might consider a different schema for your data that might make this recommendation algorithm a bit more graph-like. A thought might be to make the question/answer a vertex (a "qa" label perhaps) and have edges go from the user vertex to the "qa" vertex. Then users directly link to the question/answers they gave. You can easily see by way of edges, a direct relationship, which users gave the same question/answer combination. That change allows the query to flow much more naturally when asking the question, "What users answered questions in the same way user 'A' did?"
g.V().has('person','name','A').
out('responds').
in('responds').
groupCount().
order(local).by(values)
With that change you can see that we can rid ourselves of all those has() filters because they are implicitly implied by the "responds" edges which encode them into the graph data itself.
I have a class
class Topic {
Integer id
String name
Integer numberPosts
}
and another one
class TopicDetails {
Integer id
Integer numberPosts
}
The second is actually a container for query results that's why the similarity.
I have two lists List<Topic> and List<TopicDetails>. Objects will be unique by id in both the lists. The second one will have at most all the ids as the first list.
I want to merge the data from second list to first list. I understand that there are simple ways like
to iterate over both and check for ids and merge the details
Using a map for the details.
But is there some better way to do this? Collection framework has many new methods so I was thinking that there may be some elegant way to do this in groovy instead of doing the above mentioned methods.
EDIT I forgot to mention that the first one initially does not have the information regarding the numberPosts. That is why the second one is present i.e. as a container for information from the database.
A List is still just a list. You can use lambda expressions and "find" the ID each time, but you gain nothing in efficiency. A map is the way to go, at least for one of the lists.
Is it possible to get a users first name or surname from a freebase query?
For example, I have a person entry I have the id of, but I just want to extract their first name.
{
"id": "/en/paul_thomas_anderson",
"name" : null
}
How would I modify this query, its something I've found nothing about by googling or searching here on S.O.? I know this kind of thing is possible in dbpedia for most people entries.
No, it's not possible directly. The name is stored as a single unit. There are topics for given names and surnames (e.g. http://www.freebase.com/view/base/givennames/given_name), so you could split the name and see which list(s) it appears in, but that's indirect and doesn't tell you about the specific person you are querying.
Lets say I've got an online dating site, users can filter the user list based on various criteria, Height, Age, BodyType, Ethnic Origin....
I want to pass the criteria to the pager, via QueryString. Height and Age are easy as these are ranges, and I would use
MinHeight=3&MaxHeight=12&MinAge=21&MaxAge=30
However other Criteria like BodyType and Ethnic orgins are Lists of ForeignKey values e.g:
Ethnitity:2,3,5
What is the best way to pass these as a QueryString? Should I convert it to a Json string eg:
www.site.com?filterjson={\"minage\":0,\"maxage\":0,\"minheight\":0,\"maxheight\":0,\"bodytypelist\":[1,2,3],"ethnicitylist\":[2,3,4],\"eyecolorlist\":[],\"haircolorlist\":[],\"orientationlist\":[]}
Or is this not-valid/overkill/too complex?
Maybe something like this:
MinHeight=3&MaxHeight=12&bodytypes=1,2,3&
and parse the list values by splitting the ','?????
I don't know the ups and downs of all these ideas. So how would you pass a list of values in a querystring?
Using comma-separated values is the most pragmatic approach in my opinion. You can use this code to split values:
if (!string.IsNullOrEmpty(Request.QueryString["bodytypes"]))
{
string[] rgs = Request.QueryString["bodytypes"].Split(new char[] { ',' });
}
Both will work, though querystring is much easier to be 'hacked'. However if you have it well protected from malicious/unexpected values, I say it's fine.
Consuming data via querystring is relatively more straightforward than from JSON.