Local analysis (Pseudo relevance feedback ) - information-retrieval

While studying relevance feedback(Pseudo relevance feedback), I have learn that the model can go horribly wrong for some queries. Can anyone give reasons why this is?

The problem is also called query drift: if the top-k retrieved documents are all (or mostly) about a particular sub-topic, the importance of such sub-topic is boosted by the feedback mechanism.
A textbook example is with a query about "copper mines": if most of the retrieved documents are about "copper mines in Chile", the feedback process will drift the results towards documents on Chile.

Related

Neo4j Movie Database centrality

H, i'm fairly new to neo4j and cypher in general and i just started playing around with the default movie database provided on neo4j's installation.
I'm trying to obtain a measure that reflects the importance of each movie based on the actors that acted in them that in some iterative way would take into account the importance of each actor based on the number of movies they acted in.
I guess my best option here would be to use pageRank score:
CALL algo.pageRank(label, relationship, ...)
What would be in this situation the best and most correct way to obtain what i'm looking for?
I'm quite lost and googling around didn't help as much as i hoped. Thanks for reading, sorry for my bad english.

Modeling document data and query performance

I have an aggerate data model (think a Customer entity with Widgets that belong to them as a list of embedded entities).
When I search for customers (e.g DocumentDBRepository.GetItemsAsync) That will be hydrating the customer data model along with the widgets for each. For efficiency reasons, I don’t really need the customer search to consider the widgets.
Are there any strategies for this in document dbs (such as a “LiteCustomer” entity)? I suspect not as that is just the nature of the “schema-less” data I’ve told it to store in the first place, but interested to hear thoughts.
Is this simply a ‘non issue’?
First, disclaimer: data modeling is hard. There are many nuances and a SO question can never cover entire business and everything left unsaid in both Q and A. There's no silver bullets. Regardless..
"LiteCustomer"
Perfectly fine to have such model in your client code. Your main Customer model may and will have many representations, most of them simple subsets of full model. Similarly to relational sql, select only what you need. Don't fetch data to client which you don't need.
The SQL API provides quite cool SQL tools to compose json for return documents for you.
physical storage model may differ from domain model
Consider your usage scenarios. If many scenarios happen to work with customer without widgets (or vice versa) then consider having widgets as separate document(s) in storage model.
In DocDB, the question is often not so much in querying logic but what your application expects on modification logic. Querying which is indexed is fast and every sql query can easily do transformations (though cross-doc joining is troublesome). For C(R)UD - you have less options - it's always by full document. Having too large documents will end up with higher RU costs and complex code.
Questions to consider:
How often customer changes without widget count/details changing?
How often widgets change without customer changing?
Do widgets on customer change independently or always as a set?
When do you need transactional updates on customer+widget changes?
How would queries look like? Can they be indexed?
Test.
True, changing model later is cumbersome in DocDB, but don't try to fix something before you know it's broken. If you are not sure you have an issue or not, then most likely fixing the maybe-issue is costlier than not fixing it.
If in doubt, generate loads of data and test it out.

Modeling Reddit style Comments in DynamoDB

I am looking into using DynamoDB to store comments for my application. The comments will be a nested data structure like you would find in reddit. So users can rate and reply to any comment. For example
Topic1
Comment1
Reply1
Reply2
Comment2
Reply1
My question is how do I model the Reply relationship in DynamoDB so I can query a topics comments and all subsequent replies without doing a lot of grouping on the backend. This kind of data structure is obviously more suited to a Graph database but I am curious if anyone has tried to model a tree like data structure in DynamoDB.
With document support introduced in late 2014, you can model tree data using Map and List types. Your thread depth would be limited by the maximum depth of JSON documents, currently at 32. Alternatively, you could use the DynamoDB Storage Backend for Titan to model your message data as a graph. You get to decide how many hops you want your graph traversals to perform, so you get to decide the limit for thread depth.

Graph database modelling: Should i use a collection node to avoid to many rel on a node

I'm currently working on my first application that uses a Graph database (Neo4J). I'm in the process of modelling my graph on a whiteboard. My colleague and I are in a pickle on whether or not we should introduce a 'collection node'.
We have something like this (Cypher syntax, Fictive example):
(parking:Parking) - Parking node
(car:Car) - Car node
Obviously, a Parking can have multiple Cars, let's say it can have up to 1mio cars.
Is it, in this case, better to introduce a new node:
(carCollection:CarCollection) - Car collection node?
A Parking could have a rel to the 'Car collection node' which can have a lot of cars. This should avoid a simple query being performed on the Parking node it self (let's say you want to query the number of available seats) to lose performance.
Is this a good idea? Or is this bogus and should you model it as it is, and does this not influence performance?
If anyone can provide a link or book with some graph modelling best practices, that would be awesome as well :).
Thx in advance.
Gr
Kwinten
anyhow, there is no way of a performance enhancer once you need to have 1mil nodes for each car.
if you will simply query your parking node with just one car, it will be as fast as if you have just 1 car in the car collection.
if you will need to return all 1 mil cars, than there is no enhancer. (the main problem, however, would be simply the net connection to stream all the data).
you can play with labels, but i suggest to keep the millions of relations directly to the parking node. but if you could provide us with an example scenario with a query, than we can figure maybe smthnig out

What's the sqlite's cost model?

I found only "SQLite uses a cost-based query planner that estimates the CPU and disk I/O costs of various competing query plans and chooses the plan that it thinks will be the fastest." from documents of documents, is there any paper or book specify the cost function with more details ? thx.
There is only the source code (where.c).
The query planner details are tweaked in every version.

Resources