Public data sets for Cranfield experiment - information-retrieval

I am trying to evaluate an information retrieval approach. Thus, I need data suitable for a Cranfield experiment:
Documents (D)
Queries (Q)
Relevance(Q, D)
Unfortunately I haven't found anything that is freely available...

DataSets:
Cranfield had released a collection of about 3000 abstracts and a set of queries with relevance judgements. However, working with this collection is not advisable because it is very small.
For moderate sized collections, you can use the TREC ad hoc search data which comes in 5 volumes. Volumes 4 and 5 are typically used. These documents (about half a million) correspond to the TREC Robust query set (TREC 6/7/8 and Robust tracks) comprising of 250 queries in total.
INEX ad hoc search task data comprises of a collection of XML documents (a collection of 27G of Wikipedia dump). The relevance judgements include relevant paragraphs marked within the whole articles. The task is to retrieve these passages.
For non-English documents, you can use the CLEF data (European languages) or the FIRE data (South Asian languages).
For larger collections you can use the ClueWeb (TREC web search track). The size is 25Tb.
Alternatively, you can also use domain specific test collections, such as the Tweets corpus (TREC microblog search track), legal documents (TREC legal track), patent collections (CLEF-IP), medical collections (Pub-Med) etc.
Availability:
Most of these collections are freely available. You just need to register for the track (if it's an ongoing one) and they will make the data available to you. Some past tracks make the data available in the track web-pages.
The TREC ad hoc and the ClueWeb data are not freely available. The recent tracks of TREC make the data freely available, though. The various datasets of INEX, FIRE, and CLEF are all freely available. Just send an email to the (past) organizers in case they have removed the links to the data.

Related

Firebase and cartesian public reads

I'm working on a product that displays the results of running races. Races could have thousands of participants. So, in the days after a medium-sized event, there might be 3000 non-authenticated users wanting to browse 3000 results.
Although not every visitor will view all the results, the maximum damage at 3000 * 3000 would be 9,000,000 reads which at $.06 (Google cloud pricing) would cost $540,000 (Update: I'm a dummy, I missed the "per 100,000 documents" part, so this would only be $540).
Obviously, I wouldn't deliver all 3000 results for each visit - there would be paging and limits. Though, there's something inherently scary about the possibility of those costs.
Questions:
Is firebase simply the wrong technology for this type of product?
Is firebase not really intended for non-authenticated apps? Obviously DDOS becomes a concern for public access and there's no real protection in FB for this.
Every post I've read on these topics assumes developers are building apps for authenticated users.
9,000,000 reads which at $.06 (Google cloud pricing) would cost $540,000
The Firestore pricing of $0.06 is for 100,000 document reads, so 9 million document reads cost $540.
Aside from that: you should model your data in a way that ensures you read the data that the user actually sees. For example, if all users will read the entirety of all 3,000 documents, consider using a data bundle to distribute that to them.
Realistically though it is more likely that each user will read just a subset of the documents, and probably not of all 3,000 documents. So consider if you can combine the part that they'll read into a more cost-efficient structure. For if these were news articles: you could store the headline and intro paragraph of the first 100 articles in a single document, and just read that document (let's call it the frontpage) into each client when it starts.
There are many more ways to model the data, depending on the use-cases of your app. To learn more on how to think about such data modeling, I recommend reading NoSQL data modeling and watching the excellent Get to know Cloud Firestore video series.

Flutter Firebase: NoSQL data modelling subcollections reference

A newbie here.
I need help regarding Firebase sub collection referencing in a structured way where a user can select and pass information through sub collection.
=> Tournaments => Cities => Cairo => Year => High Goal => Team A
That goes like this from the root I have a list of cities let’s say
1. Cairo
2. Alexandria
3. Sixth October
I want to keep record of tournaments hosted each year by these cities based on years. Let’s say
a.
1. 2019
2. 2018
3. 2017
Each year there are 3 different competed cups let’s say
1. High goal
2. Medium goal
3. Low goal
Every competed cup has teams that participate in the tournament
1. Team A
2. Team B
3. Team C
I have added a visual representation of the app designed in adobe XD.
Data modeling for NoSQL databases depends as much on the use-cases of your app as it depends on the data that you store. So there is no "perfect" data model, nor are there nearly as many best practices (or normal forms) for NoSQL databases are there are for relational data models.
Firestore (which you seem to be looking to use), offers a few tools for modeling data:
The discrete unit of storage is called a document. Each document contains fields of various types, including nested fields, and a document can be up to 1MB in size.
Documents are stored in named collections.
You can nest collections under a document, and build hierarchies that way.
Each document has a unique path of the form /collection1/docid1/collection2/doc2 etc.
To write to a document, you must know its exact path.
You can query a collection for a subset of the documents in there.
You can query across all collections with the same name, no matter their path in the database.
The performance of queries depends solely on the number of documents you retrieve, and not on the number of documents in the collection(s).
There are probably quite a few more rules, but these should be enough to get you started.
I typically recommend writing a list of your top 3-5 use-cases, and determining what reads/queries you need for that. With those queries, you can then start defining your data model, and implementing your application code.
Then each time you add a use-case, you figure out how to read/write the data for that use-case, and potentially change/expand the data model to allow for the new and existing use-cases. If you get stuck when adding a specific use-case, report back here and we can try to help.
Some good additional material to get started:
NoSQL data modeling
Getting to know Cloud Firestore
Firebase for SQL developers, which is for Firebase's other NoSQL database, but is a great primer on NoSQL modeling too.

Are CosmosDB attachments (still) a good way to store payloads larger than the document limit of 2MB?

I need to save CosmosDB documents that contain a large list - larger than the document limit of 2 MB.
I'd like to split that list into an 'attachment' associated to the "main document".
But this documentation page briefly mentions that
Attachment feature is being depreciated [sic]
What's the deprecation plan? Will newly created collections (in the future) stop supporting attachments?
The same page of documentation mentions a limit of 2GB for "Maximum attachment size per Account".
What does that mean? 2GB per attachment? 2 GB total for all attachments?
I recommend not taking a dependency on attachments. We are still planning on deprecating them but have not started in earnest on this.
Depending on your access patterns for this data you may want to break this up as individual documents or modeled in some other way. CRUD operations on large documents can be very costly and will experience high latency because of the large payload in each request.
If you have an unbounded array these should definitely be stored as individual documents or modeled such that increasing size does not cause eventual performance issues. If your data is updated frequently it should be modeled such that the frequently updated portions are separate from properties that are static.
This article here describes scenarios and considerations when modeling data in Cosmos and may help you come up with a more efficient model.
https://learn.microsoft.com/en-us/azure/cosmos-db/modeling-data
Hope this is helpful.

Maxing out document storage in Firestore

I'm working on some posting forum projects and trying to figure out the ideal Firestore database structure.
I read that documents have a max size of 1 mg but what are the pros and cons to maxing out the storage space of each document by having multiple posts stored in a document rather than using a single document for each post?
I think it would be cheaper. Assuming that the app would make use of all the data in a document, the bandwidth costs would be the same but rather than multiple reads, I would be charged for only one document. Does this make sense?
Would it also be faster?
You can likely store many posts in a single document, and depending on your application, there may be good reasons for doing so. Just keep a few things in mind:
Firestore always reads complete documents. So if you store 100 posts in a single 1MB document, to only display 10 of those posts, you may have reduced the read operations by 10x, but you've increased the bandwidth consumption by 10x. And your mobile users will likely also pay for that bandwidth.
Implementing your own sharding strategy is not always hard, but it's seldom related to application functionality.
My guidelines when modeling data in any NoSQL database is:
model application screens in your database
I tend to model the data in my database after the screens that I have in my application. So if you typically show a list of headlines of recent articles when a user starts the app, I might actually create a document that contains just the headlines of recent articles. That way the app only has to read a single document with just the headlines, instead of having to read each individual post. This reduces not only the number of documents the app needs to read, but also the bandwidth it consumes.
don't be afraid to duplicate data
This goes hand-in-hand with the previous guideline, and is very normal across all NoSQL databases, but goes against the core of what many of us have learned from relational databases. It is sometimes also referred to as denormalizing, as it counters the database normalization of relations database models.
Continuing the example from before: you'll probably have a separate document for each post, just to make sure that each post has its own single point of definition. But you'll store parts of that post in many other places, such as in the document-of-recent-headlines that we had before. This means that we'll have to duplicate the data for each new post into that document, and possibly multiple other places. This process is known as fan-out, and there are some common strategies for updating this denormalized data.
I find that this duplication leads to no concerns, as long as it is clear what the main point of definition for each entity is. So in our example: if there ever is a difference between the headline of a post in the post-document itself, and the document-of-recent-headlines, I know that I should update the document-of-recent-headlines, since the post-document itself is my point-of-definition for the post.
The result of all this is that I often see my database as part actual data storage, part prerendered fragments of application screens. As long as the points of definition are clear, that works quite well and allows me to define data models that scale efficiently both for users of the applications that consume the data and for the cost to operate them.
To learn more about NoSQL data modeling:
NoSQL data modeling
Getting to know Cloud Firestore, which contains many more examples of these prerendered application screens.

Relational behavior against a NoSQL document store for ODBC support

The first assertion is that document style nosql databases such as MarkLogic and Mongo should store each piece of information in a nested/complex object.
Consider the following model
<patient>
<patientid>1000</patientid>
<firstname>Johnny</firstname>
<claim>
<claimid>1</claimid>
<claimdate>2015-01-02</claimdate>
<charge><amount>100</amount><code>374.3</code></charge>
<charge><amount>200</amount><code>784.3</code></charge>
</claim>
<claim>
<claimid>2</claimid>
<claimdate>2015-02-02</claimdate>
<charge><amount>300</amount><code>372.2</code></charge>
<charge><amount>400</amount><code>783.1</code></charge>
</claim>
</patient>
In the relational world this would be modeled as a patient table, claim table, and claim charge table.
Our primary desire is to simultaneously feed downstream applications with this data, but also perform analytics on it. Since we don't want to write a complex program for every measure, we should be able to put a tool on top of this. For example Tableau claims to have a native connection with MarkLogic, which is through ODBC.
When we create views using range indexes on our document model, the SQL against it in MarkLogic returns excessive repeating results. The charge numbers are also double counted with sum functions. It does not work.
The thought is that through these index, view, and possibly fragment techniques of MarkLogic, we can define a semantic layer that resembles a relational structure.
The documentation hints that you should create 1 object per table, but this seems to be against the preferred document db structure.
What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?
If the ODBC connection is going to always return bad data and not be aware of relationships, then all of the tools claiming to have ODBC support against NoSQL is not true.
References
https://docs.marklogic.com/guide/sql/setup
https://docs.marklogic.com/guide/sql/tableau
http://www.marklogic.com/press-releases/marklogic-and-tableau-build-connection/
https://developer.marklogic.com/learn/arch/data-model
For your question: "What is the data modeling and application pattern to store large amounts of document data and then provide a turnkey analytics tool on top of it?"
The rule of thumb I use is that when I want to count "objects", I model them as separate documents. So if you want to run queries that count patients, claims, and charges, you would put them in separate documents.
That doesn't mean we're constraining MarkLogic to only relational patterns. In UML terms, a one-to-many relationship can be a composition or an aggregation. In a relational model, I have no choice but to model those as separate tables. But in a document model, I can do separate documents per object or roll them all together - the choice is usually based on how I want to query the data.
So your first assertion is partially true - in a document store, you have the option of nesting all your related data, but you don't have to. Also note that because MarkLogic is schema-agnostic, it's straightforward to transform your data as your requirements evolve (corb is a good option for this). Certain requirements may require denormalization to help searches run efficiently.
Brief example - a person can have many names (aliases, maiden name) and many addresses (different homes, work address). In a relational model, I'd need a persons table, a names table, and an addresses table. But I'd consider the names to be a composite relationship - the lifecycle of a name equals that of the person - and so I'd rather nest those names into a person document. An address OTOH has a lifecycle independent of the person, so I'd make that an address document and toss an element onto the person document for each related address. From an analytics perspective, I can now ask lots of interesting questions about persons and their names, and persons and addresses - I just can't get counts of names efficiently, because names aren't in separate documents.
I guess MarkLogic is a little atypical compared to other document stores. It works best when you don't store an entire table as one document, but one record per document. MarkLogic indexing is optimized for this approach, and handles searching across millions of documents easily that way. You will see that as soon as you store records as documents, results in Tableau will improve greatly.
Splitting documents to such small fragments also allows higher performance, and lower footprints. MarkLogic doesn't hold the data as persisted DOM trees that allow random access. Instead, it streams the data in a very efficient way, and relies on index resolution to pull relevant fragments quickly..
HTH!

Resources