I would like to store data retrieved hourly from RSS feeds in a database or in Lucene so that the text can be easily indexed for wordcounts.
I need to get the text from the title and description elements of RSS items.
Ideally, for each hourly retrieval from a given feed, I would add a row to a table in a dataset made up of the following columns:
feed_url, title_element_text, description_element_text, polling_date_time
From this, I can look up any element in a feed and calculate keyword counts based upon the length of time required.
This can be done as a database table and hashmaps used to calculate counts. But can I do this in Lucene to this degree of granularity at all? If so, would each feed form a Lucene document or would each 'row' from the database table form one?
Can anyone advise?
Thanks
Martin O'Shea.
My parsing of your question is:
for each item in feed:
calculate term frequency of item, then add to feed's frequency list
This is not something that Lucene excels at, so CouchDB or another db might be as good if not a better choice (like larsmans suggests). However, it can be done (in a way that is probably slightly easier than other DBs):
HashMap<string, int> terms = new HashMap<string, int>(indexReader.getUniqueTermCount());
TermEnum tEnum = indexReader.Terms();
while (tEnum.Next())
{
results.Add(tEnum.Term().Text(), tEnum.DocFreq());
}
All Lucene is saving you is the difficulty of calculating the docfreq, and it will probably be a bit faster than looping through all the rows yourself. But I'd be surprised if the performance difference is noticeable for reasonably small data sets.
Related
Imagine you have a “Posts” model in firestore which has image, description, rating, comments etc. You want to display 10 or 15 comments when the post is clicked by user. Question is:
Would you store comments in “Posts” model as a field or would you create another new data model “Comments” for that?
In the first situation i wonder how to handle if the post has 1.000.000 comments? you can’t paginate a field as far as i know. Each time you need to fetch all of the comments and its kinda heavy and useless request i think. What is the best way to store comments?
Would you store comments in “Posts” model as a field or would you create another new data model “Comments” for that? In the first situation i wonder how to handle if the post has 1.000.000 comments?
There is no "100% correct" way of doing this, but your modeling should match the requirements of your expected use case. Without knowing how you are going to query this data, you might make a bad design decision.
Note that the maximum size of a Firestore document is 1MB. If you are expecting a large number of posts, then that simply will not fit inside a single document, and you should instead store each post as a separate document in a subcollection.
If you need to paginate any items at all, you should always store them as separate documents. Firestore queries can't fetch partial documents - a read always gets everything in the document.
Actually you have answered your question yourself :)
I would create an another data model / table so you can easier implement:
pagination
remove / edit
likes
answers to comments etc...
this brings more complexity, but is a more elegant and flexible solution.
The first solution only makes sense if it can't happen that there are 1 million comments. For example in an intranet application. But better not do it, because the effort is actually the same.
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
I'm working on a library app, and am using Firestore with the following (simplified) two collections books and wishes:
Book
- locationIds[] # Libraries where the book is in stock
Wish
- userId # User who has wishlisted a book
- bookId # Book that was wishlisted
The challenge: I would like to be able to make a query which gets a list of all Book IDs which have been wishlisted by a user AND are currently available in a library.
I can imagine two ways to solve this:
APPROACH 1
Copy the locationIds[] array to each Wish, containing the IDs of every location having a copy of that book.
My query would then be (pseudocode):
collection('wishes')
.where('userId' equals myUserId)
.where('locationIds' contains myLocationId)
But I expect my Wishes collection to be pretty large, and I don't like the idea of having to update the locationIds[] of all (maybe thousands) of wishes whenever a book's location changes.
APPROACH 2
Add a wishers[] array to each Book, containing the IDs of every user who has wishlisted it.
Then the query would look something like:
collection('books')
.where('locationIds' contains myLocationId)
.where('wishers' contains myUserId)
The problem with this is that the wishers array for a particular book may grow pretty huge (I'd like to support thousands of wishes on each book), and then this becomes a mess.
Help needed
In my opinion, neither of these approaches are ideal. If I had to pick one, I will probably go with Approach 1 simply because I don't want my Book object to contain such a huge array.
I'm sure I'm not the first person to come across this sort of problem, is there a better way?
You could try dividing the query in two different requests. For instance, in pseudocode:
wishes = db.collection('wishes').where('userId', '==', myUserId)
book_ids = [wish.bookId for wish in wishes]
books = db.collection('books').where('bookId', 'in', book_ids)
result = [book.bookId for book in books if book.locationIds]
Notice that this is just an example, this code probably doesn't work, since I haven't tested it and the keywork in just supports 10 values. But you get the idea. A good idea would be adding the length of the locationIds or whether it's empty or not in a separate attribute so you could omit the last iteration querying the books with:
books = db.collection('books').where('bookId', 'in', book_ids).where('hasLocations', '==', True)
Although you would still have to iterate to only get the bookId.
Also, you should avoid using arrays in Firestore since it doesn't have native support for them, as explained in their blog.
Is it mandatory to use NoSQL? Maybe you could do this M:M relation better in SQL. Bear in mind that I'm no database expert though.
I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…
I am working on Marklogic tool
I am having a database of around 27000 documents.
What I want to do is retrieve the keywords which have maximum frequency in the documents given by the result of any search query.
I am currently using xquery functions to count the frequency of each word in the set of all documents retrieved as query result. However, this is quite inefficient.
I was thinking that it would help me if i could get the list of words on which marklogic has performed indexing.
So is there a way to retrieve the list of indexed words from the universal index of marklogic??
Normally you would use something like this in MarkLogic:
(
for $v in cts:element-values(xs:Qname("myelem"))
let $f := cts:frequency($v)
order by $f descending
return $v
)[1 to 10]
This kind of functionality is built-in in the search:search library, which works very conveniently.
But you cannot use that on values from cts:words e.a. unfortunately. There is a little trick that could get you close though. Instead of using cts:frequency, you could use a xdmp:estimate on a cts:search to get a fragment count:
(
for $v in cts:words()
let $f := xdmp:estimate(cts:search(collection(), $v))
order by $f descending
return $v
)[1 to 10]
The performance is less, but still much faster than bluntly running through all documents.
HTH!
What if your search contains multiple terms? How will you calculate the order?
What if some of your terms are very common in your corpus of documents, and others are very rare? Should the count of "the" contribute more to the score than "protease", or should they contribute the same?
If the words occur in the title vs elsewhere in the document, should that matter?
What if one document is relatively short, and another is quite long. How do you account for that?
These are some of the basic questions that come up when trying to determine relevancy. Most search engines use a combination of term frequency (how often do the terms occur in your documents), and document frequency (how many documents contain the terms). They can also use the location of the terms in your documents to determine a score, and they can also account for document length in determining a score.
MarkLogic uses a combination of term frequency and document frequency to determine relevance by default. These factors (and others) are used to determine a relevance score for your search criteria, and this score is the default sorting for results returned by search:search from the search API or the low-level cts:search and its supporting operators.
You can look at the details of the options for cts:search to learn about some of the different scoring options. See 'score-logtfidf' and others here:
http://community.marklogic.com/pubs/5.0/apidocs/SearchBuiltins.html#cts:search
I would also look at the search developers guide:
http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf
Many of the concepts are under consideration by the XQuery working group as enhancements for a future version of XQuery. They aren't part of the language today. MarkLogic has been at the forefront of search for a number of years, so you'll find there are many features in the product, and a lot of discussion related to this area in the archives.
"Is there a way to retrieve the list of indexed words from the universal index of marklogic?" No. The universal index is a hash index, so it contains hashes not words.
As noted by others you can create value-based lexicons that can list their contents. Some of these also include frequency information. However, I have another suggestion: cts:distinctive-terms() will identify the most distinctive terms from a sequence of nodes, which could be the current page of search results. You can control whether the output terms are just words, or include more complex terms such as element-word or phrase. See the docs for more details.
http://docs.marklogic.com/5.0doc/docapp.xqy#display.xqy?fname=http://pubs/5.0doc/apidoc/SearchBuiltins.xml&category=SearchBuiltins&function=cts:distinctive-terms
I have used cts:distinctive-terms(). It gives mostly wildcarded terms in my case which are not of much use. Furthur it is suitable for finding distinctive terms in a single document. When I try to run it on many documents it is quite slow.
What I want to implement is a dynamic facet which is populated with the keywords of the documents which come up in the search result. I have implemented it but it is inefficient as it counts the frequency of all the words in the documents. I want it to be a suggestion or recommandation feature like if you have searched for this particular term or phrase then you may be interested in these suggested terms or phrases. So I want an efficient method to find the terms which are common in the result set of documents of a search.
I tried cts:words() as suggested. It gives similar words as the search query word and the number of documents in which it is contained. WHat it does not take into account is the set of search result documents. It just shows the number of documents which contain similar words in the whole database, irrespective of whether these documents are present in the search result or not