How to get the MID of all compound value types from Freebase data dump - freebase

I want to process all the triples with compound value types(CVT). Since the Freebase website is shut down, it's hard find docs.

Related

Indexing frequently updated counters (e.g., likes on a post and timestamps) in Firebase

I'm new to firebase and I'm currently trying to understand how to properly index frequently updating counters.
Let's say I have a list of articles on a news website. Every article is stored in my collection 'articles' and the documents inside have a like counter, a date when it was published and an id to represent a certain news category. I would like to be able to retrieve the most liked and latest articles for every category. Therefore I'm thinking about creating two indices, one for category type (in ASC order) and likes (DESC order) and one of the category type and the published date (DESC order).
I tried researching limitations and on the best practices page I found this, regarding creating hotspots with indices:
Creates new documents with a monotonically increasing field, like a timestamp, at a very high rate.
In my example I'm using articles which are not created too frequently. So I'm pretty sure this wouldn't create an issue, correct me if I'm wrong please. But I do still wonder if I could run into limitations or high costs with my approach (especially regarding to likes which can change frequently, while the timestamp is constant).
Is my approach to indexing likes and timestamps by category a sound approach or am I overseeing something?
If you are not adding documents at a high rate, then you will not trigger the limit that you cited in your question.
From the documentation:
Maximum write rate to a collection in which documents contain sequential values in an indexed field: 500 per second
If you are changing a single document frequently, then you will possibly trigger the limitation that a single document can't be updated more than 1 times per second (in a sustained burst of updates only, not a hard limit).
From the documentation on distributed counters:
In Cloud Firestore, you can only update a single document about once per second, which might be too low for some high-traffic applications.
That limit seems to (now) be missing from the formal documentation, not sure why that is. But I'm told that particular rate limit has been dropped. You might want to start a discussion on firebase-talk to get an official answer from Google staff.
Whether or not your approach is "sound" depends entirely on your expected traffic. We can't predict that for you, but you are at least aware of when things will go poorly.

Queryable unbound amount of items

I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

Theory question: what strategy is faster? Querying a lot of documents vs query fewer documents and then load some?

I'm wondering whats the better structure for my Firestore database.
I want to create some sort of appointment manager for employee where I can show every employee its appointment for some date. I have thought of these two options:
Option:
Every employee has a collection Appointments where I save all the upcoming appointments. The appointment documents would have a column date.
When I want to load all appointments for a date I would have to query all appointments by this date.
Option:
Every employee has a collection Workdays with documents for each day. These workday documents would have the column date. And then a collection with Appointments where I save the appointments for a workday.
When I want to load all appointments, I would have to query the Workdays collection for the correct date and then load all its Appointments.
I expect an average workday to contain 10-20 appointments. And let's say I save appointments for the next 30 days. For option 1, I would then have to query 300-600 documents down to 10-20.
For option 2 I would have 30 documents and query it for 1 documents. Then load around 10-20 documents.
So in option 2 I would have to query fewer documents, but I would have to wait until the query is finished and then load 10-20 further documents. While for option 1, I would have to query more documents but once this query is finished I wouldn't have to load any more documents.
I'm wondering what option is the faster for my use case - any thoughts?
Documents are not necessarily tabular (columnar). Keep it simple, follow their documentation, Choose a data structure, and do not overthink optimizing search. Leave query optimization to the Firebase platform/team as there are several search approaches which might be implemented, depending on the type of data you are querying for. Examples include source Wikipedia:
Dijkstra's algorithm, Kruskal's algorithm, the nearest neighbour
algorithm, and Prim's algorithm.
Again, provided you basically follow their data structure guideline, the optimal search approach should be baked in to the Firebase/Firestore platform and may be optimized by them when possible. In short, the speed of the compute platform will amaze you. Focus on higher level tasks relating to your particular app.
If the total number of documents read in each case is the same, the faster option will be the one that reduces the number of round trips between the client and server. So, fewer total queries would be better. The total number of documents in the collection is not going to affect the results very much.
With Firestore, performance of queries scales with the size of the result set (total number of documents read), not with the total size of the collection.
The first option is rather straightforward, and is definitely the way you'd do it with a relational database. The date column could become a natural foreign key to any potential workday table, assuming one is even needed.
The second option is more complicated because there are three data cases:
Workday does not exist
Workday does exist but has no appointments in the list
Workday exists and has appointments
In terms of performance, they are not likely to be very different, but if there is a significant gap, I'd gamble option 1 to be more efficient.

Neo4J : Java Heap Space Error : 100 k nodes

I have a neo4j graph with a little more than 100,000 nodes. When I use the following cypher query over REST, I get a Java Heap Error . The query is producing a 2-itemset from a set of purchases .
MATCH (a)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN a.id,c.id
The cross product of two types of nodes Type 1 (a,c) and Type 2 (b) is of order 80k*20k
Is there a more optimized query for the same purpose ? I am still a newbie to cypher. (I have two indexes on all Type1 and Type2 nodes respectively which I can use)
Or should I just go about increasing the java heap size .
I am using py2neo for the REST queries.
Thanks.
As you said the cross product is 80k * 20k so you probably pull all of them across the wire?
Which is probably not what you want. Usually such a query is bound by a start user or a start product.
You might try to run this query in the neo4j-shell:
MATCH (a:Type1)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN count(*)
If you have a label on the nodes, you can use that label Type1? to drive it.
Just to see how many paths you are looking at. But 80k times 20k are 1.6 billion paths.
And I'm not sure if py2neo of the version (which one) you are using is already using streaming for that? Try to use the transactional endpoint with py2neo (i.e. the cypherSession.createTransaction() API).

Resources