How do i implement link walking in Riak 2.0 - riak

I am trying to come up one-many relationship where an user can have links to many organization bucket.
I would like to walk and return the results back.
I am upgrading the stackmob's scala driver to support linkwalking https://github.com/megamsys/scaliak
Any help would be greatly be appreciated. The forums talk about using mapreduce.

Link walking is deprecated in the latest version of Riak, and will likely be removed in future versions. So it probably doesn't make sense to upgrade the Scala driver to support it.
The real question here is - how should you model a One to Many relationship in Riak? There are two main approaches to this, depending on if you have a read-heavy or a write-heavy use case.
1 - Links as Lists of Keys
You can store the list of links/associations as a separate object for easy retrieval. For example, if I have a users object stored at /buckets/users/keys/user-id-123:
{ id: "user-id-123", name: "Dmitri", ... }
I can then store the organizations that user belongs to (notice that I'm using the same key for the user and for their membership object) in /buckets/user-orgs/keys/user-id-123:
["organization-id-1", "organization-id-2", "organization-id-3"]
This allows me to answer the question of "Which organizations does this user belong to?" with a single GET to the user-orgs object (and, optionally, a multi-get to fetch each of the organization objects by their IDs).
Note: If you're using Riak 2.0 or above, you can use the new Riak Data Types (specifically, the sets data type), to store that list of IDs. (The Sets data type provides an API of operations to add/remove/fetch elements from a list in a way appropriate for distributed systems).
Use this approach when you have a read-heavy use case (when the list of links is read frequently, but is not written to/updated frequently).
2 - Search / Query for the links
The other main approach is to use indexes (preferably via the Solr-based Riak Search, or, for rare cases, via Secondary Indexes) and queries to retrieve one-to-many association objects.
So, if you had a user object stored at /buckets/users/keys/user-id-123:
{ id: "user-id-123", name: "Dmitri", ... }
You would then insert multiple "membership entry" objects into the search-enabled (meaning, you would create a search index and associate it with the user-orgs bucket) /buckets/user-orgs/:
{user_id: "user-id-123", org_id: "organization-1"}
{user_id: "user-id-123", org_id: "organization-2"}
{user_id: "user-id-123", org_id: "organization-3"}
Afterwards, you can answer the question "Which organizations does the user belong to?" by issuing a Search query saying "Give me all of the objects in user-orgs where user_id equals to user-id-123", for example.
Incidentally, using Search / membership objects like this also allows you to model a Many-To-Many relationship (meaning, you can also answer the question "Which users belong to the organization organization-id-1?").
Because Search queries are more expensive than a single GET to fetch a membership list (like in the first strategy), you should use this strategy when you're not in a read-heavy use case (when the membership objects are updated often, but not read often), or when you need to also model the inverse relationship (many to many).
Note: Do not use Map/Reduce to model one-to-many relationships, and don't use the deprecated Link Walking mechanism (which uses Map/Reduce on the backend, anyways).

Related

DynamoDB Modeling Multiple Query Elements

Background: I have a relational db background and have never built anything for DynamoDB that wasn't just used for fast writes with very few reads. I am trying to learn DynamoDB patterns by migrating one of my help desk apps from MySQL to DynamoDB.
The application is a fairly simple one from a data storage perspective. A user submits a request and that request generates 1 or more tickets.
Setup: I have screens where people see initial requests and that request's tickets and search views that allow support to query on a bunch of attributes of a ticket (last name of user, status of ticket, use case of ticket, phone number of user, dept of user). This design in a SQL db is pretty straightforward but in Dynamo, I'm really being thrown for a loop on how to structure primary/sort keys and secondary indexes (if necessary).
I created one collection for requests and one collection for tickets. The individual requests have an array of ticket ids that belong to it. The ticket item has an attribute that stores the request id so that I can search that way. But what I am hung up on, is how do I incorporate searching on a ticket/request's attributes without having to do a full scan?
I read about composite keys and perhaps creating a composite sort key similar to: ## so that I can search on each of those fields directly without having to know the primary key (ticket id).
Question: How do you design dynamo collections/tables that require querying a lot of different attribute values without relying on a primary key?
This is typically something that DynamoDB is not good at, not to say it definitely cannot be done. The strength and speed for DynamoDB comes from having well known access patterns and designing your schema for these patterns. In general if you don't know what your users will search for, or there are many different possible queries, it's better to look at something like RDS or a native SQL DB. That being said a possible direction to solve this could be to create multiple lists for each of the fields and duplicate the data. This could all be done in the same table.

Lookup the existence of a large number of keys (up to1M) in datastore

We have a table with 100M rows in google cloud datastore. What is the most efficient way to look up the existence of a large number of keys (500K-1M)?
For context, a use case could be that we have a big content datastore (think of all webpages in a domain). This datastore contains pre-crawled content and metadata for each document. Each document, however, could be liked by many users. Now when we have a new user and he/she says he/she likes document {a1, a2, ..., an}, we want to tell if all these document ak {k in 1 to n} are already crawled. That's the reason we want to do the lookup mentioned above. If there is a subset of documents that we don't have yet, we would start to crawl them immediately. Yes, the ultimate goal is to retrieve all these document content and use them to build the user profile.
My current thought is to issue a bunch of batch lookup requests. Each lookup request can contain up to 1K of keys [1]. However to get the existence of every key in a set of 1M, I still need to issue 1000 requests.
An alternative is to use a customized middle layer to provide a quick look up (for example, can use bloom filter or something similar) to save the time between multiple requests. Assuming we never delete keys, every time we insert a key, we add it through the middle layer. The bloom-filter keeps track of what keys we have (with a tolerable false positive rate). Since this is a custom layer, we could provide a micro-service without a limit. Say we could respond to a request asking for the existence of 1M keys. However, this definitely increases our design/implementation complexity.
Is there any more efficient ways to do that? Maybe a better design? Thanks!
[1] https://cloud.google.com/datastore/docs/concepts/limits
I'd suggest breaking down the problem in a more scalable (and less costly) approach.
In the use case you mentioned you can deal with one document at a time, each document having a corresponding entity in the datastore.
The webpage URL uniquely identifies the page, so you can use it to generate a unique key/identifier for the respective entity. With a single key lookup (strongly consistent) you can then determine if the entity exists or not, i.e. if the webpage has already been considered for crawling. If it hasn't then a new entity is created and a crawling job is launched for it.
The length of the entity key can be an issue, see How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?. To avoid it you can have the URL stored as a property of the webpage entity. You'll then have to query for the entity by the url property to determine if the webpage has already been considered for crawling. This is just eventually consistent, meaning that it may take a while from when the document entity is created (and its crawling job launched) until it appears in the query result. Not a big deal, it can be addressed by a bit of logic in the crawling job to prevent and/or remove document duplicates.
I'd keep the "like" information as small entities mapping a document to a user, separated from the document and from the user entities, to prevent the drawbacks of maintaining possibly very long lists in a single entity, see Manage nested list of entities within entities in Google Cloud Datastore and Creating your own activity logging in GAE/P.
When a user likes a webpage with a particular URL you just have to check if the matching document entity exists:
if it does just create the like mapping entity
if it doesn't and you used the above-mentioned unique key identifiers:
create the document entity and launch its crawling job
create the like mapping entity
otherwise:
launch the crawling job which creates the document entity taking care of deduplication
launch a delayed job to create the mapping entity later, when the (unique) document entity becomes available. Possibly chained off the crawling job. Some retry logic may be needed.
Checking if a user liked a particular document becomes a simple query for one such mapping entity (with a bit of care as it's also eventually consistent).
With such scheme in place you no longer have to make those massive lookups, you only do one at a time - which is OK, a user liking documents one a time is IMHO more natural than providing a large list of liked documents.

Entity associations mapping, ORM and Data Modelling approach for a complex task

I'm working on a studio project to try to learn different approaches using Symfony2, Doctrine 2.4.7 as ORM and MySql 5.5 as DB. I've deliberately minimized my question for a better understandability and readability, if you need more details you only have to ask and apologize me if my english is not so good.
To avoid a large discussion due to the title of my question let me synthesize the problem showing a simple and common case (but complex for me because I'm new with doctrine).
The Model:
The User entity (mapped) that stores the user's data
The Category entity (mapped) that stores some categories associated to the User with a ManyToMany BD.
Each User can select one or more Categories.
The Problem:
User categories are near 100.
Many Categories could have a specific associated form.
Each form is composed by common and/or only specific fields (from 1 to 10 fields per category).
The Goal:
Understand what's the most balanced approach for this use case (in terms of flexibility and performances), for create the entities and associations needed to store the data filled by the user (some of these data I wish they were to be searchable).
Some Related References:
Doctrine2 docs
Serialized LOB
Extensible Data Modelling
and many other threads not much relevant...
A Possible Solution:
Create manually a form type for each category with inside the block of related fields (I use this forms as services in DIC and use blocks for the fields I need to reuse on more then one category).
Create a CategoryForm entity with the properties needed to retrieve the name of the form related to the category (useful to the form factory when I build the form) with an association ManyToMany UD with Category and to store the serialized LOB (the data coming from the form and related to the User).
There is a better approach to avoid the serialization of the object in a LOB? (maybe I'm wrong but serialized data are not searchable/indexable in mysql)
Any other solution or reference to a readable resource is welcome!
Well, I will try to answer the question with a simple guess: the category is something shared beetween several users (since you got the many to many).
So, if you want the User's form to be able to set (add/delete or update) Categories assiated to the user, then you should just have acollection of entity widgets related to the Category.
Why do I say that ?
Since your categories are linked to several users, the way you want to treat the relation beetween Categories and Users will cause any update on existing Category from a User's form to be propagated to other Users.
This means that Categories should be created/updated by a single form (modulo your needs). You can then link the Category to the User from User's form.
As far as the number of form of Categories is concerned, there are several parameters to handle:
Are all the Categories validated the same way (to know if you simply need to hide widgets to make the validation work) ?
Do you have a large amount of different types of Categories ?
If yes, are always composed the same way for a given type ?
Give further details if I'm wrong in my initial guess ;)

What's the best way of implementing a many-to-many relationship, that contains meta, RESTfully?

When representing data models through a RESTful interface it is understood to create top level endpoints that associate to a type/group of objects:
/users
/cars
We can re-use HTTP verbs to enable actions upon these groups (GET to list, POST to create, etc). And when representing a model with a "dependency" (being that it can't exist without a "parent"), we can create deeper endpoints to represent that dependency relationship:
/users/[:id]/tokens
In this case, it makes sense to not have a top-level endpoint of /tokens, as they shouldn't be able to exist without the user.
Many-to-many relationships get a bit more tricky. If two models can have a many-to-many relationship but can also truly exist on their own, it makes sense to give both objects a top-level endpoint and a deeper endpoint for defining that relationship:
/users
/cars
/users/[:id]/cars
/cars/[:id]/users
We can then use PUT and DELETE methods to define those relationships through an HTTP interface: PUT /users/[:user_id]/cars/[:car_id]. It makes sense that running that PUT operation would create a data-model that somehow links the two objects (like a join table in a relational DB).
The tricky part, then, becomes deciding on where to limit the interface to combat redundancy.
Do you allow a GET request to the second-level deep endpoints (GET /users/[:user_id]/cars/[:car_id])? Or do you require that they access the "car" from the top level GET /cars/[:id]?
Now, what if the many-to-many relationship contains meta information? How do you represent that and where do you return it?
For example, what if we wanted to keep track of how many times a user drove a certain car? Where would we return that information? If we return it at the nested endpoint, are we violating REST or being inconsistent if we return the meta information and not the resource? Do we embed the meta information in the requested resource through an attribute of some kind?
Pls advise. :P (but really, thanks)
This is really more of a personal design preference at this point IMHO.
I would personally opt for stopping at /users/[:user_id]/cars/ and then requiring a call to /cars/[:car_id] to get the car information.
If you are including relation specific metadata though, like "how many times a user drove a certain car?" it would make sense to keep that under a deeper relationship like /users/[:user_id]/cars/[:car_id].
Truthfully it's not an exact science. You have to do what is simplest, most expressive, yet still powerful enough for your data model.
You could create a new resource. Something like users/[:user_id]/cars/[:car_id]/stats, whose response includes {drivings_count: 123}. You'd probably only allow a GET of this resource.

What is the best way to implement multilingual domain objects using NHibernate?

What is the best way to design the Domain objects which can have multi-lingual fields. An example can be a Product class with Description being multi-lingual.
I have found few links but could not decide which one is the best way.
http://fabiomaulo.blogspot.com/2009/06/localized-property-with-nhibernate.html
(This stores all localised language data in one field. Can be a problem if we query from Sql)
http://ayende.com/Blog/archive/2006/12/26/LocalizingNHibernateContextualParameters.aspx
(This one has a warning at the beginning that it is a hack and no longer supported)
http://www.webdevbros.net/2009/06/24/create-a-multi-languaged-domain-model-with-nhibernate-and-c/
(This does not describe how multilingual data will be structured in the database.)
Anyone having experience with using NHibernate with multi-lingual data. Is there a better way?
The third option looks great. The hibernate mapping is given, but not the database schema - if that's what you are missing, then I'll sketch it out here:
dictionary
----------
ID: int - identity
name: nvarchar(255)
phrase
------
dictionary_id:int (fkey dictionary.ID)
culture_id:int (LCID)
phrase:nvarchar(255) - this is the default size - seems too small
According to this blog entry, 255 is the default string length for String values. To overcome the short string length on the phrase text, you can change the <element> tag to
<element column="phrase" type="String" length="4001"></element>
To use this in your domain model, you add a PhraseDictionary property to your entity where you want translatable text. E.g. the title property or decription property.
I think the article describes a great approach, and is the one that I would go
for.
EDIT: In response to the comments, make the length less than 4001 if you know the absolute maximum size is less than that, as this will typically be faster. Also, NHibernate will lazily fetch the collection, but it may fetch all the items at once. You can profile to determine if this has any performance implications. (If you have only a handful of languages then I doubt you will see a difference.) If you have many languages (Say 50+) then it may be worthwhile creating custom properties to fetch the localized text. These will issue queries to fetch specifically the text required. More importantly, you may be able to fetch all the text for a given entity in one query, rather than each localized text property as a separate query.
Note that this extra effort is only needed if profiling gives you reason to be concerned about the performance. Chances are that the implementation in the article as is will function more than adequately.
I only have experience for Hibernate, but since nHibernate is so similar:
One option is to define a component type MultilingualString with members for each language (this assumes the set of languages is known at coding time). This type is also a convenient location to place an getter for the string by language id.
class MultiLingualString {
String english;
String chinese;
String klingon;
String forLanguage(Language lang) {
switch (lang) {
// you can guess what goes here
}
}
}
This results in the strings for all languages being stored in separate columns in the database while the representation in the object world retains fine granularity.
The advantage is that no join is required to fetch the strings. On the other hand, the only way not to fetch a string with this approach is to use a projection, which is a severe limitation if the strings are large, numerous and rarely needed.
If you do this a lot, writing a UserType might be worth it.
From a strictly database oriented standpoint with SQL Server, you should have one table with all of the base data (record key, dates, numbers, etc) and one table with all of the translatable string data. Let call the two tables Base and Base_Description.
Base ensures that there is a single key for each record, the key might be a string or auto-generated id depending on your particular use case.
The Base_Description table is related to the Base table, but also contains a value to select the language that the data is in. In my projects we use the langid column from sys.languages because we can set the language of the connection with and then grab it with ##LANGID for most operations.
In our testing we found this to be significantly faster than having multiple fields for each language, it also allows you to add other languages more easily. We are also using SQL Server Full-Text indexing and it fully works with this method. You should index in the neutral language and then you can pick the language to search against at run time (also filtering against the LangID column in Base_Description).
Do your requirements include the domain objects actually having multiple-language properties in the same object? And, if so, is it unlimited translations stored in the object (in a collection, say - in which case I would say that it would need to be just like any master/detail or parent/child collection) or fixed translations, in which case the languages (and thus the mapping to results of a stored proc or whatever) have to be determined statically anyway?
In many internationalized applications I worked on, the data was in only one language - customer names, the product names (there was no point in mapping even identical products used in one country to products in another, they all had different distributors and different SKUs, and of course localized pricing). The interface was also only in one language (at a time). So all the domain objects only required one language at a time. Thus the language of the translation would be determined when the object was instantiated.
We had translation user interfaces which allowed users to update the translated texts, but these only required two languages at a time (local and the default). I can see this being closest to what you are talking about. I guess that you would have child collections for each translatable property with all the possible translations in the collection. This would probably be closest to the second solution in the third article you linked. Of course, at this point you would also need to see if you want eager/lazy loading etc.

Resources