Graph Database - How to deal with multilingual data - graph

I'm trying to approach a multilingual graph database but I'm struggling on how to achieve an optimal model.
My current proposal is to make two node types: Movie and MovieTranslation.
Movie holds all relationships as likes, related, ratings and comments. MovieTranslation contains all translatable data (title, plot, genres). A Movie node does not contain these kind of properties, only the original_title.
Movie and MovieTranslation are tied together by a translation relationship.
When I query nodes, I would check if they have a translation relationship with the queried locale (en_US for example). If true, merge the translation with the main node as the result.
I think this way might not be the best, but I can't think on a better one.
Do you guys have a better suggestion for the database model? It would be very appreciated.
I'm using neo4j, if you need this information.
Thanks,
Vinicius.

I suggest moving the original title to its own node also, call it MovieTitle. "Complicating" your model in this way should actually "simplify" (or at least standardise) your queries because you're always looking in one place for film titles (also for indexing and searching).
You're assuming that films only have one original title which isn't the case. A Korea-Japan co-production will have at least two original titles. Whole genres of Japanese cinema were released with different original Japanese titles in cinemas and on VHS.
Distinct from the idea of an original title is that of specific language titles. The same film released in different Chinese-speaking countries will have different Chinese-language titles that are deemed more marketable to the specific local audiences.
To get the original title:
MATCH (c:Country)<-[HAS_NATIONALITY]-(m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)
WHERE m.id = 1
RETURN COLLECT(t.title, c.country_code)
To get the original title in China:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)
WHERE c.country_code == "CN"
RETURN m, COLLECT(t.title, c.country_code)
To get all language titles:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)-[HAS_LANGUAGE]->(l:Language)
RETURN m, COLLECT(t.title, l.language_code)
To get all Chinese-language titles:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)-[HAS_LANGUAGE]->(l:Language)
WHERE l.language_code == "zh"
RETURN m, COLLECT(t.title, c.name)
I would separate plot and genre into their own nodes. There is an argument that different national cinemas have unique genres, but if westerns and samurai dramas are both sub-genres of period dramas then you want to find them both on a period drama search.
I would still have the idea of Translation nodes but don't confuse with them the domain you're modelling. It should be domain-ignorant and - for simple words/phrases like "romantic comedy" - should almost be a third-party graph plug-in released by GraphAware in 2025.
Get the French-language genre titles of a specific film:
MATCH (m:Movie)-[HAS_GENRE*]->(g:Genre)-[HAS_TRANSLATION]->(t:Translation)-[HAS_LANGUAGE]->(l:Language)
WHERE m.id = 100 AND l.language_code = "fr"
RETURN COLLECT(t.translation)
Get all romanic comedies:
MATCH (m:Movie)-[HAS_GENRE*]->(g:Genre)-[HAS_TRANSLATION]->(t:Translation)
WHERE t.translation = "comédie romantique"
RETURN m
Unlike movie titles and genres, plots are altogether more simple because you're modelling the film's story as a blob of text and not as domain objects in itself. Perhaps later you may do textual analysis on the plot texts to find themes, gender bias, etc, and model this in the graph as well.
Get the French language plot for a specific movie:
MATCH (m:Movie)-[HAS_PLOT]->(p:Plot)-[HAS_LANGUAGE]->(l:Language)-[HAS_TRANSLATION]->(t:Translation)
WHERE m.id = 100 AND t.translation = "French"
RETURN p.plot
(Please treat the Cypher queries as pseudo-code. I didn't make a graph and test them.)

I think the model is ok.
You can RETURN movie, translation or RETURN {movie:movie, translation:translation}
Currently converting nodes to maps and combining these maps is not yet supported, that's something on the roadmap.
How and where would you want to use the nodes? If for rendering, you can just access the two columns or entries. If for graph visualization you can also combine them into a node in the json source for the viz.

Related

Do I need to create all nodes by hand in Neo4j?

I am probably missing something because I am very new to Neo4j, but looking at their Movie graph - probably the very first graph to play with when you are learning the platform - they give us a really big piece of code where every node and labels and properties are imputed by hand, one after the other. Ok, it seems fair to a small graph for learning purpose. But, how should I proceed when I want to import a CSV and create a graph from this data? I believe a hand-imput is not expected at all.
My data look something like this:
date
origin
destiny
value
type
balance
01-05-2021
A
B
500
transf
2500
It has more than 10 thousand rows like this.
I loaded it as:
LOAD CSV FROM "file:///MyData.csv" AS data
RETURN data;
and it worked. The data was loaded etc. But now I have some questions:
1- How do I proceeed if I want origin to be a node and destiny to be another node with type to be edges with value as property? I mean, I know how to create it like (a)->[]->(b) but how to create the entire graph without creating edge by edge, node by node, property by property etc...?
2- Am I able to select the date and see something like a time evolution for this graph? I want to see all transactions in 20-05-2021, 01-05-2021 etc and see how it evolves. Is it possible?
As example in the official docs says here: https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/#tutorial-neo4j-admin-import
You may want to create 3 separate files for the import:
First: you need the movies.csv to import nodes with label :Movie
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Second: you need actors.csv to import nodes with label :Actor
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
Finally, you can import relationships
As you see, actors and movies are already imported. So now you just need to specify the relationships. In the example, you're importing ROLE relationship in the given format:
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
So as you see in the header, you've got values:
START_ID - where the relationship starts, from which node
role - property name (you can specify multiple properties here, just make sure the csv format contains data for it)
:END_IN - where the relationship ends, to which node
:TYPE - type of the relationship
That's all :)

QGIS - rules. Finding roads crossing cities with query

On same layer in QGIS I have highways (primary, secondary etc.) and cities boundaries (place=city). I want to show/label only primary roads crossing cities. I try something like
"highway" = 'tertiary' AND "place" = 'city'
but got no items, when
"highway" = 'tertiary'
alone returns 6 elements.
Or maybe another way? How can I find crossing elements?
Since both your target variables are in the same layer it seems to be a query problem.
Have you checked if the condition that you are looking for really exists in your dataset?
If so, like many things in QGIS, Rule-based Labeling uses Structured Query Language (SQL), which is strongly typed, to fetch the data you are looking for. Thereby, it's important to be sure about the data types of your dataset and be aware of the double-quote pattern for "column" and single quote for the 'values', if they are varchar or string, exactly as you already did in your example.
Furthermore, check if the values that you are trying to match, in your Rule-based Labeling, are written exactly as they appear in your data attribute table. Anyway, if your dataset is huge, you can use this approach as a workaround:
"highway" ilike '%tertiary% and "place" ilike '%city%'
The ilike operator will make no difference from the upper or lower case in your query, which means that if you have a value 'Tertiary' it would be possible to fetch the data anyway.
Besides, as I put it, the wildcard character % finds any values that have the strings 'tertiary' and 'city' in any position, which can avoid lots of headaches due to whitespaces before or after the string.

R biomaRt package: obtaining all values in linked databases

A bioinformatics programming question. In R, I have a classic speciesA-to-speciesB gene symbol conversion, in this example from mouse to human, which I'm performing using biomaRt, and specifically the getLDS function.
x<-c("Lbp","Ndufv3","Ggt1")
require(biomaRt)
convert<-function(x){
human=useMart("ensembl",dataset="hsapiens_gene_ensembl")
mouse=useMart("ensembl",dataset="mmusculus_gene_ensembl")
newgenes=getLDS(
attributes="mgi_symbol",
filters="mgi_symbol",
values=x,
mart=mouse,
attributesL="hgnc_symbol",
martL=human,
uniqueRows=TRUE
)
humanx<-unique(newgenes)
return(humanx)
}
conversion<-convert(x)
However, I would like to obtain ALL ids present in the linked database: in other words, all mouse/human pairs (in this example). Something to tell the parameter value in the getLDS function to retrieve all ids, not just those specified in the x variable. I am talking about a full map, tens of thousands of lines long, specifying all orthologous relationships between symbols of the two databases.
Any ideas or workarounds? Thanks a lot!
I believe a workaround could be retrieving all IDs from the Biomart database itself, here: https://www.ensembl.org/biomart/martview/
Click on choose database -> Ensembl Genes
Choose dataset -> your selected species (e.g. Mouse genes)
Click on Results -> Check "Unique results only" -> Go
Profit
The list retrieved here has currently 53605 ids, which is, I believe, what you need.
Enjoy!

Is there a way to represent the changes over time in a graph of the knowledge base in grakn?

Given a context, for example, you do have a set of facts in your graph database / knowledge base (as in the grakn), that would represent a current state of a graph (in plain text here) like :
version 1 (jan/2016): "Rachel is a person that is a english teacher for a class of 10 students in a University ABC" .
change 1 (mar/2016), that generates version 2: "Alice replaces Rachel"
version 2: (mar/2016): "Alice is a person that is a english teacher for a class of 10 students in a University ABC".
So given that, I know that I could represent the versions inside the graph and replicate everything (minus the change) from version 1 into a new set of the data (nodes and edges) to the version 2,
But I am wondering if there is a Best Practice (or some mechanism of the engine) in representing these changes overtime, like versioning of that data set, or something similar that would make the change to a new data set but keep a history so that you can recompose the previous state the graph.
The only thing close to that is that Grakn can support attaching attributes to relationships. For example:
insert
$x (spouse: $p1, spouse: $p2) isa marriage;
$x has date "01/10/2010"
You can also attach attributes to attributes. So if you defined a attribute type for example Version you could attach that to all your relationships.
So while it cannot represent change over time out of the box you can work around it to some degrees depending on your use case.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources