Merge existing records in neo4j, remove duplicates, keep relationships - graph

I've imported my millions of records using CREATE for performance reasons, now I want to MERGE the records together, and keep all the relationships intact.
Any ideas?
EDIT:
MATCH (c1:company), (c2:company)
WITH c1, c2
WHERE c1.name = c2.name
SET c1=c2
Is the type of thing I'm looking for.

If you want to merge nodes in cypher you can do something like this:
MATCH (c:Company)
WITH c.name as name, collect(c) as companies, count(*) as cnt
WHERE cnt > 1
WITH head(companies) as first, tail(companies) as rest
LIMIT 1000
UNWIND rest AS to_delete
MATCH (to_delete)<-[r:WORKS_AT]-(e:Employee)
MERGE (first)<-[:WORKS_AT]-(e)
DELETE r
DELETE to_delete
RETURN count(*);
see: http://www.neo4j.org/graphgist?dropbox-14493611%2Fmerge_nodes.adoc

It doesn't work that way. There is no way to move relationships around, and no way to coalesce existing nodes. You should use MERGE from the beginning, along with constraints and indexes to aid performance.

Related

How are you supposed to think of joins in cosmosdb?

I am very confused by the cosmosdb documentation on joins. When I think of a join conventionally, I think of 2 tables, with 1 shared id, on which I perform the join. These 2 tables have different schemas, but the result of the join is a combined table with a merge of the columns from both tables. The join for cosmosdb does not seem to me intuitively congruent with that.
I have a collection with heterogenous data. Each document can have a different structure from the next. I want to count the number of documents that have a value that is present in the result set of a subquery. Intuitively, I want to do something like this:
SELECT COUNT(1) as c
FROM CollectionName as outer
where outer.type = "table"
JOIN ((SELECT c.id from c where c.type = "database") as inner) on outer.databaseId == t.id
// count the number of tables that are in deleted databases
It would seem like I would need to do a join on the result of the subquery with the result of the outer query, and then process that resulting table. But I am not understanding right now how to do that:
Select COUNT(1)
from Collection outer
where outer.type = 'table'
JOIN (select c.id from c IN outer.databaseId where c.type = "database" and c.state = "deleted")
I am constantly getting a 400 with the above query. So how am I supposed to think about joins in cosmosdb?
Cosmos is a document database. It stores and operates on json data which can be in hierarchical format. Joins in Cosmos reference tuples within these hierarchies where they can be projected with other data in the document.
There is a really good article that talks through this at pretty deep level but also have lots of examples too, Joins in Cosmos DB.
This takes some getting used to writing queries like this but once you get the hang of it you'll be ok. You can easily practice queries using the Query Playground that has a bunch of sample queries for nutrition dataset with food and ingredients. Or follow along with the families data in the docs. You can create additional items and then write some queries to see how joins work.
Hope that is helpful.

select * from (select...) sqlite python

I'm working on a sqlite database and try to make a special request between two tables.
In the first table (table1 for example), i have two columns named "reference" and "ID". I want to search an ID in it, get it value in "reference" and display all informations from the table which have this value as name.
I try to find something on the internet but I didn't find an answer.
This is the request I made:
select * from (select Reference from table1 where Name='Value1')
It only give me the result of
select Reference from table1 where Name='Value1'
EDIT:
I want
select Reference from table1 where Name='Value1' => name of table
select * from name of table => show all elements
I'm new in sqlite but I hope you can help me.
Thank you by advance
Matt
If I understand your question correctly, I don't think there's a way to do it in sql completely (or at least not in a portable way). I'd recommend one of 3 solutions:
Do exactly what you want, but do some processing in Python. That means query your master table, then construct new query based on each of the rows returned.
If you have many tables, possibly changing dynamically - it may be a good idea to rethink your database design. Maybe you can move some of the changing table names into a new column and put your data in one table?
If you have only a few tables available as the Reference and they never change, you could join all the possible tables, like:
SELECT ... FROM table1
LEFT JOIN table2
ON table1.id = table2.id AND table1.Reference = "table2"
LEFT JOIN table3 ...
But you may need to explain it all a bit better...

Indexing columns crossing multiple tables

I have 2 tables, headers with 2 millions rows and files with 30 rows.
I have a query that supposes to get the total number of headers for each directory.
The SQL looks like below:
SELECT files.dir_id, COUNT(*) AS "TOTAL"
FROM headers
LEFT JOIN files ON headers.file_id = files.file_id
GROUP BY files.dir_id
Currently, executing the SQL above is taking 20sec. How can I index it to make it faster?
I have tried CREATE INDEX IF NOT EXISTS HEADERS_FILE ON HEADERS(FILE_ID). This made GROUP BY file_id gets instant response (without left joining files table). However, it doesn't improve the performance for the original query above.
I'm thinking of something like CREATE INDEX INDEX_NAME ON HEADERS, FILES(FILE_ID, DIR_ID) should work. but I find no way in creating such index.
Appreciate for any help. Thanks!
The LEFT join prevents the database from using files as the outer table in the nested loop join.
Try using an inner join, and then adding the missing rows by hand; this might allow better optimization for the two subqueries:
SELECT files.dir_id, COUNT(*) AS "TOTAL"
FROM headers
-- LEFT JOIN expanded by hand for better optimization
INNER JOIN files ON headers.file_id = files.file_id
GROUP BY files.dir_id
UNION ALL
SELECT dir_id, 0
FROM files
WHERE file_id NOT IN (SELECT file_id
FROM headers)

What is the fastest way of selecting by a list of strings in sqlite database?

I have database with with roughly following structure:
table1 (name) -< table2 -< table3 (score)
where -< means 1 to many relationship. What i need to do is for every string in a given list find the linked entry from table3 with a maximum score value. The way i do it now is quite slow, and i wonder of it could be sped up.
How i am doing this:
SELECT k.score,k.yaw,k.pitch,k.roll,k.kp_number,k.ke_number,k.points,k.elems --various fields of third table
FROM File
JOIN FaceDetection AS d ON d.f_id=File.file_id --joining second table
JOIN FaceKey AS k ON k.face_det=d.fd_id --joining third table
WHERE name=:fld
ORDER BY k.score DESC
I open transaction, prepare query with the above text, and in cycle retrieve the entries i am interested in from the database, then commit transaction. What are better, faster ways?
Indexes can be used for all the columns that are used for lookups or sorting, but a query cannot use more than one index per table.
Check the EXPLAIN QUERY PLAN output to see whether this query does table scans or uses indexes.
You are not returning values from any table but FaceKey, so you do not actually need to do a join.
However, rewriting the query as below might or might not help:
SELECT score,
yaw,
pitch,
roll,
kp_number,
ke_number,
points,
elems
FROM FaceKey
WHERE face_det IN (SELECT fd_id
FROM FaceDetection
WHERE f_id IN (SELECT file_id
FROM File
WHERE name = :fld))
ORDER BY score DESC

Dynamic query and caching

I have two problem sets. What I am preferably looking for is a solution which combines both.
Problem 1: I have a table of lets say 20 rows. I am reading 150,000 rows from other table (say table 2). For each row read from table 2, I have to match it with a specific row of table 1 (not matching whole row, few columns. like if table2.col1 = table1.col && table2.col2 = table1.col2) etc. Is there a way that i can cache table 1 so that i don't have to query it again and again ?
Problem 2: I want to generate query string dynamically i.e., if parameter 2 is null then don't put it in where clause. Now the only option left is to use immidiate execute which will be very slow.
Now what i am asking that how can i have dynamic query to compare it with table 1 ? any ideas ?
For problem 1, as mentioned in the comments, let the database handle it. That's what it does really well. If it is something being hit often, then the blocks for the table should remain in the database buffer cache if the buffer cache is sized appropriately. Part of DBA tuning would be to identify appropriate sizing, pinning tables into the "keep" pool, etc. But probably not something that needs worrying over.
If the desire is just to simplify writing the queries rather than performance, then views or stored procs can simplify the repetitive use of the join.
For problem 2, a query in a format like this might work for you:
SELECT id, val
FROM myTable
WHERE filter = COALESCE(v_filter, filter)
If the input parameter v_filter is null, then just automatically match the existing column. This assumes the existing filter column itself is never null (since you can't use = for null comparisons). Also, it assumes that there are other indexed portions in the WHERE clause since a function like COALESCE isn't going to be able to take advantage of an index.
For problem 1 you just join the tables. If there is an equijoin and one table is quite small and the other large then you're likely to get a hash join. This is effectively a caching mechanism, and the total cost of reading the tables and performing the join is only very slightly higher than that of reading the tables (as long as the hash table fits in memory).
It does not make a difference if the query is constructed and run through execute immediate -- the RDBMS hash join will still act as an effective cache.

Resources