I have a task: "to get a duplicates (documents with same property value) from Alfresco database with count duplicates amounts".
In MySql there will be something like that:
mysql> SELECT COUNT(*) AS repetitions, last_name, first_name
-> FROM cat_mailing
-> GROUP BY last_name, first_name
-> HAVING repetitions > 1
But I have read that "The CMIS query language in Alfresco does not support GROUP BY or HAVING." . Is there any query (in any supported language) to perform described task?
Thank you!
UPD: for now I am counting in JVM this way (redefining hashCode/equals for Form20Row)
Map<Form20Row, Form20Row> rowsMap = results.stream().parallel().map(doc -> {
Form20Row row = new Form20Row();
String propMark = propertyHelper.getNodeProp(doc, NDBaseDocumentModel.PROP_MARK);
row.setGroupName(systemMessages.getString("form20.nss.name"));
row.setDocMark(propMark);
row.setDupesNumber(1);
return row;
}).collect(Collectors.toConcurrentMap(form20Row -> form20Row, form20Row -> form20Row,
(existing, replacement) -> {
int count = existing.getDupesNumber();
existing.setDupesNumber(++count);
return existing;
}));
Alfresco uses SOLR for search on nodes but SOLR is very limited on joins, aggregate functions, counting ... What you may do is querying the SOLR index using facets like facet.field=field1&facet.mincount=1.
Personally I would prefer to query the alfresco db directly to find nodes having the same property values for specific properties. This will not depend on the solr index and gives you the full flexibility of SQL.
I agree with Heiko.
This won't be trivial either, since Alfresco keeps "everything" in "alf_node_properties" table, and in case of strings in "string_value" column. So you can't know "which" property this is without joining with "alf_qname" table, and for larger databases this can take a longish time.
You probably want to filter out deleted and node versions, and not compare them at all.
Related
I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids.
I couldn't find anything that looks related in https://sqlite.org/limits.html
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
AND flows.id IN (1,2,3) /* <=== here more than 26623 IDs make it super slow */
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
Before I try to make that reproducible in isolate (e.g. search_params is a custom virtual table), does anyone know what limitation I might be running into? It's not the number of IDs per se, since a different query runs just fine with the same IDs.
SQLite version 3.36.0 via better-sqlite3 (Node.js) with a readonly database. The only pragma I use is journal_mode = WAL.
Compiled with (https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/compilation.md#bundled-configuration):
SQLITE_DQS=0
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_THREADSAFE=2
SQLITE_USE_URI=0
SQLITE_DEFAULT_MEMSTATUS=0
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_GET_TABLE
SQLITE_OMIT_TCL_VARIABLE
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_TRACE_SIZE_LIMIT=32
SQLITE_DEFAULT_CACHE_SIZE=-16000
SQLITE_DEFAULT_FOREIGN_KEYS=1
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_MATH_FUNCTIONS
SQLITE_ENABLE_DESERIALIZE
SQLITE_ENABLE_COLUMN_METADATA
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
SQLITE_ENABLE_STAT4
SQLITE_ENABLE_FTS3_PARENTHESIS
SQLITE_ENABLE_FTS3
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_RTREE
SQLITE_ENABLE_GEOPOLY
SQLITE_INTROSPECTION_PRAGMAS
SQLITE_SOUNDEX
HAVE_STDINT_H=1
HAVE_INT8_T=1
HAVE_INT16_T=1
HAVE_INT32_T=1
HAVE_UINT8_T=1
HAVE_UINT16_T=1
HAVE_UINT32_T=1
Here's the answer from the SQLite forums. Essentially this is a combination of how the query planner handles IN literals and what cost my virtual table estimates. That means I'm running into the exact moment when the query planner makes a different decision.
SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.
EXPLAIN QUERY PLAN should show that in a more concise manner.
If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.
Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.
https://sqlite.org/forum/forumpost/a3d68ed8b40cf583?t=h
The workaround I use is json_each and serialize the array of integers into a JSON string. In my particular use-case this has some other benefits as well (e.g. I can bind a single parameter and re-use the query with any number of IDs), so I don't mind doing that:
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE search NOT IN ('', '?')
-AND flows.id IN (1,2,3)
+AND flows.id IN (SELECT value FROM json_each('[1,2,3]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
I also know that the generic virtual table implementation of better-sqlite3 makes a trade-off between being easy to use (it's ridiculously easy) and achieving maximum performance.
In my android application, I use Cursor c = db.rawQuery(query, null); to query data from a local sqlite database, and one of the query string looks like the following:
SELECT t1.* FROM table t1
WHERE NOT EXISTS (
SELECT 1 FROM table t2
WHERE t2.start_time = t1.start_time AND t2.stop_time > t1.stop_time
)
however, the issue is that the query gets very slow when the database gets huge. Trying to look into introducing indexing to speed up the query, but so far, not been very successful, therefore, would be great to have some help here, as it's also hard to find examples for this for android applications.
You can create a composite index for the columns start_time and stop_time:
CREATE INDEX idx_name ON table_name(start_time, stop_time);
You can read in The SQLite Query Optimizer Overview:
The ON and USING clauses of an inner join are converted into
additional terms of the WHERE clause prior to WHERE clause analysis
...
and:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
Then the index might be used if the initial columns of the index
(columns a, b, and so forth) appear in WHERE clause terms. The initial
columns of the index must be used with the = or IN or IS operators.
The right-most column that is used can employ inequalities.
You may have to uninstall the app from the device so that the db is deleted and rerun to recreate it, or increase the version number of the db so that you can create the index in the onUpgrade() method.
I am trying to select data based on a status which is a string. What I want is that status 'draft' comes first, so I tried this:
SELECT *
FROM c
ORDER BY c.status = "draft" ? 0:1
I get an error:
Unsupported ORDER BY clause. ORDER BY item expression could not be mapped to a document path
I checked Microsoft site and I see this:
The ORDER BY clause requires that the indexing policy include an index for the fields being sorted. The Azure Cosmos DB query runtime supports sorting against a property name and not against computed properties.
Which I guess makes what I want to do impossible with queries... How could I achieve this? Using a stored procedure?
Edit:
About stored procedure: actually, I am just thinking about this, that would mean, I need to retrieve all data before ordering, that would be bad as I take max 100 value from my database... IS there any way I can do it so I don t have to retrieve all data first? Thanks
Thanks!
ORDER BY item expression could not be mapped to a document path.
Basically, we are told we can only sort with properties of document, not derived values. c.status = "draft" ? 0:1 is derived value.
My idea:
Two parts of query sql: The first one select c.* from c where c.status ='draft',second one select c.* from c where c.status <> 'draft' order by c.status. Finally, combine them.
Or you could try to use stored procedure you mentioned in your question to process the data from the result of select * from c order by c.status. Put draft data in front of others by if-else condition.
I am using aws-sdk to connect with DynamoDb and I have got into a scenario where I got one dynamodb table with different partition/hash-key and I have to scan and filter to get the results. Scanning the entire table would be a costly operation . Is there a way to scan only a certain partion/haskey of a table ?
You have to use Dynamo DB Query. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
In my opinion you shouldn't use scan, because it very costly and slow.
You didn't write what is the program language, but here is some example to query:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html#query-property
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html
https://www.dynamodbguide.com/querying/
About indices:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.Indexes.html
UPDATE #1:
Maybe that will help:
Add a new column to your table. The values will be static. (For example: Col name: const_value Values: const)
Create a new secondary index to your table.
'partition key': 'const_value'
'sort key': the column what you want to filter
You can use Query.
I have recently stumbled upon a problem with selecting relationship details from a 1 table and inserting into another table, i hope someone can help.
I have a table structure as follows:
ID (PK) Name ParentID<br>
1 Myname 0<br>
2 nametwo 1<br>
3 namethree 2
e.g
This is the table i need to select from and get all the relationship data. As there could be unlimited number of sub links (is there a function i can create for this to create the loop ?)
Then once i have all the data i need to insert into another table and the ID's will now have to change as the id's must go in order (e.g. i cannot have id "2" be a sub of 3 for example), i am hoping i can use the same function for selecting to do the inserting.
If you are using SQL Server 2005 or above, you may use recursive queries to get your information. Here is an example:
With tree (id, Name, ParentID, [level])
As (
Select id, Name, ParentID, 1
From [myTable]
Where ParentID = 0
Union All
Select child.id
,child.Name
,child.ParentID
,parent.[level] + 1 As [level]
From [myTable] As [child]
Inner Join [tree] As [parent]
On [child].ParentID = [parent].id)
Select * From [tree];
This query will return the row requested by the first portion (Where ParentID = 0) and all sub-rows recursively. Does this help you?
I'm not sure I understand what you want to have happen with your insert. Can you provide more information in terms of the expected result when you are done?
Good luck!
For the retrieval part, you can take a look at Common Table Expression. This feature can provide recursive operation using SQL.
For the insertion part, you can use the CTE above to regenerate the ID, and insert accordingly.
I hope this URL helps Self-Joins in SQL
This is the problem of finding the transitive closure of a graph in sql. SQL does not support this directly, which leaves you with three common strategies:
use a vendor specific SQL extension
store the Materialized Path from the root to the given node in each row
store the Nested Sets, that is the interval covered by the subtree rooted at a given node when nodes are labeled depth first
The first option is straightforward, and if you don't need database portability is probably the best. The second and third options have the advantage of being plain SQL, but require maintaining some de-normalized state. Updating a table that uses materialized paths is simple, but for fast queries your database must support indexes for prefix queries on string values. Nested sets avoid needing any string indexing features, but can require updating a lot of rows as you insert or remove nodes.
If you're fine with always using MSSQL, I'd use the vendor specific option Adrian mentioned.