Analyzing RDF Graph: average number of certain relation - graph

I'm new to SPARQL.
I'm trying to find a way to generally analyze and RDF graph, meaning for example the average number of a certain relation for a subject.
So if we would have the data
[Alice likes Money]
[Bob has Money]
[Bob likes Diving]
[Bob likes Skiing]
What is the average number of "likes" per node, (here: 1.5).
My first try is to simply write a script to iterate all distinct objects and query for the count of likes relations on each.
Is there a way to do this directly in SPARQL?

Yes you can use GROUP BY and aggregates for this kind of thing. See Aggregates in the specification for an overview of this.
If you wanted to get the likes per node you can do so like so:
PREFIX : <http://example.org/ns#>
SELECT ?node (COUNT(*) AS ?likes)
WHERE
{
?s :likes ?node
}
GROUP BY ?node
Here we group by the ?node and do a COUNT(*) which simply counts the number of solutions in the group. This gives us the number of likes for every distinct ?node value in a single query.
If we wanted to find the average likes per node we can also do this using aggregates:
PREFIX : <http://example.org/ns#>
SELECT
(COUNT(*) AS ?likeCount)
(COUNT(DISTINCT ?node) AS ?nodeCount)
(?likeCount / ?nodeCount AS ?avgLikesPerNode)
WHERE
{
?s :likes ?node .
}
Here we use COUNT(*) again to get the total number of likes and then we use COUNT(DISTINCT ?node) which will count the distinct values for ?node and then we can simply divide our ?likeCount by our ?nodeCount to give us the average likes per node.

Related

Writing a SPARQL query that constructs new triples with result from COUNT in an aggregate

I'm working with the Snap SPARQL tool in Protege so to add data into the ontology I have to use CONSTRUCT because it doesn't support INSERT (the tool gives the option to assert the new triples constructed back into the ontology). I want to count values with a specific value and assert the count of those values back into the ontology. I created a little test ontology regarding students and grades to help me figure this out. I have the following query which works:
SELECT ?student (COUNT(?test) AS ?tcount)
WHERE {?student test:tookTest ?test.
?test test:hasGrade test:A.}
GROUP BY ?student
This gives me a table with each student in one column and their number of A grades in the next. What I want to do next is to use the ?tcount to assert the data back into the ontology. I've tried various things like replacing the SELECT with a CONSTRUCT or using an embedded query:
CONSTRUCT {?student test:hasACount ?tcount.}
WHERE {
SELECT ?student (COUNT(?test) AS ?tcount)
WHERE {?student test:tookTest ?test.
?test test:hasGrade test:A.}
GROUP BY ?student}
I think the problem with this is that ?tcount isn't in scope of the surrounding query. I've tried several different options like using BIND to BIND ?tcount or grouping by ?tcount rather than ?student but no luck.

Clickhouse topK query on several columns

In ClickHouse, is there any way use the topK query on more than the column ,
for example:
select topK(10)(AGE,COUNTRY) ...
meaning I want the top10 combinations of AGE+COUNTRY,
I only found a workaround using concat on fields and topK on them, wondered if there is any other way.
You can pass array (or tuple) of columns to topK:
SELECT topK(10)([Age, Country])
FROM table
Or use the straightforward calculation (it is much slower but provides the exact result):
SELECT
Age,
Country
FROM table
GROUP BY
Age,
Country
ORDER BY count() DESC
LIMIT 10

How to find the nth largest value of each row in SQL

I have researched this problem and have found the answer for a single query, where you can find the nth value of a single column by using DESC OFFSET 2. What I am trying to do is find the nth value for each item in a row. For example, I'm working with a data base concerning bike share data. The data base stores the duration of each trip and the date. I'm trying to find the 3rd longest duration for each day in a data base. If I was going to find the max duration I would use the following code.
SELECT DATE(start_date) trip_date, MAX(duration)
FROM trips
GROUP BY 1
I want the output to be something like this.
Date 3rd_duration
1/1/2017 334
1/2/2017 587
etc
If the value of the third longest duration is the same for two or more different trips, I would like the trip with the lowest trip_id to be ranked 3rd.
I'm working in SQLite.
Any help would be appreciated.
Neither SQLite nor MySQL have a ROW_NUMBER function built in, so get ready for an ugly query. We can still group by the date, but to find the max duration we can use a correlated subquery.
SELECT
DATE(t1.start_date) AS start_date,
t1.duration
FROM trips t1
WHERE
(SELECT COUNT(*) FROM trips t2
WHERE DATE(t2.start_date) = DATE(t1.start_date) AND
t2.duration <= t1.duration) = 3;
Note that this approach might break down if you could have, for a given date, more than one record with the same duration. In this case, you might get multiple results, neither of which might actually be the third highest duration. In order to handle such ties, you should tell us what the logic is with regard to ties.
Demo here:
Rextester

Dax for % of columntotal

I've got a tablereport with on the rows product category and on the columns years. In the valuesection, I want to show the number of sales. This works fine. But now I also want to show the % of columntotal for the product categories.
I use dax:
Measure := count(factSales[salesnr])/calculate(count(factSales[salesnr]);all(factSales))
But this yields the percentage of grand total over all years. I want the percentage of columntotal for every seperate year.
The ALL(Table) function tells Power Pivot to ignore any filters applied over the whole table. Therefore, you're telling Power Pivot to count all the rows of the factsales table regarless of the Category or Year being filtered on the pivot table.
However, in your case, what you want is the sum for ALL the categories on each year. Since you want the sum of ALL the categories you must use `ALL(factsales[categories]). In this way, you're ignoring only the filters for the categories and not the filters for the years.
Based on the previous explanation the dax formula would be:
Measure :=
count(factSales[salesnr]) / calculate(count(factSales[salesnr]);all(factsales[categories]))

Difference between Qualify and Having

Can someone please explain me, what is the difference between qualify...over...partition by and group by...having in Teradata?I would also like to know if there are any differences in their performances.
QUALIFY is a proprietary extension to filter the result of a Windowed Aggregate Function.
A query is logically processed in a specific order:
FROM: create the basic result set
WHERE: remove rows from the previous result set
GROUP BY: apply aggregate functions on the previous result set
HAVING: remove rows from the previous result set
OVER: apply windowed aggregate functions on the previous result set
QUALIFY: remove rows from the previous result set
Having clause is used to filter the result set of the aggregate functions like (COUNT,min,max etc)
they eliminate rows based from groups based on some criteria like this :-
SELECT dept_no, MIN(salary), MAX(salary), AVG(salary)
FROM employee
WHERE dept_no IN (100,300,500,600)
GROUP BY dept_no
HAVING AVG(salary) > 37000;
The QUALIFY clause eliminates rows based on the function value, returning a new value for each of the participating rows.
It works on the final result set.
SELECT NAME,LOCATION FROM EMPLOYEE
QUALIFY ROW_NUMBER() OVER ( PARTITION BY NAME ORDER BY JOINING_DATE DESC) = 1;
We can club both having and qualify as well in a query if we use both aggregate and analytical fucntion like below:-
SELECT StoreID, SUM(sale),
SUM(profit) OVER (PARTITION BY StoreID)
FROM facts
GROUP BY StoreID, sale, profit
HAVING SUM(sale) > 15
QUALIFY SUM(profit) OVER (PARTITION BY StoreID) > 2;
You can see there order of execution from dnoeth answer.

Resources