how to count number of rows in a relation in pig

how to count number of rows in a relation in pig - count

I have a relation in pig named 'A'
(name:gender:zip-code)
(x:m:1234)
(y:f:1234)
(z:m:1245)
(s:f:1235)
How can I get the number of rows in relation A?
I want to get the result as 4.

Assuming your relation is named A.
B = GROUP A ALL;
B_COUNT = FOREACH B GENERATE COUNT(A);
DUMP B_COUNT;

Related

Cypher to show relationships of hidden nodes

I’m a total beginner in Cypher and I’m struggling to obtain the result I want.
So I have nodes that all have a property called « level ». I want to keep only a certain level, but I want to recreate the missing links.
Here is my dataset:
in CSV:
n
"{owner:Team A,name:MySubscription,level:1}"
"{name:Database,level:2}"
"{owner:Team A,name:Service A,level:3}"
"{owner:Team A,name:MyTopic,level:2}"
"{name:Service B,level:3}"
"{name:Service C,level:3}"
"{name:MySecret,level:1}"
I want to keep only the nodes that are level >= 2 but I want to recreate the links like so:
Could you help me create the query that does just this?

Not sure it's the better way to do it. But I did found the answer:
MATCH (a:Asset)-[rel]-(b:Asset) WHERE a.level >= 2 AND b.level >= 2
RETURN a, rel, b
UNION
MATCH (a:Asset) -[:USING]-(:Asset)-[:ATTACHED]-(b:Asset) WHERE a.level >= 2
AND b.level >= 2
CALL apoc.create.vRelationship(a,'USING',{}, b) YIELD rel
RETURN a, rel, b
UNION
MATCH (a) WHERE NOT (a)--()
RETURN a, null as rel, null as b;

BQ array lookup: similar to NTH, but based on index, not position

The NTH function is really useful for extracting nested array elements in BQ, but its utility for a given table depends on each row's nested array containing the same amount of elements, and in the same order. If I have a 2+ column nested array where one column is variable name/ID, and the different instances of the array in different rows have inconsistent naming and/or ordering, is there an elegant way to fetch/pivot a variable based on the variable name/ID?
For example, if row1 has customDimensions array:
index value
4 aaa
23 bbb
70 ccc
and row2 has customDimensions array:
index value
4 ddd
70 eee
I'd want to run something like
SELECT
NTHLOOKUP(70, customdims.index, customdims.value) as val70,
NTHLOOKUP(4, customdims.index, customdims.value) as val4,
NTHLOOKUP(23, customdims.index, customdims.value) as val23
from my_table;
And get:
val70 val4 val23
ccc aaa bbb
eee ddd (null)
I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
EDIT: Based on Mikhail's answer below (thank you!!) I was able to write my query more elegantly. Not quite as slick as an NTHLOOKUP, but I'll take it:
select id,
max(case when index = 41 then value[OFFSET(0)] else '' end) as val41,
max(case when index = 59 then value[OFFSET(0)] else '' end) as val59
from
(select
concat(array1.thing1, array1.thing2) as id,
cd.index,
ARRAY_AGG(distinct cd.value) as value
FROM my_table g
,unnest(array1) as array1
,unnest(array1.customDimensions) as cd
where index in (41,59)
group by 1,2
order by 1,2
) x
group by 1
order by 1

The best I can "offer" is below (BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.my_table` AS (
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'aaa'), (23, 'bbb'), (70, 'ccc')] customDimensions
UNION ALL
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'ddd'), (70, 'eee')] customDimensions
)
SELECT cd.index, ARRAY_AGG(cd.value) VALUES
FROM `project.dataset.my_table`,
UNNEST(customDimensions) cd
GROUP BY cd.index
with result as below
Row index values
1 4 aaa
ddd
2 23 bbb
3 70 ccc
eee
I would recommend to stay with this flatten version as it serves most of practical cases I can think of
But if you still want to further pivot this - there are quite a number of posts related to how to pivot in BigQuery

I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
Yes, you can use a user-defined function to encapsulate the common logic. For example,
CREATE TEMP FUNCTION NTHLOOKUP(
targetIndex INT64,
customDimensions ARRAY<STRUCT<index INT64, value STRING>>
) AS (
(SELECT value FROM UNNEST(customDimensions)
WHERE index = targetIndex)
);
SELECT
NTHLOOKUP(70, customDimensions) as val70,
NTHLOOKUP(4, customDimensions) as val4,
NTHLOOKUP(23, customDimensions) as val23
from my_table;

How to set all values for a column to a max value on a certain WHERE?

If I have:
2 baskets of oranges with 7 and 10 each
3 baskets of peaches with 12 and 15 each
then I want to set:
for every orange basket value of maxfruit to 10 and
for every peach basket value of maxfruit to 15
I tried
update baskets set maxfruit = (select max(fruitCount) from baskets b where b.fruit = fruit)
but it just sets everything to 15...

In SQL, when you are referencing a column by its name, the table instance that you end up with is the innermost one, unless you use a table prefix.
So fruit refers to the innermost instance, b. This means that b.fruit and fruit are always the same value.
To refer to the outer table instance, you must use the name of the outer table:
update baskets
set maxfruit = (select max(fruitCount)
from baskets b
where b.fruit = baskets.fruit);
^^^^^^^^
(And instead of b.fruit, you could write just fruit, but that could be unclear.)

your update is just pulling the max from the whole table you can use a sub query to pull out the max for each fruit
UPDATE b
SET b.maxfruit = b2.fruitCount
FROM baskets b
INNER JOIN (SELECT fruit, MAX(fruitCount) AS fruitCount
FROM baskets
GROUP BY fruit) b2 ON b.fruit = b2.fruit

Sqlite trigger to calculate an ID based on an entry type

I am trying to setup a trigger that will auto calculate an ID field based on the sum of a specific type of entry. I have it working where the ID number in the ID indexes based on the number of all entries
> BEGIN
> UPDATE master_workorders
> SET wo_no = master_workorders.wo_sub || substr('0000'||master_workorders.pkuid, -4,4)||'-'|| substr(master_workorders.rdate,3,2)
> WHERE rowid = NEW.rowid;
> END
This returns ID's like WO0001-17 BB0002-17 and M0003-17 each ID number (middle 4 digits) is an entry. I want my ID numbers to represent the number of each type (WO, BB, M these values are stored in the wo_sub column) as WO0001-17 BB0001-17 M0001-17 and if a new BB work order is added it would be BB0002-17 and so on for each type.

To replace the autoincremented ID with the current count, replace master_workorders.pkuid with a subquery:
... || (SELECT COUNT(*) FROM master_workorders WHERE wo_sub = NEW.wo_sub) || ...

SQLITE query, if last row matches criteria, check row preceding it matches different criteria

I'm finding it hard to get my head around this problem, and I couldn't find any answers to this specific problem anywhere:
Say I have a table like this, I'm just using fruit as an example:
Fruit | Date | Value
=================================
Apple | 1 | other_random_value
Apple | 2 | some_value_1
Apple | 3 | some_value_2
Pear | 1 | other_random_value
Pear | 2 | unexpected_value_1
Pear | 3 | some_value_2
Everything will be ordered by Fruit, then Date.
Basically, if the last row (for each fruit) is some_value_2, but the one preceding it is not some_value_1, I want to match just those fruits (i.e. in this case, Pear).
So, some_value_2 I always expect to come after a row with a certain value for that particular fruit, and if it doesn't I want to flag errors against those particular fruits. It would also be nice to match cases where nothing precedes some_value_2 as well, though if this is too complicated I could match it seperately and just check that some_value_2 is not the first row, which I don't imagine would be a difficult query.
EDIT: Also, being able to match any consecutive rows where the preceding value is unexpected would be nice, though I mainly care about the last 2 rows. So if being able to match all consecutive rows results in a simpler and better performing query, then I might go with that. I'm going to be doing an INSERT at the same time (into an alert table), so if I could flag it as an ERROR if it's the last two rows and a WARNING if it's not, that would be really nifty. Though I wouldn't know where to start with writing a query that does that. Also having a query that performs well is a must, as I will be using this across a large dataset.
EDIT:
This is what I used in the end, it's quite slow, but if I index Date, it's not so bad:
SELECT c.Id AS CId, c.Fruit AS CFruit,
c.Date AS CDate, c.Value AS CValue,
(SELECT Id
FROM fruits
WHERE Fruit = c.Fruit
AND Date >= c.Date
AND Id > c.Id
ORDER BY Date, Id) AS NId, n.Fruit AS NFruit,
n.Date AS NDate, n.Value AS NValue
FROM fruits AS c
JOIN fruits AS n ON n.Id = NId
ORDER BY c.Date, c.Id
I might try Joachim's method again at some point, as I realised I'm getting a lot of results I don't really care much about. Or I might even try incorporating the two somehow and delegate to INFO/ERROR as appropriate...
Solved: I used the same SELECT statement that I used to get NId, and used SELECT COUNT(*) instead of SELECT Id. This told me the number of results after the current one. Then I just used a CASE operator to turn it into a boolean field called Latest :). So I effectively combined Nicolas' and Joachim's methods. Performance still seems OK, probably because SQLite caches the results.

SQLite is (as far as I know) a bit low on efficient operators for this, so this is the best I can come up with for now :)
SELECT Fruit FROM fruits
WHERE ( SELECT COUNT(*) FROM fruits f
WHERE f.fruit=fruits.fruit
AND f.date > fruits.date ) = 1
AND fruits.value <> 'some_value_1'
INTERSECT
SELECT Fruit FROM fruits
WHERE ( SELECT COUNT(*) FROM fruits f
WHERE f.fruit=fruits.fruit
AND f.date > fruits.date ) = 0
AND fruits.value = 'some_value_2'
An SQLfiddle to test with.

I named the table fruits. This query gets you the preceding date for a ‘key‘ (fruit + date)
select fruit, date, value currvalue,
(select max(date) precedingDate
from fruits p
where p.fruit = c.fruit
and p.date < c.date) precedingdate
from fruits c ;
From there we can get the precedent value for each key
select f1.*, precedingdate, f2.value precedingvalue
from
fruits f1 join
(select fruit, date, value,
(select max(date) precedingDate
from fruits p
where p.fruit = c.fruit
and p.date < c.date) precedingdate
from fruits c) f2
on f1.fruit = f2.fruit and f1.date = precedingdate ;
For all the rows that have a previous row, you get both the current and preceding date and the current and preceding value.
Edit : we add an id used to choose when there are several identical previous date (see comment below)
I will be using intermediate views for the sake of clarity but you could write one big query.
As before, what's the previous date :
create view VFruitsWithPreviousDate
as select fruit, date, value, id,
(select max(date)
from fruits p
where p.fruit = c.fruit
and p.date < c.date) previousdate
from fruits c ;
What's the previous id :
create view VFruitsWithPreviousId
as select fruit, date, value,
(select max(id)
from fruits f
where v.fruit = f.fruit AND
v.previousdate = f.date) previousID
from VFruitsWithPreviousDate v ;
A query for all consecutive rows :
select f.*, v.value
from fruits f
join VFruitsWithPreviousId v on f.id = v.previousid ;
You can then add the condition WHERE f.Value = 'some_value_2' AND v.value != 'some_value_1'

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to count number of rows in a relation in pig - count

I have a relation in pig named 'A' (name:gender:zip-code) (x:m:1234) (y:f:1234) (z:m:1245) (s:f:1235) How can I get the number of rows in relation A? I want to get the result as 4.

Assuming your relation is named A. B = GROUP A ALL; B_COUNT = FOREACH B GENERATE COUNT(A); DUMP B_COUNT;

Related

Cypher to show relationships of hidden nodes

BQ array lookup: similar to NTH, but based on index, not position

How to set all values for a column to a max value on a certain WHERE?

Sqlite trigger to calculate an ID based on an entry type

SQLITE query, if last row matches criteria, check row preceding it matches different criteria

Categories

Resources