Pig get distinct rows with counts

Pig get distinct rows with counts - count

I have a pig table (called table1) containing many duplicates and more than one column (called col1, col2)
Here is a simple example
| col1 | col2 |
-----------------
| 111 | bbb |
| 111 | ccc |
| 111 | bbb |
| 222 | bbb |
I would like to get the distinct lines with the count of their appearance (like using uniq -c in bash), so that the result would be :
| count |col1 | col2 |
-----------------
| 2 | 111 | bbb |
| 1 | 111 | ccc |
| 1 | 222 | bbb |
What is the syntax for such a command?

Please try the below:
A = LOAD 'data'....;
GR = GROUP A by (col1,col2);
CNT = FOREACH GR GENERATE FLATTEN (group) AS (col1,col2) , COUNT(A) as cnt_col;
dump CNT;

Related

Split column into different rows on SQLite recursively using delimiter ","

I have a SQLite table just like this:
the table name is 'surat'
But i want to make id_ayat to be split into different rows using SQLite query, and expected result just like this:
_id|id_surat|id_ayat
---+--------+-------
3 | 2 | 112
3 | 2 | 213
3 | 3 | 19
3 | 3 | 83
3 | 3 | 85
3 | 3 | 102
is that possible? what query that i can use in SQLite format?

With a recursive CTE:
with recursive cte as (
select _id, id_surat, id_ayat,
id_ayat + 0 col
from tablename
union all
select _id, id_surat, trim(substr(id_ayat, length(col) + 2)),
trim(substr(id_ayat, length(col) + 2)) + 0
from cte
where instr(id_ayat, ',')
)
select _id, id_surat, col id_ayat
from cte
order by _id, id_surat
See the demo.
Results:
| _id | id_surat | id_ayat |
| --- | -------- | ------- |
| 3 | 2 | 112 |
| 3 | 2 | 213 |
| 3 | 3 | 19 |
| 3 | 3 | 83 |
| 3 | 3 | 85 |
| 3 | 3 | 102 |

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________

test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

add and subtract by type

I have a SQLite table payments:
+------+--------+-------+
| user | amount | type |
+------+--------+-------+
| AAA | 100 | plus |
| AAA | 200 | plus |
| AAA | 50 | minus |
| BBB | 100 | plus |
| BBB | 20 | minus |
| BBB | 5 | minus |
| CCC | 200 | plus |
| CCC | 300 | plus |
| CCC | 25 | minus |
I need to calculate the sum with type 'plus' and subtract from it the sum with type 'minus' for each user.
The result table should look like this:
+------+--------+
| user | total |
+------+--------+
| AAA | 250 |
| BBB | 75 |
| CCC | 475 |
I think that my query is terrible, and I need help to improve it:
select user,
(select sum(amount) from payments as TABLE1 WHERE TABLE1.type = 'plus' AND
TABLE1.user= TABLE3.user) -
(select sum(amount) from payments as TABLE2 WHERE TABLE2.type = 'minus' AND
TABLE2.user= TABLE3.user) as total
from payments as TABLE3
group by client
order by id asc

The type is easier handled with a CASE expression. And then you can merge the aggregation into the outer query:
SELECT user,
SUM(CASE type
WHEN 'plus' THEN amount
WHEN 'minus' THEN -amount
END) AS total
FROM payments
GROUP BY client
ORDER BY id;

sqlite - how do I write a query to receive an additional column containing a selection of data from another table in every cell

Assume I have two tables, e.g.:
table_1:
+----+-------+------------+--
| id | name | table_2_id | ...
+----+-------+------------+--
| 1 | test1 | 2 | ...
| 2 | test2 | 1 | ...
| 3 | test3 | 1 | ...
...
and
table_2:
+----+------+--
| id | name | ...
+----+------+--
| 1 | xxx | ...
| 2 | yyy | ...
| 3 | zzz | ...
...
Now I want to select everything from table_2 and add another column containing in every cell a collection of all names from table_1 where table_2_id corresponds with the current id from table_2:
output:
+----+------+-----+--------------+
| id | name | ... | link |
+----+------+-----+--------------+
| 1 | xxx | ... | test2, test3 |
| 2 | yyy | ... | test1 |
| 3 | zzz | ... | % |
...
How can I achieve this?

This can be done with a correlated subquery.
To combine values from multiple rows, use group_concat:
SELECT id,
name,
(SELECT group_concat(name)
FROM table_1
WHERE table_2_id = table_2.id
) AS link
FROM table_2;

Is there an sqlite function that can check if a field matches a certain value and return 0 or 1?

Consider the following sqlite3 table:
+------+------+
| col1 | col2 |
+------+------+
| 1 | 200 |
| 1 | 200 |
| 1 | 100 |
| 1 | 200 |
| 2 | 400 |
| 2 | 200 |
| 2 | 100 |
| 3 | 200 |
| 3 | 200 |
| 3 | 100 |
+------+------+
I'm trying to write a query that will select the entire table and return 1 if the value in col2 is 200, and 0 otherwise. For example:
+------+--------------------+
| col1 | SOMEFUNCTION(col2) |
+------+--------------------+
| 1 | 1 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 0 |
| 3 | 1 |
| 3 | 1 |
| 3 | 0 |
+------+--------------------+
What is SOMEFUNCTION()?
Thanks in advance...

In SQLite, boolean values are just integer values 0 and 1, so you can use the comparison directly:
SELECT col1, col2 = 200 AS SomeFunction FROM MyTable

Like described in Does sqlite support any kind of IF(condition) statement in a select you can use the case keyword.
SELECT col1,CASE WHEN col2=200 THEN 1 ELSE 0 END AS col2 FROM table1