Word count using Hive - count

Lets say I have a table with columns id and content:
id | content
________________________
1 | abc abr abc as abs
2 | abc arc cre arc
3 | agr ann agd agd agd
What I want is output like this:
{"abc":2,"abr":1,"as":1, "abs":1} # for id 1
{"abc":1,"arc":2,"cre":1} # for id 2
{"agr":1,"agd":3,"ann":1} # for id 3
How could the task be done using Hive?

You'll need this library. It's pretty straightforward to build.
Query:
ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
SELECT id
, COLLECT(words, c) AS count_map
FROM (
SELECT id
, words
, COUNT(*) AS c
FROM (
SELECT id, words
FROM db.tbl
LATERAL VIEW EXPLODE(SPLIT(content, ' ')) exptbl AS words ) x
GROUP BY id, words ) y
GROUP BY id
Output:
+----+---------------------------------+
|id |count_map |
+----+---------------------------------+
|1 |{"as":1,"abs":1,"abc":2,"abr":1} |
+----+---------------------------------+
|2 |{"cre":1,"arc":2,"abc":1} |
+----+---------------------------------+
|3 |{"ann":1,"agr":1,"agd":3} |
+----+---------------------------------+

Related

De duping Table Joined to itself

I have the following table:
ID|ID2
-----+---
1234 |56473
56473|1234
34521|56473
35462|23457
23457|35462
56473|34521
As you can see these ids are linked together via a previous join based upon different fields, the combination of these ids repeats itself throughout the table just in a different order
Desired output:
ID|ID2
-----+---
1234 |56473
34521|56473
35462|23457
You can use MIN() and MAX() functions:
select distinct
min(ID, ID2) ID, max(ID, ID2) ID2
from tablename
See the demo.
Results:
| ID | ID2 |
| ----- | ----- |
| 1234 | 56473 |
| 34521 | 56473 |
| 23457 | 35462 |

Split data in SQLite column

I have a SQLite database that looks similar to this:
---------- ------------ ------------
| Car | | Computer | | Category |
---------- ------------ ------------
| id | | id | | id |
| make | | make | | record |
| model | | price | ------------
| year | | cpu |
---------- | weight |
------------
The record column in my Category table contains a comma separated list of the table name and id of the items that belong to that Category, so an entry would look like this:
Car_1,Car_2.
I am trying to split the items in the record on the comma to get each value:
Car_1
Car_2
Then I need to take it one step further and split on the _ and return the Car records.
So if I know the Category id, I'm trying to wind up with this in the end:
---------------- ------------------
| Car | | Car |
---------------| -----------------|
| id: 1 | | id: 2 |
| make: Honda | | make: Toyota |
| model: Civic | | model: Corolla |
| year: 2016 | | year: 2013 |
---------------- ------------------
I have had some success on splitting on the comma and getting 2 records back, but I'm stuck on splitting on the _ and making the join to the table in the record.
This is my query so far:
WITH RECURSIVE record(recordhash, data) AS (
SELECT '', record || ',' FROM Category WHERE id = 1
UNION ALL
SELECT
substr(data, 0, instr(data, ',')),
substr(data, instr(data, ',') + 1)
FROM record
WHERE data != '')
SELECT recordhash
FROM record
WHERE recordhash != ''
This is returning
--------------
| recordhash |
--------------
| Car_1 |
| Car_2 |
--------------
Any help would be greatly appreciated!
If your recursive CTE works as expected then you can split each of the values of recordhash with _ as a delimiter and use the part after _ as the id of the rows from Car to return:
select * from Car
where id in (
select substr(recordhash, 5)
from record
where recordhash like 'Car%'
)

SQLITE order by numeric and not alphabetic

When I order my database SQLITE by Classement I have this :
Classement | Nom
1 | clem
10 | caro
11 | flo
12 | raph
2 | prisc
3 | karim
4 | prout
I would like to get :
Classement | Nom
1 | clem
2 | prisc
3 | karim
4 | prout
10 | caro
11 | flo
12 | raph
Here is my code :
SELECT t.Classement
FROM tableau t
WHERE 1 = (SELECT 1 + COUNT (*) FROM tableau t2 WHERE t2.Classement < t.Classement OR ( t2.Classement == t.Classement AND t2.Nom < t.Nom ))
Can anyone help me ?
Thank you!
I guess column Classement is not an integer but character. So try this:
SELECT * FROM tableau ORDER BY cast(Classement as integer);
You get alphabetic order if the values are strings.
To change the table so that all Classement values are numbers, ensure that the column type is not a text type, and use this:
UPDATE tableau SET Classement = CAST(Classement AS NUMBER);

sqlite, order by date/integer in joined table

I have two tables
Names
id | name
---------
5 | bill
15 | bob
10 | nancy
Entries
id | name_id | added | description
----------------------------------
2 | 5 | 20140908 | i added this
4 | 5 | 20140910 | added later on
9 | 10 | 20140908 | i also added this
1 | 15 | 20140805 | added early on
6 | 5 | 20141015 | late to the party
I'd like to order Names by the first of the numerically-lowest added values in the Entries table, and display the rows from both tables ordered by the added column overall, so the results will be something like:
names.id | names.name | entries.added | entries.description
-----------------------------------------------------------
15 | bob | 20140805 | added early on
5 | bill | 20140908 | i added this
10 | nancy | 20140908 | i also added this
I looked into joins on the first item (e.g. SQL Server: How to Join to first row) but wasn't able to get it to work.
Any tips?
Give this query a try:
SELECT Names.id, Names.name, Entries.added, Entries.description
FROM Names
INNER JOIN Entries
ON Names.id = Entries.name_id
ORDER BY Entries.added
Add DESC if you want it in reverse order i.e.: ORDER BY Entries.added DESC.
This should do it:
SELECT n.id, n.name, e.added, e.description
FROM Names n INNER JOIN
(SELECT name_id, description, Min(added) FROM Entries GROUP BY name_id, description) e
ON n.id = e.name_id
ORDER BY e.added

.NET Merge Single Column from datarows when ID matchs

Just in advance, I have no access to the SQL query written, so all I can do is try to handle the dataset after the query has executed.
I'm using ASP.NET Webforms to try and merge only one column across a SQL returned datatable e.g
PID | C1 | C2 | C3 | I1
1 | a | a | a | bob
1 | x | x | x | Jim
1 | b | b | b | Fred
2 | g | g | g | Jill
From this Dataset I would like to see:
PID | C1 | C2 | C3 | I1
1 | a | a | a | bob Jim Fred
2 | g | g | g | Jill
Essentially I don't care what is in C1-C3, it will just take the values of the first match. What I need to do though is join all the values of I1 into the one result based on a matching PID.
Any help would be greatly appreciated. LINQ answers acceptable, preferably in vb.net so I don't have to change it later.
Thank you.
You can use the group by with the String.Join. Ia,m adding answer in c# you might want to convert to vb(I not very well with vb syntax :P).
var result = dataListObject.GroupBy(l => l.PId )
.Select(g => new { PID = g.Key.PId, C1= g.Key.C1, C2 = g.Key.C2, C3=g.Key.C3, I1 = string.Join(",", g.Select(i => i.I1)) });
Should select the first PID, C1 ,C2 and C3. Then join the I1 together. Haven't checked this but seems like it should work.
UPDATE
For datatable you can apply the AsEnumerable() to datatable to make it enumerable
var result = datatable.AsEnumerable().GrouBy(l=>l.Field<string>("Pid"))
.Select(g=>new{PID = g.Key.PId, C1= g.Key.C1, C2 = g.Key.C2, C3=g.Key.C3, I1 = string.Join(",", g.Select(i => i.I1)) });

Resources