R: Create groups within column - r

I'm trying to group an age column into an age group column and summarize by that grouping.
ie I need the dataset below -
AGE
1
2
5
68
27
4
2
33
45
To become
AGE_GRP COUNT
1-10 5
11-20 0
21-30 1
31-40 1
40+ 2
I'm using R
Thanks.

You need CASE statement to split the AGE into different groups
SELECT CASE
WHEN AGE BETWEEN 1 AND 10 THEN '1-10'
WHEN AGE BETWEEN 11 AND 20 THEN '11-20'
WHEN AGE BETWEEN 21 AND 30 THEN '21-30'
WHEN AGE BETWEEN 31 AND 40 THEN '31-40'
ELSE '40+'
END AS AGE_GRP,
Count(1) as Cnt
FROM yourtable
GROUP BY CASE
WHEN AGE BETWEEN 1 AND 10 THEN '1-10'
WHEN AGE BETWEEN 11 AND 20 THEN '11-20'
WHEN AGE BETWEEN 21 AND 30 THEN '21-30'
WHEN AGE BETWEEN 31 AND 40 THEN '31-40'
ELSE '40+'
END
If you don't want to repeat the CASE statement in GROUP BY then use this
SELECT AGE_GRP,
Count(1) AS cnt
FROM (SELECT CASE
WHEN AGE BETWEEN 1 AND 10 THEN '1-10'
WHEN AGE BETWEEN 1 AND 10 THEN '11-20'
WHEN AGE BETWEEN 1 AND 10 THEN '21-30 '
WHEN AGE BETWEEN 1 AND 10 THEN '31-40'
ELSE '40+'
END AS AGE_GRP
FROM yourtable) A
GROUP BY AGE_GRP

You have zero values so you need a left join:
select agegrp, count(t.agegrp)
from (select '1-10' as agegrp, 1 as lowb, 10 as hib union all
select '11-20' as agegrp, 11, 20 union all
select '21-30' as agegrp, 21, 30 upperbound union all
select '31-40' as agegrp, 31, 40 as upperbound union all
select '40+' as agegrp, 41, NULL as upperbound
) ag left join
t
on t.age >= ag.lowb and t.age <= ag.hib
group by ag.agegrp
order by ag.lowb;
Note: this assumes the column is an integer, so a value like 30.5 isn't allowed. It is easy to adjust the query to handle non-integer ages, if that is the requirement.

Related

Merge two dataframes by a closest value in R

I have two dataframes that I want to merge by the closest value in one column. The first dataframe (DF1) consists of individuals and their estimated individual risk ("risk"):
DF1<- data.frame(ID = c(1, 2, 3), risk = c(22, 40, 20))
ID risk
1 22
2 40
3 20
The second dataframe (DF2) consists of population by age groups ("population_age") and the normal risks within each age group ("population_normal_risk"):
DF2<- data.frame(population_age = c("30-34","35-39","40-44"), population_normal_risk = c(15, 30, 45))
population_age population_normal_risk
30-34 15
35-39 30
40-44 45
What I want is to add a new column in the DF1 dataframe showing the population age group ("population_age") with the closest risk value ("population_normal_risk") to the estimated risk on each individual ("risk").
What I expected would be:
ID risk population_age_group
1 22 30-34
2 40 40-44
3 20 30-34
Thanks in advance!
We can use findInterval.
First we need to calculate our break points at the halfway points between the population risk values:
breaks <- c(0, df2$population_normal_risk + c(diff(df2$population_normal_risk) / 2, Inf))
Then use findInterval to detect which bin our risks fall into:
matches <- findInterval(df1$risk, breaks)
Finally, write the matches in:
df1$population_age <- df2$population_age[matches]
Giving us:
df1
ID risk population_age
1 1 22 30-34
2 2 40 40-44
3 3 20 30-34`
We can try the code below using outer + max.col
transform(
DF1,
population_age = DF2[max.col(-abs(outer(risk, DF2$population_normal_risk, `-`))), "population_age"]
)
which gives
ID risk population_age
1 1 22 30-34
2 2 40 40-44
3 3 20 30-34

Select from end of table until selection contains a number of valid values

I have an SQLite table like the following:
gary_ages
====================
Name | Age
--------------------
Gary 1 | 20
Gary 2 | 50
Gary 3 | 35
Gary 4 | 71
Gary 5 | 50
Gary 6 | 4
Gary 7 | 50
Gary 8 | 65
Gary 9 | 91
Gary 10 | 50
Gary 11 | 0
You can see that Garies 2, 5, 7 and 10 are 50 years old.
I would like to make a selection, starting from the end of the table, that contains 3 Garies whose age is 50. In this case that selection would range inclusively from Gary 5 to Gary 11.
The selection contains 3 Garies aged 50.
It would not include Gary 2, as then it would contain 4 Garies aged 50, and I only want 3.
The selection does include Gary 11, because Gary 11 is at the end of the table which is where the selection starts.
The selection does include all Garies between 5 and 11, even though not all of them are aged 50.
The selection does not include Garies 3 and 4, because the selection already has 3 Garies aged 50 and doesn't care about any more Garies.
Using SQLite via Python, I can do this fairly easily by selecting the final row of the table, checking if the total count of 50's is 3, and either selecting the next row or returning the current selection as a Python list depending. But ideally I'd like to confine this to the SQLite world.
Is there a simple solution?
SELECT * FROM gary_ages WHERE ...
I believe that
SELECT * FROM gary_ages WHERE rowid >= (SELECT min(rowid) FROM (SELECT rowid FROM gary_ages WHERE Age = 50 ORDER BY rowid DESC LIMIT 3));
Will return :-
That is
the innermost sub-query is selecting the 3 latest 50 years olds,
the outer sub-query is then selecting the lowest rowid from the 3 latest 50 year olds,
the lowest rowid being used to drive the WHERE clause which will include all rowid's that are equal to or larger than the derived lowest rowid.
One way:
WITH ranked AS
(SELECT rowid, row_number() OVER (ORDER BY rowid DESC) AS rn
FROM gary_ages
WHERE age = 50)
SELECT name
FROM gary_ages
WHERE rowid >= (SELECT rowid FROM ranked WHERE rn = 3)
ORDER BY rowid;
name
----------
Gary 5
Gary 6
Gary 7
Gary 8
Gary 9
Gary 10
Gary 11
(Note: Requires Sqlite 3.25 or newer for row_number())

SQL Count Data 1/2 hourly

I have a stored procedure that counts data for each hour,
Declare #DateTimeToFilter DATETIME;
--set #DateTimeToFilter = GetDate();
set #DateTimeToFilter = '6/5/14'
SET NOCOUNT ON;
WITH H ([Hour]) AS
( SELECT 7 UNION
SELECT 8 UNION
SELECT 9 UNION
SELECT 10 UNION
SELECT 11 UNION
SELECT 12 UNION
SELECT 13 UNION
SELECT 14 UNION
SELECT 15 UNION
SELECT 16 UNION
SELECT 17 UNION
SELECT 18 UNION
SELECT 19
)
SELECT H.[Hour],
COUNT(T.BookingID) AS NoOfUsers
FROM H
LEFT JOIN tbl_Visitor T
ON H.[Hour] = DATEPART(HOUR, T.TimeofArrival) AND
((DATEDIFF(dd, T.TimeofArrival, #DateTimeToFilter) = 0) AND (DATEDIFF(mm, T.TimeofArrival, #DateTimeToFilter) = 0) AND
(DATEDIFF(yy, T.TimeofArrival, #DateTimeToFilter) = 0))
GROUP BY H.[Hour];
This forces the data returned for each hour irrespective of whether there is any data or not.
How could I add the half hourly data to be added also, so the returned data look like.
Hour Count
7 0
7.5 0
8 0
8.5 0
9 0
9.5 0
10 4
10.5 0
11 0
11.5 0
12 0
12.5 0
13 0
13.5 0
14 5
14.5 0
15 2
15.5 0
16 2
16.5 0
17 0
17.5 0
18 0
18.5 0
19 0
19.5 0
The data is stored in the database as a smalltimedate, i.e. 2014-06-05 14:00:00
Any help is appreciated.
You can use minutes instead of hours:
with h ([Minute]) as (
select 420 union all
select 450 union all
select 480 union all
select 510 union all
select 540 union all
...
Divide the minutes to get fractional hours:
select h.[Minute] / 60.0 as [Hour], ...
Calculate the start and stop time for the interval to filter the data:
... on T.TimeofArrival >= dateadd(minute, h.[Minute], #DateTimeToFilter) and
T.TimeofArrival < dateadd(minute, h.[Minute] + 30, #DateTimeToFilter)
Below is an example that groups by half-hour intervals and can easily be extended for other intervals. I suggest you avoid applying functions to columns in the WHERE clause as that prevents indexes on those columns from being used efficiently.
DECLARE
#DateTimeToFilter smalldatetime = '2014-06-05'
, #IntervalStartTime time = '07:00:00'
, #IntervalEndTime time = '20:00:00'
, #IntervalMinutes int = 30;
WITH
t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n))
, t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d)
, t64k AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b)
, intervals AS (SELECT DATEADD(minute, (num - 1) * #IntervalMinutes, #DateTimeToFilter) AS interval
FROM t64k
WHERE num <= 1440 / #IntervalMinutes)
SELECT
interval
, CAST(DATEDIFF(minute, #DateTimeToFilter, interval) / 60.0 AS decimal(3, 1)) AS Hour
, COUNT(T.BookingID) AS NoOfUsers
FROM intervals
LEFT JOIN dbo.tbl_Visitor T
ON T.TimeofArrival >= intervals.interval
AND T.TimeofArrival < DATEADD(minute, #IntervalMinutes, intervals.interval)
WHERE
interval >= DATEADD(minute, DATEDIFF(minute, '', #IntervalStartTime), #DateTimeToFilter)
AND interval < DATEADD(minute, DATEDIFF(minute, '', #IntervalEndTime), #DateTimeToFilter)
GROUP BY interval
ORDER BY Hour;

How To Group By Only Some Rows

I have some records.
ID Salary WillGroupBy Amount
----------------------------------------
6320 100 1 15
6320 150 1 20
6694 200 0 25
6694 300 0 30
7620 400 1 45
7620 500 1 50
How can I group by only which "WillGroupBy = 1" records?
(I will SUM Salary and Amount columns)
I want to get this result:
ID Salary WillGroupBy Amount
----------------------------------------
6320 250 1 35
6694 200 0 25
6694 300 0 30
7620 900 1 95
Can you help me please :( ?
Solution:
SELECT ID, SUM(Salary) Salary, WillGroupBy, SUM(Amount) Amount
FROM YourTable
where WILLGROUPBY = 0
union all
SELECT ID, SUM(Salary) Salary, WillGroupBy, SUM(Amount) Amount
FROM YourTable
where WILLGROUPBY = 1
group by ID, WillGroupBy
I used this solution via Erhan.
I would to know that how it could be in another way.
With MySQL you can do:
SELECT ID, SUM(Salary) Salary, WillGroupBy, SUM(Amount) Amount, #row := #row + 1
FROM YourTable
JOIN (SELECT #row := 0) v
GROUP BY ID, IF(WillGroupBy = 1, -1, #row)
DEMO

Ref cursor with dynamic columns

I am using oracle 11g and have written a stored procedure which stores values in temporary table as follows:
id count hour age range
-------------------------------------
0 5 10 61 10-200
1 6 20 61 10-200
2 7 15 61 10-200
5 9 5 61 201-300
7 10 25 61 201-300
0 5 10 62 10-20
1 6 20 62 10-20
2 7 15 62 10-20
5 9 5 62 21-30
1 8 6 62 21-30
7 10 25 62 21-30
10 15 30 62 31-40
now using this temp table i want to return two cursors. one for 61 and one for 62(age).
and for cursors there distinct range will be columns . for example cursor for age 62 should return following as dataset.
user 10-20 21-30 31-40
Count/hour count/hour count/hour
----------------------------------------------
0 5 10 - - - -
1 6 20 8 6 - -
2 7 15 - - - -
5 - - 9 5 - -
7 - - 10 25 - -
10 - - - - 15 30
this column range in temp table is is not a fixed values these are referenced from other table.
edited: i am using PIVOT for above problem, all examples i saw in internet are there for fixed values of column values (range in my case). how can i get dynamic values. following is the ex query:
SELECT *
FROM (SELECT column_2, column_1
FROM test_table)
PIVOT (SUM(column1) AS sum_values FOR (column_2) IN ('value1' AS a, 'value2' AS b, 'value3' AS c));
Instead of using handwritten value i am using following query inside 'IN'
SELECT * from(
with x as (
SELECT DISTINCT range
FROM test_table
WHERE age = 62 )
select ltrim( max( sys_connect_by_path(range, ','))
keep (dense_rank last order by curr),
',') range
from (select range,
row_number() over (order by range) as curr,
row_number() over (order by range) -1 as prev
from x)
connect by prev = PRIOR curr
start with curr = 1 )
it is giving error in this case. But when i using handwritten values its giving right output.
select * from (select user_id, nvl(count,0) count, nvl(hour,0) hour,nvl(range,0) range,nvl(age,0)
age from test_table)
PIVOT (SUM(count) as sum_count, sum(hour) as sum_hour for (range) IN
(
'10-20','21-30','31-40'
)
) where age = 62 order by userid
how can i give values dynamically there?
how can i do it.
Cursors are slow, I would recommend trying to do this in a query unless there's no alternative (or speed doesn't matter). You may want to look into: PIVOT / UNPIVOT which can rotate columns (in this case "range").
Here's some PIVOT / UNPIVOT documentation and examples:
http://www.oracle-developer.net/display.php?id=506
Based on your last edit:
Pretty sure you have two options:
Build dynamic sql based on the distinct values found in the "range" column.
You'll probably be stuck using a cursor again to build the column names but at least it will be limited to just the distinct ranges.
Oracle has a PIVOT XML command that you can use for this.
See: http://www.oracle.com/technetwork/articles/sql/11g-pivot-097235.html
And scroll down to the section: "XML Type"

Resources