How to group and rank records based on 7 days.
Call 1 - 06-Jun-14 16.39.14 Rank 1
Call 7 - 10-Jun-14 14.28.40 Rank 7
After 7 days, whenever the next call date occurs,
I need to watch the next 7 days and rank accordingly.
Call 1 - 27-Jun-14 11.44.35 Rank 1
Call 4 - 03-Jul-14 14.23.39 Rank 4
CALL_DATE ROW_NUMBER
06-Jun-14 16.39.14 1
06-Jun-14 17.29.27 2
07-Jun-14 09.13.18 3
07-Jun-14 14.45.52 4
08-Jun-14 13.05.44 5
08-Jun-14 13.14.49 6
10-Jun-14 14.28.40 7
27-Jun-14 11.44.35 1
27-Jun-14 11.46.27 2
27-Jun-14 12.00.21 3
03-Jul-14 14.23.39 4
You can calculate the day number within the range by using the first_value() analytic function and getting the difference; then divide that by seven to get the week number (within the data); and then use that calculate the row_number() of each date within its calculated week number.
select call_date,
row_number() over (partition by week_num order by call_date) as row_num
from (
select call_date,
ceil((trunc(call_date)
- trunc(first_value(call_date) over (order by call_date))
+ 1) / 7) as week_num
from t42
)
order by call_date;
Which gives:
| CALL_DATE | ROW_NUM |
|-----------------------------|---------|
| June, 06 2014 16:39:14+0000 | 1 |
| June, 06 2014 17:29:27+0000 | 2 |
| June, 07 2014 09:13:18+0000 | 3 |
| June, 07 2014 14:45:52+0000 | 4 |
| June, 08 2014 13:05:44+0000 | 5 |
| June, 08 2014 13:14:49+0000 | 6 |
| June, 10 2014 14:28:40+0000 | 7 |
| June, 27 2014 11:44:35+0000 | 1 |
| June, 27 2014 11:46:27+0000 | 2 |
| June, 27 2014 12:00:21+0000 | 3 |
| July, 03 2014 14:23:39+0000 | 4 |
SQL Fiddle showing some of the intermediate steps and the final result.
Related
I have the following data and looking to create the "Final Col" shown below using dplyr in R. I would appreciate your ideas.
| Year | Week | MainCat|Qty |Final Col |
|:----: |:------: |:-----: |:-----:|:------------:|
| 2017 | 1 | Edible |69 |69/(69+12) |
| 2017 | 2 | Edible |12 |12/(69+12) |
| 2017 | 1 | Flowers|88 |88/(88+47) |
| 2017 | 2 | Flowers|47 |47/(88+47) |
| 2018 | 1 | Edible |90 |90/(90+35) |
| 2018 | 2 | Edible |35 |35/(90+35) |
| 2018 | 1 | Flowers|78 |78/(78+85) |
| 2018 | 2 | Flowers|85 |85/(78+85) |
It can be done with a group_by operation i.e. grouped by 'Year', 'MainCat', divide the 'Qty' by the sum of 'Qty' to create the 'Final' column
library(dplyr)
df1 <- df1 %>%
group_by(Year, MainCat) %>%
mutate(Final = Qty/sum(Qty))
You can use prop.table :
library(dplyr)
df %>% group_by(Year, MainCat) %>% mutate(Final = prop.table(Qty))
# Year Week MainCat Qty Final
# <int> <int> <chr> <int> <dbl>
#1 2017 1 Edible 69 0.852
#2 2017 2 Edible 12 0.148
#3 2017 1 Flowers 88 0.652
#4 2017 2 Flowers 47 0.348
#5 2018 1 Edible 90 0.72
#6 2018 2 Edible 35 0.28
#7 2018 1 Flowers 78 0.479
#8 2018 2 Flowers 85 0.521
You can also do this in base R :
df$Final <- with(df, ave(Qty, Year, MainCat, FUN = prop.table))
I need to make a SQLite query for something I just can't quite wrap my brain around.
I have a database table with a bunch of "issues". And each "issue" has a createddate, and resolutiondate.
id createddate resolutiondate
-------------------------------------------
1 2019-04-18 2019-08-18
2 2019-04-20 2019-04-21
3 2019-05-08 2019-06-05
etc....
What I need to do, is count how many "issues" every month in the past 12 months, had a created date <= that month, and where the resolutiondate is > that month. I want a table that looks like this:
Month No. Of Issues Not Resolved But Existed That Month
---------------------------------------------------------------
2019-04 20
2019-05 17
2019-06 15
etc...
I'm struggling, because I essentially need to check every row multiple times, for every month it's created date is <= that month, and it hasn't been resolved yet. The count for a particular issue could increase the value for "No. Of Issues" for both April 2019 AND May 2019, for example, if it wasn't resolved for 2 months. I'm not sure how to check all rows multiple times.
I have to do it in SQLite.
My current attempt that doesn't seem to be working:
SELECT * FROM(
SELECT substr(createddate, 1, 7) AS created
FROM {{ project_key }}
GROUP BY substr(createddate, 1, 7)
) a JOIN (
SELECT substr(createddate, 1, 7) AS created,
COUNT(CASE WHEN julianday(substr(resolutiondate, 1, 10)) >= julianday(substr(created, 1, 10)) THEN 1 ELSE NULL END) as "No. Issues Not Resolved"
FROM {{ project_key }}
GROUP BY substr(createddate, 1, 7)
) b ON b.created = a.created
With a recursive CTE that returns the past 12 months and a left join to the table:
with months as (
select strftime('%Y-%m', 'now', '-1 year') month
union all
select strftime('%Y-%m', strftime('%Y-%m-%d', month || '-01', '+1 month') )
from months
where month < strftime('%Y-%m', 'now', '-1 month')
)
select m.month,
count(id) [No. Of Issues Not Resolved But Existed That Month]
from months m left join tablename t
on strftime('%Y-%m', t.createddate) <= m.month and strftime('%Y-%m', t.resolutiondate) > m.month
group by m.month
See the demo.
Results:
| month | No. Of Issues Not Resolved But Existed That Month |
| ------- | ------------------------------------------------- |
| 2019-02 | 0 |
| 2019-03 | 0 |
| 2019-04 | 1 |
| 2019-05 | 2 |
| 2019-06 | 1 |
| 2019-07 | 1 |
| 2019-08 | 0 |
| 2019-09 | 0 |
| 2019-10 | 0 |
| 2019-11 | 0 |
| 2019-12 | 0 |
| 2020-01 | 0 |
I am importing data that is neither long nor wide:
clear
input str1 id purchased sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
My goal is to get the data in the following long format, reflecting the count in each year:
Identifier Year Inventory
A 2016 0
A 2017 1
A 2018 1
A 2019 1
B 2016 0
B 2017 0
B 2018 0
B 2019 0
C 2016 1
C 2017 1
C 2018 2
C 2019 1
D 2016 0
D 2017 0
D 2018 2
D 2019 1
My initial approach would be to transform it first into a wide format, that is having only one row per identifier, and adding columns for years between 2016-2018. And then converting this format into the desired long format. However, this seems to be inefficient.
Is there any shorter and more efficient method to do this, as I have a much larger dataset?
This needs several small tricks. The most crucial are reshape long and fillin.
The inventory is essentially a running sum of purchases minus sales.
clear
input str1 Identifier Purchased Sold
A 2017 .
B . .
C 2016 2019
C 2018 .
D 2018 2019
D 2018 .
end
generate long id = _n
rename (Purchased Sold) year=
reshape long year, i(id) j(Event) string
drop id
fillin Id year
drop _fillin
drop if missing(year)
bysort Id (year Event) : generate inventory = sum((Event == "Purchased") - (Event == "Sold"))
drop Event
bysort Id year : keep if _n == _N
list, sepby(Id)
+----------------------------+
| Identi~r year invent~y |
|----------------------------|
1. | A 2016 0 |
2. | A 2017 1 |
3. | A 2018 1 |
4. | A 2019 1 |
|----------------------------|
5. | B 2016 0 |
6. | B 2017 0 |
7. | B 2018 0 |
8. | B 2019 0 |
|----------------------------|
9. | C 2016 1 |
10. | C 2017 1 |
11. | C 2018 2 |
12. | C 2019 1 |
|----------------------------|
13. | D 2016 0 |
14. | D 2017 0 |
15. | D 2018 2 |
16. | D 2019 1 |
+----------------------------+
I've a table as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-09 |
| 1 | 12 | 2018-09 |
| 2 | 13 | 2019-10 |
| 2 | 14 | 2018-10 |
| 3 | 67 | 2019-01 |
| 3 | 78 | 2018-01 |
+----+-------+---------+
I want to be able to swap the VALUE column for all ID's where the DATE != year-month of system date
and If the DATE == year-month of system date then just keep this years value
the resulting table I need is as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 12 | 2019-09 |
| 2 | 13 | 2019-10 |
| 3 | 78 | 2019-01 |
+----+-------+---------+
As Jon and Maurits noticed, your example is unclear: you give no line with what is a wrong format to you, and you mention "current year" but do not describe the expected output for the next year for instance.
Here is an attempt of code to actually answer your question:
library(dplyr)
x = read.table(text = "
ID VALUE DATE
1 10 2019-09
1 12 2018-09
1 12 2018-09-04
1 12 2018-99
2 13 2019-10
2 14 2018-10
3 67 2019-01
3 78 2018-01
", header=T)
x %>%
mutate(DATE = paste0(DATE, "-01") %>% as.Date("%Y-%m-%d")) %>%
group_by(ID) %>%
filter(DATE==max(DATE, na.rm=T))
I inserted two lines with a "wrong" format (according to me) and treated "current year" as the maximum year you could find in the column for each ID.
This may be wrong assertions, but I'd need more information to better answer this.
This question already has answers here:
Can dplyr summarise over several variables without listing each one? [duplicate]
(2 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I have a large dataset containing the names of hospitals, the hospital groups and then the number of presenting patients by month. I'm trying to use dplyr to create a summary that contains the total number of presenting patients each month, aggregated by hospital group. The data frame looks like this:
Hospital | Hospital_group | Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
---------------------------------------------------------------
Hosp 1 | Group A | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 2 | Group A | 6 | 3 | 8 | 2 | .....
---------------------------------------------------------------
Hosp 3 | Group B | 5 | 5 | 6 | 4 | .....
---------------------------------------------------------------
Hosp 4 | Group B | 3 | 7 | 2 | 1 | .....
---------------------------------------------------------------
I'm trying to create a new dataframe that looks like this:
Hospital_group |Jan 03 | Feb 03 | Mar 03 | Apr 03 | .....
----------------------------------------------------------
Group A | 11 | 8 | 14 | 6 | .....
----------------------------------------------------------
Group B | 8 | 12 | 8 | 5 | .....
----------------------------------------------------------
I'm trying to use dplyr to summarise the data but am a little stuck (am very new at this as you might have guessed). I've managed to filter out the first column (hospital name) and group_by the hospital group but am not sure how to get a cumulative sum total for each month and year (there is a large number of date columns so I'm hoping there is a quick and easy way to do this).
Sorry about posting such a basic question - any help or advice would be greatly appreciated.
Greg
Use summarize_all:
Example:
df <- tibble(name=c("a","b", "a","b"), colA = c(1,2,3,4), colB=c(5,6,7,8))
df
# A tibble: 4 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 1 5
2 b 2 6
3 a 3 7
4 b 4 8
df %>% group_by(name) %>% summarize_all(sum)
Result:
# A tibble: 2 × 3
name colA colB
<chr> <dbl> <dbl>
1 a 4 12
2 b 6 14
Edit: In your case, your data frame contains one column that you do not want to aggregate (the Hospital name.) You might have to either deselect the hospital name column first, or use summarize_at(vars(-Hospital), funs(sum)) instead of summarize_all.
We can do this using base R
We split the dataframe by Hospital_group and then sum it column-wise.
do.call(rbind, lapply(split(df[-c(1, 2)], df$Hospital_group), colSums))
# Jan_03 Feb_03 Mar_03 Apr_03
#Group_A 11 8 14 6
#Group_B 8 12 8 5