I created a summary for my data and worked out percentages of occurrences per category.
Now, I want to sum a subset of categories to show their value combined. For example, I want to be able to say that 51.1% of all occurrences are within the categories 30, 60, and 120 days (sum of rows #6, #9, and #3). The name of the Data.frame is "Summary_2".
Category Count Percent
1 1 day 4 3.3%
8 5 days 5 4.1%
4 180 days 8 6.5%
5 240 days 9 7.3%
2 10 days 15 12.2%
3 120 days 18 14.6%
6 30 days 19 15.4%
7 360 days 19 15.4%
9 60 days 26 21.1%
This is a summary of tickets. I arbitrarily want to say that 50% of our tickets are resolved within 2 months, 30% are resolved from 180 to 360 days, and 20% is resolved within 10 days.
In Excel it looks like that:
RECORD
ATTRIBUTE
DATE
MONTH
AMT
CML AMT
1
A
1/1/2021
1
10
10
2
A
2/1/2021
2
10
20
3
A
3/1/2021
3
10
30
4
A
4/1/2021
4
10
40
5
A
5/1/2021
5
10
50
6
A
6/1/2021
6
10
60
7
B
1/1/2021
1
20
20
8
B
3/1/2021
3
20
40
9
B
5/1/2021
5
20
60
10
B
7/1/2021
7
20
80
11
B
9/1/2021
9
20
80
12
B
11/1/2021
11
20
80
13
C
1/1/2021
1
30
30
14
C
8/1/2021
8
30
30
15
C
9/1/2021
9
30
60
I am looking to calculate the cumulative sum (CML AMT column) using the AMT column for the past 6 months.
The CML AMT column should only look at window of 6 Months.
If there is no other record for the same attribute within a 6 month time frame, then it should simply return the AMT column.
I tried the below which clearly wont work as the dates/months are not consistent.
Any help will be appreciated.
SUM(AMT)
OVER (PARTITION BY ATTRIBUTE
ORDER BY DATE
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
Unfortunately Teradata doesn't support RANGE, but if you need to sum over a small number of values only (six months = up to six rows) you can apply a brute-force-approach:
AMT
+
CASE WHEN LAG(DATE,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,1) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE 0
END
+
CASE WHEN LAG(DATE,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE) >= ADD_MONTHS(DATE,-6)
THEN LAG(AMT,2) OVER (PARTITION BY ATTRIBUTE ORDER BY DATE)
ELSE
END
+
...
Looks ugly, but it's mostly cut&paste&modify and still a single step in Explain. Other possible solutions would be based on an additional EXPAND ON or time-series aggregation step.
Below is the sample data and one manipulation. The first data set is employment specific to an industry. The second data set is overall employment and unemployment rate. I am seeking to do a left join (or at least that's what I think it should be) to achieve the desired result below. When I do it, I get a one to many issue with the row count growing. In this example, it goes from 14 to 18. In the larger data set, it goes from 228 to 4348. Primary question is if this can be done with only a properly written join script or is there more to it?
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
month<-c(1,2,3,4,5,6,7,8,9,10,11,12,1,2)
emp1 <-c(10,11,12,13,14,15,16,17,20,21,22,24,26,28)
firstset<-data.frame(area1,periodyear,month,emp1)
area1<-c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
periodyear1<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2021,2021)
period<-c(01,02,03,04,05,06,07,08,09,10,11,12,01,02)
rate<-c(3.0,3.2,3.4,3.8,2.5,4.5,6.5,9.1,10.6,5.5,7.8,6.5,4.5,2.9)
emp2<-c(1001,1002,1005,1105,1254,1025,1078,1106,1099,1188,1254,1250,1301,1188)
secondset<-data.frame(area2,periodyear1,period,rate,emp2)
secondset <- secondset%>%mutate(month = as.numeric(period))
secondset <- left_join(firstset,secondset, by=c("month"))
Desired Result (14 rows with below being the first 3)
area1 periodyear month emp1 rate emp2
000000 2020 1 10 3.0 1001
000000 2020 2 11 3.2 1002
000000 2020 3 12 3.4 1005
We may have to add 'periodyear' as well in the by
library(dplyr)
left_join(firstset,secondset, by=c("periodyear" = "periodyear1",
"area1" = "area2", "month"))
-output
area1 periodyear month emp1 period rate emp2
1 0 2020 1 10 1 3.0 1001
2 0 2020 2 11 2 3.2 1002
3 0 2020 3 12 3 3.4 1005
...
I'm trying to find the historical average temperature between a range of dates using NOAA data and comparing to the long term average temperatures.
I'm using the rnoaa package and have hit a bit of a snag. For long term averages, I have been successful using the following syntax:
library('rnoaa')
start_date = "2010-01-15"
end_date = "2010-11-14"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_DLY', stationid=paste0('GHCND:',station_id),
datatypeid='dly-tavg-normal',
startdate = start_date, enddate = end_date,limit=365)
This lets me parse weather_data$data for the long term average temperatures for that given station between January 15th and November 14th.
However, I can't seem to find the right dataset or datatype for historical average temperatures. I'd like to get the same data as the code above except with the actual daily average temperatures for those days. Any idea how to query this? I've been at it for a few hours and have had no luck.
Something I tried was the following:
weather_data <- ncdc(datasetid='GHCND', stationid=paste0('GHCND:',station_id),
startdate = start_date, enddate = end_date,limit=365)
uniq_d_types = unique(weather_data$data$datatype)
View(uniq_d_types)
This let me see the unique data types in the GHCND dataset but none of the data types seemed to be daily average temperatures. Any thoughts?
In order to obtain average daily actual temperatures from the NOAA data using the rnoaa package, one must use the hourly data and aggregate it by day. Hourly NOAA data is in the NORMAL_HLY data set, and the required data type is HLY-TEMP-NORMAL.
library('rnoaa')
library(lubridate)
options(noaakey = "obtain key from NOAA website")
start_date = "2010-01-15"
end_date = "2010-01-31"
station_id = "USW00093738"
weather_data <- ncdc(datasetid='NORMAL_HLY', stationid=paste0('GHCND:',station_id),
datatypeid = "HLY-TEMP-NORMAL",
startdate = start_date, enddate = end_date,limit=500)
data <- weather_data$data
data$year <- year(data$date)
data$month <- month(data$date)
data$day <- day(data$date)
# summarize to average daily temps
aggregate(value ~ year + month + day,mean,data = data)
...and the output:
> aggregate(value ~ year + month + day,mean,data = data)
year month day value
1 2010 1 15 323.5417
2 2010 1 16 322.8750
3 2010 1 17 323.4167
4 2010 1 18 323.7500
5 2010 1 19 323.2083
6 2010 1 20 321.0833
7 2010 1 21 318.4167
8 2010 1 22 317.6667
9 2010 1 23 319.0000
10 2010 1 24 321.0833
11 2010 1 25 323.5417
12 2010 1 26 326.0833
13 2010 1 27 328.4167
14 2010 1 28 330.9583
15 2010 1 29 333.2917
16 2010 1 30 335.7917
17 2010 1 31 308.0000
>
Note that temperatures are stored in tenths of degrees in this data set, so for the period between January 15th and 31st 2010, the average daily temperatures at the Dulles International Airport weather station were between 30.8 degrees and 33.5 degrees.
Also note that to calculate the average by stationId and run across multiple weather stations, simply add station to the aggregate() function.
> # summarize to average daily temps by station
> aggregate(value ~ station + year + month + day,mean,data = data)
station year month day value
1 GHCND:USW00093738 2010 1 15 323.5417
2 GHCND:USW00093738 2010 1 16 322.8750
3 GHCND:USW00093738 2010 1 17 323.4167
4 GHCND:USW00093738 2010 1 18 323.7500
5 GHCND:USW00093738 2010 1 19 323.2083
6 GHCND:USW00093738 2010 1 20 321.0833
7 GHCND:USW00093738 2010 1 21 318.4167
8 GHCND:USW00093738 2010 1 22 317.6667
9 GHCND:USW00093738 2010 1 23 319.0000
10 GHCND:USW00093738 2010 1 24 321.0833
11 GHCND:USW00093738 2010 1 25 323.5417
12 GHCND:USW00093738 2010 1 26 326.0833
13 GHCND:USW00093738 2010 1 27 328.4167
14 GHCND:USW00093738 2010 1 28 330.9583
15 GHCND:USW00093738 2010 1 29 333.2917
16 GHCND:USW00093738 2010 1 30 335.7917
17 GHCND:USW00093738 2010 1 31 308.0000
>
The answer is to grab historical (meaning actual, on the day specified-- not long term average) weather data from the NOAA's ISD database. USAF and WBAN values can be found by looking through the isd-history.csv file found here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa
Here's an example query.
out <- isd(usaf='724030', wban = '93738', year=2018)
This will grab a years worth of ~hourly weather data from ISD mapping. You can then parse/process this data however you see fit (e.g. for daily average temperatures like I did).
is there any way in Oracle that My month start after every 28 days
Example
24-dec- 2015 to 20-jan-16 ( we mention Dec 2015)
21-jan-16 to 17-feb-16 (we mention Jan 16)
select rownum as month_number
,day1 + (rownum-1) * 28 as gregorian_month_start
,day1 + rownum * 28 - 1 as gregorian_month_end
from (select date'2015-12-24' day1
from dual connect by level <= 13);
1 24/DEC/2015 20/JAN/2016
2 21/JAN/2016 17/FEB/2016
3 18/FEB/2016 16/MAR/2016
4 17/MAR/2016 13/APR/2016
5 14/APR/2016 11/MAY/2016
6 12/MAY/2016 08/JUN/2016
7 09/JUN/2016 06/JUL/2016
8 07/JUL/2016 03/AUG/2016
9 04/AUG/2016 31/AUG/2016
10 01/SEP/2016 28/SEP/2016
11 29/SEP/2016 26/OCT/2016
12 27/OCT/2016 23/NOV/2016
13 24/NOV/2016 21/DEC/2016
Note: this doesn't handle the 365th day for normal years, or 366th day for leap years. You would need to specify which month these should be added to.