Hive: No results are displayed from the query - datetime

I am writing a query on this table to get the sum of size for all the directories, group by directory where date is yesterday. I am getting no output from the below query.
test.id test.path test.size test.date
1 this/is/the/path1/fil.txt 232.24 2019-06-01
2 this/is/the/path2/test.txt 324.0 2016-06-01
3 this/is/the/path3/index.txt 12.3 2017-05-01
4 this/is/the/path4/test2.txt 134.0 2019-03-23
5 this/is/the/path1/files.json 2.23 2018-07-23
6 this/is/the/path1/code.java 1.34 2014-03-23
7 this/is/the/path2/data.csv 23.42 2016-06-23
8 this/is/the/path3/test.html 1.33 2018-09-23
9 this/is/the/path4/prog.js 6.356 2019-06-23
4 this/is/the/path4/test2.txt 134.0 2019-04-23
SELECT regexp_replace(path,'[^/]+$',''), sum(cast(size as decimal))
from test WHERE date > date_sub(current_date, 1) GROUP BY path,size;

You must not group by size, only by regexp_replace(path,'[^/]+$','').
Also, since you want only yesterday's rows why do you use WHERE date > '2019%?
You can get yesterday's date with date_sub(current_date, 1):
select
regexp_replace(path,'[^/]+$',''),
sum(cast(size as decimal))
from test
where date = date_sub(current_date, 1)
group by regexp_replace(path,'[^/]+$','');

You probably want WHERE date >= '2019-01-01'. Using % in matching strings, for example your 2019%, only works with LIKE, not inequality matching.
The example you gave looks like you want all rows in calendar year 2019.
For yesterday, you want
WHERE date >= DATE_SUB(current_date, -1)
AND date < current_date
This works even if your date column contains timestamps.

Related

Creating a SQLite query

I have a SQLite database, I want to create a query that will group records if the DateTime is within 60 minutes - the hard part is the DateTime is cumulative so if we have 3 records with DateTimes 2019-12-14 15:40:00, 2019-12-14 15:56:00 and 2019-12-14 16:55:00 it would all fall in one group. Please see the hands and desired output of the query to help you understand the requirement.
Database Table "Hands"
ID DateTime Result
1 2019-12-14 15:40:00 -100
2 2019-12-14 15:56:00 1000
3 2019-12-14 16:55:00 -2000
4 2012-01-12 12:00:00 400
5 2016-10-01 21:00:00 900
6 2016-10-01 20:55:00 1000
Desired output of query
StartTime Count Result
2019-12-14 15:40:00 3 -1100
2012-01-12 12:00:00 1 400
2016-10-01 20:55:00 2 1900
You can use some window functions to indicate at which record a new group should start (because of a datetime difference with the previous that is 60 minutes or larger), and then to turn that information into a unique group number. Finally you can group by that group number and perform the aggregation functions on it:
with base as (
select DateTime, Result,
coalesce(cast((
julianday(DateTime) - julianday(
lag(DateTime) over (order by DateTime)
)
) * 24 >= 1 as integer), 1) as firstInGroup
from Hands
), step as (
select DateTime, Result,
sum(firstInGroup) over (
order by DateTime rows
between unbounded preceding and current row) as grp
from base
)
select min(DateTime) DateTime,
count(*) Count,
sum(Result) Result
from step
group by grp;
DB-fiddle

How to check for continuity minding possible gaps in dates

I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet

Having trouble in a PL/SQL program

I have a table(pay_period) as following
pay_period
period_id list_id start_date end_date price
1 100 2017-01-01 2017-08-31 100
2 100 2017-09-01 2017-12-31 110
3 101 2017-01-01 2017-08-31 75
Now I have list_id, checkin_date, checkout_date
list_id 100
checkin_date 2017-08-25
checkout_date 2017-09-10
I need to calculate the price of a list for the period from checkin date to checkout date.
therefore the calculation is supposed to be
7 * 100 + 10 * 110
I am thinking to do it with a for loop, if there is any other better way to do it, can you please suggest?
You have to see if the checkin_date and checkout_date are into the same period_id.
1.1 If yes, you multiply the price with the nunmber of days.
1.2 If no, you have count the days between checkin_day untill the end of your period 1 and multiply with the corresponding price, then do the same with checkout_day and beginning of next period.
Note: i guess it might happen to have more than 2 prices per list_id. for example:
period_id list_id start_date end_date price
1 100 2017-01-01 2017-04-30 100
2 100 2017-05-01 2017-09-30 110
3 100 2017-10-01 2017-12-31 120
4 101 2017-01-01 2017-08-31 75
and the calculation period to be:
list_id 100
checkin_date 2017-03-01
checkout_date 2017-11-10
In this case, yes, the solution would be to have a CURSOR where to keep the prices for list_id and periods; loop through it and compare the checkin_date and checkout_date with each record.
Best,
Mikcutu.
You can do the following for a much cleaner code. Although it is purely sql, I am using a function to make it code better to understand.
Create a generic function which gets you the number of overlapping days in 2 different date range.
CREATE OR REPLACE FUNCTION fn_count_range
( p_start_date1 IN DATE,
p_end_date1 IN DATE,
p_start_date2 IN DATE,
p_end_date2 IN DATE ) RETURN NUMBER AS
v_days NUMBER;
BEGIN
IF p_end_date1 < p_start_date1 OR p_end_date2 < p_start_date2 THEN
RETURN 0;
END IF;
SELECT COUNT(*) INTO v_days
FROM (
(SELECT p_start_date1 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date1 - p_start_date1 + 1 ) INTERSECT
(SELECT p_start_date2 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date2 - p_start_date2 + 1 ) );
RETURN v_days;
END;
/
Now, your query to calculate the total price is simplified.
WITH lists ( list_id,
checkin_date,
checkout_date) AS
( SELECT 100,
TO_DATE('2017-08-25','YYYY-MM-DD'),
TO_DATE('2017-09-10','YYYY-MM-DD')
FROM dual) --Not required if you have a lists table.
SELECT l.list_id,
SUM(fn_count_range(start_date,end_date,checkin_date,checkout_date) * price) total_price
FROM pay_period p
JOIN lists l ON p.list_id = l.list_id
GROUP BY l.list_id;

select date ranges for multiple years in r

I have a data set containing data for about 4.5 years. I'm trying to create two different data frames from this, for what I will call holiday and non-holiday periods. There are multiple periods per year, and these periods will repeat over multiple years.
For example, I'd like to choose a time period between Thanksgiving and New Year's Day, as well as periods prior to Valentine's Day and Mother's Day for each year, and make this my holiday data frame. Everything else would be non-holiday.
I apologize if this has been asked before, I just can't find it. I found a similar question for SQL, but I'm trying to figure out how to do this in R.
I've tried filtering and selecting, to no avail.
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
subset(cdate>=2011-11-25, cdate<=2011-12-31)
wine.holiday
Source: local data frame [27,628 x 3]
Groups: clubgroup_id.x [112]
clubgroup_id.x total cdate
(chr) (dbl) (date)
1 1 45 2011-10-04
2 1 45 2011-10-08
3 1 45 2011-10-09
4 1 45 2011-10-09
5 1 45 2011-10-11
6 1 45 2011-10-15
7 1 45 2011-10-24
8 1 90 2011-11-13
9 1 45 2011-11-18
10 1 45 2011-11-26
.. ... ... ...
Clearly something isn't right, because not only is it not limiting the date range, but it's including a column in the data frame that I'm not even selecting.
As mentioned in the comments, dplyr uses filter not subset. Just a simple change to the code you've got (therefore not a complete solution to your issue, but hopefully helps) should get the subset working.
wine.holiday <- wine.sub2 %>%
select(total, cdate)
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
Or, to stick with dplyr piping:
wine.holiday <- wine.sub2 %>%
select(total, cdate) %>%
filter( cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31") )
wine.holiday
EDIT to add: If the dplyr select isn't working (it looks fine to me), you could try this:
wine.holiday <- subset( wine.sub2, select = c( total, cdate ) )
wine.holiday <- subset(wine.holiday, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"))
wine.holiday
You could, of course, combine those two lines into one. This makes it harder to read, but would probably improve the processing efficiency:
wine.holiday <- subset(wine.sub2, cdate>=as.Date("2011-11-25") & cdate<=as.Date("2011-12-31"), select=c(total,cdate) )
I figured out another method for this through looking through SO posts (took a while).
> library(dateTime)
> wine.holiday <- data.table(start = c(as.Date(USThanksgivingDay(2010:2020))),
+ end = as.Date(USNewYearsDay(2011:2021))-1)
> wine.holiday
start end
1: 2010-11-25 2010-12-31
2: 2011-11-24 2011-12-31
3: 2012-11-22 2012-12-31
4: 2013-11-28 2013-12-31
5: 2014-11-27 2014-12-31
6: 2015-11-26 2015-12-31
7: 2016-11-24 2016-12-31
8: 2017-11-23 2017-12-31
9: 2018-11-22 2018-12-31
10: 2019-11-28 2019-12-31
11: 2020-11-26 2020-12-31
I still need to figure out how to add other ranges (e.g. two weeks before Valentine's Day or Mother's Day) to this, and will update this answer if/when I figure it out.

Get count by column values

I have a dataframe that looks like this:
month create_time request_id weekday
1 4 2014-04-25 3647895 Friday
2 12 2013-12-06 2229374 Friday
3 4 2014-04-18 3568796 Friday
4 4 2014-04-18 3564933 Friday
5 3 2014-03-07 3081503 Friday
6 4 2014-04-18 3568889 Friday
And I'd like to get the count of request_ids by the weekday. How would I do this in R?
I've tried a lot of stuff based on ddply and aggregate with no luck.
Try using aggregate
> aggregate(request_id ~ weekday, FUN=length, dat=df)
weekday request_id
1 Friday 6
There are several valid ways to do it. I usually go with my trusty sqldf(). If the dataframe is named D, then
library(sqldf)
counts <- sqldf('select weekday, count(request_id) as nrequests from D group by weekday')
sqldf() can be wordy, but it is just so easy to remember and get right the first time!
or ... u could try:
count(df,"weekday")
or
library(plyr)
ddply(df,.(weekday),summarise,count=length(month))
Another option is to use a table and take the rowSums
> rowSums(with(dat, table(weekday, request_id)))
Friday
6

Resources