I have this table, date is a TEXT field and the only field.
date
2020-01-01
2010-03-01
2010-06-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
I want the table to join itself on the date that is 1 year smaller than the target record, I tried this. However it doesn't seemed to work when I only add number after doing a strftime.
SELECT d0.*, d1.* from the_table d0 left join the_table d1 on strftime(d0."date",'%Y') = strftime(d1."date",'%Y') + 1;
What I want is the following result
date date
2020-01-01 None
2010-03-01 2011-01-01
2010-06-01 2011-01-01
2011-01-01 2012-01-01
...
But this is what it returned instead.
I have several questions regarding this issue?
Besides the example that joins table on a specific difference in year. How do I do this for months, days etc?
Does the strftime uses the index if there's an index created on that field? The date field is the primary key field in the example. How do I know if I'm using indices? If not how do I make it use the index?
The syntax for the function strftime() requires the format to be the first argument and the date to be the second.
Also, strftime() returns a string, so you must convert it to a number (implicitly by adding 0) if you want to compare it to a number:
SELECT d0.*, d1.*
FROM the_table d0 LEFT JOIN the_table d1
ON strftime('%Y', d0."date") + 1 = strftime('%Y', d1."date") + 0;
See the demo.
Results:
date
date
2020-01-01
null
2010-03-01
2011-01-01
2010-06-01
2011-01-01
2011-01-01
2012-01-01
2012-01-01
2013-01-01
2013-01-01
2014-01-01
2014-01-01
2015-01-01
2015-01-01
null
You can apply the same code by changing the format to '%m' or '%d' to compare month or day respectively, if the year is not relevant.
But, if you want to join on the next day of each date you can do it with the function date():
SELECT d0.*, d1.*
FROM the_table d0 LEFT JOIN the_table d1
ON date(d0."date", '+1 day') = d1."date";
Also, strftime() and date() are functions and normally SQLite would not use any index with these functions.
SQLite supports indexes on expressions (also: SQLite Expression-based Index), but I don't think that this would help in your case.
Not an exact answer or correction to your current approach, but you could use the DATE() function here with an offset of 1 year:
SELECT d0.*, d1.*
FROM the_table d0
LEFT JOIN the_table d1
ON d0."date" = DATE(d1."date", '+1 year');
Related
I am trying to extract and join 2 data frames based on some date parts but its not working. The data frames are as follows :-
startdf
startperiod
2015-10-01
2016-10-01
2017-10-01
2018-10-01
enddf
endperiod
2016-03-31
2017-03-31
2018-03-31
Both startperiod and endperiod are of 'Date' data type
This is final output I desire :-
startperiod, endperiod
2015-10-01 2016-03-31
2016-10-01 2017-03-31
2017-10-01 2018-03-31
2018-10-01 Null
The equivalent SQL would be something like this :-
Select startperiod, endperiod
From startdf a lef join enddf b
On year(b.endperiod) = (year(a.startperiod) + 1)
is there a way to do in R? I believe I need to use library sqldf and RH2 but I couldn't get it going no matter what I did.
Simplistically, this should work but doesn't!
sqldf("Select * from startperioddf a where year(startperiod) = 2016")
1) RH2 Assuming
the data shown in reproducible form in the Note below. In particular, note that startdate and enddate are assumed to be of Date class.
typos in the question are fixed
use of h2 database backend instead of the default sqlite
then your code works:
library(sqldf)
library(RH2)
sql <- "Select startperiod, endperiod
From startdf a left join enddf b
On year(b.endperiod) = (year(a.startperiod) + 1)"
sqldf(sql)
giving:
startperiod endperiod
1 2015-10-01 2016-03-31
2 2016-10-01 2017-03-31
3 2017-10-01 2018-03-31
4 2018-10-01 <NA>
Also
sqldf("Select * from startdf a where year(startperiod) = 2016")
giving:
startperiod
1 2016-10-01
Be sure to read the material on the sqldf github site: https://github.com/ggrothendieck/sqldf
2) sqlite If you want to use the default sqlite backend then be sure that RH2 is NOT loaded (otherwise, it will assume you want to use it) and note that Date class variables will be uploaded to sqlite as integers representing the number of days since the unix epoch (since there is no Date class type in sqlite) so we need to convert days since the epoch to years (which can be done using strftime as shown).
sql2 <- "Select startperiod, endperiod
From startdf a left join enddf b
On strftime('%Y', b.endperiod * 3600 * 24, 'unixepoch') + 0 =
strftime('%Y', a.startperiod * 3600 * 24, 'unixepoch') + 1"
sqldf(sql2)
sqldf("Select * from startdf a
where strftime('%Y', a.startperiod * 3600 * 24, 'unixepoch') = '2016'")
Note
Lines1 <- "
startperiod
2015-10-01
2016-10-01
2017-10-01
2018-10-01"
Lines2 <- "
endperiod
2016-03-31
2017-03-31
2018-03-31"
startdf <- read.table(text = Lines1, header = TRUE, colClasses = "Date")
enddf <- read.table(text = Lines2, header = TRUE, colClasses = "Date")
The sqldf package in R uses the SQLite database engine by default. Hence, you cannot use the year function in your query to extract the year part from the date. The following query will do the job:
sqldf("Select * from startdf where strftime('%Y', startperiod) = '2016'")
It uses SQLite's strftime function to compare specific date parts. The year function is defined under MySQL so you may have to install the RMySQL package and then use the drv = 'MySQL' argument to specify the database engine that you want sqldf to use.
enter image description hereI have a hive table with more than millions records.
The input is of the following type:
Input:
rowid |starttime |endtime |line |status
--- 1 2007-07-19 00:05:00 2007-07-19 00:23:00 l1 s1
--- 2 2007-07-20 00:00:10 2007-07-20 00:22:00 l1 s2
--- 3 2007-07-19 00:00:00 2007-07-19 00:11:00 l2 s2
What I want to do is when 1st order the table by starttime group by line.
Then find the difference between two consecutive rows endtime and starttime. If the difference is more than 5mins then in a new table add a new row in between with status misstime.
In input row 1 & 2 the time time difference is 1 hour 10 mins so 1st I will create row for 19th Date and complete that days with missing time and then add one more row for 20th as below.
output:
rowid |starttime |endtime |line |status
--- 1 |2007-07-19 00:05:00 |2007-07-19 00:23:00 |l1 |s1
--- 2 |2007-07-19 00:23:01 |2007-07-19 00:00:00 |l1 |misstime
--- 3 |2007-07-20 00:00:01 |2007-07-20 00:00:09 |l1 |misstime
--- 4 |2007-07-20 00:00:10 |2007-07-20 00:22:00 |l1 |s2
--- 3 |2007-07-19 00:00:00 |2007-07-19 00:11:00 |l2 |s2
Can anyone help me achieve this directly in hue - hive ?
Unix script will also do.
Thanks in advance.
The solution template is:
Use LAG() function to get previous line starttime or endtime.
For each line calculate the different between current and previous time
Filter rows with difference more than 5 minutes.
Transform the dataset into required output.
Example:
insert into yourtable
select
s.rowid,
s.starttime ,
s.endtime,
--calculate your status here, etc, etc
from
(
select rowid starttime endtime,
lag(endtime) over(partition by rowid order by starttime) prev_endtime
from yourtable ) s
where (unix_timestamp(endtime) - unix_timestamp(prev_endtime))/60 > 5 --latency>5 min
I am trying to add to a date using sqldf, i know it should be simple but I can't figure out what is wrong with my date format. Using:
sqldf("select date(model_date, '+1 day') from lapse_test")
give's answers like '-4666-01-23'
The model_date's are in the date format and look like 2015-01-01
I previously made them from a character string ('12/1/2015') using
lapse_test$model_date <- as.Date(lapse_test$date1,format = "%m/%d/%Y") or
lapse_test$model_date <- as.POSIXCT(lapse_test$date1,format = "%m/%d/%Y")
I'm guessing this is the problem? Any ideas?
Passing a character variable to the date() function seems to work:
df <- data.frame(a=as.Date("2010-10-01"))
df$b <- as.character(df$a)
sqldf("select date(a) from df")
# date(a)
# 1 -4672-08-24
sqldf("select date(b) from df")
# date(b)
# 1 2010-10-01
sqldf("select date(b, '+1 day') from df")
# date(b, '+1 day')
# 1 2010-10-02
Note that you can do (some) arithmetic on Date objects in R directly, without needing SQL:
df$a <- df$a + 1
df
# a b
# 1 2010-10-02 2010-10-01
SQLite date functions consider dates as days since Nov 24, 4714BC, which means the integer storage of 16770 for the example date of 2015-12-01 in R returns an ancient date somewhere in 4667BC.
You can figure out that the difference between the R origin of 1970-01-01 and the SQLite origin is 2440588 days. Which means, you can take this constant into account if you want:
test <- data.frame(model_date=as.Date("12/1/2015",format="%m/%d/%Y"))
sqldf("select date(model_date + 2440588, '+1 day') as select_date from test")
# select_date
#1 2015-12-02
#HongOoi's answer is probably better, but I thought this might be interesting to know the underlying workings.
I need to get the full date with timestamp from a column that has a format 'yyyymm'. For example i need to get 2007-01-01 00:00:00:000 from 200701.
My Column 'A' consists of:
200701
200702
200703
...
...
...
I need another to calculate another column 'B' showing:
2007-01-01 00:00:00.000
2007-02-01 00:00:00.000
2007-03-01 00:00:00.000
2007-04-01 00:00:00.000
Column B has to be a calculation based on Column A or Sys.Calendar. Using platform Teradata 14.
Please let me know the answer. Thank you in advance for your answers.
If the datatype is a string:
cast(col as timestamp(3) format 'yyyymm')
If it's numeric:
cast(cast(col * 100 - 19000000 + 1 as date) as timestamp(3))
Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.