Combine different date fields to a single Timestamp column in pyspark - datetime

If the dataframe is like below:
year
month
day
weekday
hour
2017
January
1
Sunday
0
2018
September
22
Saturday
11
Then I need to add another column with values of type timestamp like the following:
2017-01-01 00:00:00
2018-09-22 11:00:00
I'm trying unix_timestamp after concatenating the fields into string type but not working.

you can concat the elements into a string and use to_timestamp (or from_unixtime(unix_timestamp())) with the appropriate datetime pattern.
here's the example
data_sdf. \
withColumn('ts',
func.to_timestamp(func.concat_ws(' ', 'year', 'month', 'day', 'hour'),
'yyyy MMMM d H'
)
). \
show(truncate=False)
# +----+---------+---+--------+----+-------------------+
# |year|month |day|weekday |hour|ts |
# +----+---------+---+--------+----+-------------------+
# |2017|January |1 |Sunday |0 |2017-01-01 00:00:00|
# |2018|September|22 |Saturday|11 |2018-09-22 11:00:00|
# +----+---------+---+--------+----+-------------------+

Related

Change Date format - Convert to Date Class [duplicate]

This question already has answers here:
Changing the Date print format, retaining its mode, class and type
(1 answer)
Change Date print format from yyyy-mm-dd to dd-mm-yyyy
(2 answers)
Change print format of a Date without converting it to character
(1 answer)
Closed last year.
I have a column of dates that were read in as character. I want to produce a data class with my desired format (say, US-style, 08/28/2020).
But, all solutions to change format, produce character class, or produce date class with standard format (2020-08-28)
This is a reproducible example:
df1 <- data.frame(date=c("08/27/2020", "08/28/2020", "08/29/2020"), cases=c(5,6,7))
class(df1$date)
df1$date1<- format(as.Date(df1$date, format = "%m/%d/%Y"), "%m/%d/%Y")
class(df1$date1)
df1$date2<-as.Date(parse_date_time(df1$date,"%m/%d/%Y"))
class(df1$date2)
df1$date3<- as.Date(df1$date, format = "%m/%d/%Y")
class(df1$date3)
df1
As you can see data1 has my desired format while it is not date class. In addition, date2 and date3 are Date class while they produce undesired format.
date cases date1 date3 date2
1| 08/27/2020 | 5 | 08/27/2020 |2020-08-27 | 2020-08-27|
2| 08/28/2020 | 6 | 08/28/2020 |2020-08-28 | 2020-08-28|
3| 08/29/2020 | 7 | 08/29/2020 |2020-08-29 | 2020-08-29|
Where am I going wrong?
A Date class is always shown like "2020-08-27" in R. That's R's standard Date. To reformat it into something different you can use strftime. It assumes a Date class and outputs a character object with your desired format, e.g.
df1$date2
[1] "2020-08-27" "2020-08-28" "2020-08-29"
class(df1$date2)
[1] "Date"
strftime(df1$date2, format="%m/%d/%Y")
[1] "08/27/2020" "08/28/2020" "08/29/2020"
class(strftime(df1$date2, format="%m/%d/%Y"))
[1] "character"
When dealing with dates and time lubridate package is really handy: https://lubridate.tidyverse.org/.
In this case we could use mdy function (month, day, year) for date and date1.
library(lubridate)
library(dplyr)
df1 %>%
mutate(across(c(date, date1), mdy))
date cases date1 date3
<date> <dbl> <date> <date>
1 2020-08-27 5 2020-08-27 2020-08-27
2 2020-08-28 6 2020-08-28 2020-08-28
3 2020-08-29 7 2020-08-29 2020-08-29

how to subset data between fixed time on successive days, for several months of data

I have data of the following form:
DateTime | Var1
11/01/2016 06:01 | 0
11/01/2016 06:02 | 0.70
...
...
11/01/2016 23:59 | 35.08
11/02/2016 00:01 | 33.29
...
11/02/2016 06:00 | 24.62
...
11/30/2016 23:59 | 42.08
12/01/2016 00:01 | 39.79
....
I have ~5 months data. I have to subset the data from 6:00am of 1 day to just before 6:00am of next day. I can use the following code to subset the data once I have the dates in hand, but how to automatically obtain all the successive dates from the input data?
Date1 <- as.integer(as.POSIXct(Date1))
Date2 <- as.integer(as.POSIXct(Date2))
subset <- subset(data, as.integer(as.POSIXct(data$txtime)) >= Date1 & as.integer(as.POSIXct(data$txtime)) < Date2)
Right now, I can use to following code to obtain successive dates within a month, but this won't work for the last day of the month, where part of the data to be subsetted is on the first day of the next month. So I can't do it automatically for the duration 6:00am 30th November - 5:59am 1st December. Also, the code is not fully automated, as the number of days (used in the loop) varies across months.
for (dateofmonth in c(1:29)) {
Date1 <- paste("2016-11-", dateofmonth, ' 06:00:00', sep = '')
Date2 <- paste("2016-11-", (dateofmonth+1), ' 06:00:00', sep = '')
}
There is possibly an easier way to do this, but I can't figure it out. Please suggest.
Try this:
datelist <- split(data, as.Date(as.POSIXct(data$txtime)-21600))
This will shift your time 6 hours backwards, and then split your data by date. So that each sub dataframe will contain times from 6:00 am in that date to 5:59 am in next day.

How to find all dates from the present and previous month?

I have a table
EmployeeSalary:
Date | Salary
01.12.2016 | 2000
01.02.2016 | 3000
03.02.2016 | 5000
01.03.2017 | 1000
30.01.2017 | 5000
10.03.2017 | 1300
When the System Date is 13.03.2017. How to get the present month dates and the past month Dates (i.e., from February 1 to System date).
My code is :
start= format(Sys.Date() - 30, '%Y-%m-01')
end=Sys.time()
while (start<end)
{
print(EmployeeSalary)
EmployeeSalary$"Date" = EmployeeSalary$"Date"+1
}
Error which I get:
Error: non-numeric argument to binary Operator
Expected Output is :
EmployeeSalary:
Date | Salary
01.02.2016 | 3000
03.02.2016 | 5000
01.03.2017 | 1000
10.03.2017 | 1300
Here is one way:
R> dates <- seq(Sys.Date(), length=62, by=-1)
R> mon <- function(d) as.integer(format(d, "%m")) %% 12
R> dates[mon(dates) >= mon(Sys.Date())-1]
[1] "2017-03-13" "2017-03-12" "2017-03-11" "2017-03-10" "2017-03-09"
[6] "2017-03-08" "2017-03-07" "2017-03-06" "2017-03-05" "2017-03-04"
[11] "2017-03-03" "2017-03-02" "2017-03-01" "2017-02-28" "2017-02-27"
[16] "2017-02-26" "2017-02-25" "2017-02-24" "2017-02-23" "2017-02-22"
[21] "2017-02-21" "2017-02-20" "2017-02-19" "2017-02-18" "2017-02-17"
[26] "2017-02-16" "2017-02-15" "2017-02-14" "2017-02-13" "2017-02-12"
[31] "2017-02-11" "2017-02-10" "2017-02-09" "2017-02-08" "2017-02-07"
[36] "2017-02-06" "2017-02-05" "2017-02-04" "2017-02-03" "2017-02-02"
[41] "2017-02-01"
R>
We create sequence of dates going backwards. We then create helper function to get the (integer-valued) month for a date.
Given those two, we index the date sequence down to the ones matching your criteria: from this months or the preceding month.
And by taking 'month modulo 12' we also catch the case of January comparing to December.

how can i extract month and date and year from data column in R

I had a column with date datatype. in my column the dates are in 4/1/2007 format. now I want to extract month value from that column and date value from that column in different column in R. my date are from 01/01/2012 to 01/01/ 2015 plz help me.
If your variable is date type (as you say in the post) simply use following to extract month:
month_var = format(df$datecolumn, "%m") # this will give output like "09"
month_var = format(df$datecolumn, "%b") # this will give output like "Sep"
month_var = format(df$datecolumn, "%B") # this will give output like "September"
If your date variable in not in date format, then you will have to convert them into date format.
df$datecolumn<- as.Date(x, format = "%m/%d/%Y")
Assuming your initial data is character and not POSIX.
df <- data.frame(d = c("4/1/2007", "01/01/2012", "02/01/2015"),
stringsAsFactors = FALSE)
df
# d
# 1 4/1/2007
# 2 01/01/2012
# 3 02/01/2015
These are not yet "dates", just strings.
df$d2 = as.POSIXct(df$d, format = "%m/%d/%Y")
df
# d d2
# 1 4/1/2007 2007-04-01
# 2 01/01/2012 2012-01-01
# 3 02/01/2015 2015-02-01
Now they proper dates (in the R fashion). These two lines extract just a single component from each "date"; see ?strptime for details on all available formats.
df$dY = format(df$d2, "%Y")
df$dm = format(df$d2, "%m")
df
# d d2 dY dm
# 1 4/1/2007 2007-04-01 2007 04
# 2 01/01/2012 2012-01-01 2012 01
# 3 02/01/2015 2015-02-01 2015 02
An alternative method would be to extract the substrings from each string, but now you're getting into regex-pain; for that, I'd suggest sticking with somebody else's regex lessons-learned, and translate through POSIXct (or even POSIXlt if you want).

Difference between two dates from two consecutive rows in two different columns

enter image description hereI have a hive table with more than millions records.
The input is of the following type:
Input:
rowid |starttime |endtime |line |status
--- 1 2007-07-19 00:05:00 2007-07-19 00:23:00 l1 s1
--- 2 2007-07-20 00:00:10 2007-07-20 00:22:00 l1 s2
--- 3 2007-07-19 00:00:00 2007-07-19 00:11:00 l2 s2
What I want to do is when 1st order the table by starttime group by line.
Then find the difference between two consecutive rows endtime and starttime. If the difference is more than 5mins then in a new table add a new row in between with status misstime.
In input row 1 & 2 the time time difference is 1 hour 10 mins so 1st I will create row for 19th Date and complete that days with missing time and then add one more row for 20th as below.
output:
rowid |starttime |endtime |line |status
--- 1 |2007-07-19 00:05:00 |2007-07-19 00:23:00 |l1 |s1
--- 2 |2007-07-19 00:23:01 |2007-07-19 00:00:00 |l1 |misstime
--- 3 |2007-07-20 00:00:01 |2007-07-20 00:00:09 |l1 |misstime
--- 4 |2007-07-20 00:00:10 |2007-07-20 00:22:00 |l1 |s2
--- 3 |2007-07-19 00:00:00 |2007-07-19 00:11:00 |l2 |s2
Can anyone help me achieve this directly in hue - hive ?
Unix script will also do.
Thanks in advance.
The solution template is:
Use LAG() function to get previous line starttime or endtime.
For each line calculate the different between current and previous time
Filter rows with difference more than 5 minutes.
Transform the dataset into required output.
Example:
insert into yourtable
select
s.rowid,
s.starttime ,
s.endtime,
--calculate your status here, etc, etc
from
(
select rowid starttime endtime,
lag(endtime) over(partition by rowid order by starttime) prev_endtime
from yourtable ) s
where (unix_timestamp(endtime) - unix_timestamp(prev_endtime))/60 > 5 --latency>5 min

Resources