R - Extract time along with timezone which is part of string [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a large database of text, read as data frame with one column of text which has few sentences with time mentioned in different formats as below:
Row 1. I tried to call you on xxx-xxx-xxxx, however reached voice mail I'm scheduling our next follow up on 6/13/2018 between 12 PM and 2 PM PST.
Row 2. I will call you again today if I hear something from them, if not, will call you tomorrow between 4 - 6PM EST.
Row 3. We will await for your reply, if we don't hear from you then we will call you tomorrow between 12:00PM to 2:00PM CST
Row 4. As discussed over the call, we scheduled call back for tomorrow between 12 - 02 PM EST.
Row 5. As suggested by you, we will have our next follow up on 6/13/2018 between 12 PM TO 2 PM PST.
Would like to extract just the time part along with EST/CST/PST.
Expected Outputs:
6/13/2018 4 PM - 6 PM EST
tomorrow 12 PM TO 2 PM PST
Have tried the below:
x <- text$string
sc1 <- str_match(x, " follow up on (.*?) T.")
which returns something like:
follow up on 6/13/2018 between 1 PM TO | 6/13/2018 between 1 PM
Tried to combine other formats using below codes
sc2 <- str_match(x, " will call you tomorrow between (.*?) T.")
and do a rowbind to include both formats (follow up * and will call you*)
sc1rb <- rbind(sc1,sc2)
which did not workk
Any way to extract only the time part along with timezone from the above example strings?
Thanks in advance!

Here's something that works for the sample. As #MrFlick mentioned, please try to share your data in a reproducible way.
Data
> dput(txt)
c("Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.", "will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST", "Next follow up on 6/13/2018 between 12 PM TO 2 PM PST."
)
code
> regmatches(txt, regexec('[[:space:]]([[:digit:]]{1,2}[[:space:]].*[[:upper:]]{3})', txt))
[[1]]
[1] " 12 PM and 2 PM PST" "12 PM and 2 PM PST"
[[2]]
[1] " 4 - 6PM EST" "4 - 6PM EST"
[[3]]
character(0)
[[4]]
[1] " 11 AM to 12 PM EST" "11 AM to 12 PM EST"
[[5]]
[1] " 12 PM TO 2 PM PST" "12 PM TO 2 PM PST"
the output is a list wherein each element has two character vectors (read the help section for regmatches). You can simplify this further to get only the output indicated above:
> unname(sapply(txt, function(z){
pattern <- '[[:space:]]([[:digit:]]{1,2}([[:space:]]|:).*[[:upper:]]{3})'
k <- unlist(regmatches(z, regexec(pattern = pattern, z)))
return(k[2])
}))
[1] "12 PM and 2 PM PST" "4 - 6PM EST" "12:00PM to 2:00PM CST" "11 AM to 12 PM EST"
[5] "12 PM TO 2 PM PST"
This based on the sample input. Of course if the input is far too irregular, it'll be hard to use a single regex. If you have such a case, I'd recommend using multiple regex functions that are called one after the other depending on if the preceding ones return NA. Hope this is helpful!

sub(".*?(\\d+\\s*[PA:-].*)","\\1",data)
[1] "12 PM and 2 PM PST." "4 - 6PM EST." "12:00PM to 2:00PM CST"
[4] "11 AM to 12 PM EST" "12 PM TO 2 PM PST."

This code works for almost all your specifications, excepted this substring "4 - 6PM EST". I hope it would be useful on your whole data
data=c(
"Next follow up on 6/13/2018 between 12 PM and 2 PM PST.",
"will call you tomorrow between 4 - 6PM EST.",
"will call you tomorrow between 12:00PM to 2:00PM CST",
"will call you tomorrow between 11 AM to 12 PM EST",
"Next follow up on 6/13/2018 between 12 PM TO 2 PM PST.")
#date exclusion with regex
data=gsub( "*(\\d{1,2}/\\d{1,2}/\\d{4})*", "", data)
#parameters for exlusion and substitution#
excluded_texts=c("Next follow up on","between","will call you tomorrow",":00","\\.")
replaced_input=c(" ","\'-","and","TO"," AM"," PM")
replaced_output=c("","to","to","to","AM","PM")
for (i in excluded_texts){
data=gsub(i, "", data)}
for (j in 1:length(replaced_input)){
data=gsub(replaced_input[j],replaced_output[j],data)
}
print(data)

Related

Why is this date formatted with a different hour value?

I have three dates in the database. They all represent 5 pm "real user time" meaning that in real life, the event occurring at those timestamps will occur at 5 pm:
1600117200000 (09/14/2020 # 5:00pm (America/Toronto) - 09/14/2020 # 9:00pm (UTC)
1615240800000 (03/08/2021 # 5:00pm (America/Toronto) - 03/08/2021 # 10:00pm (UTC)
1615842000000 (03/15/2021 # 5:00pm (America/Toronto) - 03/15/2021 # 9:00pm (UTC)
When I run the following, the third date is displayed as 4 pm instead of 5 pm. Why?
moment(1600117200000).tz('America/Toronto').format('LLL')
-> "September 14, 2020 5:00 PM"
moment(1615240800000).tz('America/Toronto').format('LLL')
-> "March 8, 2021 5:00 PM"
moment(1615842000000).tz('America/Toronto').format('LLL')
-> "March 15, 2021 4:00 PM"
I could understand why the middle date (1615240800000) would display an invalid hour since it's in different daylight saving time than I am currently, but the third (1615842000000) is in the same daylight saving time as when I execute the code.
Thanks!
:facepalm: I had an old version of moment-timezone and since the 2021 timezone updates did not exist, it was just displaying an invalid hour. I updated the lib (and database) and everything is fine.

How to split Monday, July 1, 2019 12:00:00:000 AM

I have read, studied, and tested, but I'm just not getting it. Here is my data frame:
MyDate TEMP1 TEMP2
Monday, July 1, 2019 12:00:00:000 AM 90.0 1586
Monday, July 1, 2019 12:01:00:000 AM 88.6 1581
Monday, July 1, 2019 12:02:00:000 AM 89.4 1591
Monday, July 1, 2019 12:03:00:000 AM 90.5 1586
I need to compare it to a second data frame:
Date Time A.B.Flow A.B.Batch.Volume
7/1/2019 14:47:46 1.0 2.0
7/9/2019 14:47:48 3.0 5.0
7/11/2019 14:47:52 0.0 2.0
7/17/2019 14:48:52 3.8 4.0
7/24/2019 14:49:52 0.0 3.1
I just have to combine the two data frames when the minutes dates, hours, and minutes match. The seconds do not have to match.
So far I have gleaned that I need to convert the first Column MyDate into separate Dates and Times. I've been unable to come up with a strsplit command that actually does this.
This just gives each element in quotes:
Tried, newdate <- strsplit(testdate$MyDate, "\\s+ ")[[3]]
This is better but "2019"is gone:
Tried, newdate <- strsplit(testdate$MyDate, "2019")
It looks like this:
[1] "Monday, July 1, " "12:00:00:000 AM"
[[2]]
[1] "Monday, July 1, " "12:01:00:000 AM"
[[3]]
[1] "Monday, July 1, " "12:02:00:000 AM"
[[4]]
[1] "Monday, July 1, " "12:03:00:000 AM"
Please tell me what I am doing wrong. I would love some input as to whether I am barking up the wrong tree.
I've tried a few other things using anytime and lubridate, but I keep coming back to this combined date and time with the day written out as my nemesis.
You could get rid of the day (Monday, ...) in your MyDate field by splitting on ',', removing the first element, then combining the rest and converting to POSIXCt.
Assuming your first dataframe is called df:
dt <- strsplit(df$MyDate, ',')
df$MyDate2 <- sapply(dt, function(x) trimws(paste0(x[-1], collapse = ',')))
df$MyDate2 <- as.POSIXct(df$MyDate2, format = '%b %d, %Y %H:%M:%S')
And since you are not interested in the seconds part of the timestamps, you can do:
df$MyDate2 <- format(df$MyDate2, '%Y-%m-%d %H:%M')
You should similarly convert the Date/Time fields of your second dataframe df2, creating a MyDate2 field there with the seconds part removed as above.
Now you can merge the two dataframes on the MyDate2 column.
This might give you a hint:
Since you have time, you shouldn't used as.Date but rather as.POSIXct, imho.
x=c("Monday, July 1, 2019 12:00:00:000 AM 90.0 1586")
Months=c("January","February","March","April","May","June","July","August","September","October","November","December")
GetDate=function(x){
x=str_remove_all(x,",")#get rid of the
mo=which(Months==word(x,2))
day=word(x,3)
year=word(x,4)
time=word(x,5)
as.POSIXct(paste(paste(year,mo,day,sep="-"),time))
}
GetDate(x)

Having trouble converting df index to datetime object

So this is my dataframe
Ticker Owner \
SEC Form 4
Nov 09 02:19 PM HSY HERSHEY TRUST
Nov 09 02:05 PM HSY HERSHEY TRUST CO
Nov 09 02:03 PM WDFC PITTARD DANIEL E
Nov 09 01:34 PM IMGN Enyedy Mark J
Nov 09 01:25 PM ORI ZUCARO ALDO C
I'm trying to convert the index(SEC Form 4) into a datetime object, so I can use that object's methods. However, I like the current format style of the date (Nov 09 02:19 PM) and don't want to replace it with something like (2016-11-09 14:19).
pd.to_datetime(df.index, format = '%b %d')
pd.to_datetime(df.index, format = '%b %d %I:%M %p' )
I played around with some of these format parameters but it seems these change the look display style of the date into something like (2016-11-09 14:19:00) format, which is not the format that I want.
I even tried to see if there was a dtype datetime that I can just convert to (so I won't have to change the display look) but I had no luck finding such a dtype.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html
Thank you.
Maybe you missed to append the year since it is not specified in the data. Here is a possible solution.
zz = """"SEC Form 4" Ticker Owner
"Nov 09 02:19 PM" "HSY" "HERSHEY TRUST"
"Nov 09 02:05 PM" HSY "HERSHEY TRUST CO"
"Nov 09 02:03 PM" WDFC "PITTARD DANIEL E"
"Nov 09 01:34 PM" IMGN "Enyedy Mark J"
"Nov 09 01:25 PM" ORI "ZUCARO ALDO C"
"""
df = pd.read_table(io.StringIO(zz), delim_whitespace=True)
df.set_index('SEC Form 4', inplace=True)
# Adding the missing year
df.index = '2016 ' + df.index
# There is no need to detail the expected format
df.index = pd.to_datetime(df.index)
print(df.index.dtype)
print(df)
# datetime64[ns]
# Ticker Owner
# 2016-11-09 14:19:00 HSY HERSHEY TRUST
# 2016-11-09 14:05:00 HSY HERSHEY TRUST CO
# 2016-11-09 14:03:00 WDFC PITTARD DANIEL E
# 2016-11-09 13:34:00 IMGN Enyedy Mark J
# 2016-11-09 13:25:00 ORI ZUCARO ALDO C

converting multiple date formats into one in r

I am working with messy excel file with multiple date formats
2016-10-17T12:38:41Z
Mon Oct 17 08:03:08 GMT 2016
10-Sep-15
13-Oct-09
18-Oct-2016 05:42:26 UTC
I want to convert all of the above in yyyy-mm-dd format. I am using following code for the conversion but lot of values are coming NA.
as.Date(parse_date_time(df$date,c('mdy', 'ymd_hms','a b d HMS y','d b y HMS')))
How can I do it all of them together. I have read other threads on similar case,but nothing seems to work for my case.
Please help
If I add 'dmy' to the list then at least all of the cases in your example are succesfully parsed:
z <- c("2016-10-17T12:38:41Z", "Mon Oct 17 08:03:08 GMT 2016",
"10-Sep-15", "13-Oct-09", "18-Oct-2016 05:42:26 UTC")
library(lubridate)
parse_date_time(z,c('mdy', 'dmy', 'ymd_HMS','a b d HMS y','d b y HMS'))
## [1] "2016-10-17 12:38:41 UTC" "2016-10-17 08:03:08 UTC"
## [3] "2015-09-10 00:00:00 UTC" "2009-10-13 00:00:00 UTC"
## [5] "2016-10-18 05:42:26 UTC"
Your big problem will be the third and fourth elements: are these actually meant to be 'ymd' and 'dmy' respectively? I'm not sure how any logic will let you auto-detect these differences ... out of context, "15 Sep 2010" and "10 September 2015" both seem perfectly reasonable possibilities ...
For what it's worth I also tried the new anytime package - it only handled the first and last element.
Removing the times first makes it possible to specify only three alternatives in orders to parse the sample data in the question. This interprets 10-Sep-15 and 13-Oct-09 as dmy but if you want them interpreted as ymd then uncomment the commented out line:
orders <- c("dmy", "mdy", "ymd")
# orders <- c("ymd", "dmy", "mdy")
as.Date(parse_date_time(gsub("..:..:..", " ", x), orders = orders))
giving:
[1] "2016-10-17" "2016-10-17" "2015-09-10" "2009-10-13" "2016-10-18"
or if the commented out line is uncommented then:
[1] "2016-10-17" "2016-10-17" "2010-09-15" "2013-10-09" "2016-10-18"
Note: The input is:
x <- c("2016-10-17T12:38:41Z ", "Mon Oct 17 08:03:08 GMT 2016", "10-Sep-15",
"13-Oct-09", "18-Oct-2016 05:42:26 UTC")

R parse timestamp of form %j%Y with no leading zeroes

I am working with csv timestamp data given in the form '%j%Y %H:%M with no leading zeroes. Here are some time stamp examples:
112005 22:00
1292005 6:00
R is reading the first line at the 112th day of the 005th year. How can I make R correctly parse this information?
Code I'm using which doesn't work:
train$TIMESTAMP <- strptime(train$TIMESTAMP, format='%j%Y %H:%M', tz='GMT')
train$hour <- as.numeric(format(train$TIMESTAMP, '%H'))
I don't think there's any simple way to decipher where the day stops and the year starts. Maybe you could split it at something that looks like a relevant year (20XX):
gsub("^(\\d{1,3})(20\\d{2})","\\1 \\2",train$TIMESTAMP)
#[1] "11 2005 22:00" "129 2005 6:00"
and do:
strptime(gsub("^(\\d{1,3})(20\\d{2})","\\1 \\2",train$TIMESTAMP), "%j %Y %H:%M")
#[1] "2005-01-11 22:00:00 EST" "2005-05-09 06:00:00 EST"

Resources