Extract part of string: date and times - r

I have a variable that usually has some gibberish like:
\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n
I am trying to extract the date (30.07.2019) and time (12:00 - 14:30). I am not very good with parsing so some help with implementing this in R would be appreciated.

If you can rely on the fact that the date and time part only occur once in your data you could use regular expressions to extract them (here using a dataframe):
library(tidyverse)
data <-
tibble(gibberish_string = "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n")
data %>% mutate(date = str_extract(gibberish_string,
pattern = "\\d{1,2}\\.\\d{1,2}\\.\\d{4}"),
time = str_extract(gibberish_string,
pattern = "\\d{1,2}:\\d{1,2}"))

String split, then extract date and times:
x <- "\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
lapply(strsplit(x, "[\n\t ]"), function(i){
dd <- i[ grepl("[0-9]{2}.[0-9]{2}.[0-9]{2}", i) ]
tt <- i[ grepl("[0-9]{2}:[0-9]{2}", i) ]
c(dd, paste(tt, collapse = "-"))
})
# [[1]]
# [1] "30.07.2019" "12:00-14:30"

This for date:
(\d{1,2}[\.\/]){2}((\d{4})|(\d{2}))
Here is Demo
This for time:
\d{1,2}:\d{2}\s?-\s?\d{1,2}:\d{2}
Here Is Demo

A kind of lengthy step by step base/stringr approach:
tst<-"\n\t\n\t\n\t\n\t\tSeuat eselyt\n\t\t\t\t\t\n\t\t\tti 30.07.2019 klo 12:00 - 14:30\n\t\t\t\t\t\t\tTau ski 2342342 2342342\n\t\t\t\t\t\n\t\n"
cleaner<-gsub("\\n|\\t","",tst)
split_txt<-strsplit(cleaner, "\\s(?=[a-z])",perl=T)
dates<-stringr::str_extract_all(unlist(split_txt),
"\\d{1,}\\.\\d{2,}\\.\\d{4}")
times<-stringr::str_extract_all(stringr::str_remove_all(unlist(split_txt),
"[A-Za-z]"),".*\\-.*")
dates[lengths(dates)>0]
[[1]]
[1] "30.07.2019"
trimws(times[lengths(times)>0])
[1] "12:00 - 14:30"

Related

How to create a sequence of *%Year%Week* from numeric?

From my inputs, which is numeric format and represent the year and the week number, I need to create a sequence, from one input to the other.
Inputs example :
input.from <- 202144
input.to <- 202208
Desired output would be :
c(202144:202152, 202201:202208)
According to me, it is a little more complex, because of these constraints :
Years with 53 weeks : I tried lubridate::isoweek(), the %W or %v format, ...
Always keep two digits for the week : I tried "%02d", ...
I also tried to convert my input to date, ...
Anyway, many attemps without success to create my function.
Thanks for your help !
In case it would be useful to someone one day, here is finally the function I wrote, which respects ISO 8601 :
library(ISOweek)
foo <- function(pdeb, pfin) {
from <- ISOweek::ISOweek2date(paste0(substr(pdeb, 1, 4), "-W", substr(pdeb, 5, 6), "-1"))
to <- ISOweek::ISOweek2date(paste0(substr(pfin, 1, 4), "-W", substr(pfin, 5, 6), "-1"))
res <- seq.Date(from, to, by = "week")
return(format(res, format = "%G%V"))
}
foo(201950, 202205)
Step #1 : tranform input to character : YYYY-"W"WW-1
Step #2 : capture the ISOweek
Step #3 : sequence by week
Step #4 : return the sequence to the format "%G%V", still to respect ISO 8601 and YYYYWW
I'd go with
x <- c("202144", "202208")
out <- do.call(seq, c(as.list(as.Date(paste0(x, "1"), format="%Y%U%u")), by = "week"))
out
# [1] "2021-11-01" "2021-11-08" "2021-11-15" "2021-11-22" "2021-11-29" "2021-12-06" "2021-12-13" "2021-12-20" "2021-12-27"
# [10] "2022-01-03" "2022-01-10" "2022-01-17" "2022-01-24" "2022-01-31" "2022-02-07" "2022-02-14" "2022-02-21"
If you really want to keep them in the %Y%W format, then
format(out, format = "%Y%W")
# [1] "202144" "202145" "202146" "202147" "202148" "202149" "202150" "202151" "202152" "202201" "202202" "202203" "202204"
# [14] "202205" "202206" "202207" "202208"
(This answer heavily informed by Transform year/week to date object)
We could do some mathematics.
f <- function(from, to) {
r <- from:to
r[r %% 100 > 0 & r %% 100 < 53]
}
input.from <- 202144; input.to <- 202208
f(input.from, input.to)
# [1] 202144 202145 202146 202147 202148 202149 202150 202151 202152
# [10] 202201 202202 202203 202204 202205 202206 202207 202208

Discretize a date-time variable to "in-hours" and "after-hours"

I have date-times like:
x = c("2015-09-12 03:52:00", "2017-06-15 21:37:28", "2017-04-08 20:44:11")
I want to create two categories: If the time is between 6.30pm and 8 am I want to return "after-hours"`, otherwise it returns "in-hours".
I tried to solve this first by extracting the time part, but that converted it to a character which meant, ifelse was not working.
Thank you in advance.
base R
Cheating a little, converting to %H%M as an integer on a 24h clock.
vec <- as.POSIXct(c("2015-09-12 03:52:00", "2017-06-15 21:37:28", "2017-04-08 20:44:11"))
hhmm <- as.integer(format(vec, format = "%H%M"))
ifelse(hhmm < 0800 | hhmm > 1830, "after-hours", "in-hours")
# [1] "after-hours" "after-hours" "after-hours"
lubridate
Similar, but using decimal hours instead of fake-hour/minute.
library(lubridate)
hhmm2 <- hour(vec) + minute(vec)/60
ifelse(hhmm2 < 8 | hhmm2 > 18.5, "after-hours", "in-hours")
# [1] "after-hours" "after-hours" "after-hours"
times_as_char = c("2015-09-12 03:52:00", "2017-06-15 21:37:28", "2017-04-08 20:44:11")
# Converting character to date-time
times_as_datetimes <- lubridate::ymd_hms(times_as_char)
# We can use decimal hours to make time comparisons easier
times_as_hour_dec <- lubridate::hour(times_as_datetimes) +
lubridate::minute(times_as_datetimes)/60
time_status <- ifelse(times_as_hour_dec < 8 | times_as_hour_dec >= 18.5,
"after-hours",
"in hours")

combine tibbles with inexact values

I have two tibbles and I want to combine them based on the Batsman column. However, the values in the 2 columns are not completely identical, i.e. "V Kohli" vs. "Virat Kohli (IND)". How can I combine the tibbles based on these inexact matches?
Thank you!
x1 <- tibble(Batsman=c("V Kohli (INDIA)","RG Sharma (INDIA)","Babar Azam (PAK)","GJ Maxwell (AUS)"),
Runs=c(500,400,300,200),
Matches=c(67,54,47,23)
x2 <- tibble(Rank=c(1,2,3,4),
Batsman=c("Virat Kohli", "Rohit Sharma", "Glenn Maxwell","Babar Azam"),
Rating=c(853,820,640,500))
So you want to join two texts strings,
> x1$Batsman
[1] "V Kohli (INDIA)" "RG Sharma (INDIA)" "Babar Azam (PAK)" "GJ Maxwell (AUS)"
> x2$Batsman
[1] "Virat Kohli" "Rohit Sharma" "Glenn Maxwell" "Babar Azam"
I guess you have much more names than these four?
It is definitely a tricky task, computer are notoriously bad at doing these kind of tasks. (There is some famous examples of very long function just to read phone numbers). From the strings you provide, i can see they always have similar names.
I would use stringr to extract the names with a regexp.
The full code:
library(tibble)
library(stringr)
x1 <- tibble(Batsman=c("V Kohli (INDIA)","RG Sharma (INDIA)","Babar Azam (PAK)","GJ Maxwell (AUS)"),
Runs=c(500,400,300,200),
Matches=c(67,54,47,23) )
x2 <- tibble(Rank=c(1,2,3,4),
Batsman=c("Virat Kohli", "Rohit Sharma", "Glenn Maxwell","Babar Azam"),
Rating=c(853,820,640,500))
AA <- str_sub(x1$Batsman, start = str_locate(x1$Batsman, " ")[,1]+1, 20)
AA <- str_sub(AA, start = 1, end = str_locate(AA, " ")[,1]-1) %>%
str_to_lower()
BB <- str_sub(x2$Batsman, start = str_locate(x2$Batsman, " ")[,1]+1, 20) %>%
str_to_lower()
match(AA, BB)

Print all hours:minutes from 00:00 to 23:59

I would like to print all the hours: minutes in a day from 00:00 to 23:59.
This part goes beyond the question, but if you want to help me, this is the whole idea:
Once that is done, I would like to calculate all the "curious" times that can be interpreted as serendipities. Patterns like: 00:00, 22:22, 01:10, 12:34, 11:44, and the like.
Later on, I would like to count all the "serendipities", and divide them to the total number of hours to know the probabilities of find a "serendipity" each time a person look at the time on his smartphone.
To be honest, I am pretty lost. There is already some months without coding. For the first part of the problem, I guess that a loop can make the task.
For the second part, an if conditional can probably make it.
For the first part of the problem I have tried loops like this
for(i in x){
for(k in y){
cat(i,":",k, ",")
}
}
For the second, something like
Assuming the digits of the time are ab:cd
if(a==b & a==c & a==d){
print(ab:cd)
TRUE
}
if(a==b & c==d){
print(ab:cd)
TRUE
}
I would like to get the whole list of numbers first. Then, the list of "serendipities", and finally the count of both to make the percentage.
I find interesting how people find patterns in numbers when they look at the time, and I would like to know how probable is to get one of these patterns out of the 24*60 = 1440
I hope I have explained myself. (I used to be better with coding and maths, but after some months, I have forgotten almost everything).
Here's a way to generate the list of all possible times.
h <- seq(from=0, to=23)
m <- seq(from=0, to=59)
h <- sprintf('%02d', h)
m <- sprintf('%02d', m)
df <- data.frame(expand.grid(h, m))
df$times <- paste0(df$Var1, ':', df$Var2)
df <- df[order(df$times), ]
df$times
Partial output
df$times[1:25]
[1] "00:00" "00:01" "00:02" "00:03" "00:04" "00:05" "00:06" "00:07" "00:08"
[10] "00:09" "00:10" "00:11" "00:12" "00:13" "00:14" "00:15" "00:16" "00:17"
[19] "00:18" "00:19" "00:20" "00:21" "00:22" "00:23" "00:24"
Length of variable
dim(df)
[1] 1440 3
We can create a sequence of 1 minute interval starting from 00:00:00 to 23:59:00 and then use format to get output in desired format.
format(seq(as.POSIXct("00:00:00", format = "%T"),
as.POSIXct("23:59:00", format = "%T"), by = "1 min"), "%H:%M")
#[1] "00:00" "00:01" "00:02" "00:03" "00:04" "00:05" "00:06" "00:07" "00:08" "00:09"
# "00:10" "00:11" "00:12" "00:13" "00:14" "00:15" "00:16" "00:17" "00:18" "00:19" ...
Yet another way of doing it:
> result <- character(1440)
> for (i in 0:1439) result[i+1L] <- sprintf("%02d:%02d",
+ i %/% 60,
+ i %% 60
+ )
> head(result)
[1] "00:00" "00:01" "00:02" "00:03" "00:04" "00:05"
> tail(result)
[1] "23:54" "23:55" "23:56" "23:57" "23:58" "23:59"

R: generate dataframe of Friday dates for the year [duplicate]

This question already has answers here:
Get Dates of a Certain Weekday from a Year in R
(3 answers)
Closed 9 years ago.
I would like to generate a dataframe that contains all the Friday dates for the whole year.
Is there a simple way to do this?
eg for December 2013: (6/12/13,13/12/13,20/12/13,27/12/13)
Thank you for your help.
I'm sure there is a simpler way, but you could brute force it easy enough:
dates <- seq.Date(as.Date("2013-01-01"),as.Date("2013-12-31"),by="1 day")
dates[weekdays(dates)=="Friday"]
dates[format(dates,"%w")==5]
Building on #Frank's good work, you can find all of any specific weekday between two dates like so:
pick.wkday <- function(selday,start,end) {
fwd.7 <- start + 0:6
first.day <- fwd.7[as.numeric(format(fwd.7,"%w"))==selday]
seq.Date(first.day,end,by="week")
}
start and end need to be Date objects, and selday is the day of the week you want (0-6 representing Sunday-Saturday).
i.e. - for the current query:
pick.wkday(5,as.Date("2013-01-01"),as.Date("2013-12-31"))
Here is a way.
d <- as.Date(1:365, origin = "2013-1-1")
d[strftime(d,"%A") == "Friday"]
Alternately, this would be a more efficient approach for generating the data for an arbitrary number of Fridays:
wk1 <- as.Date(seq(1:7), origin = "2013-1-1") # choose start date & make 7 consecutive days
wk1[weekdays(wk1) == "Friday"] # find Friday in the sequence of 7 days
seq.Date(wk1[weekdays(wk1) == "Friday"], length.out=50, by=7) # use it to generate fridays
by=7 says go to the next Friday.
length.out controls the number of Fridays to generate. One could also use to to control how many Fridays are generated (e.g. use to=as.Date("2013-12-31") instead of length.out).
Takes a year as input and returns only the fridays...
getFridays <- function(year) {
dates <- seq(as.Date(paste0(year,"-01-01")),as.Date(paste0(year,"-12-31")), by = "day")
dates[weekdays(dates) == "Friday"]
}
Example:
> getFridays(2000)
[1] "2000-01-07" "2000-01-14" "2000-01-21" "2000-01-28" "2000-02-04" "2000-02-11" "2000-02-18" "2000-02-25" "2000-03-03" "2000-03-10" "2000-03-17" "2000-03-24" "2000-03-31"
[14] "2000-04-07" "2000-04-14" "2000-04-21" "2000-04-28" "2000-05-05" "2000-05-12" "2000-05-19" "2000-05-26" "2000-06-02" "2000-06-09" "2000-06-16" "2000-06-23" "2000-06-30"
[27] "2000-07-07" "2000-07-14" "2000-07-21" "2000-07-28" "2000-08-04" "2000-08-11" "2000-08-18" "2000-08-25" "2000-09-01" "2000-09-08" "2000-09-15" "2000-09-22" "2000-09-29"
[40] "2000-10-06" "2000-10-13" "2000-10-20" "2000-10-27" "2000-11-03" "2000-11-10" "2000-11-17" "2000-11-24" "2000-12-01" "2000-12-08" "2000-12-15" "2000-12-22" "2000-12-29"
There are probably more elegant ways to do this, but here's one way to generate a vector of Fridays, given any year.
year = 2007
st <- as.POSIXlt(paste0(year, "/1/01"))
en <- as.Date(paste0(year, "/12/31"))
#get to the next Friday
skip_ahead <- 5 - st$wday
if(st$wday == 6) skip_ahead <- 6 #for Saturdays, skip 6 days ahead.
first.friday <- as.Date(st) + skip_ahead
dates <- seq(first.friday, to=en, by ="7 days")
dates
#[1] "2007-01-05" "2007-01-12" "2007-01-19" "2007-01-26"
# [5] "2007-02-02" "2007-02-09" "2007-02-16" "2007-02-23"
# [9] "2007-03-02" "2007-03-09" "2007-03-16" "2007-03-23"
I think this would be the most efficient way and would also returns all the Friday in the whole of 2013.
FirstWeek <- seq(as.Date("2013/1/1"), as.Date("2013/1/7"), "days")
seq(
FirstWeek[weekdays(FirstWeek) == "Friday"],
as.Date("2013/12/31"),
by = "week"
)

Resources