Subset a string in r using gsub and regexpr

Subset a string in r using gsub and regexpr - r

I need to change the following
test <- c("August 08, 2016, Hour 23",
"June 26, 2016, Hour 14",
"November 26, 2016, Hour 01")
test1 <- c("Wednesday:8pm-12pm:31days",
"Tuesday:7pm-10pm:6days|Today:7AM-6PM:7days")
Edit:-
In test1, I don't really care much about the day of the week, but am more interested in the timestamp. I would like to see 8PM-12PM converted into 24 hr time format as : 2000 - am agreeable with a string as an output as I require a 4 digit number. (Anything before 10 AM would need to be 0x)
into two datasets as:-
a$date <- c(08/08/2016,06/26/2016,11/26/2016) # all in date class
a$hour <- c(23, 14 , 01) #all should be numeric
b$time <- c("2000","1922","0718") #can be character
b$days <- c(31,6,7) #needs to be numeric
The logic for the hour and days cases would be similar. I'm looking to use gsub and regexpr in R.
My current process for the date section is too long and tedious:-
mat <- as.data.frame(matrix(unlist(strsplit(test," ")),ncol=5,byrow=T))
mat$V6 <- str_replace_all(paste(as.numeric(str_replace_all(mat$V2,"[[:punct:]]","")),
"-",as.character(mat$V1),
"-",as.numeric(str_replace_all(mat$V3,"[[:punct:]]",""))),
"[[:space:]]","")
mat$V7 <- as.Date(mat$V6, format="%d-%B-%Y")
class(mat$V7)
mat$V8 <- as.numeric(as.character(mat$V5))
Any suggestions for using gsub and regexpr in both cases would be appreciated.

This does the same thing as your mat line. Go ahead and try it.
library(reshape2)
mat <- colsplit(test," ", c("M","D","YYYY","HR","Time"))
I think is your best bet, instead of using gsub or regexpr.
mat$Len <- paste(mat$D,mat$M,mat$YYYY)
mat$Len <- gsub(",","",gsub(" ","-",mat$Len))
I am not a fan of using nested gsub's but it serves a purpose here. Keeps this a bit more concise. This should take care of the mat$v6 line.

Related

Formatting and Replacing Multiple Dates within a Single String in R

I have a question very similar to this one. The difference with mine is that I can have text with multiple dates within one string. All the dates are in the same format, as demonstrated below
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
All my sentences are lower case and all dates follow the %B %d %Y format. I'm able to extract all the dates using the following code:
> pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
> str_extract_all(rep, pattern)
[[1]]
[1] "june 11 2022" "august 4 2022" "august 25 2022"
what I want to do is replace every instance of a date formatted %B %d %Y with the format %Y-%m-%d. I've tried something like this:
str_replace_all(rep, pattern, as.character(as.Date(str_extract_all(rep, pattern),format = "%B %d %Y")))
Which throws the error do not know how to convert 'str_extract_all' to class "Date". This makes sense to me since Im trying to replace multiple different dates and R doesn't know which one to replace it with.
If I change the str_extract_all to just str_extract I get this:
"on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-06-11. on 2022-06-11 there will be a test "
Which again, makes sense since the str_extract is taking the first instance of a date, converting the format, and applying that same date across all instances of a date.
I would prefer if the solution used the stringr package just because most of my string tidying thus far has been using that package, BUT I am 100% open to any solution that gets the job done.

We may capture the pattern i.e one or more character (\\w+) followed by a space then one or two digits (\\d{1,2}), followed by space and then four digits (\\d{4}) as a group ((...)) and in the replacement pass a function to convert the captured group to Date class
library(stringr)
str_replace_all(rep, "(\\w+ \\d{1,2} \\d{4})", function(x) as.Date(x, "%b %d %Y"))
-output
[1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
NOTE: It is better to name objects with different names as rep is a base R function name

You can pass a named vector with multiple replacements to str_replace_all():
library(stringr)
rep <- "on the evening of june 11 2022, i was too tired to complete my homework that was due on august 4 2022. on august 25 2022 there will be a test "
pattern <- paste(month.name, "[:digit:]{1,2}", "[:digit:]{4}", collapse = "|") %>%
regex(ignore_case = TRUE)
extracted <- str_extract_all(rep, pattern)[[1]]
replacements <- setNames(as.character(as.Date(extracted, format = "%B %d %Y")),
extracted)
str_replace_all(rep, replacements)
#> [1] "on the evening of 2022-06-11, i was too tired to complete my homework that was due on 2022-08-04. on 2022-08-25 there will be a test "
Created on 2022-05-26 by the reprex package (v2.0.1)

Change date format with format() in R

So here's a basic algorithm in R that prints out the dates between two dates.
initial_date <- as.Date(toString((readline(prompt = "Enter a starting date in the format year-month-day:"))))
final_date <- as.Date(toString((readline(prompt = "Enter a final in the format year-month-day:"))))
dates <- seq(final_date, initial_date, by = "-1 day")
rev(dates[dates > initial_date & dates < final_date])
max.print = length(dates)
print(dates)
I would like to modify it so that the dates are in the format month-day-year like this: nov 27 2008. So I add "format(dates, format="%b %d %Y")".
initial_date <- as.Date(toString((readline(prompt = "Enter a starting date in the format year-month-day:"))))
final_date <- as.Date(toString((readline(prompt = "Enter a final in the format year-month-day:"))))
dates <- seq(final_date, initial_date, by = "-1 day")
format(dates, format="%b %d %Y")
rev(dates[dates > initial_date & dates < final_date])
max.print = length(dates)
print(dates)
But this keeps printing the same output as the previous code. How do I fix it?

There are a few points being misunderstood here:
format(dates, format="%b %d %Y") might be formatting it the way you want it to look, but it is not being stored, so the next command using dates is using the object as it was before the call to format(..). This as well as most R functions are functional, meaning that the effect of them is realized when it is stored in an object: calling the function itself has no side-effect. The "right" way to use format is to either print it right away (see far below) or to store it into the same or another variable. While I do not recommend doing this, a more functional use of this would have been
dates <- format(dates, format="%b %d %Y")
Ditto for rev(dates[...]): you need to use it immediately (as in print(rev(...)), i.e., the argument of an immediate function call) or store it somewhere else, such as
reversed_dates <- rev(dates[...])
In R, dates (proper Date-class) are number-like, so that one can safely make continuous-number comparisons such as date1 < date2 and date2 >= date3, etc. However, if you accidentally compare a %Y-%m-%d-string with another similary-formatted string, then it will still work. It still works because strings are compared lexicographically. This means that when comparing strings "2020-01-01" and "2019-01-01", it will first compare "2" and "2", it's a tie; same with "0"s; then it will see that "2" > "1", and therefore "2019-01-01" comes before the other.
This still works, even as strings, because the components with the most-significance are years, and as long as they are first in the string, the relative ordering (>, sort, order) still works. This continues to work if the dates are 0-padded integers. This does not work if they are not 0-padded, where "2021-2-1" > "2021-11-1" is reported as TRUE; this is because it gets to the month portion and compares the "2" with the first "1" of "11", and does not see that the next digit makes the "1" greater than "2".
The moment one starts bringing in month names, this goes the same type of wrong, since the month names (in any language, perhaps?) are not ordered lexicographically (I don't know that this is an absolute truth, but it is certainly true in English and perhaps many/most western languages ... I'm not polyglot to speak for other languages). This means that "2020-Apr-01" < "2020-Jan-01" will again be TRUE, unfortunately.
We'll combine #3 with the fact that in general, R will always print a Date-class object as "%Y-%m-%d"; there is no (trivial) way to get it to print a Date-class object as your "%b %d %Y" without either (a) converting it to a string and losing proper ordering; or (b) super-classing it so that it presents like you want on the console, but it is still a number underneath.
As for (a), this is a common thing to do for reports and labeling in plots, and I'm perfectly fine with that. I am not trying to convince the world that it should always see a date as %Y-%m-%d. However, what I am saying is that it is much easier to keep it as a proper Date-class object until you actually render it, and then format it at the last second. For this, do all of your filtering and ordering and then print(format(..)), such as this. I recommend this method.
dates <- seq(as.Date("2020-02-02"), as.Date("2020-02-06"), by = "day")
dates <- rev(dates[ dates > as.Date("2020-02-03") ])
print(format(dates, format = "%b %d %Y"))
# [1] "Feb 06 2020" "Feb 05 2020" "Feb 04 2020"
Again, above is the technique I recommend.
As for (b), yes, you can do it, but this approach is fragile since it is feasible that some functions that want Date-class objects will not immediately recognize that these are close enough to continue working as such; or they will strip the new class we assign at which point it will resort to "%Y-%m-%d"-format. You can use this, which requires that you change the class (see the # important line) of every Date-object you want to personalize the formatting. I recommend against doing this.
format.myDATE <- function(x, ...) { # fashioned after format.Date
xx <- format.Date(x, format = "%b %d %Y")
names(xx) <- names(x)
xx
}
print.myDATE <- function(x, max = NULL, ...) { # fashioned after print.Date
if (is.null(max))
max <- getOption("max.print", 9999L)
if (max < length(x)) {
print(format.myDATE(x[seq_len(max)]), ...)
cat(" [ reached 'max' / getOption(\"max.print\") -- omitted",
length(x) - max, "entries ]\n")
} else if (length(x))
print(format.myDATE(x), ...)
else cat(class(x)[1L], "of length 0\n")
invisible(x)
}
dates <- seq(as.Date("2020-02-02"), as.Date("2020-02-06"), by = "day")
class(dates) <- c("myDATE", class(dates)) ## important!
dates <- rev(dates[ dates > as.Date("2020-02-03") ])
print(dates) ## no need for format!
# [1] "Feb 06 2020" "Feb 05 2020" "Feb 04 2020"
### and number-like operations still tend to work
diff(dates)
# Time differences in days
# [1] -1 -1
Again, I recommend against doing this for data that you are working with. Many packages that pretty-print tables and plots and such may choose to override our preference for formatting, so there is no guarantee that this is honored across the board. This is why I suggest "accepting" the R way while working with it, regardless of your locale, and formatting it for your aesthetic preferences immediately before printing/rendering.
Another couple minor points:
remove toString, it's doing nothing for you here I think;
your use of max.print = ... suggests you think this is going to change anything else; most R things that have global options use options(...) for this, so you need to either set it globally in this R session with options(max.print=length(dates)), or a one-time limit with print(dates, max = length(dates)).

Extracting Date from text using R

My dataframe looks like
df <- setNames(data.frame(c("2 June 2004, 5 words, ()(","profit, Insight, 2 May 2004, 188 words, reports, by ()("), stringsAsFactors = F), "split")
What I want is to split column for date and words So far I found
"Extract date text from string"
lapply(df2, function(x) gsub(".*(\\d{2} \\w{3} \\d{4}).*", "\\1", x))
But its not working with my example, thanks for the help as always

As there is only a single column, we can directly use gsub/sub after extracting the column. In the pattern, the days can be 1 or more, similarly the words have 3 ('May') or 4 characters ('June'), so we need to make those changes
sub(".*\\b(\\d{1,} \\w{3,4} \\d{4}).*", "\\1", df$split)
#[1] "2 June 2004" "2 May 2004"

Regex pattern questions in r

I need to match author and time from string in R.
test = "Postedby BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
I am currently using gsub() to find the desired output.
Expected output would be:
#author
"BeauHDon"
#Month
"November"
#Date
24
#Time
22:30
I got to gsub("Postedby (.*).*", "\\1", test) but the output is
"BeauHDon Friday November 24, 2017 #10:30PM from the cost-effective dept."
Also I understand time requires more more coding after extracting 10:30.
Is it possible to add 12 if next two string is PM?
Thank you.

We can extract using capturing as a group (assuming that the patterns are as shown in the example). Here the pattern is to match one or more non-white spaces (\\S+) followed by spaces (\\s+) from the start (^) of the string, followed by word which we capture in a group (\\w+), followed by capturing word after we skip the next word and space, then get the numbers ((\\d+)) and the time that follows the #
v1 <- scan(text=sub("^\\S+\\s+(\\w+)\\s+\\w+\\s+(\\w+)\\s+(\\d+)[^#]+#(\\S+).*",
"\\1,\\2,\\3,\\4", test), what = "", sep=",", quiet = TRUE)
As the last entry is time, we can convert it to datetime with strptime and change the format, assign it to the last element
v1[4] <- format(strptime(v1[4], "%I:%M %p"), "%H:%M")
If needed, set the names of the element with author, Month etc.
names(v1) <- c("#author", "#Month", "#Date", "#Time")
v1
# #author #Month #Date #Time
#"BeauHDon" "November" "24" "22:30"

converting numbers to time

I entered my data by hand, and to save time I didn't include any punctuation in my times. So, for example, 8:32am I entered as 832. 3:34pm I entered as 1534. I'm trying to use the 'chrono' package (http://cran.r-project.org/web/packages/chron/chron.pdf) in R to convert these to time format, but chrono seems to require a delimiter between the hour and minute values. How can I work around this or use another package to convert my numbers into times?
And if you'd like to criticize me for asking a question that's already been answered before, please provide a link to said answer, because I've searched and haven't been able to find it. Then criticize away.

I think you don't need the chron package necessarily. When:
x <- c(834, 1534)
Then:
time <- substr(as.POSIXct(sprintf("%04.0f", x), format='%H%M'), 12, 16)
time
[1] "08:34" "15:34"
should give you the desired result. When you also want to include a variable which represents the date, you can use the ollowing line of code:
df$datetime <- as.POSIXct(paste(df$yymmdd, sprintf("%04.0f", df$x)), format='%Y%m%d %H%M%S')

Here's a sub solution using a regular expression:
set.seed(1); times <- paste0(sample(0:23,10), sample(0:59,10)) # ex. data
sub("(\\d+)(\\d{2})", "\\1:\\2", times) # put in delimitter
# [1] "6:12" "8:10" "12:39" "19:21" "4:43" "17:27" "18:38" "11:52" "10:19" "0:57"

Say
x <- c('834', '1534')
The last two characters represent minutes, so you can extract them using
mins <- substr(x, nchar(x)-1, nchar(x))
Similarly, extract hours with
hour <- substr(x, 0, nchar(x)-2)
Then create a fixed vector of time values with
time <- paste0(hour, ':', mins)
I think you are forced to specify dates in the chron package, so assuming a date value, you can converto chron with this:
chron(dates.=rep('02/02/02', 2),
times.=paste0(hour, ':', mins, ':00'),
format=c(dates='m/d/y',times='h:m:s'))

I thought I'd throw out a non-regex solution that uses lubridate. This is probably overkill.
library(lubridate)
library(stringr)
time.orig <- c('834', '1534')
# zero pad times before noon
time.padded <- str_pad(time.orig, 4, pad="0")
# parse using lubridate
time.period <- hm(time.padded)
# make it look like time
time.pretty <- paste(hour(time.period), minute(time.period), sep=":")
And you end up with
> time.pretty
[1] "8:34" "15:34"

Here are two solutions that do not use regular expressions:
library(chron)
x <- c(832, 1534, 101, 110) # test data
# 1
times( sprintf( "%d:%02d:00", x %/% 100, x %% 100 ) )
# 2
times( ( x %/% 100 + x %% 100 / 60 ) / 24 )
Either gives the following chron "times" object:
[1] 08:32:00 15:34:00 01:01:00 01:10:00
ADDED second solution.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset a string in r using gsub and regexpr - r

Related

Formatting and Replacing Multiple Dates within a Single String in R

Change date format with format() in R

Extracting Date from text using R

Regex pattern questions in r

converting numbers to time

Categories

Resources