Splitting character string in R - Extracting the timestamp - r

Thank you in advance for any feedback.
I am attempting to clean some data in R where a time stamp and a text string are included together in the same cell. I am not getting the expected result. I know the regex needs validation work, but just testing out this particular function
Expected:
"04/05/2018 17:14:35" " -(Additional comments) update"
Actual:
"04/05/2018 17:14:35 -(Additional comments) update"
What I tried:
string <- "04/05/2018 17:14:35 -(Additional comments) update"
pattern <- "[:digit:][:digit:][:punct:]
[:digit:][:digit:][:punct:]
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]"
strsplit(string, pattern)
I also tried this variation, same result
pattern <- "[:digit:][:digit:]\\/
[:digit:][:digit:]\\/
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
\\:
[:digit:][:digit:]
\\:
[:digit:][:digit:]"

You can try :
string <- "04/05/2018 17:14:35 -(Additional comments) update"
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}).*","\\1", string)
#[1] "04/05/2018 17:14:35"
#RHS part
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)","\\2", string)
#" -(Additional comments) update"
Regex explanation:
\\d{2} - 2 digits
\\d{4} - 4 digits
/ - separator
: - separator
() - Group for selection
.* - Followed by anything
Seems OP is very keen on using strsplit. One option could be as:
strsplit(gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)",
paste("\\1","####","\\2",sep=""), string), split = "####")
# [[1]]
# [1] "04/05/2018 17:14:35" " -(Additional comments) update"

Try this:
sub('-.*','',string)
[1] "04/05/2018 17:14:35 "

Related

Remove all punctuation except underline between characters in R with POSIX character class

I would like to use R to remove all underlines expect those between words. At the end the code removes underlines at the end or at the beginning of a word.
The result should be
'hello_world and hello_world'.
I want to use those pre-built classes. Right know I have learn to expect particular characters with following code but I don't know how to use the word boundary sequences.
test<-"hello_world and _hello_world_"
gsub("[^_[:^punct:]]", "", test, perl=T)
You can use
gsub("[^_[:^punct:]]|_+\\b|\\b_+", "", test, perl=TRUE)
See the regex demo
Details:
[^_[:^punct:]] - any punctuation except _
| - or
_+\b - one or more _ at the end of a word
| - or
\b_+ - one or more _ at the start of a word
One non-regex way is to split and use trimws by setting the whitespace argument to _, i.e.
paste(sapply(strsplit(test, ' '), function(i)trimws(i, whitespace = '_')), collapse = ' ')
#[1] "hello_world and hello_world"
We can remove all the underlying which has a word boundary on either of the end. We use positive lookahead and lookbehind regex to find such underlyings. To remove underlying at the start and end we use trimws.
test<-"hello_world and _hello_world_"
gsub("(?<=\\b)_|_(?=\\b)", "", trimws(test, whitespace = '_'), perl = TRUE)
#[1] "hello_world and hello_world"
You could use:
test <- "hello_world and _hello_world_"
output <- gsub("(?<![^\\W])_|_(?![^\\W])", "", test, perl=TRUE)
output
[1] "hello_world and hello_world"
Explanation of regex:
(?<![^\\W]) assert that what precedes is a non word character OR the start of the input
_ match an underscore to remove
| OR
_ match an underscore to remove, followed by
(?![^\\W]) assert that what follows is a non word character OR the end of the input

Insert characters when a string changes its case R

I would like to insert characters in the places were a string change its case. I tried this to insert a '\n' after a fixed number of characters and then a ' ', as I don't figure out how to detect the case change
s <-c("FloridaIslandE7", "FloridaIslandE9", "Meta")
gsub('^(.{7})(.{6})(.*)$', '\\1\\\n\\2 \\3', s )
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
This works because the positions are fixed but I would like to know how to do it for the general case.
Surely there's a less convoluted regex for this, but you could try:
gsub('([A-Z][0-9])', ' \\1', gsub('([a-z])([A-Z])', '\\1\n\\2', s))
Output:
[1] "Florida\nIsland E7" "Florida\nIsland E9" "Meta"
Here is an option
str_replace_all(s, "(?<=[a-z])(?=[A-Z])", "\n")
#[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"
If you really want to insert \n, try this:
gsub("([a-z])([A-Z])", "\\1\\\n\\2", s)
[1] "Florida\nIsland\nE7" "Florida\nIsland\nE9" "Meta"

Regex find the string between last two quotes " "?

For example, this is my string -> abcd 1234abcda="author 1" content="author 2.">\n
I only want the string author 2. by using the function str_extract() in R. How can I use regex to do that? Thank you so much.
You can use :
string = 'abcd 1234abcda="author 1" content="author 2.">\n'
sub('.*"(.*)".*', '\\1', string)
#[1] "author 2."
With str_match
library(stringr)
str_match(string, '.*"(.*)"')[, 2]
Another option is to extract all the values with "author" followed by a number and select the last one using tail.
tail(str_extract_all(string, 'author \\d+')[[1]], 1)

Regex for matching between a colon and last newline prior to next colon

I am trying to parse a string with regex to pull out information between a colon and the last newline prior to the next colon. How can I do this?
string <- "Name: Al's\nPlace\nCountry:\nState\n/ Province: RI\n"
stringr::str_extract_all(string, "(?<=:)(.*)(?:\\n)")
but I get:
[[1]]
[1] " Al's\n" " \n" " RI\n"
when I want:
[[1]]
[1] " Al's\nPlace\n" " \n" " RI\n"
I'm not sure if this is what you're after as your wanted output looks a bit different.
:((?:.*\\n?)+?)(?=.*:|$)
: match a colon
((?:.*\n?)+?) match and capture lazily any lines (to optional \n)
(?=.*:|$) until there is a line with colon ahead
See this demo at regex101

Replacing a special character does not work with gsub

I have a table with many strings that contain some weird characters that I'd like to replace with the "original" ones. Ä became ä, ö became ö, so I replace each ö with an ö in the text. It works, however, ß became à < U+009F> and I am unable to replace it...
# Works just fine:
gsub('ö', 'REPLACED', "Testing string ö")
# this does not work
gsub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ")
# this does not work as well...
gsub("â<U+0080><U+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
How do I tell R to replace These parts with some letter I want to insert?
As there are metacharacters (+ - to signify one or more), in order to evaluate it literally either escape (as #boski mentioned in the solution) or use fixed = TRUE
sub("Ã<U+009F>", "REPLACED", "Testing string Ã<U+009F> ", fixed = TRUE)
#[1] "Testing string REPLACED "
You have to escape the + symbol, as it is a regex command.
> gsub("Ã<U\\+009F>", "REPLACED", "Testing string Ã<U+009F> ")
[1] "Testing string REPLACED "
> gsub("â<U\\+0080><U\\+0093>", "REPLACED", "Testing string â<U+0080><U+0093> ")
[1] "Testing string REPLACED "

Resources