Replacing all occurrences of a pattern in a string - r

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").

The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

Related

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Simple regex in R escape "Mr\\." [duplicate]

Used to run R with numbers and matrix, when it comes to play with strings and characters I am lost. I want to analyze some data where the time is read into R as follow:
>my.time.char[1]
[1] "\"2011-10-05 15:55:00\""
I want to end up with a string containing only:
"2011-10-05 15:55:00"
Using the function sub() (that i barely understand...), I got the following result:
> sub("(\")","",my.time.char[1])
[1] "2011-10-05 15:55:00\""
This is closer to the format i am looking for, but I still need to get rid of the two last characters (\").
The second line from ?sub explains:
sub and gsub perform replacement of the first and all matches respectively.
which should tell you to use gsub instead.

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

Subsetting different length strings by spaces in R

In R, I currently have a long vector of dates and times saved as a string. So depending on the given date, the string can be 16 or 17 or 18 characters long and so I cannot just subset the first the 8 or 10 characters in the string, since that would not work for every date. But since there is a space between the date and time values, I am wondering how can I subset this string so that I only get the characters before the space?
Just to show how the string looks like now, here are a couple of examples:
"4/18/1950 0:00:00"
"6/8/1951 0:00:00"
"11/15/1951 0:00:00"
I'm not sure if you are familiar with regular expressions, if not you should learn as they are extremely useful:
tutorial
As akrun pointed out you can use the "sub" command to remove the space and everything after it like this:
sub(" .*","",stringVar)
First argument is the regular expression code which matches the space and everything that follows.
Second argument is what you want to replace the match with, in this case nothing
Third argument is the input string
Alternatively, you can just split the string at the space and select the first half using "strsplit"
strsplit(stringVar," ")[1]

how to use grep in R to get the specified character?

I have
str=c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
and I want to get
"00005.profit" "00006.profit"
How can I achieve this using grep in R?
Here is one way:
R> s <- c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
> unique(gsub("([0-9]+.profit).*", "\\1", s))
[1] "00005.profit" "00006.profit"
R>
We define a regular expression as digits followed by .profit, which we assign by keeping the expression in parantheses. The \\1 then recalls the first such assignment -- and as we recall nothing else that is what we get. The unique() then reduces the four items to two unique ones.
Dirk's answer is pretty much the ideal generalisable answer, but here are a couple of other options based on the fact that your example always has a - character starting the part you wish to chop off:
1: gsub to return everything prior to the -
gsub("(.+)-.+","\\1",str)
2: strsplit on - and keep only the first part.
sapply(strsplit(str,"-"),head,1)
Both return:
[1] "00005.profit" "00005.profit" "00006.profit" "00006.profit"
which you can then wrap in unique to not return duplicates like:
unique(gsub("(.+)-.+","\\1",str))
unique(sapply(strsplit(str,"-"),head,1))
These will then return:
[1] "00005.profit" "00006.profit"
Another non-generalisable solution would be to just take the first 12 characters (assuming string length for the part you want to keep doesn't change):
unique(substr(str,1,12))
[1] "00005.profit" "00006.profit"
I'm actually interpreting your question differently. I think you might want
grep("[0-9]+\\.profit$",str,value=TRUE)
That is, if you only want the strings that end with profit. The $ special character stands for "end of string", so it excludes cases that have additional characters at the end ... The \\. means "I really want to match a dot, not any character at all" (a . by itself would match any character). You weren't entirely clear about your target pattern -- you might prefer "0+[1-9]\\.profit$" (any number of zeros followed by a single non-zero digit), or even "0{4}[1-9]\\.profit$" (4 zeros followed by a single non-zero digit).

Resources