I have a regex problem or somewhat regex related problem...
I have strings that look like this:
"..........))))..)))))))"
"....))))))))...)).))))..))"
"......))))...)))...)))))"
I want to remove the initial dot sequence, so that I only get the string starting by the first occurence of ")" symbol. Say, the output would be somthing like:
"))))..)))))))"
"))))))))...)).))))..))"
"))))...)))...)))))"
I assume it would be somewhat similar to a lookahead regex but cannot figure out the correct one...
Any help?
Thanks
We match for 0 or more dots (\\.*) from the start (^) of the string and replace it with blank
sub("^\\.*", "", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
If it needs to start from ), then as above match 0 or more dots till the first ) and replace with the )
sub("^\\.*\\)", ")", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
data
v1 <- c("..........))))..)))))))", "....))))))))...)).))))..))", "......))))...)))...)))))")
You can simply remove dots from the beginning of the line (marked in the regex by ^) until you reach a non-dot character:
a <- "..........))))..)))))))"
b <- "....))))))))...)).))))..))"
c <- "......))))...)))...)))))"
sub("^\\.*", "", a) # "))))..)))))))"
sub("^\\.*", "", b) # "))))))))...)).))))..))"
sub("^\\.*", "", c) # "))))...)))...)))))"
The way your question is worded, the goal isn't to remove just . from the beginning, but any symbol until the first ) is encountered. So this answer is a more general solution.
stringr::str_extract("..........))))..)))))))","\\).*$")
Alternatively, if you want to stick with base R, you could use sub/gsub like this:
gsub("[^\\)]*(\\).*$)","\\1","..........))))..)))))))")
sub("[^\\)]*","","..........))))..)))))))")
Related
I have to find different patterns in a data frame column, once it is found, the next letter should be wrapped between parentheses:
Data:
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
if the pattern is: '(acetyl)'
this is the output that I'd like to achieve:
Expected output:
b <- c('(R)KJOEQLKQ', 'LDFEION(E)FNEOW')
I know how that I can find the pattern with gsub:
b <- gsub('(acetyl)', replacement = '', a)
However, I'm not sure how to approach the wrapping between the parenthesis of the next letter after the pattern is found.
Any help would be appreciated.
You can use
a <- c('(acetyl)RKJOEQLKQ', 'LDFEION(acetyl)EFNEOW')
gsub('\\(acetyl\\)(.)', '(\\1)', a)
## => [1] "(R)KJOEQLKQ" "LDFEION(E)FNEOW"
See the regex demo and the online R demo.
Details:
\(acetyl\) - matches a literal string (acetyl)
(.) - captures into Group 1 any single char
The (\1) replacement pattern replaces the matches with ( + Group 1 value + ).
I want to remove the comma and the apostrophe but the point of the following character. After that pass to numeric
I have this:
characterExample <- "234'564,900.99"
I want 234564900.99
I try the following but I can't:
result <- gsub("[:punct:].","", characterExample)
Another option is to explicitly remove the characters you want to remove:
gsub("[',]", "", characterExample)
#[1] "234564900.99"
``
An option is to not match the digits or the . by using ^ within the square bracket
gsub("[^0-9.]+","", characterExample)
#[1] "234564900.99"
Or another option is to make use of SKIP/FAIL for the ., while matching the rest of the punct
gsub("(\\.)(*SKIP)(*F)|[[:punct:]]+", "", characterExample, perl = TRUE)
#[1] "234564900.99"
NOTE: Both solutions make sure that it matches any punct characters other than the . and replace with blank ("")
It can also use the pipe symbol like this:
#Code
gsub(",|'","", characterExample)
Output:
gsub(",|'","", characterExample)
[1] "234564900.99"
I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"
I have a column that looks like this:
Item_Number
R8934nr fd
4hgsi32df
Miognse daf
I only want to keep the first 7 characters and remove the rest. I am new to R and I tried:
gsub(Item_Number, '', '[7]')
Using sub is one option, as you suggested. This answer uses a pattern to selectively remove everything except for the up to first 7 characters of the string.
Item_Number = "1234567890"
sub("(?<=^.{7}).*", "", Item_Number, perl=TRUE)
[1] "1234567"
Demo
If you only want to keep characters something like this would do it.
your.string <- "R8934nr fd"
your.string <- gsub(" ","",your.string)
your.string <- gsub("[[:digit:]]+","",your.string)
your.string <- substr(your.string,1,7)
I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."