I have a column of a dataframe in R like this:
names <- data.frame(name=c("ABC", "ABC-D", "ABCD-"))
I would like to remove the hyphen at the end of the strings while maintaining the hyphen in the middle of them. I've tried a few expressions like:
names$name <- gsub("+-\\w", "", names$name)
# the desired output is "ABC", "ABC-D", and "ABCD", respectively
While several combinations remove the hyphens entirely, I'm not sure how to specify the string boundary and the hyphen together.
Thanks!
Try :
gsub("\\-$", "", names$name)
# [1] "ABC" "ABC-D" "ABCD"
$ tells R that the (escaped) hyphen is at the end of the word
Although, as the - is placed first in the regex you don't need to escape it so this works too:
gsub("-$", "", names$name)
#[1] "ABC" "ABC-D" "ABCD"
Related
I am trying to manipulate a character vector and want to delete all characters before the first occurrence of a specific string using sub function in r, since the function performs replacement of the first match, but in my code sub replaces the last but not the first match?
Here below is an example
Vec <- c("ID1.P.001", "ID2.P.002") # character vector
# I want to get rid of all characters before the first dot (including the dot)
# So i want to get this vector
c("P.001", "P.002")
#[1] "P.001" "P.002"
# my code
sub('.*\\.', "", Vec )
#[1] "001" "002"
# sub replace the last not the first match !!
How can i use sub to get rid of characters before the first match (including the pattern)?
You can make the * quantifier lazy (opposed to the default greedy matching) by adding a ? after it. I.e.:
sub('.*?\\.', "", Vec)
[1] "P.001" "P.002"
We can specify the start (^) of the string, match the characters that are not a . ([^.]+ - one or more characters that are not a dot) followed by a dot (\\. - metacharacter - so escaping, within the [], it would be evaluated as . though) and in replacement, specify as blank ("")
sub("^[^.]+\\.", "", Vec)
#[1] "P.001" "P.002"
I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"
I have a vector of names that looks like this:
names <- c("Verticordia (Cha)", "Whiteodendron\n(Loph)", "Platysace",
"Xanthostemon\n(Xan)", "Quercus (incl.\nCyclobalanopsis)\n(Fag)"
)
[1] "Verticordia (Cha)" "Whiteodendron\n(Loph)" "Platysace" "Xanthostemon\n(Xan)"
[5] "Quercus (incl.\nCyclobalanopsis)\n(Fag)"
I would like to conditionally remove all characters thatcome after a space or a \, including the space or the \. I have been able to remove the \ or the space using:
gsub("\n*","",names)
gsub(" *","",names)
However, I am having trouble getting the code to remove all following characters as well.
gsub("\n.*","",names)
gsub(" .*","",names)
You want the asterisk quantifier to apply to the dot (which is a wildcard matching all characters). Your version applied the quantifer to the newline or space character, so you were removing only strings of consecutive newlines or spaces.
Or all in 1 regex:
names.reduced <- gsub('[ \\\n].*', '', names)
[1] "Verticordia" "Whiteodendron" "Platysace" "Xanthostemon" "Quercus"
I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."
From an import, I have a date being read in as a factor:
user$registrationDate[1]
[1] "2004-07-23 14:19:32"
15551 Levels: " "1" "2004-07-23 14:19:32" "2004-07-25 03:29:18" "2004-07-25 08:35:20" ... i10yo."
I convert it apparently successfully into a character vector
as.character(user$registrationDate[1])
[1] "\"2004-07-23 14:19:32\""
Whatever I try to strip off the leading and trailing quote, I still end up with a trailing quote (or something like it)
sub('"', "", as.character(user$registrationDate[10]), fixed=TRUE)
[1] "2004-09-12 22:39:21\""
I tried many variations of sub and keep getting the same result. Tips?
From ?sub: "sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences". So use gsub instead.