Remove extra characters in a string

Remove extra characters in a string - r

I have a column that looks like this:
Item_Number
R8934nr fd
4hgsi32df
Miognse daf
I only want to keep the first 7 characters and remove the rest. I am new to R and I tried:
gsub(Item_Number, '', '[7]')

Using sub is one option, as you suggested. This answer uses a pattern to selectively remove everything except for the up to first 7 characters of the string.
Item_Number = "1234567890"
sub("(?<=^.{7}).*", "", Item_Number, perl=TRUE)
[1] "1234567"
Demo

If you only want to keep characters something like this would do it.
your.string <- "R8934nr fd"
your.string <- gsub(" ","",your.string)
your.string <- gsub("[[:digit:]]+","",your.string)
your.string <- substr(your.string,1,7)

Related

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing

We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.

you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"

t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

Remove part of a string until a character is found R

I have a regex problem or somewhat regex related problem...
I have strings that look like this:
"..........))))..)))))))"
"....))))))))...)).))))..))"
"......))))...)))...)))))"
I want to remove the initial dot sequence, so that I only get the string starting by the first occurence of ")" symbol. Say, the output would be somthing like:
"))))..)))))))"
"))))))))...)).))))..))"
"))))...)))...)))))"
I assume it would be somewhat similar to a lookahead regex but cannot figure out the correct one...
Any help?
Thanks

We match for 0 or more dots (\\.*) from the start (^) of the string and replace it with blank
sub("^\\.*", "", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
If it needs to start from ), then as above match 0 or more dots till the first ) and replace with the )
sub("^\\.*\\)", ")", v1)
#[1] "))))..)))))))" "))))))))...)).))))..))" "))))...)))...)))))"
data
v1 <- c("..........))))..)))))))", "....))))))))...)).))))..))", "......))))...)))...)))))")

You can simply remove dots from the beginning of the line (marked in the regex by ^) until you reach a non-dot character:
a <- "..........))))..)))))))"
b <- "....))))))))...)).))))..))"
c <- "......))))...)))...)))))"
sub("^\\.*", "", a) # "))))..)))))))"
sub("^\\.*", "", b) # "))))))))...)).))))..))"
sub("^\\.*", "", c) # "))))...)))...)))))"

The way your question is worded, the goal isn't to remove just . from the beginning, but any symbol until the first ) is encountered. So this answer is a more general solution.
stringr::str_extract("..........))))..)))))))","\\).*$")
Alternatively, if you want to stick with base R, you could use sub/gsub like this:
gsub("[^\\)]*(\\).*$)","\\1","..........))))..)))))))")
sub("[^\\)]*","","..........))))..)))))))")

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks

We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")

Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

subsetting data with only entries with in the parentheses

How can i subset data that contains only entries with in the parentheses from description column
data= ID description control
1814668 glycoprotein 2 (Gp2) (Fy2) LMN_2904435
1791634 claudin 10 (Cldn10), transcript variant 1 ILMN_1214954 NM
1790993 claudin 10 (Cldn10), transcript variant 2 ILMN_2515816
output
ID description control
1814668 Gp2, Fy2 LMN_2904435
1791634 Cldn10 ILMN_1214954 NM
1790993 Cldn10 ILMN_2515816

You could try
df2$description <- gsub('.*\\(([^)]+)\\).*', '\\1', df2$description)
Or use bracketXtract from qdap
library(qdap)
unlist(bracketXtract(df2$description, 'round'))
Or
library(qdapRegex)
unlist(rm_round(df2$description, extract=TRUE))
Update
Based on the new dataset "df2N",
df2N$description <- sapply(rm_round(df2N$description,
extract=TRUE),toString)
Or using str_extract
library(stringr)
sapply(str_extract_all(df2N$description,
perl('(?<=\\()[^)]+(?=\\))')), toString)

Probably not as great as #akrun 's solutions but here is another option, using function gsub (twice...) from base R:
df2$description <- gsub("^,\\s|,\\s$",
"",
gsub("^[^(]*\\(|\\)[^()]*\\(|\\)[^(]*$",
", ",
df2$description, perl=T))
#[1] "Gp2, Fy2" "Cldn10" "Cldn10"
First, it's telling R to search for either:
^[^(]*\\(: anything that is not a opening bracket, at the beginning of the
string, and ending with an opening bracket
\\)[^()]*\\(: a closing bracket followed by anything that is not a bracket, ending with an opening bracket
\\)[^(]*$: a closing bracket, followed by anything that is not an opening bracket and goes till the end of string
and replace it by a comma followed by a space.
Second, it replaces the "comma followed by a space" at the beginning and at the end of the string by an empty string.

Remove hyphen at the end of string in R

I have a column of a dataframe in R like this:
names <- data.frame(name=c("ABC", "ABC-D", "ABCD-"))
I would like to remove the hyphen at the end of the strings while maintaining the hyphen in the middle of them. I've tried a few expressions like:
names$name <- gsub("+-\\w", "", names$name)
# the desired output is "ABC", "ABC-D", and "ABCD", respectively
While several combinations remove the hyphens entirely, I'm not sure how to specify the string boundary and the hyphen together.
Thanks!

Try :
gsub("\\-$", "", names$name)
# [1] "ABC" "ABC-D" "ABCD"
$ tells R that the (escaped) hyphen is at the end of the word
Although, as the - is placed first in the regex you don't need to escape it so this works too:
gsub("-$", "", names$name)
#[1] "ABC" "ABC-D" "ABCD"