Move location of special character - r

I have an entire vector of strings with the only special symbol in them being "-"
To be clear a sample string is like 23 C-Exam
I'd like to change it 23-C Exam
I essentially want R to find the location of "-" and move it 2 spaces back.
I feel this is a really simple task although I cant figure out how.
Assume that whenever R finds "-" , two spaces back is whitespace just like the example above.

regex attempt:
x <- c("23 C-Exam","45 D-Exam")
#[1] "23 C-Exam" "45 D-Exam"
sub(".(.)-", "-\\1 ", x)
#[1] "23-C Exam" "45-D Exam"
Find a character ., before a character (.), followed by a literal dash -.
Replace with a literal dash -, the saved character from above \\1, and overwrite the dash with a space

There is probably a sleek way of doing this with regular expressions, but one approach is to simply splice together the various pieces of the desired output. First, I find the index in the string containing the -, and then I use substr() to piece together the output.
pos <- regexpr("-", "23 C-Exam")
x <- "23 C-Exam"
x <- paste0(substr(x, 1, pos-3),
"-",
substr(x, pos-1, pos-1),
" ",
substr(x, pos+1, nchar(x)))
> x
[1] "23-C Exam"

We can also use chartr
chartr(" -", "- ", x)
#[1] "23-C Exam" "45-D Exam"
data
x <- c("23 C-Exam","45 D-Exam")

Related

How to insert a white space before open bracket

I have a string 3.4(2.5-4.7), I want to insert a white space before the open bracket "(" so that the string becomes 3.4 (2.5-4.7).
Any idea how this could be done in R?
x <- "3.4(2.5-4.7)"
sub("(.*)(?=\\()", "\\1 ", x, perl = T)
[1] "3.4 (2.5-4.7)"
This regex is based on lookahead: it creates one capturing group subsuming everything up until the lookahead, namely, the opening parenthesis (?=\\()), recalls it and inserts one whitespace after it in the replacement argument to sub (which is enough unless you have more than one such substitution per string, in which case gsubis needed). The argument perl = Tneeds to be added to enable the lookahead.
EDIT:
If you have a string like this:
x <- "3.4(2.5to4.7)"
the regex gets slightly more complex; the underlying idea though remains the same: you divide the string into different captruing groups (...), which you then recall using appropriate backreference in the replacement argument while adding the sought spaces:
sub("(.*)(\\(\\d+\\.\\d+)(to)(\\d+\\.\\d+\\))", "\\1 \\2 \\3 \\4", x)
[1] "3.4 (2.5 to 4.7)"
EDIT2:
x <- '3.4(2.5,4.7)'
sub("(.*)(\\(\\d+\\.\\d+)(,)(\\d+\\.\\d+\\))", "\\1 \\2\\3 \\4", x)
[1] "3.4 (2.5, 4.7)"
EDIT3:
x <- '3(2,4)'
sub("(.*)(\\(\\d+)(,)(\\d+)", "\\1 \\2\\3 \\4", x)
A very short way uses sub, which will substitute the first open bracket ( with a space followed by an open bracket, i.e. (.
x <- '3.4(2.5-4.7)'
sub("\\(", " (", x)
# [1] "3.4 (2.5-4.7)"
Alternatively, you can specify the argument fixed = TRUE which considers the pattern as fixed and not as a regular expression.
x <- '3.4(2.5-4.7)'
sub("(", " (", x, fixed = TRUE)
# [1] "3.4 (2.5-4.7)"
Try
gsub('(.*)(\\(.*\\))', '\\1 \\2', '3.4(2.5-4.7)')
#[1] "3.4 (2.5-4.7)"
The way the regex works is that it creates two groups. The first group (.*) it takes all elements and the second group (\\(.*\\)) takes all elements after the parenthesis. Note that we need to escape the parenthesis so we use \\(. We then join those two groups with a space between them \\1 \\2

Remove punctuation but keep hyphenated phrases in R text cleaning

Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?
I used the following function to clean my text
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("#\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
and apply it on hyphenated expressions that returned
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
while my desired output is
"accident-prone"
I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.
Putting my two cents in, you could use (*SKIP)(*FAIL) with perl = TRUE and remove any non-word characters:
data <- c("my-test of #$%^&*", "accident-prone")
(gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))
Resulting in
[1] "my-test of" "accident-prone"
See a demo on regex101.com.
Here the idea is to match what you want to keep
(?<![^\\w])[- ](?=\\w)
# a whitespace or a dash between two word characters
# or at the very beginning of the string
let these fail with (*SKIP)(*FAIL) and put what you want to be removed on the right side of the alternation, in this case
\W+
effectively removing any non-word-characters not between word characters.
You'd need to provide more examples for testing though.
The :punct: set of characters includes the dash and you are removing them. You could make an alternate character class that omits the dash. You do need to pay special attention to the square-brackets placements and escape the double quote and the backslash:
(test <- gsub("[]!\"#$%&'()*+,./:;<=>?#[\\^_`{|}~]", "", "my-test of #$%^&*") )
[1] "my-test of "
The ?regex (help page) advises against using ranges. I investigated whether there might be any simplification using my local ASCII sequence of punctuation, but it quickly became obvious that was not the way to go for other reasons. There were 5 separate ranges, and the "]" was in the middle of one of them so there would have been 7 ranges to handle in addition to the "]" which needs to come first.

extracting character from character string as per certain conditions

Let's say
x = "R is so tough for SAS programmer"
y = "R why you so hard"
Now we have to find the word before 8th place and the first space (" ") encountered going right to left, i.e. backwards.
In case of x it would be the word "so"
In the case of y it would be "y"
How can I do this?
Here is another option with word and sub
library(stringr)
word(sub("^(.{1,7}).*", "\\1", x), -1)
#[1] "so" "y"
data
x <- c("R is so tough for SAS programmer", "R why you so hard")
Let's assume you have both strings in one vector:
x = c("R is so tough for SAS programmer", "R why you so hard")
Then, if I understand your question correctly, you can use a combination of substr to extract the first 7 characters of each string and then sub to extract the part after the last space:
sub(".*\\s", "", substr(x, 1, 7))
#[1] "so" "y"
It may be safer to use
sub(".*\\s", "", trimws(substr(x, 1, 7), "right"))
which will cut off any whitespace on the right side of the vector resulting from substr. This ensures that the sub call won't accidentally match a space at the end of the string.

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

string manipulation to remove the name of files

I have a list of strings
/temp/123/afedcgid/abc.csv
/temp/123/4388dkfa/abc1.csv
/temp/123/4388dkfa/ab1.csv
I want to remove name of the file from the strings
The results desired are
/temp/123/afedcgid
/temp/123/4388dkfa
/temp/123/4388dkfa
How can i do it. Thanks.
You could try the below,
sub("/[^/]*$", "", x)
It removes all the chars from the last / symbol.
OR
> x <- "/temp/123/afedcgid/abc.csv"
> sub("(.*)/.*", "\\1", x)
[1] "/temp/123/afedcgid"
captures all the chars from the start upto the last / symbol (excluding /). Then the following chars are matched by .*. Replacing the matched chars with chars inside group 1 will give you the desired output.
Example:
> x <- "/temp/123/afedcgid/abc.csv"
> sub("/[^/]*$", "", x)
[1] "/temp/123/afedcgid"
OR
regmatches(x, gregexpr(".+(?=/)", x, perl=TRUE))
Use this regex to catch character you want to replace
\/\w+\.\w+$
try this demo
Demo
files <- c("/temp/123/afedcgid/abc.csv" ,
"/temp/123/4388dkfa/abc1.csv" , "/temp/123/4388dkfa/ab1.csv")
sub("\\/\\w+\\.\\w+$" , "" , files)
as you may know you need to \\ for escaping sequences in R

Resources