removing spaces outside quotes in r - r

Rearranging simpsons names with R to follow first name, last name format but there are large spaces between the names, is it possible to remove spaces outside the quoted names?
library(stringr)
simpsons <- c("Moe Syzlak", "Burns, C. Montgomery", "Rev. Timothy Lovejoy", "Ned Flanders", "Simpson, Homer", "Dr. Julius Hibbert")
reorder <- sapply(sapply(str_split(simpsons, ","), str_trim),rev)
for (i in 1:length(name) ) {
splitname[i]<- paste(unlist(splitname[i]), collapse = " ")
}
splitname <- unlist(splitname)

If we need to rearrange the first name followed by last name, we could use sub. We capture one or more than character which is not a , in a group, followed by , followed by 0 or more space (\\s*), capture one or more characters that are not a , as the 2nd group, and in the replacement reverse the backreference to get the output.
sub("([^,]+),\\s*([^,]+)", "\\2 \\1", simpsons)
#[1] "Moe Syzlak" "C. Montgomery Burns" "Rev. Timothy Lovejoy" "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"

Related

reverse the name if it seperate by comma

If there is a first and last name is like "nandan, vivek". I want to display as "vivek nandan".
n<-("nandan,vivek")
result:
[1] vivek nandan
where first name:vivek
last name:nandan
this is the author name.
We can try using sub here:
input <- "nankin,vivek"
sub("([^,]+),\\s*(.*)", "\\2 \\1", input)
[1] "vivek nankin"
The regex pattern used above matches the last name followed by the first name, in separate capture groups. It then replaces with those capture groups, in reverse order, separated by a single space.
An option would be sub to capture the substring that are letters ([a-z]+) followed by a , and again capture the next word ([a-z]+). In the replacement, reverse the order of the backreferences
sub("([a-z]+),([a-z]+)", "\\2 \\1", n)
#[1] "vivek nandan"
A non-regex option would be to split the string and then paste the reversed words
paste(rev(strsplit(n, ",")[[1]]), collapse=" ")
#[1] "vivek nandan"
Or extract the word and paste
library(stringr)
paste(word(n, 2, sep=","), word(n, 1, sep=","))
#[1] "vivek nandan"
data
n<- "nandan,vivek"

Extracting parts of text string between two characters

I am new to R and still learning so I would appreciate so much any help or suggestion.
I have different character strings similar to those:
"Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
Then I would like to extract only the name of the countries in those strings, including semicolon, that is:
"Germany; Spain;"
The problem for me is finding out how to extract just from the last coma to the semicolon and do that repeatedly. I tried with gsub function but I was not able to make the right approach..
For test input make a 3 component vector s as shown in the Note at the end so that we can see that it works for multiple lines -- here just three lines.
Now, we can get a one-line solution using strapply in the gsubfn package. We match the indicated pattern returning only the match to the capture group, i.e. the portion within parentheses. Then for each line we use sapply to paste the matches together.
library(gsubfn)
sapply(strapply(s, ", ([^,;]+;)"), paste, collapse = " ")
giving:
[1] "Germany; Spain;" "Germany; Spain;" "Germany; Spain;"
Note
s1 <- "Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
s <- c(s1, s1, s1)
We can try using strsplit along with sub here for a base R option:
x <- "Department of Biophysical Chemistry, University of Braunschweig, Braunschweig, Germany; Consejo Superior de Investigaciones Científicas, CCHS, Madrid, Spain;"
terms <- sapply(strsplit(x, ";\\s*")[[1]], function(x) {
sub("^.*\\s+", "", x)
})
output <- paste0(terms, ";", collapse=" ")
output
[1] "Germany; Spain;"
The logic here is to first split your semicolon-separated string on the pattern ;\s*, which results in a list containing each department. Then, we use apply to remove everything up to, and including, the last appearance of whitespace. Finally we paste collapse to generate another semicolon separated string.
Note: I changed the names of the output vector only for demo purposes, because R was using the full department description as the name by default, making it hard to display.
I would simply find the last comma before the ; and capture everything between using a simple gsub call. This will also work for a vector
gsub(".*?(=?[^,]*;)", "\\1", x, perl = TRUE)
# [1] " Germany; Spain;"

R Regex searching text between delimiter

I have a data file which has text in the following format:
"name: alex age: 27 profession: it"
I want to pull the data between ':' (it should exclude the preceding field name before ":" e.g. name, age, and profession are the only corresponding values that should be retrieved. The token names are not same; they can change.)
I want data to be
alex 27 it
We can use gsub to match word (\\w+), then a :, one or more spaces (\\s+) followed by a word captured as a group ((\\w+)) and replace it with the backreference.
gsub("\\w+:\\s+(\\w+)", "\\1", str1)
#[1] "alex 27 it"
NOTE: Here, we assume the pattern of the string is in key: value pair
Using str_split with a negative lookback Regex can you split the text into a vector of three
st <- "name: alex age: 27 profession: it"
str_split(st,"(?<!:) ")
after that it is easy to remove the text that we dont want with gsub
str_split(st,"(?<!:) ") %>% unlist() %>% gsub("^.*: ","",.)
now using the same technic but extracting the names and using setNames we get a named list wich is very comfortable to work with
dta <- setNames(
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub("^.*: ","",.) %>%
as.list(),
str_split(st,"(?<!:) ") %>%
unlist() %>%
gsub(":.*$","",.))
dta$profession
[1] "it"
A solution with str_extract_all from stringr. This matches alphanumerics ([[:alnum:]]) that are followed by a : and a space (\\s) and ends at a word boundary (\\b):
library(stringr)
str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]]
# [1] "alex" "27" "it"
or:
paste(str_extract_all(string, "(?<=:\\s)[[:alnum:]]+\\b")[[1]], collapse = " ")
# [1] "alex 27 it"

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

gsub only part of pattern

I want to use gsub to correct some names that are in my data. I want names such as "R. J." and "A. J." to have no space between the letters.
For example:
x <- "A. J. Burnett"
I want to use gsub to match the pattern of his first name, and then remove the space:
gsub("[A-Z]\\.\\s[A-Z]\\.", "[A-Z]\\.[A-Z]\\.", x)
But I get:
[1] "[A-Z].[A-Z]. Burnett"
Obviously, instead of the [A-Z]'s I want the actual letters in the original name. How can I do this?
Use capture groups by enclosing patterns in (...), and refer to the captured patterns with \\1, \\2, and so on. In this example:
x <- "A. J. Burnett"
gsub("([A-Z])\\.\\s([A-Z])\\.", "\\1.\\2.", x)
[1] "A.J. Burnett"
Also note that in the replacement you don't need to escape the . characters, as they don't have a special meaning there.
You can use a look-ahead ((?=\\w\\.)) and a look-behind ((?<=\\b\\w\\.)) to target such spaces and replace them with "".
x <- c("A. J. Burnett", "Dr. R. J. Regex")
gsub("(?<=\\b\\w\\.) (?=\\w\\.)", "", x, perl = TRUE)
# [1] "A.J. Burnett" "Dr. R.J. Regex"
The look-ahead matches a word character (\\w) followed by a period (\\.), and the look-behind matches a word-boundary (\\b) followed by a word character and a period.

Resources