how to separate names using regular expression?

how to separate names using regular expression? - r

I have a name vector like the following:
vname<-c("T.Lovullo (73-58)","K.Gibson (63-96) and A.Trammell (1-2)","T.La Russa (81-81)","C.Dressen (16-10), B.Swift (32-25) and F.Skaff (40-39)")
Watch out for T.La Russa who has a space in his name
I want to use str_match to separate the name. The difficulty here is that some characters contain two names while the other contain only one like the example I gave.
I have write my code but it does not work:
str_match_all(ss,"(D[.]D+.+)s(\\(d+-d+\\))(s(and)s(D[.]D+.+)s(\\(d+-d+\\)))?")

Perhaps this helps
res <- unlist(strsplit(vname, "(?<=\\))(\\sand\\b\\s)*", perl = TRUE))
res
#[1] "T.Lovullo (73-58)" "K.Gibson (63-96)" "A.Trammell (1-2)" "T.La Russa (81-81)"
To get the names only (if that is what the expected)
sub("\\s*\\(.*", "", res)
#[1] "T.Lovullo" "K.Gibson" "A.Trammell" "T.La Russa"

Related

Extract a part of string and join to the start

I have a list
temp_list <- c("Temp01_T1", "Temp03_T1", "Temp04_T1", "Temp11_T6",
"Temp121_T6", "Temp99_T8")
I want to change this list as follows
output_list <- c("T1_Temp01", "T1_Temp03", "T1_Temp04", "T6_Temp11",
"T6_Temp121", "T8_Temp99")
Any leads would be appreciated

Base R answer using sub -
sub('(\\w+)_(\\w+)', '\\2_\\1', temp_list)
#[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121" "T8_Temp99"
You can capture the data in two capture groups, one before underscore and another one after the underscore and reverse them using backreference.

This should work.
sub("(.*)(_)(.*)", "\\3\\2\\1", temp_list)
You capture the three groups, one before the underscore, one is the underscore and one after and then you rearrange the order in the replacement expression.

python approach: splitting on '_', reversing order with -1 step index slicing, and joining with '_':
['_'.join(i.split('_')[::-1]) for i in temp_list]

You can use str_split() from stringr and then use sapply() to paste it in the order you want if it is always seperated by a _
x <- stringr::str_split(temp_list, "_")
sapply(x, function(x){paste(x[[2]],x[[1]],sep = "_")})

A trick with paste + read.table
> do.call(paste, c(rev(read.table(text = temp_list, sep = "_")), sep = "_"))
[1] "T1_Temp01" "T1_Temp03" "T1_Temp04" "T6_Temp11" "T6_Temp121"
[6] "T8_Temp99"

easy way to extract uppercase in string in R

I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")

Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.

You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)

You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.

EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"

I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo

If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

Resolving a formatter string

Suppose I have the following:
format.string <- "#AB#-#BC#/#DF#" #wanted to use $ but it is problematic
value.list <- c(AB="a", BC="bcd", DF="def")
I would like to apply the value.list to the format.string so that the named value is substituted. So in this example I should end up wtih a string: a-bcd/def
I tried to do it like the following:
resolved.string <- lapply(names(value.list),
function(x) {
sub(x = save.data.path.pattern,
pattern = paste0(c("#",x,"#"), collapse=""),
replacement = value.list[x]) })
But it doesn't seem to be working correctly. Where am I going wrong?

The glue package is designed for this. You can change the opening and closing delimiters using .open and .close, but they have to be different. Also note that value.list has to be either a list or a dataframe:
library(glue)
format.string <- "{AB}-{BC}/{DF}"
value.list <- list(AB="a", BC="bcd", DF="def")
glue_data(value.list, format.string)
# a-bcd/def

To answer your actual question, by using lapply over names(value.list) you, as your output shows, take each of the elements of value.list and perform the replacement. However, all this happens independently, i.e., the replacements aren't ultimately combined to a single result.
As to make something very similar to your approach work, we can use Reduce which does exactly this combining:
Reduce(function(x, y) sub(paste0(c("#", y, "#"), collapse = ""), value.list[y], x),
init = format.string, names(value.list))
# [1] "a-bcd/def"
If we call the anonymous function f, then the result is
f(f(f(format.string, "A"), "B"), "C")
exactly as you intended, I believe.

We can use gsubfn that can take a key/value pair as replacement to change the pattern with the 'value'
library(gsubfn)
gsub("#", "", gsubfn("[^#]+", as.list(value.list), format.string))
#[1] "a-bcd/def"
NOTE: 'value.list' is a vector and not a list

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.

If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.

To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).

Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)

Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to separate names using regular expression? - r

Related

Extract a part of string and join to the start

easy way to extract uppercase in string in R

RegEx for a conditional pattern in a string

Resolving a formatter string

Avoid that space in column name is replaced with period (".") when using read.csv()

Categories

Resources