I have a data table and I want to add a string (FirstWord!) to one column values if the pattern (Letter digits:Letter(s)digits) matches is like below
ColName
New test defiend
G54:Y23 (matched)
test:New
The expected results would be
New test defiend
FirstWord!G54:Y23
test:New
dt[, ColName := ColName %>% str_replace('(?<=\d)\:(?=[[:upper:]])',
paste0("'FirstWord!'",.))]
I don't know how to add the "FristWord!" when I find the pattern in the ColName.
transform(df, ColName = sub("([A-Z][0-9]+:[A-Z]+[0-9]+)", 'FirstWord!\\1', ColName))
ColName
1 New test defiend
2 FirstWord!G54:Y23
3 test:New
You can use sub and backreference:
sub("([A-Z]+\\d+:[A-Z]+)", "FirstWord!\\1", txt)
[1] "ColName" "New test defiend" "FirstWord!G54:Y23 (matched)" "test:New"
Here, we wrap the pattern Upper-case letter(s)digit(s):Upper-case letter(s)digit(s) into a capturing group to be able to refer to it using backreference \\1 in sub's replacement clause; there, we also add the desired "First_Word!" string.
If the pattern should not be case-sensitive, just add (?i) to the front of the pattern.
Data:
txt <- c("ColName","New test defiend","G54:Y23 (matched)","test:New")
Related
A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.
We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``
You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.
I have 4 values in the list
c("JSMITH_WWWFRecvd2001_asof_20220901.xlsx", "WSMITH_AMEXRecvd2002_asof_20220901.xlsx",
"PSMITH_WWWFRecvd2003_asof_20220901.xlsx", "QSMITH_AMEXRecvd2004_asof_20220901.xlsx")
I would like my outcome to be
"wwwf_01","amex_02","wwwf_03","amex_04"
You can use sub:
tolower(sub('.+_(.+)Recvd[0-9][0-9](..).+', '\\1_\\2', x))
Something like this would work. You can extract the string you want with str_extract() make it lower case with tolower() and paste the formatted counter to the end of the string with a "_" separator =.
paste(tolower(stringr::str_extract(x,"WWWF|AMEX" )), sprintf("%02d",seq_along(x)), sep = "_")
I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")
Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.
You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)
You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$
I have a list of filenames in the format "filename PID00-00-00" or just "PID00-00-00".
I want to extract part of the filename to create an ID column.
I am currently using this code for the string extraction
names(df) <- stringr::str_extract(names(df), "(?<=PID)\\d+")
binded1 = rbindlist(df, idcol = "ID")%>%
as.data.frame(binded1)
This gives the ID as the first set of digits after PID. e.g. filename PID1234-00-01 becomes ID 1234.
I want to also extract the first hyphen and following digits. So from filename PID1234-00-01 I want 1234-00.
What should my regex be?
try this:
stringr::str_extract(names(df),"(?<=PID)\\d{4}-\\d{2}")
I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")