I need to export information from a string into different columns.
More specifically the content of the brackets within the string;
Lets say I have a string
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
What I am trying to output is a vector with the contents of the brackets, if there is a comma save them as separate bracketed strings, and remove parentheses.
e.g.
tmp <- function(a)
Result
tmp
"[K89]" , "[K96]", "[N-Term]", "[S87]", "[S93]"
My approach so far:
pattern <- "(\\[.*?\\])"
hits <- gregexpr(pattern, a)
matches <- regmatches(a, hits)
unlisted_matches <- unlist(matches)
Results
"[K89; K96]" "[N-Term]" "[S87(100); S93(100)]"
This does give me the brackets but still doesn't split the terms. And for any reason I am not able to efficiently separate the ";" terms.
Here's a way using the tidyverse :
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
library(tidyverse)
a %>%
# extract between square, brackets, not keeping brackets, and unlist
str_extract_all("(?<=\\[).*?(?=\\])") %>%
unlist() %>%
# remove round brackets and content
str_replace_all("\\(.*?\\)", "") %>%
# split by ";" and unlist
str_split("; ") %>%
unlist() %>%
# put the brackets back
str_c("[",.,"]")
#> [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
You may use
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
pattern <- "(?:\\G(?!^)(?:\\([^()]*\\))?\\s*;\\s*|\\[)\\K[^][;()]+"
matches <- regmatches(a, gregexpr(pattern, a, perl=TRUE))
unlisted_matches <- paste0("[", unlist(matches),"]")
unlisted_matches
## => [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
See the R demo and the regex demo.
Pattern details
(?:\G(?!^)(?:\([^()]*\))?\s*;\s*|\[) - either the end of the previous successful match (\G(?!^)) followed with any substring inside round parentheses (optional, see (?:\([^()]*\))?) and then a ; enclosed with optional 0+ whitespaces (see \s*;\s*) or a [ char
\K - match reset operator discarding all text matched so far
[^][;()]+ - one or more chars other than [, ], ;, ( and ).
The paste0("[", unlist(matches),"]") part wraps the matches with square brackets.
Related
A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.
We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``
You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.
I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")
Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.
You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)
You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$
I need to split the following sequence of letters into distinct chunks
SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC
I have used the following code provided from a previous user to achieve what I initially wanted, which was to split the sequence after every C.
library(dplyr)
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist
df <- data.frame(Fragment = Test) %>%
mutate("position" = cumsum(nchar(Test)))
This allowed me to split the sequence after every C and retain it's position in the sequence, for example C at position 2, 11 etc.
Now I need to split the same sequence at different locations, which I can do using the following to split after P,A,G or S:
Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist
This is fine if I want it to split after a given character, but if I try to split it before a character for example D, I cannot seem to retain the D in the fragment. I can only have it retained if it is split after the D.
I have tried every combination of look behind or look ahead I can think of, the following cuts before and after every D which isn't that useful.
Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist
Also is there a way to retain the exact position of every C in the original sequence?
So if I were to split the test sequence after the initial K, I'd have a fragment that was SCDK, could I have a separate column that tells me where the C was in the original sequence. Just as a second example, the next fragment would be SFNRGECSCDK and in that separate column it would say the C was originally in position 11.
Zero-length matches that result from the use of lookahead only patterns used in strsplit are not handled properly.
In this case, you need to "anchor" the matches on the left, too. Either use a non-word boundary, or a lookbehind that disallows the match at the start of string:
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
See the online R demo.
The \B(?=D) pattern matches a location that is immediately preceded with a word char and is immediately followed with D.
The (?<!^)(?=D) pattern matches a location that is not immediately preceded with a start of string location (i.e. not at the start of string) and is immediately followed with D.
Also, note that [P,A,G,S] matches P, A, G, S and a comma. You should use [PAGS] to match one of the letters.
I have a series of strings that have a particular set of characters. What I'd like to do is be able to extract just the word from the string with those characters in it, and discard the rest.
I've tried various regex expressions to do it but I either get it to split all the words or it returns the entire string. Following is an example of the kinds of strings. I've been trying to use stringr::str_extract_all() as there are instances where there are more than one word that needs to be pulled out.
data <- c("AlvariA?o, 1961","Andrade-Salas, Pineda-Lopez & Garcia-MagaA?a, 1994", "A?vila & Cordeiro, 2015", "BabiA?, 1922")
result <- unlist(stringr::str_extract_all(data, "regex"))
From this I'd like a result that pulls all the words that has the "A?", like this:
result <- c("AlvariA?o", "MagaA?a", "A?vila", "BabiA"?)
It seems really simple but my regex knowledge is just not cutting it at the moment.
To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).
unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"
I made as function but pretty worse than Gki's...
library(quanteda)
set_of_character <- function(dummy, key){
n <- nchar(key)
dummy %>% str_split(., " ") %>%
unlist %>%
str_replace(., ",", "") %>%
sapply(., function(x) {
x %>%
tokens("character") %>%
unlist() %>%
char_ngrams(n, concatenator = "")
}) %>%
sapply(., function(x) (key %in% x)) %>% which(TRUE) %>% names %>%
return
}
for your example,
set_of_character(data, "A?")
[1] "AlvariA?o" "Garcia-MagaA?a" "A?vila" "BabiA?"
I am working with single cell data.
I am trying to match cell barcodes I extracted with another data, but the structure of barcodes are different.
Barcode I extracted: ,"SAMPLE_AAAGCAAAGATACACA-1_1" (weirdly, it saved with a comma at the front)
Barcode I want: SAMPLE_AAAGCAAAGATACACA.1_1
Which function is necessary to use when I try to remove <,"> replace these?
Is this what you want?
Data:
x <- ',"SAMPLE_AAAGCAAAGATACACA-1_1"'
Solution:
cat(gsub(',', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
"SAMPLE_AAAGCAAAGATACACA.1_1"
Here we use 'nested' gsub to first change the hyphen into the period and then to delete the comma.
If you need it without double quote marks:
cat(gsub(',"|"$', '', gsub('(?<=[A-Z])-(?=\\d)', '\\.', x, perl = T)))
SAMPLE_AAAGCAAAGATACACA.1_1
The following are some alternatives.
1) chartr/trimws Assume the test data v below. Then we replace each dash with minus using chartr and we can strip all commas and double quotes from both ends using trimws. If you have a very old version of R you will need to upgrade since the whitespace= argument was added more recently. No packages are used.
Note that the double quotes shown in the output are not part of the strings but are just how R displays chraacter vectors.
# test input
v <- c(',"SAMPLE_AAAGCAAAGATACACA-1_1"', ',"SAMPLE_AAAGCAAAGATACACA-1_1"')
trimws(chartr("-", ".", v), whitespace = '[,"]')
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
2) gsubfn gsubfn in the package of the same name can map all minus characters to dot and commas and double quotes to empty strings in a single command. The second argument defines the mapping.
This substitutes all double quotes, commas and minus signs. If there are embedded double quotes and commas (i.e. not on the ends) that are not to be substituted then use (1) which onbly trims comma and double quote off the ends.
library(gsubfn)
gsubfn('.', list('"' = '', ',' = '', '-' = '.'), v)
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"
3) read.table/chartr This also uses only base R. Read in the input using read.table separating fields on comma and keeping only the second field. This will also remove the double quotes. Then use chartr to replace minus signs with dot.
This assumes that the only double quotes are the ones surrounding the field and all minus signs are to be replaced by dot. Embedded commas will be handled properly.
chartr("-", ".", read.table(text = v, sep = ",")[[2]])
## [1] "SAMPLE_AAAGCAAAGATACACA.1_1" "SAMPLE_AAAGCAAAGATACACA.1_1"