Modify a substring matched by regex and place it back [duplicate] - r

This question already has answers here:
Applying a function to a backreference within gsub in R
(4 answers)
Closed 3 years ago.
I have an arbitrary string, say "1a2 2a1 3a2 10a5" I want to do an arbitrary mathematical operation, say doubling, to some of the numbers, say anything followed by an "a".
I can extract the numbers I want with relative ease
string = "1a2 2a1 3a2 10a5"
numbers = stringr::str_extract_all(string,'[0-9]+(?=a)')[[1]]
and obviously, doubling them is trivial
numbers = 2*(as.integer(numbers))
But I am having problems with placing the new results in their old positions. To get the output "2a2 4a1 6a2 20a5". I feel like there should be a single function that accomplishes this but all I can think of is recording the original indexes of the matches using gregexpr and manually placing the new results in the coordinates.

We can use str_replace_all from stringr to capture numbers followed by "a" and then multiply them by 2 in their callback function.
stringr::str_replace_all(string, "\\d+(?=a)", function(m) as.integer(m) * 2)
#[1] "2a2 4a1 6a2 20a5"

gsubfn is like gsub except that the second argument can be a string, function (possibly expressed in formula notation), a list or a proto object. If it is a function the capture groups are input into it and the match is replaced with the output of the function.
library(gsubfn)
string <- "1a2 2a1 3a2 10a5"
gsubfn("(\\d+)(?=a)", ~ 2L * as.integer(..1), string)
## [1] "2a2 4a1 6a2 20a5"
This variation also works. backref=0 says input the match into the function rather than the the capture groups.
gsubfn("\\d+(?=a)", ~ 2 * as.integer(x), string, backref = 0)

Related

How to replace characters in a vector in R? [duplicate]

This question already has answers here:
How to reverse complement a DNA strand using R
(1 answer)
Reverse complementary Base
(2 answers)
Closed 1 year ago.
I have been tasked with writing an R function capable of taking a DNA string (s) as input and returning the complementary string on the reverse strand (e.g. "ACGT" returns "TGCA"). The result should look something like this:
> s <- "CCCTTAG"
> reverse.dna(s)
[1] "CTAAGGG"
I currently have the following functions for converting a string to a vector and vice versa, however any attempts I have made to use the replace() or switch() commands to substitute the complementary bases into either the string or vector have been unsuccessful.
string.to.vec <- function(s) {
strsplit(s,"") [[1]]
vec.to.string <- function(v) {
paste(v,collapse="")
As I have very limited experience using R I was wondering if anyone would be able to help me by suggesting the simplest method to implement this functionality into my function. Thanks!
We may use chartr in combination with stri_reverse
library(stringi)
chartr("ACGT", "TGCA", stri_reverse(s))
[1] "CTAAGGG"
The Bioconductor Biostrings package has functions for DNA strings. We convert s from the question to a DNAString object, run reverseComplement and then convert back to character class. Omit the last conversion if you want to keep it as a DNAString.
library(Biostrings)
s |> DNAString() |> reverseComplement() |> as.character()
## [1] "CTAAGGG"

Extract all numbers from a character string into a SINGLE character string of numbers in the original order [duplicate]

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

Identifying specific string along the row to identify the count [duplicate]

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
length(dataset[which(dataset=="corn")])
I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

regular expression: remove consecutive repeated characters at least 2 times as well as those after it in a string in R

I have a vector with different strings like this:
s <- c("mir123mm8", "qwe98wwww98", "123m3tqppppw23!")
and
> s
[1] "mir123mm8" "qwe98wwww98" "123m3tqppppw23!"
I would like to have the answer like this:
> c("mir123", "qwe98", "123m3tq")
[1] "mir123" "qwe98" "123m3tq"
That means that if a string has at least 2 consecutive repeated characters, then them and after them should be removed.
What is the better way to do it using regular expression in R?
You can use back reference in the pattern to match repeated characters:
sub("(.*?)(.)\\2.*", "\\1", s)
# [1] "mir123" "qwe98" "123m3tq"
The pattern matches when the second captured group which is a single character repeats directly after it. Make the first capture group ungreedy by ? so that whenever the pattern matches, the first captured group is returned.

how to add a character to a string in R [duplicate]

This question already has answers here:
Insert a character at a specific location in a string
(8 answers)
Closed 6 years ago.
I have something like this:
text <- "abcdefg"
and I want something like this:
"abcde.fg"
how could I achieve this without assigning a new string to the vector text but instead changing the element of the vector itself? Finally, I would like to randomly insert the dot and actually not a dot but the character element of a vector.
We can try with sub to capture the first 5 characters as a group ((.{5})) followed by one or more characters in another capture group ((.*)) and then replace with the backreference of first group (\\1) followed by a . followed by second backreference (\\2).
sub("(.{5})(.*)", "\\1.\\2", text)
#[1] "abcde.fg"
NOTE: This solution is direct and doesn't need to paste anything together.
Also, substring with paste will work:
paste(substring(text, c(1,6), c(5,7)), collapse=".")
"abcde.fg"
The substring function accepts vector start-stop arguments and "splits" the string at the desired locations. We then can paste these elements together and with the collapse argument.
Without relying on the vector arguments, we could use the newer and recommended substr function:
paste(c(substr(text, 1, 5), substr(text, 6,7)), collapse=".")
[1] "abcde.fg"
Note that as mentioned by konrad-rudolph, this will create a copy of the vector.

Resources