I would like to find the location of a character in a string.
Say: string = "the2quickbrownfoxeswere2tired"
I would like the function to return 4 and 24 -- the character location of the 2s in string.
You can use gregexpr
gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired")
[[1]]
[1] 4 24
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
or perhaps str_locate_all from package stringr which is a wrapper for gregexpr stringi::stri_locate_all (as of stringr version 1.0)
library(stringr)
str_locate_all(pattern ='2', "the2quickbrownfoxeswere2tired")
[[1]]
start end
[1,] 4 4
[2,] 24 24
note that you could simply use stringi
library(stringi)
stri_locate_all(pattern = '2', "the2quickbrownfoxeswere2tired", fixed = TRUE)
Another option in base R would be something like
lapply(strsplit(x, ''), function(x) which(x == '2'))
should work (given a character vector x)
Here's another straightforward alternative.
> which(strsplit(string, "")[[1]]=="2")
[1] 4 24
You can make the output just 4 and 24 using unlist:
unlist(gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired"))
[1] 4 24
find the position of the nth occurrence of str2 in str1(same order of parameters as Oracle SQL INSTR), returns 0 if not found
instr <- function(str1,str2,startpos=1,n=1){
aa=unlist(strsplit(substring(str1,startpos),str2))
if(length(aa) < n+1 ) return(0);
return(sum(nchar(aa[1:n])) + startpos+(n-1)*nchar(str2) )
}
instr('xxabcdefabdddfabx','ab')
[1] 3
instr('xxabcdefabdddfabx','ab',1,3)
[1] 15
instr('xxabcdefabdddfabx','xx',2,1)
[1] 0
To only find the first locations, use lapply() with min():
my_string <- c("test1", "test1test1", "test1test1test1")
unlist(lapply(gregexpr(pattern = '1', my_string), min))
#> [1] 5 5 5
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(min) %>%
unlist()
#> [1] 5 5 5
To only find the last locations, use lapply() with max():
unlist(lapply(gregexpr(pattern = '1', my_string), max))
#> [1] 5 10 15
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(max) %>%
unlist()
#> [1] 5 10 15
You could use grep as well:
grep('2', strsplit(string, '')[[1]])
#4 24
Related
I am looking to split a string into ngrams of 3 characters - e.g HelloWorld would become "Hel", "ell", "llo", "loW" etc
How would I achieve this using R?
In Python it would take a loop using the range function - e.g. [myString[i:] for i in range(3)]
Is there a neat way to loop through the letters of a string using stringr (or another suitable function/package) to tokenize the word into a vector?
e.g.
dfWords <- c("HelloWorld", "GoodbyeMoon", "HolaSun") %>%
data.frame()
names(dfWords)[1] = "Text"
I would like to generate a new column which would contain a vector of the tokenized Text variable (preferably using dplyr). This can then be split later into new columns.
For the others that are coming here, as I did, to really find the R function that would be an equivalent to range() function in Python, I have found the answer.
And it is seq() function. A few examples will be better than words but the usage is really the same as in Python:
> seq(from = 1, to = 5, by = 1)
[1] 1 2 3 4 5
> seq(from = 1, to = 6, by = 2)
[1] 1 3 5
> seq(5)
[1] 1 2 3 4 5
In base R you could do something like this
ss <- "HelloWorld"
len <- 3
lapply(seq_len(nchar(ss) - len + 1), function(x) substr(ss, x, x + len - 1))
#[[1]]
#[1] "Hel"
#
#[[2]]
#[1] "ell"
#
#[[3]]
#[1] "llo"
#
#[[4]]
#[1] "loW"
#
#[[5]]
#[1] "oWo"
#
#[[6]]
#[1] "Wor"
#
#[[7]]
#[1] "orl"
#
#[[8]]
#[1] "rld"
Explanation: The approach is a basic sliding window method to extract substrings from ss. The return object is a list.
Another (sliding window) alternative could be zoo::rollapply with strsplit
library(zoo)
len <- 3
rollapply(unlist(strsplit(ss, "")), len, paste, collapse = "")
[1] "Hel" "ell" "llo" "loW" "oWo" "Wor" "orl" "rld"
In response to your comment/edit, here's a tidyverse option
# Sample data
df <- data.frame(words = c("HelloWorld", "GoodbyeMoon", "HolaSun"))
library(tidyverse)
library(zoo)
df %>% mutate(lst = map(str_split(words, ""), function(x) rollapply(x, len, paste, collapse = "")))
# words lst
#1 HelloWorld Hel, ell, llo, loW, oWo, Wor, orl, rld
#2 GoodbyeMoon Goo, ood, odb, dby, bye, yeM, eMo, Moo, oon
#3 HolaSun Hol, ola, laS, aSu, Sun
If my string is a DNA sequence,
x<-"TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
I want to extract substring from ATG to TAA, TGA or TAG. I am able to extract from one point to another by using stringi package with regex.
My code is
stri_extract_all(x, regex = "ATG.*?TAA")
Help me by solving my query.
I believe that you meant str_extract_all from the stringr package. That function does not have an argument called regex; you need pattern. Once you get by that, you can just use or | to allow any of the sequence endings.
library(stringr)
str_extract_all(x, pattern="ATG.*?(TAA|TGA|TAG)")
[[1]]
[1] "ATGCAACGAGGGGCATAA" "ATGCCCAAAATCTGA" "ATGACCGGGTAG"
Here is a possibility using Biostrings:
library("Biostrings")
x <- "TATAATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG"
# Get all combinations of substrings starting with "ATG" and ending with "TAA"
library(tidyverse)
df <- expand.grid(start(matchPattern("ATG", x)), end(matchPattern("TAA", x))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
#A BStringSet instance of length 3
# width seq
#[1] 18 ATGCAACGAGGGGCATAA
#[2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
#[3] 20 ATGCCCAAAATCTGATATAA
Since you're working with DNA sequence data, I recommend familiarising yourself with Biostrings from Bioconductor. There exist many Bioconductor packages beyond Biostrings that will make your life a lot easier (down the track), when you're working with DNA/RNA sequence data.
Update
To account for multiple stop codons, simply wrap end(matchPattern(...)) within an sapply loop.
df <- expand.grid(
start(matchPattern("ATG", x)),
unlist(sapply(c("TAA", "TGA", "TAG"), function(ss) end(matchPattern(ss, x))))) %>%
filter(Var1 < Var2);
ir <- IRanges(df[, 1], df[, 2]);
extractAt(BString(x), IRanges(df[, 1], df[, 2]));
# [1] 18 ATGCAACGAGGGGCATAA
# [2] 44 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAA
# [3] 20 ATGCCCAAAATCTGATATAA
# [4] 39 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGA
# [5] 15 ATGCCCAAAATCTGA
# ... ... ...
# [7] 23 ATGCCCAAAATCTGATATAATGA
# [8] 4 ATGA
# [9] 55 ATGCAACGAGGGGCATAATTATATATGCCCAAAATCTGATATAATGACCGGGTAG
#[10] 31 ATGCCCAAAATCTGATATAATGACCGGGTAG
#[11] 12 ATGACCGGGTAG
I want to count how many commas are at the end of a string with a regex:
x <- c("w,x,,", "w,x,", "w,x", "w,x,,,")
I'd like to get:
[1] 2 1 0 3
This gives:
library(stringi)
stringi::stri_count_regex(x, ",+$")
## [1] 1 1 0
Because I'm using a quantifier but don't know how to count actual number of times single character was repeated at end.
The "match.length" attribute within the regexpr seem to get the job done (-1 is used to distinguish no match from zero-width matches such as lookaheads)
attr(regexpr(",+$", x), "match.length")
## [1] 2 1 -1 3
Another option (with contribution from #JasonAizkalns) would be
nchar(x) - nchar(gsub(",+$", "", x))
## [1] 2 1 0 3
Or using stringi package combined with nchar while specifying , keepNA = TRUE (this way no matches will be specified as NAs)
library(stringi)
nchar(stri_extract_all_regex(x, ",+$"), keepNA = TRUE)
## [1] 2 1 NA 3
I have a string and need to count number of appearances of a given value which must appear consequent. I tried to take help from stringr package but it counts every time it finds that value/pattern. For example, say we have to count appearance of "213" in string "2132132132137889213", then the output i need is 4 however, i am getting 5 after using stringr_count function. Please help.
I'm not sure of my "regex" skills but, hopefully, you could make something out of this:
max_rep_pat = function(pat, text)
{
res = gregexpr(paste0("(", pat, ")+"), text)
sapply(res, function(x) max(attr(x, "match.length")) / nchar(pat))
}
max_rep_pat("213", c("2132132132137889213",
"21321321321378892132132132132132213213"))
#[1] 4 5
gregexpr returns the position a pattern occured and the number of characters of the found pattern. Wrapping the pattern in "(pattern)+" means 'find the repetitive pattern'. Compare the following two:
gregexpr("213", "2132132132137889213")
[[1]]
[1] 1 4 7 10 17
attr(,"match.length")
[1] 3 3 3 3 3
#attr(,"useBytes")
#[1] TRUE
gregexpr("(213)+", "2132132132137889213")
[[1]]
[1] 1 17
attr(,"match.length")
[1] 12 3
#attr(,"useBytes")
#[1] TRUE
In the first case, it found the position of each "213" and the length of each match is just the nchar of pattern. In the second case, it found every repetitive pattern of "213" and we see that repetitions of "213" occured two times; first time with 12 / 3 = 4 repetitions and the second with 3 / 3 = 1 repetition. Using max(attr(x, "match.length")) / nchar(pattern) we get that 4.
Another way would be:
fun1 <- function(pat, text) {
max_rep_pat1 <- function(pat, text) {
text1 <- gsub(pat, paste(" ", pat, " "), text)
rl <- rle(scan(text = text1, what = "", quiet = T) == pat)
max(rl$lengths[rl$values])
}
setNames(mapply(max_rep_pat1, pat, text), NULL)
}
str1 <- c("2132132132137889213", "21321321321378892132132132132132213213")
str2 <- "213421342134213477"
fun1("2134", str2)
#[1] 4
fun1("213", str1)
#[1] 4 5
I have a list of strings which contain random characters such as:
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
I'd like to know which numbers are present at least once (unique()) in this list. The solution of my example is:
solution: c(7,667,11,5,2)
If someone has a method that does not consider 11 as "eleven" but as "one and one", it would also be useful. The solution in this condition would be:
solution: c(7,6,1,5,2)
(I found this post on a related subject: Extracting numbers from vectors of strings)
For the second answer, you can use gsub to remove everything from the string that's not a number, then split the string as follows:
unique(as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(ll)), ""))))
# [1] 7 6 1 5 2
For the first answer, similarly using strsplit,
unique(na.omit(as.numeric(unlist(strsplit(unlist(ll), "[^0-9]+")))))
# [1] 7 667 11 5 2
PS: don't name your variable list (as there's an inbuilt function list). I've named your data as ll.
Here is yet another answer, this one using gregexpr to find the numbers, and regmatches to extract them:
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
temp1 <- gregexpr("[0-9]", l) # Individual digits
temp2 <- gregexpr("[0-9]+", l) # Numbers with any number of digits
as.numeric(unique(unlist(regmatches(l, temp1))))
# [1] 7 6 1 5 2
as.numeric(unique(unlist(regmatches(l, temp2))))
# [1] 7 667 11 5 2
A solution using stringi
# extract the numbers:
nums <- stri_extract_all_regex(list, "[0-9]+")
# Make vector and get unique numbers:
nums <- unlist(nums)
nums <- unique(nums)
And that's your first solution
For the second solution I would use substr:
nums_first <- sapply(nums, function(x) unique(substr(x,1,1)))
You could use ?strsplit (like suggested in #Arun's answer in Extracting numbers from vectors (of strings)):
l <- c("djud7+dg[a]hs667", "7fd*hac11(5)", "2tu,g7gka5")
## split string at non-digits
s <- strsplit(l, "[^[:digit:]]")
## convert strings to numeric ("" become NA)
solution <- as.numeric(unlist(s))
## remove NA and duplicates
solution <- unique(solution[!is.na(solution)])
# [1] 7 667 11 5 2
A stringr solution with str_match_all and piped operators. For the first solution:
library(stringr)
str_match_all(ll, "[0-9]+") %>% unlist %>% unique %>% as.numeric
Second solution:
str_match_all(ll, "[0-9]") %>% unlist %>% unique %>% as.numeric
(Note: I've also called the list ll)
Use strsplit using pattern as the inverse of numeric digits: 0-9
For the example you have provided, do this:
tmp <- sapply(list, function (k) strsplit(k, "[^0-9]"))
Then simply take a union of all `sets' in the list, like so:
tmp <- Reduce(union, tmp)
Then you only have to remove the empty string.
Check out the str_extract_numbers() function from the strex package.
pacman::p_load(strex)
list=list()
list[1] = "djud7+dg[a]hs667"
list[2] = "7fd*hac11(5)"
list[3] = "2tu,g7gka5"
charvec <- unlist(list)
print(charvec)
#> [1] "djud7+dg[a]hs667" "7fd*hac11(5)" "2tu,g7gka5"
str_extract_numbers(charvec)
#> [[1]]
#> [1] 7 667
#>
#> [[2]]
#> [1] 7 11 5
#>
#> [[3]]
#> [1] 2 7 5
unique(unlist(str_extract_numbers(charvec)))
#> [1] 7 667 11 5 2
Created on 2018-09-03 by the reprex package (v0.2.0).