How to loop through multiple values in the same string? - r

I wanna loop through a sequence of letters 'ABCDEFGHIJK', but the loop in R loops over 1 value at a time. Is there a way to loop over 3 values at a time? In this case the sequence 'ABCDEFGHIJK' would be looped as 'ABC' then 'DEF' and so on.
I've tried to change the length of the function but I still didn't find a way, I can do this in python but I didn't find any information about it in R nor in the help option of R.
xp <-'ACTGCT'
for(i in 1:length(xp)){
if(i == 'ACG'){
print('T')
}
}

We can use the vectorized substring, i.e.
substring('ABCDEFGHIJK', seq(1, nchar('ABCDEFGHIJK') - 1, 3), seq(3, nchar('ABCDEFGHIJK'), 3)) == 'ACG'
#[1] FALSE FALSE FALSE FALSE
NOTE: This will only extract 3-characters. So If at the end you are left with 2 characters, it will not return them. For the above example, it outputs:
substring('ABCDEFGHIJK', seq(1, nchar('ABCDEFGHIJK') - 1, 3), seq(3, nchar('ABCDEFGHIJK'), 3))
#[1] "ABC" "DEF" "GHI" ""

An option would be to split the string over each 3 characters and then do the comparison
lapply(strsplit(v1, "(?<=.{3})", perl = TRUE), function(x) x== 'ACG')
#[[1]]
#[1] FALSE FALSE FALSE FALSE
data
v1 <- 'ABCDEFGHIJK'

Here is a stringr solution that outputs a list for whether or not there are matches:
library(stringr)
# Split string into sequences of 3 (or fewer if length is not multiple of 3)
split_strings <- str_extract_all("ABCDEFGHIJK", ".{1,3}", simplify = T)[1,]
# The strings you want to loop through / search for
x <- c("ABC", "DEF", "GHI", "LMN")
# Output is named list
sapply(x, `%in%`, split_strings, simplify = F)
$ABC
[1] TRUE
$DEF
[1] TRUE
$GHI
[1] TRUE
$LMN
[1] FALSE
Or, if you only want to look for one element:
"ABC" %in% split_strings
[1] TRUE

1) Base R Iterate over the sequence 1, 4, 7, ... and use substr to extract the 3 character portion of the input string starting at that position number. Then perform whatever processing that is desired. If there are fewer than 3 characters in the last chunk it will use whatever is available for that chunk. This is a particularly good approach if you want to exit early since a break can be inserted into the loop.
for(i in seq(1, nchar(xp), 3)) {
s <- substr(xp, i, i+2)
print(s) # replace with desired processing
}
## [1] "ACT"
## [1] "GCT"
1a) lapply We translate the loop to lapply or sapply if one iteration does not depend on another.
process <- function(i) {
s <- substr(xp, i, i+2)
s # replace with desired processing
}
sapply(seq(1, nchar(xp), 3), process)
## [1] "ACT" "GCT"
2) rollapply Another possibility is to break the string up into single characters and then iterate over those passing a 3 element vector of single characters to the indicated function. Here we have used toString to process each chunk but that can be replaced with any other suitable function.
library(zoo)
rollapply(strsplit(xp, "")[[1]], 3, by = 3, toString, align = "left", partial = TRUE)
## [1] "A, C, T" "G, C, T"

Related

how to create loop for multiple output vectors with grabl function in stringdist

I'm trying to apply the grabl function of stringdist to a large character vector "testref". I want to check for whether the strings in another character vector "testtitle" can be found in "testref". However, grabl does only allow for a single string to be tested at a time.
How can I circumvent this limitation?
Example to reproduce
#in reality each of the elements contains a full bibliography of a scientific article
testref <- c("asdfd sfgdgags dgsd.dsfas.dfs.f.sfas.f My beatiful title asfsdf dsf asfd dsf dsfsdfdsfsd, fdsf sdfdf: fsd fsdfafsd (2000) dsdfsf sfda", "sdfasfdsd, sdfsddf, fsagsg: sfds sfasdf sdfsdf", "sadfsdf: sdfsdf sdfggsdg another title here sdfdfsds, asdgasg (2021) blablabal")
#the pattern vector can contain up to 500 titles of scientific articles that contain typos or formatting mistakes. Hence, I need to use approximate matching
testtitle <- c("holy cow", "random notes", "MI beautiful title", "quantitative research is hard", "an0ther title here")
What I want to get out of this is a list of logical TRUE/FALSE vectors
results_list
#[[1]]
#[1] FALSE FALSE FALSE
#[[2]]
#[1] FALSE FALSE FALSE
#[[3]]
#[1] TRUE FALSE FALSE
#[[4]]
#[1] FALSE FALSE FALSE
#[[5]]
#[1] FALSE FALSE TRUE
So far I, I tried to loop the process as per #Rui Barradas suggestion. Technically it works, but it takes a very long time.
results_list <- vector("list", length = 5)
for(i in 1:5) {
results_list[[i]] <- grabl(testref, testtitle[i], maxDist = 8)
}
I was wondering whether it is possible to use lapply in combination with the grabl function.
results_list <- lapply(testtitle, function(testtitle) grabl(testref, testtitle[], maxDist = 2))
But I get this error: Error in grabl(testref, testtitle[], maxDist = 2) :
could not find function "grabl"
I'm very grateful for your past suggestions and hope for more input!
Thank you!
Something like the following might do what you want. Untested, since there is no data.
# create a list to hold the results beforehand
results_list <- vector("list", length = 126)
for(i in 1:126) {
results_list[[i]] <- grabl(year2002$References, ref_year2002$Title[i], maxDist = 8))
}

making for loop for character vector in R

char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport") # character vector
Suppose I have the above character vector
I would like to create a for loop to print on the screen only the elements in a vector that have more than 5 characters and starts with a vowel
and also delete from the vector those elements that do not start with a vowel
I created this for loop but it also gives null characters
for (i in char_vector){
if (str_length(i) > 5){
i <- str_subset(i, "^[AEIOUaeiou]")
print(i)
}
}
The result for the above is
[1] "Africa"
[1] "identical"
[1] "ending"
character(0)
character(0)
My desired result would only be the first 3 characters
I'm really new to R and facing huge difficulty with creating a for loop for this problem. Any help would be greatly appreciated!
Use grepl with the pattern ^[AEIOUaeiuo]\w{5,}$:
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
char_vector <- char_vector[grepl("^[AEIOUaeiuo]\\w{5,}$", char_vector)]
char_vector
[1] "Africa" "identical" "ending"
The regex pattern used here says to match words which:
^ from the start of the word
[AEIOUaeiuo] starts with a vowel
\w{5,} followed by 5 or more characters (total length > 5)
$ end of the word
You don't need for loop, because we use vectorized functions in R.
A simple solution using grep and substr (refer to Tim Blegeleisen answer for details):
substr(grep('^[aeiu].{4}', char_vector, T, , T), 1, 3)
# [1] "Afr" "ide" "end"
With stringr functions, you'd rather use str_detect instead of str_subset, and you can take advantage of the fact that those functions are vectorized:
library(stringr)
char_vector[str_length(char_vector) > 5 & str_detect(char_vector, "^[AEIOUaeiou]")]
#[1] "Africa" "identical" "ending"
or if you want your for loop as a single vector:
vec <- c()
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")){
vec <- c(vec, i)
}
}
vec
# [1] "Africa" "identical" "ending"
The first 3 characters?
library(stringr)
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")) {
word <- str_sub(i, 1, 3)
print(word)
}
}
output is:
[1] "Afr"
[1] "ide"
[1] "end"
Using only base R functions. No need for a loop. I wrapped the steps in a function so you can use the function with other character vectors. You could make this code shorter (see #utubun's answer) but I feel it is easier to understand the process with a "one line one step" approach.
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
yourfun <- function(char_vector){
char_vector <- char_vector[nchar(char_vector)>= 5] # grab only the strings that are at least 5 characters long
char_vector <- char_vector[grep(pattern = "^[AEIOUaeiou]", char_vector)] # grab strings that starts with vowel
return(char_vector) # print the first three strings
# remove comments to get the first three characters of each string
# out <- substring(char_vector, 1, 3) # select only the first 3 characters of each string
# return(out)
}
yourfun(char_vector = char_vector)
#> [1] "Africa" "identical" "ending"
Created on 2022-05-09 by the reprex package (v2.0.1)

Subset string by counting specific characters

I have the following strings:
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
I want to cut off the string, as soon as the number of occurances of A, G and N reach a certain value, say 3. In that case, the result should be:
some_function(strings)
c("ABBSDGN", "AABSDG", "AGN", "GGG")
I tried to use the stringi, stringr and regex expressions but I can't figure it out.
You can accomplish your task with a simple call to str_extract from the stringr package:
library(stringr)
strings <- c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
str_extract(strings, '([^AGN]*[AGN]){3}')
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
The [^AGN]*[AGN] portion of the regex pattern says to look for zero or more consecutive characters that are not A, G, or N, followed by one instance of A, G, or N. The additional wrapping with parenthesis and braces, like this ([^AGN]*[AGN]){3}, means look for that pattern three times consecutively. You can change the number of occurrences of A, G, N, that you are looking for by changing the integer in the curly braces:
str_extract(strings, '([^AGN]*[AGN]){4}')
# [1] "ABBSDGNHN" NA "AGNA" "GGGDSRTYHG"
There are a couple ways to accomplish your task using base R functions. One is to use regexpr followed by regmatches:
m <- regexpr('([^AGN]*[AGN]){3}', strings)
regmatches(strings, m)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Alternatively, you can use sub:
sub('(([^AGN]*[AGN]){3}).*', '\\1', strings)
# [1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Here is a base R option using strsplit
sapply(strsplit(strings, ""), function(x)
paste(x[1:which.max(cumsum(x %in% c("A", "G", "N")) == 3)], collapse = ""))
#[1] "ABBSDGN" "AABSDG" "AGN" "GGG"
Or in the tidyverse
library(tidyverse)
map_chr(str_split(strings, ""),
~str_c(.x[1:which.max(cumsum(.x %in% c("A", "G", "N")) == 3)], collapse = ""))
Identify positions of pattern using gregexpr then extract n-th position (3) and substring everything from 1 to this n-th position using subset.
nChars <- 3
pattern <- "A|G|N"
# Using sapply to iterate over strings vector
sapply(strings, function(x) substr(x, 1, gregexpr(pattern, x)[[1]][nChars]))
PS:
If there's a string that doesn't have 3 matches it will generate NA, so you just need to use na.omit on the final result.
This is just a version without strsplit to Maurits Evers neat solution.
sapply(strings,
function(x) {
raw <- rawToChar(charToRaw(x), multiple = TRUE)
idx <- which.max(cumsum(raw %in% c("A", "G", "N")) == 3)
paste(raw[1:idx], collapse = "")
})
## ABBSDGNHNGA AABSDGDRY AGNAFG GGGDSRTYHG
## "ABBSDGN" "AABSDG" "AGN" "GGG"
Or, slightly different, without strsplit and paste:
test <- charToRaw("AGN")
sapply(strings,
function(x) {
raw <- charToRaw(x)
idx <- which.max(cumsum(raw %in% test) == 3)
rawToChar(raw[1:idx])
})
Interesting problem. I created a function (see below) that solves your problem. It's assumed that there are just letters and no special characters in any of your strings.
reduce_strings = function(str, chars, cnt){
# Replacing chars in str with "!"
chars = paste0(chars, collapse = "")
replacement = paste0(rep("!", nchar(chars)), collapse = "")
str_alias = chartr(chars, replacement, str)
# Obtain indices with ! for each string
idx = stringr::str_locate_all(pattern = '!', str_alias)
# Reduce each string in str
reduce = function(i) substr(str[i], start = 1, stop = idx[[i]][cnt, 1])
result = vapply(seq_along(str), reduce, "character")
return(result)
}
# Example call
str = c("ABBSDGNHNGA", "AABSDGDRY", "AGNAFG", "GGGDSRTYHG")
chars = c("A", "G", "N") # Characters that are counted
cnt = 3 # Count of the characters, at which the strings are cut off
reduce_strings(str, chars, cnt) # "ABBSDGN" "AABSDG" "AGN" "GGG"

Split a string every 5 characters

Suppose I have a long string:
"XOVEWVJIEWNIGOIWENVOIWEWVWEW"
How do I split this to get every 5 characters followed by a space?
"XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Note that the last one is shorter.
I can do a loop where I constantly count and build a new string character by character but surely there must be something better no?
Using regular expressions:
gsub("(.{5})", "\\1 ", "XOVEWVJIEWNIGOIWENVOIWEWVWEW")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"
Using sapply
> string <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
> sapply(seq(from=1, to=nchar(string), by=5), function(i) substr(string, i, i+4))
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
You can try something like the following:
s <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW" # Original string
l <- seq(from=5, to=nchar(s), by=5) # Calculate the location where to chop
# Add sentinels 0 (beginning of string) and nchar(s) (end of string)
# and take substrings. (Thanks to #flodel for the condense expression)
mapply(substr, list(s), c(0, l) + 1, c(l, nchar(s)))
Output:
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
Now you can paste the resulting vector (with collapse=' ') to obtain a single string with spaces.
No *apply stringi solution:
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
stri_sub(x, seq(1, stri_length(x),by=5), length=5)
[1] "XOVEW" "VJIEW" "NIGOI" "WENVO" "IWEWV" "WEW"
This extracts substrings just like in #Jilber answer, but stri_sub function is vectorized se we don't need to use *apply here.
You can also use a sub-string without a loop. substring is the vectorized substr
x <- "XOVEWVJIEWNIGOIWENVOIWEWVWEW"
n <- seq(1, nc <- nchar(x), by = 5)
paste(substring(x, n, c(n[-1]-1, nc)), collapse = " ")
# [1] "XOVEW VJIEW NIGOI WENVO IWEWV WEW"

Use paste to combine letters instead or loops. R

I'm a newbie to R, but I'm trying to make a sliding window in R.
Using loops I can it like this, but this gets very inefficient.
results=c(1:7)
letters=c("A","B","C","D","E","F","G","H","I","J")
for(i in 1:7){
results[i]=paste(letters[i:(i+3)],collapse="")
}
How can I use an apply function to get the same output?
A little different to Ramnath's answer:
lets <- LETTERS[1:10]
substring(paste(lets,collapse=""),1:7,4:10)
#[1] "ABCD" "BCDE" "CDEF" "DEFG" "EFGH" "FGHI" "GHIJ"
Here is one way to do this
sapply(1:7, function(i) {
paste(letters[i:(i+3)], collapse = '')
})
With the zoo time series package:
apply(rollapply(letters,4,c), 1, paste, collapse="")
[1] "ABCD" "BCDE" "CDEF" "DEFG" "EFGH" "FGHI" "GHIJ"
A "roll your own" way just for fun:
## n letters
nl <- 10
## length of string
len <- 4
## note I use the inbuilt LETTERS
apply(matrix(LETTERS[seq_len(nl)], nl + 1, len), 1, paste, collapse = "")[seq_len(nl - len + 1)]
(Leaves you with a warning based on incomplete recycling, but I like the trick of using a matrix to provide the offset for rolling windows).

Resources