Replace multiple characters, by index, in a string quickly - r

I'm trying to quickly replace multiple characters in a string with another character such as *
For example, I have a string such as:
string = "abcdefghij"
I also have a vector of indexes that indicate where I would like to replace letters in the above string with another character.
string_indexes_replaced = c(1, 4, 6, 9)
Desired output:
"*bc*e*gh*j"
What I've done
I've tried a very novice like approach of splitting the characters up into a list, replacing the characters with *, then collapsing the list back into the desired string, as shown below:
library(dplyr)
library(stringi)
string%>%
strsplit(split = "")%>%
lapply(function(x) replace(x, string_indexes_replaced, rep("*", length(string_indexes_replaced))))%>%
lapply(stri_flatten)%>%
unlist(use.names = FALSE)
which outputs
"*bc*e*gh*j"
but it is clear that there should be something simpler and faster than what I've posted above. Is there anything simpler & quicker than what I've demonstrated here?

in base R, besides the method of substring() and for-loop shown by #akrun,, you can use utf8ToInt() and intToUtf8 to make it
v <- utf8ToInt(string)
v[string_indexes_replaced ] <- utf8ToInt("*")
res <- intToUtf8(v)
which gives
> res
[1] "*bc*e*gh*j"

We can use substring
v1 <- c(1, 4, 6, 9)
for(i in seq_along(v1)) substring(string, v1[i], v1[i]) <- "*"
#[1] "*bc*e*gh*j"
As we are using stringi, another option is
library(stringi)
stri_sub_all(string, from = v1, length = 1) <- "*"
string
#[1] "*bc*e*gh*j"

A simple recursive solution. The time efficiency should be same as iteration (for loop). The benefit is there is no side-effect (assignment of integer ks is localized), so that we can treat its whole computation as a functional abstract and feed it to other part of the bigger program which we are working on. It will help to modularize the code.
# multi-replace for character vector input with length greater than 1
multi_replace_v <- function(v, r, ks) {
ks <- as.integer(ks)
if (length(ks) == 0) {
v
} else if (length(ks) == 1) {
if (ks[[1]] > length(v) | ks[[1]] < 1) {
stop("Invalid parameter: ks=", as.character(ks[[1]]), ". Valid range: 1-", as.character(length(v)))
} else if (ks[[1]] == 1) {
c(r, v[-1])
} else if (ks[[1]] == length(v)) {
c(v[-length(v)], r)
} else {
c(v[1:(ks[[1]]-1)], r, v[(ks[[1]]+1):length(v)])
}
} else {
multi_replace_v(multi_replace_v(v, r, ks[[1]]), r, ks[-1])
}
}
# multi-replace for input of single string character vector
multi_replace_s <- function(s, r, ks) paste0(multi_replace_v(unlist(strsplit(s, '')), r, ks), collapse = '')
# multi-replace for both single string and long vector input
multi_replace <- function(v_or_s, r, ks) {
if (length(v_or_s) == 1) {
multi_replace_s(v_or_s, r, ks)
} else if (length(v_or_s) > 1) {
multi_replace_v(v_or_s, r, ks)
} else {
NULL
}
}
# Example
> multi_replace('abcdefghij', "*", c(1,4,6,9))
[1] "*bc*e*gh*j"

Related

searching strings with one missmatch by grepl

I would like to search for substrings in another string. The output has to be a logical, that is why i was assuming grepl() would be the best option.
The requirements are, that the the function has to detect both strings which patterns fit exactly (String 2 & 3) and at least one missmatch is allowed (String 1 & 3)
An example would look like this:
String1: ABCDEFGHIJKL
String2: ABDEFGHIJKL
String3: ABDEFG
Meaning the function has to detect both String 1 and String 2 when String 3 is the searched pattern.
Another option would be the matchpattern() function of the Biostrings package. But here, the output is not an logical but an:
Formal class 'XStringViews' [package "Biostrings"] with 5 slots
The option to transform this into a logical would be working aswell.
Thanks a lot
library(stringr)
stri <- c("ABCDEFG", "ABCDEFGHGT", "ABFCDE", "saffaf")
str_match <- function(pattern, st_c) {
logic_f <- NULL
for (i in seq_along(st_c)){
var <- strsplit(st_c[i], "")[[1]]
det <- str_detect(pattern, var)
logic <- ifelse(TRUE %in% det, TRUE, FALSE)
logic_f <- append(logic_f, logic)
}
return(logic_f)
}
str_match("ABD", stri)
## [1] TRUE TRUE TRUE FALSE
Purely for fun and not sure if viable for longer strings:
S1<-"ABCDEFGHIJKL"
S2<-"ABDEFGHIJKL"
S3<-"ABDEFG"
find_partial_matching_string<-function(string, pattern){
require(stringr)
a<-vector()
b<-vector()
for (i in 1:nchar(string)){
x<-str_sub(string, i, i)
a<-c(a,x)
}
for(j in 1:nchar(pattern)){
y<-str_sub(pattern, j, j)
b<-c(b,y)
}
z <- a %in% b
if(table(z[1:length(b)])<=1){
return(string)
}
}
> find_partial_matching_string(string = S1, pattern = S3)
[1] "ABCDEFGHIJKL"
> find_partial_matching_string(string = S2, pattern = S3)
[1] "ABDEFGHIJKL"

Cant manipulate global/local variables inside a function in R

dna = c("A","G","C","T")
x =sample(dna,50,replace =TRUE)
dna_f = function(x){
dnastring <- ""
for (val in x){
paste(dnastring,val,sep="")
}
return(dnastring)
}
dna_f(x)
I'm trying to produce a single string that contains all the randomly sampled letters. x contains all 50 letters and im trying to combine them into one string using the paste function. but when i run this, the output is an empty string. I tried placing dnastring as a global variable because i thought maybe the scope of a function operates differently in R(I'm new to R) but i got the same output. some help would be appreciated thanks.
You don't need for loop here. Try paste with collapse argument.
dna_f = function(x){
paste0(x, collapse = '')
}
dna_f(x)
#[1] "CCTACCAACCCTTTCTAGCCCACTATGCATCACAACTGCGGTCTCATCAC"
You forgot the dnastring <-
dna = c("A","G","C","T")
x =sample(dna,50,replace =TRUE)
dna_f = function(x){
dnastring <- ""
for (val in x){
dnastring <- paste(dnastring,val,sep="")
}
return(dnastring)
}
Output:
> dna_f(x)
[1] "GGTCTGGCCGAACTACTGTACACCCCAAAGACAACGCCCCCGACGCTCTA"

Function to abbreviate scientific names

Could you please help me?
I'm trying to modify an R function written by a colleague. This function receives a character vector with scientific names (Latin binomes), just like this one:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Then, it should abbreviate the scientific names, using only the first three letters of the genus (first term) and the epithet (second term) to make a short code. For instance, Cerradomys scotti should become Cersco.
This is the original function:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
With a simple list like that one, the function works perfectly. However, when the list has some missing or extra elements, it does not work. The loop stops when it meets the first row that does not match the pattern. Let's take this more complex list as an example:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Morfosp1
Vismia cf brasiliensis
See that Morfosp1 has only 1 term. And Vismia cf brasiliensis has an additional term (cf) in the middle.
I've tried adapting the function, for instance, this way:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2]))) {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
Nevertheless, it does not work. I get this error message:
Error in if (splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2])) { :
valor ausente onde TRUE/FALSE necessário
How could I make the function:
Deal also with names that have only 1 term?
Expected outcome: Morfosp1 -> Morfosp1 (stays the same)
Deal also with names that have an additional term in the middle?
Expected outcome: Vismia cf brasiliensis -> Visbra (term in the middle is ignored)
Thank you very much!
Something like this is pretty concise:
test <- c("Cerradomys scotti", "Oligoryzomys sp", "Latingstuff", "Latin staff more")
# function to truncate a given name
trunc_str <- function(latin_name) {
# split it on a space
name_split <- unlist(strsplit(latin_name, " ", fixed = TRUE))
# if one name, just return it
if (length(name_split) == 1) return(name_split)
# truncate to first 3 letters
name_trunc <- substr(name_split, 1, 3)
# paste the first and last term together (skipping any middle ones)
paste0(head(name_trunc, 1), tail(name_trunc, 1))
}
# iterate over all
vapply(test, trunc_str, "")
# Cerradomys scotti Oligoryzomys sp Latingstuff Latin staff more
# "Cersco" "Olisp" "Latingstuff" "Latmor"
If you don't want a named vector output, you can use USE.NAMES = FALSE in vapply(). Or feel free to use a loop here.
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)){
# One name
if(length(splitnames[[i]])==1){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
# Two names
else if(length(splitnames[[i]])==2){
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
# Three names
else if(length(splitnames[[i]])==3){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][3],1,3), sep = "")
# Assuming that the unwanted word is always in the middle
}
}
return(vector)
}
I tested on the list you gave and it seems to work, tell me if you need a more general code
Thank you very much for the help, Ricardo and Adam! I've made the code available on GitHub to other people who work with interaction networks, and need to abbreviate scientific names to be used in graphs.

R: parse nested parentheses with specific text

I'm relatively new to R programming, but have a specific problem concerning the extraction of text from a syntactically parsed historical language corpus. The problem should be easy to solve, but I just can't get my head around it. My question is basically a more specific variation of this one: R: parse nested parentheses
I would like to parse nested parentheses in R. Here is an example of some data:
(sometext(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))sometext)
From this string I would like to extract all (potentially nested) substrings that begin with "NP", so the result should be
(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))
(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))
(NPR_REYNARD)
(NP-PRN(D_THE)(N_FOXE))
Any help would be much appreciated!
This probably isn't the most efficient, but here's a function which can extract the "tokens" or strings between matched parentheis.
find_tokens <- function(s) {
stopifnot(length(s)==1)
mm <- gregexpr("[)()]", s)
stack <- numeric()
starts <- numeric()
stops <- numeric()
Map(function(i, v) {
if(v=="(") {
stack <<- c(stack, i)
} else if (v==")") {
starts <<- c(starts, tail(stack, 1))
stops <<- c(stops, i)
stack <<- stack[-length(stack)]
}
}, mm[[1]], regmatches(s, mm)[[1]])
rev(substring(s, starts, stops))
}
This will extract everything. If you want to keep just the values that start with "(NP" you can just grep this list
grep("^\\(NP", find_tokens(s), value=TRUE)
# [1] "(NP-SBJ(D+N_THYSTORYE)(PP(P_OF)(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))))"
# [2] "(NP(NPR_REYNARD)(NP-PRN(D_THE)(N_FOXE)))"
# [3] "(NP-PRN(D_THE)(N_FOXE))"
# [4] "(NPR_REYNARD)"
Here's another possible implementation of find_tokens that might be more efficient that will better support multiple strings as a list.
find_tokens <- function(s) {
mm <- gregexpr("[)()]", s)
vv <- regmatches(s, mm)
extr <- function(x, mm, vv) {
open_i <- 0
shut_i <- 0
open <- numeric(length(vv)/2)
shut <- numeric(length(vv)/2)
close <- numeric(length(vv)/2)
for(i in seq_along(mm)) {
if (vv[i]=="(") {
open_i <- open_i + 1
shut_i <- shut_i + 1
open[open_i] <- mm[i]
shut[shut_i] <- open_i
} else if (vv[i]==")") {
close[shut[shut_i]] <- mm[i]
shut_i <- shut_i - 1
}
}
substring(x, open, close)
}
unname(Map(extr, s, mm, vv))
}
and then you would use
lapply(find_tokens(s), function(x) grep("^\\(NP", x, value=TRUE))

How to take value from one column and store it in newly created column using function call

firstly sorry if this is a stupid question ... I am learning R, and really dont have too much experience
I have following function in R programming language, that is taking value and returning value.
dec2binSingle <- function(decimal) {
print(decimal)
binaryValue <- ""
index <- 0
decimal <- as.numeric(decimal)
while(decimal != 0) {
print(decimal)
temp <- as.numeric(decimal) %% 2
if (temp == 1) {
binaryValue <- paste("1", binaryValue, sep="", collapse = NULL)
decimal <- decimal - 1
} else {
binaryValue <- paste("0", binaryValue, sep="", collapse = NULL)
}
index <- index + 1
decimal <- decimal / 2
}
return(binaryValue)
}
The function is converting decimal number into binary equivalent.
When I try to call the function, the function completes without any error, but when I try to see the data, the following error appears:
Error in View : 'names' attribute [200] must be the same length as the vector [1]
And this is the way, how the function is being called:
test_function <- function(value1) {return(dec2binSingle(as.numeric(unlist(value1))))}
data_example$tv <- with(data_example, test_function(data_example[which(colnames(data_example) == "numbers")]))
Any help is appreciated... thanks
EDIT:
I called the function for single value and it works as expected.
> dec2binSingle(23)
[1] "10111"
>
I hope this is what you wanted to achieve with your code.
#sample data
df <- data.frame(char1=c("abc","def","xyz"), num1=c(1,34,12), num2=c(34,20,8))
df
#function to convert decimal into binary
bin_func <- function(x) {gsub("^0+","",paste(rev(as.numeric(intToBits(x))), collapse=""))}
#verify which all columns are numeric
num_col <- sapply(df,is.numeric)
df1 <- as.data.frame(lapply(df[,num_col], FUN = function(x) {sapply(x, FUN = bin_func)}))
names(df1) <- paste(names(df1),"_converted",sep="")
#final dataframe having original as well as converted columns
df <- cbind(df,df1)
df
Please don't forget to let us know if it helped :)

Resources