Comparing Two Strings And Changing Case of Differing Characters - r

I'm trying to compare a "master" string to a list of strings using R. Based on this comparison, I'd like all characters in the string which differ from the master string changed to lowercase.
For example: the master string is "AGG". The list of strings being compared to is ["ATT", "AGT"]. I want to return ["Att","AGt"]. Order also matters. So ["GGA"] should return ["gGa"].
Any help would be greatly appreciated!

You could turn all characters into lowercase first, and turn those characters in the master string back to uppercase.
master <- "AGG"
x <- c("ATT", "AGT", "GGA")
chartr(tolower(master), master, tolower(x))
# [1] "Att" "AGt" "GGA"
Update: If you want to compare x and master character-by-character, try this:
sapply(strsplit(x, ""), \(char) {
paste(ifelse(char == strsplit(master, "")[[1]], char, tolower(char)), collapse = "")
})
# [1] "Att" "AGt" "gGa"

Related

How do I remove entire strings if they contain a matched pattern in R

Say I have the following string -
vector <- "this is a string of text containing stuff. something.com thisthat#co.uk and other stuff with something.anything"
I would like to remove a string if it contains # or . , so I would like to remove something.com, thisthat#co.uk and something.anything. I do not want to remove stuff because it's the end of a sentence and does not contain .. Ideally I would like to be able to use the %>% pipe to do this.
An alternative to the (much more terse/simple) gsub method:
gre <- gregexpr("[^ ]+[.#][^ ]+", vector)
regmatches(vector, gre)
# [[1]]
# [1] "something.com" "thisthat#co.uk" "something.anything"
regmatches(vector, gre) <- ""
vector
# [1] "this is a string of text containing stuff. and other stuff with "
This has the advantage of being able to replace them arbitrarily. Granted, we're just replacing them here with "", so this is a little overkill, but if you need to change the values somehow (change each substring), then this is a more powerful mechanism.
gsub(" ?\\w+[.#]\\S+", "", vector)
[1] "this is a string of text containing stuff. and other stuff with"

Exact match with grepl R

I'm trying to extract certain records from a dataframe with grepl.
This is based on the comparison between two columns Result and Names. This variable is build like this "WordNumber" but for the same word I have multiple numbers (more than 30), so when I use the grepl expression to get for instance Word1 I get also results that I would like to avoid, like Word12.
Any ideas on how to fix this?
Names <- c("Word1")
colnames(Names) <- name
Results <- c("Word1", "Word11", "Word12", "Word15")
Records <- c("ThisIsTheResultIWant", "notThis", "notThis", "notThis")
Relationships <- data.frame(Results, Records)
Relationships <- subset(Relationships, grepl(paste(Names$name, collapse = "|"), Relationships$Results))
This doesn't work, if I use fixed = TRUE than it doesn't return any result at all (which is weird). I have also tried concatenating the name part with other numbers like this, but with no success:
Relationships <- subset(Relationships, grepl(paste(paste(Names$name, '3', sep = ""), collapse = "|"), Relationships$Results))
Since I'm concatenating I'm not really sure of how to use the \b to enforce a full match.
Any suggestions?
In addition to #Richard's solution, there are multiple ways to enforce a full match.
\b
"\b" is an anchor to identify word before/after pattern
> grepl("\\bWord1\\b",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
\< & \>
"\<" is an escape sequence for the beginning of a word, and ">" is used for end
> grepl("\\<Word1\\>",c("Word1","Word2","Word12"))
[1] TRUE FALSE FALSE
Use ^ to match the start of the string and $ to match the end of the string
Names <-c('^Word1$')
Or, to apply to the entire names vector
Names <-paste0('^',Names,'$')
I think this is just:
Relationships[Relationships$Results==Names,]
If you end up doing ^Word1$ you're just doing a straight subset.
If you have multiple names, then instead use:
Relationships[Relationships$Results %in% Names,]

concatenate a string that contains backquote characters R

I have a string that contains back quotes, which mess up the concatenate function. If you try to concatenate with back ticks, the concatenate function doesn't like this:
a <- c(`table`, `chair`, `desk`)
Error: object 'chair' not found
So I can create the variable:
bad.string <- "`table`, `chair`, `desk`"
a <- gsub("`", "", bad.string)
That gives a string "table, chair, desk".
It then should be like:
good.object <- c("table", "chair", "couch", "lamp", "stool")
I don't know why the backquotes cause the concatenate function to break, but how can I replace the string to not have the illegal characters?
Try:
good.string <- trimws(unlist(strsplit(gsub("`", "", bad.string), ",")))
Here gsub() is used to remove the backticks, strsplit converts a single string into a list of strings, where the comma in the original string denotes the separation, unlist() converts the list of strings into a vector of strings and trimws() deletes trailing or leading whitespaces.
From the documentation on quotes, back ticks are reserved for non-standard variable names such as
`the dog` <- 1:5
`the dog`
# [1] 1 2 3 4 5
So when you are trying to use concatenate, R is doing nothing wrong. It looks at all the variable in c() and tries to find them, causing the error.
If this is a vector you wrote, just copy replace all of the backticks with single or double quotes.
If this is somehow being generated in R, bring the entire thing out as a string, then use gsub() and eval(parse())
eval(parse(text = gsub('\`',"\'","c(`table`, `chair`, `desk`)")))
[1] "table" "chair" "desk"
EDIT: For the new example of bad.string
You have to go through, replace all of the back ticks with double quotes, then you can read it through read.csv(). This is a little janky though as it gives back a row vector, so we transpose it to get back a column vector
bad_string <- "`table`, `chair`, `desk`"
okay_string <- gsub('\`','\"',bad.string)
okay_string
# [1] "\"table\", \"chair\", \"desk\""
t(read.csv(text = okay_string,header=FALSE, strip.white = TRUE))
# [,1]
# V1 "table"
# V2 "chair"
# V3 "desk"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Available function for deletion of character from certain positions of a string

I am looking for a function which performs delete operation on a string based on the position.
For example, given string is like that
string1 <- "hello stackoverflow"
Suppose, I want to delete 4th,10th and 18th positions.
Preferred Output
"helo stakoverflw"
I am not sure about the existence of such function.
This worked for me.
string1 <- "hello stackoverflow"
paste((strsplit(string1, "")[[1]])[-c(4,10,18)],collapse="")
[1] "helo stakoverflw"
I used strsplit to split the string into a vector of characters, and then pasted only the desired characters back together into a string.
You could also write a function that does this:
delChar <- function(x,eliminate){
paste((strsplit(x,"")[[1]])[-eliminate],collapse = "")
}
delChar(string1,c(4,10,18))
[1] "helo stakoverflw"

Resources