Swapping each letter in a string sequentially using R - r

For those fellow DnD fans, I recently found the Ring of the Grammarian. Thus I am trying to make a quick script for generating a list of sensible words based on swapping letters from an input string. For example, I want to input "mage hand" and have the program return a list or dataframe which reads;
cage hand
...yada yada ...
mage band
mage land
...yada yada ...
mage bang
so far, I've only gotten as far as this:
dictionary<-data.frame(DICTIONARY)
spell.suggester<-function(x){
for (i in 1:nchar(x)) {
for (k in 1:length(letters)) {
res1<-gsub(pattern = x[i] ,replace = letters[k], x)
res2<-grep("\\bres1\\b",dictionary[,1],value = F)
if_else(res2>1,print(grep("\\bres1\\b",dictionary[,1],value = T)),"nonsense")
return()
}
}
}
spell.suggester(x = "mage hand")
but I end up with an error message which reads
character(0)
NULL
I haven't found any answers on stack using R. Could someone please help me with some suggestions and guidance?

Your major problem here is that you're trying to index each letter of a string, and R doesn't like letting you do that - it treats a string as a whole value, so attempting to index the letters fails.
To fix that, you can use strsplit to turn a string into a vector of individual characters that you can index as normal.
Your second issue the dictionary search seems a bit over-complicated; you can use %in% to check if a value is present in a vector.
The code below shows a minimal example of how to do this; it only works with single words, and relies on you having a decent dictionary to check valid words against.
# minimal example of valid word list
dictionary <- c("vane", "sane", "pane", "cane",
"bone", "bans", "bate", "bale")
spell.suggester<-function(spell){
#Split spell into a vector of single characters
spell_letters <- strsplit(spell,"")[[1]]
# Once for each letter in spell
for (i in 1:nchar(spell)) {
# Once for each letter in letters
for (k in 1:length(letters)) {
#If the letter isn't a space
if (spell_letters[i] != " "){
# Create a new word by changing one letter
word <-gsub(pattern = spell_letters[i] ,replace = letters[k], spell)
# If the word is in the list of valid words
if (word %in% dictionary){
# print the possibility
print(word)
}
}
}
}
}
spell.suggester(spell="bane")

Related

R: Loop should return numeric element from string

I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.

r replace text within a string by lookup table

I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])

collapse strings in a vector three times for an or statement in r

I have a vector with multiple strings
strings <- c("CD4","CD8A")
and I'd like to output an OR statement to be passed to grep like so
"CD4-|-CD4-|-CD4$|CD8A-|-CD8A-|-CD8A$"
and so on for each element in the vector..
basically I'm trying to find an exact word in a string that has three dashes in it, (I don't want grep(CD4, ..) to return strings with CD40). This is how I thought of doing it but I'm open to other suggestions
part of my data.frame looks like this:
Genes <- as.data.frame(c("CD4-MyD88-IL27RA", "IL2RG-CD4-GHR","MyD88-CD8B-EPOR", "CD8A-IL3RA-CSF3R", "ICOS-CD40-LMP1"))
colnames(Genes) <- "Genes"
Here is a one-liner...
Genes$Genes[grep(paste0("\\b",strings,"\\b",collapse="|"),Genes$Genes)]
[1] "CD4-MyD88-IL27RA" "IL2RG-CD4-GHR" "CD8A-IL3RA-CSF3R"
It uses word-boundary markers \\b to make sure that it matches complete substrings (as the - does not count as part of a word).
I don't know if I understood. If I got it, the following command will return what you want
stringr::str_split(Genes$Genes, pattern = '-') %>%
purrr::map(
function(data) {
data[stringr::str_which(data, pattern = '^CD')]
}
) %>% unlist

Best approach to remove all parts of data column that don't match any part of a list?

I've got a dataframe with a column of song titles, label info and other messy string data. I also have an isolated vector of specific song titles. I'd like to filter out all characters that aren't a matched song from the song titles. I'm using something like this, but is showing errors.
song.list <- c("Song.1","Song.2", "Song.3")
Mydata$Songs <- My data column containing all sorts of things including the songs I'm after
levels(Mydata$Songs)[(Mydata$Songs) %in% song.list] <- "" #I'd like the opposite of this
levels(Mydata$Songs)![(Mydata$Songs) %in% song.list] <- ""#My use of '!' doesn't work
I know that using the above indexing without the ! will work to replace my song list with blank space, but I'm trying to replace everything else with a blank space. I've got about 29 songs in my list and about 1000 rows of messy string data in a single column. I've also tried gsub and grep to no avail.
I haven't been able to come up with a vectorized solution, but if I understood you correctly this loop over the factor levels should do the job:
library(stringr)
for (level in levels(df$A)) {
match <- na.omit(str_extract(level, song.list))
if (length(match) > 0) {
levels(df$A)[levels(df$A) == level] <- match
}
}
Original answer which didn't do what the OP intended
I'm not sure I fully understand what you're trying to do, but I think this is what you're after. This doesn't remove the rows, though!
levels(Mydata$Songs)[!Mydata$Songs %in% song.list] <- ""

Use regular expressions inside only the end portion of strings

I am pre-processing a data frame with 100,000+ blog URLs, many of which contain content from the blog header. The grep function lets me drop many of those URLs because they pertain to archives, feeds, images, attachments or a variety of other reasons. One of them is that they contain “atom”.
For example,
string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one"
df <- data.frame(row, string)
df$string <- as.character(df$string) df[-grep("atom", string), ]
My problem is that the pattern “atom” might appear in a blog header, which is important content, and I do not want to drop those URLs.
How can I concentrate the grep on only the final 20 characters (or some number that greatly reduces the risk that I will grep out content that contains the pattern rather than the ending elements? This question uses $ at the end but is not using R; besides, I don't know how to extend the $ back 20 characters. Regular Expressions _# at end of string
Assume that it is not always the case that the pattern has forward slashes on either or both ends. E.g, /atom/.
The function substr can isolate the end portion of the strings, but I don’t know how to grep only within that portion. The pseudo-code below draws on the %in% function to try to illustrate what I would like to do.
substr(df$string, nchar(df$string)-20, nchar(df$string)) # extracts last 20 characters; start at nchar end -20, to end
But what is the next step?
string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]
Thank you for your guidance.
lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
# atom was in there
} else {
# atom was not in there
}
could also do it without the lastpart..
if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
# atom was in there
} else {
# atom was not in there
}
but things become harder to read... (gives better perfomance though)
You could try using a URL component depth approach (i.e. only return df rows which contain the word "atom" after 5 slashes):
find_first_match <- function(string, pattern) {
components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
matches <- grepl(pattern = pattern, x = components)
if(any(matches) == TRUE) {
first.match <- which.min(matches)
} else {
first.match <- NA
}
return(first.match)
}
Which can be used as follows:
# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")
# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]
# row string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/ 6
This gives you control over which URLs to return based on the depth of when "atom" appears
I chose the second answer because it is easier for me to understand and because with the first one it is not possible to predict how many forward slashes to include in the “component depth”.
The second answer translated into English from the inside function to the broadest function out says:
Define the final 20 characters of your string with the substr() function, your substring;
then find if the pattern “atom” is in that sub-string with the grep() function;
then count whether “atom” was found more than once in the substring, thus with length greater than zero, and that row will be omitted;
finally, if no pattern is matched, i.e., no “atom” is found in the final 20 characters, leave the row alone – all done with the if…else() function

Resources