String Detecting in R - r

I have the following strings.
x <- c("A1A1A1", "A3V???", "B4F3**")
I want to flag only the strings in which the last 3 characters do not follow the patter [[:digit:]][[:alpha:]][[:digit]]
Thus, I would want to flag the 2nd and 3rd string above. Any suggestions?

Just for clarification, are you trying to remove those strings that do not follow that pattern? The way i can think of doing this is clnstrings <- str_remove_all(vectornameofstrings, "symbols or patterns that you would want removed")
There are probably more efficient ways to do this, but from my knowledge (which is limited, as I am still learning) this could be a way to do it. If anyone else has any input on this answer please don't hesitate to comment!

grepl is suitable here
> !grepl("\\d\\w\\d$", x)
[1] FALSE TRUE TRUE
If you want to get the position:
> grep("\\d\\w\\d$", x, invert = TRUE)
[1] 2 3

Related

String Matching in R - Problem with pattern

I have a small Problem. I want to extract a special pattern like this:
v-97bcer
or b-chyfvg or ghd6db
I tried this:
identifier_1 <- "([:alnum:]{6})" # for things like this ghd6db
identifier_2 <- "([:lower:]{1})[- ][:alnum:]{6})" # for things like this v-97bcer or b-chyfvg
The problem is that the first "identifier" works well ok, but extracts for example names as well. In GHD6D8 this example the numbers have no fixed place and can occur everywhere. I do just now that the length is 6.
And the second problem is that for example V-97bcer can occur like v97bcer but I need this format v-97bcer. Here too the numbers are randomly.
If somebody could help or give me a good source for better understanding how to do this. I have not much exp in string matching. Thank you
this should work:
x <- c("v-97bcer", "b-chyfvg", "ghd6db", "v97bcer")
grep("^([a-z].)?[a-z0-9]{6}$", x)
Note that in order to fix the length of the string I provide ^ and $ to the string.
This pattern matches v-97bcer and b-chyfvg and ghd6db but not v97bcer.

Use grepl to match certain words but only in specific contexts where other words must not occur

Assuming this is my data...
mydata<-data.frame(text=c("There are books.","Books are bad.", "I like to read books."))
...how would I, using grepl, match rows in which "book(s)" occur, but "bad" doesn't (i.e. rows 1 and 3, but not row 2)?
I tried something like that with a negative lookahead...
grepl("book(s)?.*?(?!\\bbad\\b)", mydata$text, perl=T, ignore.case=T)
...but that didn't work as it also matches the second row. I assume that is because as soon as "book(s)" get detected, it returns "TRUE" and doesn't bother about whether or not "bad" co-occurs.
EDIT: Just to add this as a condition: I don't know anything about the specific structure of the string and the location of books and bads, but let's just assume book(s) comes first. Example: "there are plenty of books, all of which are bad, but some I really like.".
Using negative lookahead, we can do
grepl("^(?!.*bad).*books.*$", mydata$text, perl = TRUE)
#[1] TRUE FALSE TRUE
This ensures that bad is not present in mydata$text before checking for books.
An easier option is
grepl('book(s)?', mydata$text) & !grepl('\\bbad\\b', mydata$text)

How to search for strings with parentheses in R

Using R, I have a long list of keywords that I'm searching for in a dataset. One of the keywords needs to have parentheses around it in order to be included.
I've been attempting to replace the parenthesis in the keywords list with \\ then the parentheses, but have not been successful. If there is a way to modify the grepl() function to recognize them, that would also be helpful. Here is an example of what I'm trying to accomplish:
patterns<-c("dog","cat","(fish)")
data<-c("brown dog","black bear","salmon (fish)","red fish")
patterns2<- paste(patterns,collapse="|")
grepl(patterns2,data)
[1] TRUE FALSE TRUE TRUE
I would like salmon (fish) to give TRUE, and red fish to give FALSE.
Thank you!
As noted by #joran in the comments, the pattern should look like so:
patterns<-c("dog","cat","\\(fish\\)")
The \\s will tell R to read the parentheses literally when searching for the pattern.
Easiest way to achieve this if you don't want to make the change manually:
patterns <- gsub("([()])","\\\\\\1", patterns)
Which will result in:
[1] "dog" "cat" "\\(fish\\)"
If you're not very familiar with regular expressions, what happens here is that it looks for any one character within the the square brackets. The round brackets around that tell it to save whatever it finds that matches the contents. Then, the first four slashes in the second argument tell it to replace what it found with two slashes (each two slashes translate into one slash), and the \\1 tells it to add whatever it saved from the first argument - i.e., either ( or ).
Another option is to forget regex and use grepl with fixed = T
rowSums(sapply(patterns, grepl, data, fixed = T)) > 0
# [1] TRUE FALSE TRUE FALSE

Searching for an exact String in another String

I'm dealing with a very simple question and that is searching for a string inside of another string. Consider the example below:
bigStringList <- c("SO1.A", "SO12.A", "SO15.A")
strToSearch <- "SO1."
bigStringList[grepl(strToSearch, bigStringList)]
I'm looking for something that when I search for "SO1.", it only returns "SO1.A".
I saw many related questions on SO but most of the answers include grepl() which does not work in my case.
Thanks very much for your help in advance.
When searching for a simple string that doesn't include any metacharacters, you can set fixed=TRUE:
grep("SO1.", bigStringList, fixed=TRUE, value=TRUE)
# [1] "SO1.A"
Otherwise, as Frank notes, you'll need to escape the period (so that it'll be interpreted as an actual . rather than as a symbol meaning "any single character"):
grep("SO1\\.", bigStringList, value=TRUE)
# [1] "SO1.A"

Variable name restrictions in R

What are the restrictions as to what characters (and maybe other restrictions) can be used for a variable name in R?
(This screams of general reference, but I can't seem to find the answer)
You might be looking for the discussion from ?make.names:
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
In the help file itself, there's a link to a list of reserved words, which are:
if else repeat while function for in next break
TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_
NA_character_
Many other good notes from the comments include the point by James to the R FAQ addressing this issue and Josh's pointer to a related SO question dealing with checking for syntactically valid names.
Almost NONE! You can use 'assign' to make ridiculous variable names:
assign("1",99)
ls()
# [1] "1"
Yes, that's a variable called '1'. Digit 1. Luckily it doesn't change the value of integer 1, and you have to work slightly harder to get its value:
1
# [1] 1
get("1")
# [1] 99
The "syntactic restrictions" some people might mention are purely imposed by the parser. Fundamentally, there's very little you can't call an R object. You just can't do it via the '<-' assignment operator. "get" will set you free :)
The following may not directly address your question but is of great help.
Try the exists() command to see if something already exists and this way you know you should not use the system names for your variables or function.
Example...
> exists('for')
[1] TRUE
>exists('myvariable')
[1] FALSE
Using the make.names() function from the built in base package may help:
is_valid_name<- function(x)
{
length_condition = if(getRversion() < "2.13.0") 256L else 10000L
is_short_enough = nchar(x) <= length_condition
is_valid_name = (make.names(x) == x)
final_condition = is_short_enough && is_valid_name
return(final_condition)
}

Resources