Return matching values instead of boolean - r

Consider:
chars <- c("A", "B", "C")
string <- c("B", "C")
chars[!(chars %in% string)]
So, I want to get the char(s) which is(are) not in string.
The code works, but I feel like it's kind of inconvenient.
Is there a function in R which returns the actual value directly, instead of evaluating TRUE/FALSE and then indexing?

As akrun mentioned you want to use
setdiff(chars, string)
more generally, the help page for setdiff is the same than union and other useful and I feel underused functions to perform operations on sets such as
intersect()
Which more directly answers the phrasing of your initial question.

Related

Replace characters only if it is not repeating

Is there a way to replace a character only if it is not repeating, or repeating a certain number of times?
str = c("ddaabb", "daabb", "aaddbb", "aadbb")
gsub("d{1}", "c", str)
[1] "ccaabb" "caabb" "aaccbb" "aacbb"
#Expected output
[1] "ddaabb" "caabb" "aaddbb" "aacbb"
You can use negative lookarounds in your regex to exclude cases where d is preceeded or followed by another d:
gsub("(?<!d)d(?!d)", "c", str, perl=TRUE)
Edit: adding perl=TRUE as suggested by OP. For more info about regex engine in R see this question
Now that you've added "or repeating a specified number of times," the regex-based approaches may get messy. Thus I submit my wacky code from a previous comment.
foo <- unlist(strsplit(str, '')
bar <- rle(foo)
and then look for instances of bar$lengths == desired_length and use the returned indices to locate (by summing all bar$lengths[1:k] ) the position in the original sequence. If you only want to replace a specific character, check the corresponding value of bar$values[k] and selectively replace as desired.

Wildcard to match string in R

This might sound quite silly but it's driving me nuts.
I have a matrix that has alphanumeric values and I'm struggling to test if some elements of that matrix match only the initial and final letters. As I don't care the middle character, I'm trying (withouth success) to use a wildcard.
As an example, consider this matrix:
m <- matrix(nrow=3,ncol=3)
m[1,]=c("NCF","NBB","FGF")
m[2,]=c("MCF","N2B","CCD")
m[3,]=c("A3B","N4F","MCP")
I want to evaluate if m[2,2] starts with "N" and ends with "B", regardless of the 2nd letter in the string. I've tried something like
grep("N.B",m)
and it works, but still I want to know if there is a more compact way of doing it, like:
m[2,2]="N.B"
which ovbiously didn't work!
Thanks
You can use grepl with the subseted m like:
grepl("^N.B$", m[2,2])
#[1] TRUE
or use startsWith and endsWith:
startsWith(m[2,2], "N") & endsWith(m[2,2], "B")
#[1] TRUE

Translating vector elements in R using correspondence table

What is the efficient and simple way in R to do the following:
read in two-column data from a file
use this information to build some kind of translation dictionary, like a python dict
apply the translation to the content of a vector in order to obtain the translated vector, possibly for several vectors but using the same correspondence information
?
I thought that the hash package would help me to do that, but I'm unsure I perform step 3 correctly.
Say my initial vector is my_vect and my hash is my_dict
I tried the following:
values(my_dict, keys=my_vect)
The following observation make me doubt that I'm doing it in the proper way:
The operation seems slow (more than one second on a powerful desktop computer with a vector of 582 entries and a hash of 46665 entries)
It results in something that doesn't look homogeneous with my_vec: while my_vec appeared as "indexed by numbers" (I mean that integer numbers between square brackets appear on the side of the values when displaying the data in the interactive console), the result of calling values as above appears to still somehow looks like a dictionary: each displayed translated value has the original value (i.e. hash key) displayed above it. I just want the values.
Edit:
If I understand correctly, R has some way of using "names" instead of numerical indices for vectors, and what I obtain using the values function is such a vector with names. It seems to work for what I wanted to do, although I imagine it takes more memory than necessary.
I tried libraries hash and hashmap, and the second seemed more efficient.
A small usage example:
> library(hashmap)
> keys = c("a", "b", "c", "d")
> values = c("A", "B", "C", "D")
> my_dict <- hashmap(keys, values)
> my_vect <- c("b", "c", "c")
> translated <- my_dict$find(my_vect)
> translated
[1] "B" "C" "C"
To build the dictionary from a table obtained using read.table, the option stringsAsFactors = FALSE of read.table has to be used, otherwise weird things happen (see discussion in the comments of https://stackoverflow.com/a/38838271/1878788).
Did you try the str_replace_all function from the stringr package?
Let's say you have a dictionary data frame dict with columns original and replacement. The following code replaces all instances of original with replacement in the vector.
library(stringr)
translations <- setNames(dict$replacement, dict$original)
new_vect <- str_replace_all(vect, fixed(translations))
I'm not sure if it implements hashing, but the underlying expression is in C code from the stringi package, so it should be fast.
The only case where that won't work as is, is if some of the words in original contain other words in original. In this case you'll need to add regular expression start-string (^) or end-string ($) markers to the original strings you want to replace.
translations <- setNames(dict$replacement, paste0("^", dict$original, "$"))

R: Why can't for loop or c() work out for grep function?

Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P
From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))
Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.

Case-insensitive search of a list in R

Can I search a character list for a string where I don't know how the string is cased? Or more generally, I'm trying to reference a column in a dataframe, but I don't know exactly how the columns are cased. My thought was to search names(myDataFrame) in a case-insensitive manner to return the proper casing of the column.
I would suggest the grep() function and some of its additional arguments that make it a pleasure to use.
grep("stringofinterest",names(dataframeofinterest),ignore.case=TRUE,value=TRUE)
without the argument value=TRUE you will only get a vector of index positions where the match occurred.
Assuming that there are no variable names which differ only in case, you can search your all-lowercase variable name in tolower(names(myDataFrame)):
match("b", tolower(c("A","B","C")))
[1] 2
This will produce only exact matches, but that is probably desirable in this case.
With the stringr package, you can modify the pattern with one of the built in modifier functions (see `?modifiers). For example since we are matching a fixed string (no special regular expression characters) but want to ignore case, we can do
str_detect(colnames(iris), fixed("species", ignore_case=TRUE))
Or you can use the (?i) case insensitive modifier
str_detect(colnames(iris), "(?i)species")
For anyone using this with %in%, simply use tolower on the right (or both) sides, like so:
"b" %in% c("a", "B", "c")
# [1] FALSE
tolower("b") %in% tolower(c("a", "B", "c"))
# [1] TRUE
The searchable package was created for allowing for various types of searching within objects:
l <- list( a=1, b=2, c=3 )
sl <- searchable(l) # make the list "searchable"
sl <- ignore.case(sl) # turn on case insensitivity
> sl['B']
$b
[1] 2
It works with lists and vectors and does a lot more than simple case-insensitive matching.
If you want to search for one set of strings in another set of strings, case insensitively, you could try:
s1 = c("a", "b")
s2 = c("B", "C")
matches = s1[ toupper(s1) %in% toupper(s2) ]
Another way of achieving this is to use str_which(string, pattern) from the stringr package:
library("stringr")
str_which(string = tolower(colnames(iris)), pattern = "species")

Resources