Case-insensitive search of a list in R - r

Can I search a character list for a string where I don't know how the string is cased? Or more generally, I'm trying to reference a column in a dataframe, but I don't know exactly how the columns are cased. My thought was to search names(myDataFrame) in a case-insensitive manner to return the proper casing of the column.

I would suggest the grep() function and some of its additional arguments that make it a pleasure to use.
grep("stringofinterest",names(dataframeofinterest),ignore.case=TRUE,value=TRUE)
without the argument value=TRUE you will only get a vector of index positions where the match occurred.

Assuming that there are no variable names which differ only in case, you can search your all-lowercase variable name in tolower(names(myDataFrame)):
match("b", tolower(c("A","B","C")))
[1] 2
This will produce only exact matches, but that is probably desirable in this case.

With the stringr package, you can modify the pattern with one of the built in modifier functions (see `?modifiers). For example since we are matching a fixed string (no special regular expression characters) but want to ignore case, we can do
str_detect(colnames(iris), fixed("species", ignore_case=TRUE))
Or you can use the (?i) case insensitive modifier
str_detect(colnames(iris), "(?i)species")

For anyone using this with %in%, simply use tolower on the right (or both) sides, like so:
"b" %in% c("a", "B", "c")
# [1] FALSE
tolower("b") %in% tolower(c("a", "B", "c"))
# [1] TRUE

The searchable package was created for allowing for various types of searching within objects:
l <- list( a=1, b=2, c=3 )
sl <- searchable(l) # make the list "searchable"
sl <- ignore.case(sl) # turn on case insensitivity
> sl['B']
$b
[1] 2
It works with lists and vectors and does a lot more than simple case-insensitive matching.

If you want to search for one set of strings in another set of strings, case insensitively, you could try:
s1 = c("a", "b")
s2 = c("B", "C")
matches = s1[ toupper(s1) %in% toupper(s2) ]

Another way of achieving this is to use str_which(string, pattern) from the stringr package:
library("stringr")
str_which(string = tolower(colnames(iris)), pattern = "species")

Related

Replace characters only if it is not repeating

Is there a way to replace a character only if it is not repeating, or repeating a certain number of times?
str = c("ddaabb", "daabb", "aaddbb", "aadbb")
gsub("d{1}", "c", str)
[1] "ccaabb" "caabb" "aaccbb" "aacbb"
#Expected output
[1] "ddaabb" "caabb" "aaddbb" "aacbb"
You can use negative lookarounds in your regex to exclude cases where d is preceeded or followed by another d:
gsub("(?<!d)d(?!d)", "c", str, perl=TRUE)
Edit: adding perl=TRUE as suggested by OP. For more info about regex engine in R see this question
Now that you've added "or repeating a specified number of times," the regex-based approaches may get messy. Thus I submit my wacky code from a previous comment.
foo <- unlist(strsplit(str, '')
bar <- rle(foo)
and then look for instances of bar$lengths == desired_length and use the returned indices to locate (by summing all bar$lengths[1:k] ) the position in the original sequence. If you only want to replace a specific character, check the corresponding value of bar$values[k] and selectively replace as desired.

subset strings without a pattern stringr

I want to extract elements of a character vector which do not match a given pattern. See the example:
x<-c("age_mean","n_aitd","n_sle","age_sd","n_poly","n_sero","child_age")
x_age<-str_subset(x,"age")
x_notage<-setdiff(x,x_age)
In this example I want to extract those strings which do not match the pattern "age". How to achieve this in a single call of str_subset ? What is the appropriate syntax of the pattern "not age". As you can see I am not very expert with regex. Thanks for any comments.
In this case there seems to be no reason to use stringr (efficiency perhaps). You may simply use grep:
grep("age", x, invert = TRUE, value = TRUE)
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
If, however, you want to stick with str_stringr, note that (from ?str_subset)
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE).
So,
x[!str_detect(x, "age")]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"
or also
x[!grepl("age", x)]
# [1] "n_aitd" "n_sle" "n_poly" "n_sero"

Translating vector elements in R using correspondence table

What is the efficient and simple way in R to do the following:
read in two-column data from a file
use this information to build some kind of translation dictionary, like a python dict
apply the translation to the content of a vector in order to obtain the translated vector, possibly for several vectors but using the same correspondence information
?
I thought that the hash package would help me to do that, but I'm unsure I perform step 3 correctly.
Say my initial vector is my_vect and my hash is my_dict
I tried the following:
values(my_dict, keys=my_vect)
The following observation make me doubt that I'm doing it in the proper way:
The operation seems slow (more than one second on a powerful desktop computer with a vector of 582 entries and a hash of 46665 entries)
It results in something that doesn't look homogeneous with my_vec: while my_vec appeared as "indexed by numbers" (I mean that integer numbers between square brackets appear on the side of the values when displaying the data in the interactive console), the result of calling values as above appears to still somehow looks like a dictionary: each displayed translated value has the original value (i.e. hash key) displayed above it. I just want the values.
Edit:
If I understand correctly, R has some way of using "names" instead of numerical indices for vectors, and what I obtain using the values function is such a vector with names. It seems to work for what I wanted to do, although I imagine it takes more memory than necessary.
I tried libraries hash and hashmap, and the second seemed more efficient.
A small usage example:
> library(hashmap)
> keys = c("a", "b", "c", "d")
> values = c("A", "B", "C", "D")
> my_dict <- hashmap(keys, values)
> my_vect <- c("b", "c", "c")
> translated <- my_dict$find(my_vect)
> translated
[1] "B" "C" "C"
To build the dictionary from a table obtained using read.table, the option stringsAsFactors = FALSE of read.table has to be used, otherwise weird things happen (see discussion in the comments of https://stackoverflow.com/a/38838271/1878788).
Did you try the str_replace_all function from the stringr package?
Let's say you have a dictionary data frame dict with columns original and replacement. The following code replaces all instances of original with replacement in the vector.
library(stringr)
translations <- setNames(dict$replacement, dict$original)
new_vect <- str_replace_all(vect, fixed(translations))
I'm not sure if it implements hashing, but the underlying expression is in C code from the stringi package, so it should be fast.
The only case where that won't work as is, is if some of the words in original contain other words in original. In this case you'll need to add regular expression start-string (^) or end-string ($) markers to the original strings you want to replace.
translations <- setNames(dict$replacement, paste0("^", dict$original, "$"))

using grep with multiple entries in r to find matching strings

If I have a vector of strings:
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
and want to find the entires with '2011' in the string I can use
ifiles <- dd[grep("2011",dd)]
How do I search for entries with a combination of strings included, without using a loop?
For example, I would like to find the entries with both '2011' and 'sprd' in the string, which in this case will only return
sflxgrbfg_sprd_2011
How can this be done? I could define a variable
toMatch <- c('2011','sprd)
and then loop through the entries but I was hoping there was a better solution?
Note: To make this useful for different strings. Is it also possible to to determine which entries have these strings without them being in the order shown. For example, 'sflxlgrbfg_2011_sprd'
If you want to find more than one pattern, try indexing with a logical value rather than the number. That way you can create an "and" condition, where only the string with both patterns will be extracted.
ifiles <- dd[grepl("2011",dd) & grepl("sprd_",dd)]
Try
grep('2011_sprd|sprd_2011', dd, value=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
Or using an example with more patterns
grep('(?<=sprd_).*(?=2011)|(?<=2011_).*(?=sprd)', dd1,
value=TRUE, perl=TRUE)
#[1] "sflxgrbfg_sprd_2011" "sflxlgrbfg_2011_sprd"
#[3] "sfxl_2011_14334_sprd" "sprd_124334xsff_2011_1423"
data
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012",
"sflxlgrbfg_2011_sprd")
dd1 <- c(dd, "sfxl_2011_14334_sprd", "sprd_124334xsff_2011_1423")
If you want a scalable solution, you can use lapply, Reduce and intersect to:
For each expression in toMatch, find the indices of all matches in dd.
Keep only those indices that are found for all expressions in toMatch.
dd <- c("sflxgrbfg_sprd_2011","sflxgrbfg_sprd2_2011","sflxgrbfg_sprd_2012")
dd <- c(dd, "sflxgrbfh_sprd_2011")
toMatch <- c('bfg', '2011','sprd')
dd[Reduce(intersect, lapply(toMatch, grep, dd))]
#> [1] "sflxgrbfg_sprd_2011" "sflxgrbfg_sprd2_2011"
Created on 2018-03-07 by the reprex package (v0.2.0).

R: Why can't for loop or c() work out for grep function?

Thanks for grep using a character vector with multiple patterns, I figured out my own problem as well.
The question here was how to find multiple values by using grep function,
and the solution was either these:
grep("A1| A9 | A6")
or
toMatch <- c("A1", "A9", "A6")
matches <- unique (grep(paste(toMatch,collapse="|")
So I used the second suggestion since I had MANY values to search for.
But I'm curious why c() or for loop doesn't work out instead of |.
Before I researched the possible solution in stackoverflow and found recommendations above, I tried out two alternatives that I'll demonstrate below:
First, what I've written in R was something like this:
find.explore.l<-lapply(text.words.bl ,function(m) grep("^explor",m))
But then I had to 'grep' many words, so I tried out this
find.explore.l<-lapply(text.words.bl ,function(m) grep(c("A1","A2","A3"),m))
It didn't work, so I tried another one(XXX is the list of words that I'm supposed to find in the text)
for (i in XXX){
find.explore.l<-lapply(text.words.bl ,function(m) grep("XXX[i]"),m))
.......(more lines to append lines etc)
}
and it seemed like R tried to match XXX[i] itself, not the words inside.
Why can't c() and for loop for grep return right results?
Someone please let me know! I'm so curious :P
From the documentation for the pattern= argument in the grep() function:
Character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr and gregexpr.
This confirms that, as #nrussell said in a comment, grep() is not vectorized over the pattern argument. Because of this, c() won't work for a list of regular expressions.
You could, however, use a loop, you just have to modify your syntax.
toMatch <- c("A1", "A9", "A6")
# Loop over values to match
for (i in toMatch) {
grep(i, text)
}
Using "XXX[i]" as your pattern doesn't work because it's interpreting that as a regular expression. That is, it will match exactly XXXi. To reference an element of a vector of regular expressions, you would simply use XXX[i] (note the lack of surrounding quotes).
You can apply() this, but in a slightly different way than you had done. You apply it to each regex in the list, rather than each text string.
lapply(toMatch, function(rgx, text) grep(rgx, text), text = text)
However, the best approach would be, as you already have in your post, to use
matches <- unique(grep(paste(toMatch, collapse = "|"), text))
Consider that:
XXX <- c("a", "b", "XXX[i]")
grep("XXX[i]", XXX, value=T)
character(0)
grep("XXX\\[i\\]", XXX, value=T)
[1] "XXX[i]"
What is R doing? It is using special rules for the first argument of grep. The brackets are considered special characters ([ and ]). I put in two backslashes to tell R to consider them regular brackets. And imgaine what would happen if I put that last expression into a for loop? It wouldn't do what I expected.
If you would like a for loop that goes through a character vector of possible matches, take out the quotes in the grep function.
#if you want the match returned
matches <- c("a", "b")
for (i in matches) print(grep(i, XXX, value=T))
[1] "a"
[1] "b"
#if you want the vector location of the match
for (i in matches) print(grep(i, XXX))
[1] 1
[1] 2
As the comments point out, grep(c("A1","A2","A3"),m)) is violating the grep required syntax.

Resources