How to grep two words in string data? - r

So I have a data frame where the one of the columns is of type character, consisting of strings. I want to find those rows where "foo" and "bar" both occur but bar can also occur before foo. Basically like an AND operator for regular expressions. How shall I do that?

You may try
rowIndx <- grepl('foo', df$yourcol) & grepl('bar', df$yourcol)
rowIndx returns a logical TRUE/FALSE which can be used for subsetting the col. (comments from #Konrad Rudolph). If you need the numeric index, just wrap it with which i.e. which(rowIndx)

Regular expressions are bad at logical operations. Your particular case, however, can be trivially implemented by the following expression:
(foo.*bar)|(bar.*foo)
However, this is a very inefficient regex and I strongly advise against using it. In practice, you’d use akrun’s solution from the comment: grep for them individually and intersect the result (or do a logical grepl and & the results, which is semantically exchangeable).

Related

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

Cleaning data which to use (Grep) or (str_extract_all)

I need to extract from the dataset all the elements that mention "mean" and "std" which is standard deviation.
example of how it is written in feat, the column 2, the variables.
Goal> I am trying to extract only the elements that have this written.
"tBodyAcc-mean()-Z"
"tBodyAcc-std()-X"
feat<-read.table("features.txt")
I assumed that using
grep("mean"&"std",feat[,2])
would work
But does not work, I have this error:
"operations are possible only for numeric, logical or complex types"
I found someone who has used this:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
It worked fine but I do not understand the meaning of the backlashes.
I don't understand what it exactly means and I don't want to use it.
What you need is an alternation operator | in a regex pattern. grep allows using literal values (when fixed=TRUE is used) or a regular expression (by default).
Now, you found:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
The -(mean|std)\(\) regex matches a -, then either mean or std (since (...) is a grouping construct that allows enumerating alternatives inside a bigger expression), then ( and then ) (these must be escaped with a \ literal symbol - that is why it is doubled in the R code).
If you think the expression is an overkill, and you only want to find entries with either std or mean as substrings, you can use a simpler
meansd<-grep("mean|std",feat[,2])
Here, no grouping construct is necessary since you only have two alternatives in the expression.

R: If cells of a variable contain a specific text

I am trying to find out how many cells contain a specific text for a variable (in this case the "fruits" variable) in R. I tried to use the match () function but could not get the desired result. I tried to use %in% as well but to no avail.
The command which i used is match("apple", lifestyle$fruits) and it returns a value which is much more than the correct answer :X
I think this will give you what you want:
sum(grepl("apple", lifestyle$fruits))
grepl returns a logical TRUE/FALSE vector with TRUE if it is found. sum sums these together. You can make this a little faster using the fixed=TRUE argument:
sum(grepl("apple", lifestyle$fruits, fixed=TRUE))
This tells grepl that it doesn't have to spend time making a regular expression and to just match literally.

Fast grep with a vectored pattern or match, to return list of all matches

I guess this is trivial, I apologize, I couldn't find how to do it.
I am trying to abstain from a loop, so I am trying to vectorize the process:
I need to do something like grep, but where the pattern is a vector. Another option is a match, where the value is not only the first location.
For example data (which is not how the real data is, otherswise I would exploit it structure):
COUNTRIES=c("Austria","Belgium","Denmark","France","Germany",
"Ireland","Italy","Luxembourg","Netherlands",
"Portugal","Sweden","Spain","Finland","United Kingdom")
COUNTRIES_Target=rep(COUNTRIES,times=4066)
COUNTRIES_Origin=rep(COUNTRIES,each=4066)
Now, currently I got a loop that:
var_pointer=list()
for (i in 1:length(COUNTRIES_Origin))
{
var_pointer[[i]]=which(COUNTRIES_Origin[i]==COUNTRIES_Target)
}
The problem with match is that match(x=COUNTRIES_Origin,table=COUNTRIES_Target) returns a vector of the same length as COUNTRIES_Origin and the value is the first match, while I need all of them.
The issue with grep is that grep(pattern=COUNTRIES_Origin,x=COUNTRIES_Target) is the given warning:
Warning message:
In grep(pattern = COUNTRIES_Origin, x = COUNTRIES_Target) :
argument 'pattern' has length > 1 and only the first element will be used
Any suggestions?
Trying to vectorize MxN matches is fundamentally not very performant, no matter how you do it it's still MN operations.
Use hashes instead for O(1) lookup.
For recommendations on using the hash package, see Can I use a list as a hash in R? If so, why is it so slow?
It seems like you can just lapply over the list rather than loop.
lapply(COUNTRIES_Origin, function(x) which(COUNTRIES_Target==x))
Here I use which because grep seems to be better for partial matches and you're looking for exact matches.

find indexes in R by not using `which`

Is there a faster way to search for indices rather than which %in% R.
I am having a statement which I need to execute but its taking a lot of time.
statement:
total_authors<-paper_author$author_id[which(paper_author$paper_id%in%paper_author$paper_id[which(paper_author$author_id%in%data_authors[i])])]
How can this be done in a faster manner?
Don't call which. R accepts logical vectors as indices, so the call is superfluous.
In light of sgibb's comment, you can keep which if you are sure that you will also get at least one match. (If there are no matches, then which returns an empty vector and you get everything instead of nothing. See Unexpected behavior using -which() in R when the search term is not found.)
Secondly, the code looks a little cleaner if you use with.
Thirdly, I think you want a single index with & rather than a double index.
total_authors <- with(
paper_author,
author_id[paper_id %in% paper_id & author_id %in% data_authors[i]
)

Resources