Cleaning data which to use (Grep) or (str_extract_all) - r

I need to extract from the dataset all the elements that mention "mean" and "std" which is standard deviation.
example of how it is written in feat, the column 2, the variables.
Goal> I am trying to extract only the elements that have this written.
"tBodyAcc-mean()-Z"
"tBodyAcc-std()-X"
feat<-read.table("features.txt")
I assumed that using
grep("mean"&"std",feat[,2])
would work
But does not work, I have this error:
"operations are possible only for numeric, logical or complex types"
I found someone who has used this:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
It worked fine but I do not understand the meaning of the backlashes.
I don't understand what it exactly means and I don't want to use it.

What you need is an alternation operator | in a regex pattern. grep allows using literal values (when fixed=TRUE is used) or a regular expression (by default).
Now, you found:
meansd<-grep("-(mean|std)\\(\\)",feat[,2])
The -(mean|std)\(\) regex matches a -, then either mean or std (since (...) is a grouping construct that allows enumerating alternatives inside a bigger expression), then ( and then ) (these must be escaped with a \ literal symbol - that is why it is doubled in the R code).
If you think the expression is an overkill, and you only want to find entries with either std or mean as substrings, you can use a simpler
meansd<-grep("mean|std",feat[,2])
Here, no grouping construct is necessary since you only have two alternatives in the expression.

Related

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

Split a string in a flexible manner with a regular expression

Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.
You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3

zsh : Removing run of certain characters at the end of a variable

To keep it simple, consider following task:
Given a non-empty shell variable var, where we know that the last character is the letter a and that at least one character is not an a, remove all the a from the right of the variable.
Example: If the variable initially contains abcadeaaa, it should contain abcade afterwards.
I wondering whether this can be done in a compact way in Zsh.
Of course this is trivial to do using an external program (such as sed), or by using a while loop, where we consecutively strip the last a (${VAR%a}), until the value of the variable doesn't change anymore. Both would work in a POSIX shell, in bash or in ksh. However, given that Zsh has so many nice features for expansion, I wonder, whether there isn't a better way.
The problem is that matching a run of a certain character (independend of length) cries out for regular expressions, but the pattern after % in parameter expansion, and the pattern in the s/// substitution, both is a wildcard pattern, which doesn't allow me to do what I want - at least according to my understanding of the zshexpn man page.
Any ideas for this?
Note that this question is not to solve a real-world problem (in which case I simply would use sed to do the job, as this would get the job done), but more out of academic interest, to find out how far we can stretch the limits of the Zsh expansion mechanism.

R: If cells of a variable contain a specific text

I am trying to find out how many cells contain a specific text for a variable (in this case the "fruits" variable) in R. I tried to use the match () function but could not get the desired result. I tried to use %in% as well but to no avail.
The command which i used is match("apple", lifestyle$fruits) and it returns a value which is much more than the correct answer :X
I think this will give you what you want:
sum(grepl("apple", lifestyle$fruits))
grepl returns a logical TRUE/FALSE vector with TRUE if it is found. sum sums these together. You can make this a little faster using the fixed=TRUE argument:
sum(grepl("apple", lifestyle$fruits, fixed=TRUE))
This tells grepl that it doesn't have to spend time making a regular expression and to just match literally.

How to grep two words in string data?

So I have a data frame where the one of the columns is of type character, consisting of strings. I want to find those rows where "foo" and "bar" both occur but bar can also occur before foo. Basically like an AND operator for regular expressions. How shall I do that?
You may try
rowIndx <- grepl('foo', df$yourcol) & grepl('bar', df$yourcol)
rowIndx returns a logical TRUE/FALSE which can be used for subsetting the col. (comments from #Konrad Rudolph). If you need the numeric index, just wrap it with which i.e. which(rowIndx)
Regular expressions are bad at logical operations. Your particular case, however, can be trivially implemented by the following expression:
(foo.*bar)|(bar.*foo)
However, this is a very inefficient regex and I strongly advise against using it. In practice, you’d use akrun’s solution from the comment: grep for them individually and intersect the result (or do a logical grepl and & the results, which is semantically exchangeable).

Resources