A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.
Related
I have a large data.frame d that was read from a .csv file using read (it is actually a data.table resulting from fread a .csv file). I want to check in every column of type character for weird/corrupted characters. Meaning the weird sequences of characters that result from other corrupted parts of a text file or from using the wrong encoding.
A data.table solution, or some other fast solution, would be best.
This is a pseudo-code for a possible solution
create a vector str_cols with the names of all the character columns of d
for each column j in str_cols compute a frequency table of the values: tab <-d[,.N,j]. (this step is probably not necessary, just used to reduce the dimensions of the object that will be checked in columns with repetitions)
Check the values of j in the summary table tab
The crucial step is 3. Is there a function that does that?
Edit1: Perhaps some smart regular expression? This is a related non R question, trying to explicitly list all weird characters. Another solution perhaps is to find any character outside of the accepted list of characters [a-z 0-9 + punctuation].
If you post some example data it would be easier to give a more definitive answer. You could likely try something like this though.
DT[, lapply(.SD, stringr::str_detect, "^[^[[:print:]]]+$")]
It will return a data.table of the same size, but any string that has characters that aren't alphanumeric, punctuation, and/or space will be replaced with TRUE, and everything else replaced with FALSE. This would be my interpretation of your question about wanting to detect values that contain these characters.
You can change the behavior by replacing str_detect with whatever base R or stringr function you want, and slightly modifying the regex as needed. For example, you can remove the offending characters with the following code.
DT[, lapply(.SD, stringr::str_replace_all, "[^[[:print:]]]", "")]
I am trying to find out how many cells contain a specific text for a variable (in this case the "fruits" variable) in R. I tried to use the match () function but could not get the desired result. I tried to use %in% as well but to no avail.
The command which i used is match("apple", lifestyle$fruits) and it returns a value which is much more than the correct answer :X
I think this will give you what you want:
sum(grepl("apple", lifestyle$fruits))
grepl returns a logical TRUE/FALSE vector with TRUE if it is found. sum sums these together. You can make this a little faster using the fixed=TRUE argument:
sum(grepl("apple", lifestyle$fruits, fixed=TRUE))
This tells grepl that it doesn't have to spend time making a regular expression and to just match literally.
So I have a data frame where the one of the columns is of type character, consisting of strings. I want to find those rows where "foo" and "bar" both occur but bar can also occur before foo. Basically like an AND operator for regular expressions. How shall I do that?
You may try
rowIndx <- grepl('foo', df$yourcol) & grepl('bar', df$yourcol)
rowIndx returns a logical TRUE/FALSE which can be used for subsetting the col. (comments from #Konrad Rudolph). If you need the numeric index, just wrap it with which i.e. which(rowIndx)
Regular expressions are bad at logical operations. Your particular case, however, can be trivially implemented by the following expression:
(foo.*bar)|(bar.*foo)
However, this is a very inefficient regex and I strongly advise against using it. In practice, you’d use akrun’s solution from the comment: grep for them individually and intersect the result (or do a logical grepl and & the results, which is semantically exchangeable).
I guess this is trivial, I apologize, I couldn't find how to do it.
I am trying to abstain from a loop, so I am trying to vectorize the process:
I need to do something like grep, but where the pattern is a vector. Another option is a match, where the value is not only the first location.
For example data (which is not how the real data is, otherswise I would exploit it structure):
COUNTRIES=c("Austria","Belgium","Denmark","France","Germany",
"Ireland","Italy","Luxembourg","Netherlands",
"Portugal","Sweden","Spain","Finland","United Kingdom")
COUNTRIES_Target=rep(COUNTRIES,times=4066)
COUNTRIES_Origin=rep(COUNTRIES,each=4066)
Now, currently I got a loop that:
var_pointer=list()
for (i in 1:length(COUNTRIES_Origin))
{
var_pointer[[i]]=which(COUNTRIES_Origin[i]==COUNTRIES_Target)
}
The problem with match is that match(x=COUNTRIES_Origin,table=COUNTRIES_Target) returns a vector of the same length as COUNTRIES_Origin and the value is the first match, while I need all of them.
The issue with grep is that grep(pattern=COUNTRIES_Origin,x=COUNTRIES_Target) is the given warning:
Warning message:
In grep(pattern = COUNTRIES_Origin, x = COUNTRIES_Target) :
argument 'pattern' has length > 1 and only the first element will be used
Any suggestions?
Trying to vectorize MxN matches is fundamentally not very performant, no matter how you do it it's still MN operations.
Use hashes instead for O(1) lookup.
For recommendations on using the hash package, see Can I use a list as a hash in R? If so, why is it so slow?
It seems like you can just lapply over the list rather than loop.
lapply(COUNTRIES_Origin, function(x) which(COUNTRIES_Target==x))
Here I use which because grep seems to be better for partial matches and you're looking for exact matches.
Is there a faster way to search for indices rather than which %in% R.
I am having a statement which I need to execute but its taking a lot of time.
statement:
total_authors<-paper_author$author_id[which(paper_author$paper_id%in%paper_author$paper_id[which(paper_author$author_id%in%data_authors[i])])]
How can this be done in a faster manner?
Don't call which. R accepts logical vectors as indices, so the call is superfluous.
In light of sgibb's comment, you can keep which if you are sure that you will also get at least one match. (If there are no matches, then which returns an empty vector and you get everything instead of nothing. See Unexpected behavior using -which() in R when the search term is not found.)
Secondly, the code looks a little cleaner if you use with.
Thirdly, I think you want a single index with & rather than a double index.
total_authors <- with(
paper_author,
author_id[paper_id %in% paper_id & author_id %in% data_authors[i]
)