find indexes in R by not using `which` - r

Is there a faster way to search for indices rather than which %in% R.
I am having a statement which I need to execute but its taking a lot of time.
statement:
total_authors<-paper_author$author_id[which(paper_author$paper_id%in%paper_author$paper_id[which(paper_author$author_id%in%data_authors[i])])]
How can this be done in a faster manner?

Don't call which. R accepts logical vectors as indices, so the call is superfluous.
In light of sgibb's comment, you can keep which if you are sure that you will also get at least one match. (If there are no matches, then which returns an empty vector and you get everything instead of nothing. See Unexpected behavior using -which() in R when the search term is not found.)
Secondly, the code looks a little cleaner if you use with.
Thirdly, I think you want a single index with & rather than a double index.
total_authors <- with(
paper_author,
author_id[paper_id %in% paper_id & author_id %in% data_authors[i]
)

Related

In R, dataframe[-NULL] returns an empty dataframe

I'm creating some routines in R to ease model creation and to distinguish several groups based on several parameters (ex: original watches VS fakes ones using watches common attributes).
During the proccess, I keep track of the potential excluded lines in a vector (empty at first), and I get ride of them at the end using:
model$var <- raw_data[-line_excluded,]
The problem is that if line_excluded is c() (ndlr no line exlcuded), model$var is an empty dataframe then in that case I want all the lines of the dataframe.
The only solution I have think about is the us of
if (!is.null(line_excluded)){
model$var <- raw_data[-line_excluded,]}
But that's not really pretty, and I have several tracking variables as line_excluded which need that.
Thanks for the help
You can make it in another way using setdiff(), which can deal with empty line_excluded i.e.,
model$var <- raw_data[setdiff(seq(nrow(raw_data)),line_excluded),]
You can also try:
model$var <- raw_data[!(1:nrow(raw_data) %in% line_excluded),]
This is similar to what #THomasIsCoding suggested, you look for the row numbers that are not in your line_excluded..

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

Using values from a dataframe to apply a function to a vector

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.
Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

Fast grep with a vectored pattern or match, to return list of all matches

I guess this is trivial, I apologize, I couldn't find how to do it.
I am trying to abstain from a loop, so I am trying to vectorize the process:
I need to do something like grep, but where the pattern is a vector. Another option is a match, where the value is not only the first location.
For example data (which is not how the real data is, otherswise I would exploit it structure):
COUNTRIES=c("Austria","Belgium","Denmark","France","Germany",
"Ireland","Italy","Luxembourg","Netherlands",
"Portugal","Sweden","Spain","Finland","United Kingdom")
COUNTRIES_Target=rep(COUNTRIES,times=4066)
COUNTRIES_Origin=rep(COUNTRIES,each=4066)
Now, currently I got a loop that:
var_pointer=list()
for (i in 1:length(COUNTRIES_Origin))
{
var_pointer[[i]]=which(COUNTRIES_Origin[i]==COUNTRIES_Target)
}
The problem with match is that match(x=COUNTRIES_Origin,table=COUNTRIES_Target) returns a vector of the same length as COUNTRIES_Origin and the value is the first match, while I need all of them.
The issue with grep is that grep(pattern=COUNTRIES_Origin,x=COUNTRIES_Target) is the given warning:
Warning message:
In grep(pattern = COUNTRIES_Origin, x = COUNTRIES_Target) :
argument 'pattern' has length > 1 and only the first element will be used
Any suggestions?
Trying to vectorize MxN matches is fundamentally not very performant, no matter how you do it it's still MN operations.
Use hashes instead for O(1) lookup.
For recommendations on using the hash package, see Can I use a list as a hash in R? If so, why is it so slow?
It seems like you can just lapply over the list rather than loop.
lapply(COUNTRIES_Origin, function(x) which(COUNTRIES_Target==x))
Here I use which because grep seems to be better for partial matches and you're looking for exact matches.

Ordered Map / Hash Table in R

While working with lists i've noticed an issue that i didn't expect.
result5 <- vector("list",length(queryResults[[1]]))
for(i in 1:length(queryResults[[1]])){
id <- queryResults[[1]][i]
result5[[id]] <-getPrices(id)
}
The problem is that after this code runs instead of the result staying the same size (w/e queryResults[[1]] is) it goes up to the last index creating a bunch of null entries in the middle.
result5 current stores a number of int,double lists so it looks like :
result5[[index(int)]][[row]][col]
While on it's own it's not too problematic I would rather avoid that simply for easier size calculations later on.
For clarification, id is an integer. And in the given case for loop offers same performance, but greater convenience than the apply functions.
After some testing seems like the easiest way of doing it is :
Using a hash package to convert it using a hash using :
result6 <- hash(queryResults[[1]],lapply(queryResults[[1]],getPrices))
And if it needs to get accessed calling
result6[[toString(id)]]
With the difference in performance being marginal, albeit it's still fairly annoying having to include toString in your code.
It's not clear exactly what your question is, but judging by the structure of the loop, you probably want
result5[[i]] <- getPrices(id)
rather than result5[[id]] <- getPrices(id).

Resources