Genbank query (package seqinr): searching in sequence description - r

I am using the function query() of package seqinr to download myoglobin DNA sequences from Genbank. E.g.:
query("myoglobins","K=myoglobin AND SP=Turdus merula")
Unfortunately, for a lot of the species I'm looking for I don't get any sequence at all (or for this species, only a very short one), even though I find sequences when I search manually on the website. This is because of searching for "myoglobin" in the keywords only, while often there isn't any entry in there. Often the protein type is only specified in the name ("definition" on Genbank) -- but I have no idea how to search for this.
The help page on query() doesn't seem to offer any option for this in the details, a "generic search" without any "K=" doesn't work, and I haven't found anything via googling.
I'd be happy about any links, explanations and help. Thank you! :)

There is a complete manual for the seqinr package which describes the query language more in depth in chapter 5 (available at http://seqinr.r-forge.r-project.org/seqinr_2_0-1.pdf). I was trying to do a similar query and the description for many of the genes/cds is blank so they don't come up when searching using the k= option. One alternative would be to search for the organism alone, then match gene names in the individual annotations and pull out the accession numbers, which you could then use to re-query the database for your sequences.
This would pull out the annotation for the first gene:
choosebank("emblTP")
query("ACexample", "sp=Turdus merula")
getName(ACexample$req[[1]])
annotations <- getAnnot(ACexample$req[[1]])
cat(annotations, sep = "\n")
I think that this would be a pretty time consuming way to tackle the problem but there doesn't seem to be an efficient way of searching the annotations directly. I'd be interested in any solutions you might come up with.

Related

Why are my "y"s being changed for "i"s when using {litsearchr} in R to develop an academic search strategy?

I'm planning a systematic review and have been using the litsearchr package to help generate good search terms for the scientific databases.
This problem I'm seeing is that when I print my final search string, it has changed what I would expect to be "y" letters for "i" letters at the end of words that end in "y". For example, my review will involve intermittent style sports (e.g. team sports, tennis etc) and thermoregulation, so my search terms include words like “rugby” and “body temperature”, but the output is giving me terms such as “rugbi*” and “bodi* temperatur*”.
Anyone that knows this package or searching scientific/academic databases well, do you think this is expected behaviour?
Thanks!!

Searching wikipedia through R

I have a list of names in my dataframe and I want to find a way to query them in Wikipedia, although it's not as simple as just appending the name to "https://en.wikipedia.org/wiki/", I want to actually query Wikipedia so that there will be a suggestion even if its not spelt correctly. So for example if I were to put in Dick Dawkins, it'd come up with Richard Dawkins. I checked and that is actually the first hit on Wikipedia.
Ideally I'd want to use RVest but I don't want to manually get every url. Is this possible?
You are right. I, too, had a hard time getting Dick Dawkins out of the wikipedia. So much so that even searching for Dick Dawkins on the wikipedia search brought me straight to Richard Dawkins.
However, if you want to search for a term (say "Richard Dawkins") then Wikipedia has a proper API for you (https://www.mediawiki.org/wiki/API:Tutorial). You can play around and find the right parameters that work for you.
Just to get you started, I wrote a function (which is somewhat similar to rg255's post). You can change the parameter for MySearch function. Please make sure that spaces in search string are replaced by '%20' for every query from your dataframe. Simple gsub function should do the job. You will also have to install 'jsonlite' package for this to work.
library(jsonlite)
MySearch <- function(srsearch){
FullSearchString <- paste("http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=",srsearch,"&format=json",sep="")
Response <- fromJSON(FullSearchString)
return(Response)
}
Response <- MySearch("Richard%20Dawkins")
You can now use the parsed JSON to use the properties that you want. As I said, you will have to play with the parameters to get it right.
Please let me know if this is not what you wanted.

Is there any way to replicate Rstudio's integrated search function as a code?

For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.
Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.
I use Rstudio, and I've noticed that the internal search function in that program is much more effective:
As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".
However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:
So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.
Assuming I have a data frame that looks like this:
Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))
What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?
From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl
library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))
where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?

How to fuzzy match character strings of persons' names listed variously firstName lastName or lastName firstName and with misspellings [duplicate]

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.
My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the tm package appropriate for this sort of task?
Is there a faster alternative to agrep? (Said function uses the
Levenshtein edit distance which is anecdotally slow.)
Are there other suitable tools in R, apart from agrep and tm?
Should I even be doing this in R, or should this sort of thing be
done directly in the database? (It's an Access database, so I'd
rather avoid touching it if possible.)
If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.
I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).
If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.
So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.
You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.
Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.
What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.
These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.
The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.
In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try
http://www.softcorporation.com/products/neardup/

Resources