I have the following code in R.
library(biomaRt)
snp_mart = useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp_attributes = c("refsnp_id", "chr_name", "chrom_start",
"associated_gene", "ensembl_gene_stable_id", "minor_allele_freq")
getENSG <- function(rs, mart = snp_mart) {
results <- getBM(attributes = snp_attributes,
filters = "snp_filter", values = rs, mart = mart)
return(results)
}
getENSG("rs144864312")
refsnp_id chr_name chrom_start associated_gene ensembl_gene_stable_id
1 rs144864312 8 20254959 NA ENSG00000061337
minor_allele_freq
1 0.000399361
I have no background in biology so please forgive me if this is an obvious question. I was told that rs144864312 should match to the gene name "LZTS1".
The code above I largely got from off the internet. My question is where do I extract that gene name from? I get that the listAttributes(snp_mart) gives a list of all possible outputs but I don't see any that give me the above "gene name". Where do I extract this gene name from using biomart (and given the rs number)? Thank you in advance.
PS: I need to do this for something like 500 entries (not just 1). Hence why I created a simple function as above to extract the gene name.
First I think your question will draw more professional attention on https://www.biostars.org/
That said, to my knowledge, now you have the ensembl ID (ENSG00000061337), you are just one step away from getting the gene name. If you google "how to convert ensembl ID to gene name" you will find many approaches. Here I list a few options:
use: https://david.ncifcrf.gov/conversion.jsp
use biomart under ensemble: http://www.ensembl.org/biomart/martview/1cb4c119ae91cb34b2cd5280be0a1aac
download a table with both gene name and ensembl ID, and customize your query. You might want to download it from UCSC Genome Browser, and here are some instructions: https://www.biostars.org/p/92939/
Good luck
Related
I'm currently working on a dataset which has an address and a zip code column. I'm trying to deal with the invalid/missing data in zip code by finding a different record with same address, and then filling the corresponding zip code to the invalid zip code. What would be the best approach to go about doing this?
Step 1. Using the non-missing addresses and zip codes construct a dictionary
data frame of sorts. For example, in a data frame "df" with an "address"
column and a "zip_code" column, you could get this via:
library(dplyr)
zip_dictionary <- na.omit(select(df, address, zip_code))
zip_dictionary <- distinct(zip_dictionary)
This assumes there is only one unique value of "zip_code" for each "address"
in your data. If not, you need to figure out which value to use and filter or
recode it accordingly.
Step 2. Install the {elucidate} package from GitHub and use the translate()
function to fill in the missing zip codes using the extracted dictionary from
step 1:
remotes::install_github("bcgov/elucidate")
library(elucidate)
df <- df %>%
mutate(zip_code = if_else(is.na(zip_code),
translate(address,
old = zip_dictionary$address,
new = zip_dictionary$zip_code)
)
)
disclaimer: I am the author of the {elucidate} package
I'm not very expert with R but I'm trying to learn ho to use the biomaRt package to find genes located in my regions of interest.
I've managed to produce a valid output using the ensembl dataset with the following code:
> mart= useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
> results <- getBM(attributes =c("chromosome_name","start_position","end_position",
"band","hgnc_symbol","entrezgene"), filters = c("chromosome_name","start","end"),
values = list(1,226767027,227317593), mart=mart)
I know that the "entrezgene" corresponds to the NCBI gene ID, but I would like to have the GENE NAME from NCBI.
Is there a way to use biomaRt connected to NCBI database and retrieve that informartion?
Thank you in advanced.
Type listAttributes(mart) to see the list of attributes you can select
Regarding gene name, I think you might want external_gene_id but there are other gene name options as well.
I have an expression set matrix with the rownames being what I think is a GENCODE ID in the format for example
"ENSG00000000003.14"
"ENSG00000000457.13"
"ENSG00000000005.5" and so on.
I would like to convert these to gene_symbol but I am not sure of the best way to do so, especially because of the ".14" or ".13" which I believe is the version. Should I first trim all IDs for what is after the dot and then use biomaRt to convert? if so, what is the most efficient way of doing it? Is there a better way to get to the gene_symbol?
Many thanks for you help
As already mentioned, these are ENSEMBL IDs. First thing, you would need to do is to check your expression set object and identify which database it uses for annotations. Sometimes, the IDs may map to different gene symbols in newer (updated) annotation databases.
Anyway, expecting that the IDs belong to Humans, you can use this code to get the gene symbols very easily.
library(org.Hs.eg.db) ## Annotation DB
library(AnnotationDbi)
ids <- c("ENSG00000000003", "ENSG00000000457","ENSG00000000005")
gene_symbol <- select(org.Hs.eg.db,keys = ids,columns = "SYMBOL",keytype = "ENSEMBL")
You can try with org.Hs.eg.db or the exact db your expression set uses (if that information is available).
Thanks for the help. My problem was to get rid of the version .XX at the end of each ensembl gene id. I thought there would be a more straight forward way of going from an ensembl gene id that has the version number (gencode basic annotation) to a gene symbol. In the end I did the following and seem to be working:
df$ensembl_gene_id <- gsub('\\..+$', '', df$ensembl_gene_id)
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$ensembl_gene_id
symbol <- getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id","hgnc_symbol"),
values = genes,
mart = mart)
df <- merge(x = symbol,
y = df,
by.x="ensembl_gene_id",
by.y="ensembl_gene_id")
I'm using this "RISmed" library to do some query of my gene or protein of interest and the output comes with pubmed ID basically, but most of the times it consist of non-specific hits as well which are not my interest. As I can only see the pubmed ID I have to manually put those returned ID and search them in NCBI to see if the paper is of my interest or not.
Question: Is there a way to to return the abstract of the paper or summary sort of along with its pumed ID , which can be implemented in R?
If anyone can help it would be really great..
Using the example from the manuals we need EUtilsGet function.
library(RISmed)
search_topic <- 'copd'
search_query <- EUtilsSummary(search_topic, retmax = 10,
mindate = 2012, maxdate = 2012)
summary(search_query)
# see the ids of our returned query
QueryId(search_query)
# get actual data from PubMed
records <- EUtilsGet(search_query)
class(records)
# store it
pubmed_data <- data.frame('Title' = ArticleTitle(records),
'Abstract' = AbstractText(records))
I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.