How to use NCBI gene database in biomaRt R package - r

I'm not very expert with R but I'm trying to learn ho to use the biomaRt package to find genes located in my regions of interest.
I've managed to produce a valid output using the ensembl dataset with the following code:
> mart= useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
> results <- getBM(attributes =c("chromosome_name","start_position","end_position",
"band","hgnc_symbol","entrezgene"), filters = c("chromosome_name","start","end"),
values = list(1,226767027,227317593), mart=mart)
I know that the "entrezgene" corresponds to the NCBI gene ID, but I would like to have the GENE NAME from NCBI.
Is there a way to use biomaRt connected to NCBI database and retrieve that informartion?
Thank you in advanced.

Type listAttributes(mart) to see the list of attributes you can select
Regarding gene name, I think you might want external_gene_id but there are other gene name options as well.

Related

Replacing column values based on related column in R

I'm currently working on a dataset which has an address and a zip code column. I'm trying to deal with the invalid/missing data in zip code by finding a different record with same address, and then filling the corresponding zip code to the invalid zip code. What would be the best approach to go about doing this?
Step 1. Using the non-missing addresses and zip codes construct a dictionary
data frame of sorts. For example, in a data frame "df" with an "address"
column and a "zip_code" column, you could get this via:
library(dplyr)
zip_dictionary <- na.omit(select(df, address, zip_code))
zip_dictionary <- distinct(zip_dictionary)
This assumes there is only one unique value of "zip_code" for each "address"
in your data. If not, you need to figure out which value to use and filter or
recode it accordingly.
Step 2. Install the {elucidate} package from GitHub and use the translate()
function to fill in the missing zip codes using the extracted dictionary from
step 1:
remotes::install_github("bcgov/elucidate")
library(elucidate)
df <- df %>%
mutate(zip_code = if_else(is.na(zip_code),
translate(address,
old = zip_dictionary$address,
new = zip_dictionary$zip_code)
)
)
disclaimer: I am the author of the {elucidate} package

How do I tell R package Limma what to use as "targets" in read.idat()?

I am analyzing some microarray data. For each donor, I have a "before intervention" and an "after intervention" idat file. I have successfully read these into R using the Limma package with the read.idat() function. However, the resulting object only has one column in targets: "IDATfile". I believe that if I was using read.ilmn() I would specify a targets.txt file but I can't see this option when using read.idat(). E.g. in the Limma user guide Illumina example, the targets are "Donor", "Age", and "Cell Type". How do I tell Limma what to put as targets? I would like to have "Donor" and "Intervention".
An example of what I mean:
idatfiles <- dir(pattern="idat")
bgxfile <- dir(pattern="bgx")
x <- read.idat(idatfiles, bgxfile)
colnames(x$targets)
[1] "IDATfile"
Instead of "IDATfile", I would like this to be "Donor" and "Intervention". I can include some other columns of the original IDAT files as further targets by doing read.idat(..., dateinfo=TRUE), but I don't know how to edit these columns to make them "Donor" and "Intervention":
[1] "IDATfile" "ScanInfo" "DecodeInfo"
Let me know if any more info is needed, really appreciate any help!
If you want to simply edit colnames you can use:
colnames(x$targets) <- c("Donor")
But I think rows are samples in the target dataframe so is that really what you want?
http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/EList.html
targets data.frame containing information on the target RNA samples.
Rows correspond to samples. May have any number of columns.

Biomart in R to convert rssnp to gene name

I have the following code in R.
library(biomaRt)
snp_mart = useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp_attributes = c("refsnp_id", "chr_name", "chrom_start",
"associated_gene", "ensembl_gene_stable_id", "minor_allele_freq")
getENSG <- function(rs, mart = snp_mart) {
results <- getBM(attributes = snp_attributes,
filters = "snp_filter", values = rs, mart = mart)
return(results)
}
getENSG("rs144864312")
refsnp_id chr_name chrom_start associated_gene ensembl_gene_stable_id
1 rs144864312 8 20254959 NA ENSG00000061337
minor_allele_freq
1 0.000399361
I have no background in biology so please forgive me if this is an obvious question. I was told that rs144864312 should match to the gene name "LZTS1".
The code above I largely got from off the internet. My question is where do I extract that gene name from? I get that the listAttributes(snp_mart) gives a list of all possible outputs but I don't see any that give me the above "gene name". Where do I extract this gene name from using biomart (and given the rs number)? Thank you in advance.
PS: I need to do this for something like 500 entries (not just 1). Hence why I created a simple function as above to extract the gene name.
First I think your question will draw more professional attention on https://www.biostars.org/
That said, to my knowledge, now you have the ensembl ID (ENSG00000061337), you are just one step away from getting the gene name. If you google "how to convert ensembl ID to gene name" you will find many approaches. Here I list a few options:
use: https://david.ncifcrf.gov/conversion.jsp
use biomart under ensemble: http://www.ensembl.org/biomart/martview/1cb4c119ae91cb34b2cd5280be0a1aac
download a table with both gene name and ensembl ID, and customize your query. You might want to download it from UCSC Genome Browser, and here are some instructions: https://www.biostars.org/p/92939/
Good luck

Convert GENCODE IDs to Ensembl - Ranged SummarizedExperiment

I have an expression set matrix with the rownames being what I think is a GENCODE ID in the format for example
"ENSG00000000003.14"
"ENSG00000000457.13"
"ENSG00000000005.5" and so on.
I would like to convert these to gene_symbol but I am not sure of the best way to do so, especially because of the ".14" or ".13" which I believe is the version. Should I first trim all IDs for what is after the dot and then use biomaRt to convert? if so, what is the most efficient way of doing it? Is there a better way to get to the gene_symbol?
Many thanks for you help
As already mentioned, these are ENSEMBL IDs. First thing, you would need to do is to check your expression set object and identify which database it uses for annotations. Sometimes, the IDs may map to different gene symbols in newer (updated) annotation databases.
Anyway, expecting that the IDs belong to Humans, you can use this code to get the gene symbols very easily.
library(org.Hs.eg.db) ## Annotation DB
library(AnnotationDbi)
ids <- c("ENSG00000000003", "ENSG00000000457","ENSG00000000005")
gene_symbol <- select(org.Hs.eg.db,keys = ids,columns = "SYMBOL",keytype = "ENSEMBL")
You can try with org.Hs.eg.db or the exact db your expression set uses (if that information is available).
Thanks for the help. My problem was to get rid of the version .XX at the end of each ensembl gene id. I thought there would be a more straight forward way of going from an ensembl gene id that has the version number (gencode basic annotation) to a gene symbol. In the end I did the following and seem to be working:
df$ensembl_gene_id <- gsub('\\..+$', '', df$ensembl_gene_id)
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$ensembl_gene_id
symbol <- getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id","hgnc_symbol"),
values = genes,
mart = mart)
df <- merge(x = symbol,
y = df,
by.x="ensembl_gene_id",
by.y="ensembl_gene_id")

How to extract sample titles (names) using GEOquery package?

GEOquery is a great R package to retrieve and analyze the Gene Expression data stored in NCBI Gene Expression Omnibus (GEO) database. I have used the following code provided from GEO2R service of GEO database (that generates the initial R script to analyze your desired data automatically) to extract some GEO series of experiments:
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL1261", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
gset # displays a summary of the data stored in this variable
The problem is that I can not retrieve the sample titles from it. I have found some function Columns() that works on GDS datasets and returns the sample names, but not on GSE.
Please note I am not interested in sample accession IDs (i.e. GSM258609 GSM258610, etc), what I want is the sample human readable titles.
Is there any idea? Thanks
After
gset <- getGEO("GSE10246", GSEMatrix =TRUE)
gset is a simple list, it's first element is an ExpressionSet, and the sample information are in the phenoData or pData, so maybe you're looking for
pData(gset[[1]])
See ?ExpressionSet for more.

Resources