Convert GENCODE IDs to Ensembl - Ranged SummarizedExperiment - r

I have an expression set matrix with the rownames being what I think is a GENCODE ID in the format for example
"ENSG00000000003.14"
"ENSG00000000457.13"
"ENSG00000000005.5" and so on.
I would like to convert these to gene_symbol but I am not sure of the best way to do so, especially because of the ".14" or ".13" which I believe is the version. Should I first trim all IDs for what is after the dot and then use biomaRt to convert? if so, what is the most efficient way of doing it? Is there a better way to get to the gene_symbol?
Many thanks for you help

As already mentioned, these are ENSEMBL IDs. First thing, you would need to do is to check your expression set object and identify which database it uses for annotations. Sometimes, the IDs may map to different gene symbols in newer (updated) annotation databases.
Anyway, expecting that the IDs belong to Humans, you can use this code to get the gene symbols very easily.
library(org.Hs.eg.db) ## Annotation DB
library(AnnotationDbi)
ids <- c("ENSG00000000003", "ENSG00000000457","ENSG00000000005")
gene_symbol <- select(org.Hs.eg.db,keys = ids,columns = "SYMBOL",keytype = "ENSEMBL")
You can try with org.Hs.eg.db or the exact db your expression set uses (if that information is available).

Thanks for the help. My problem was to get rid of the version .XX at the end of each ensembl gene id. I thought there would be a more straight forward way of going from an ensembl gene id that has the version number (gencode basic annotation) to a gene symbol. In the end I did the following and seem to be working:
df$ensembl_gene_id <- gsub('\\..+$', '', df$ensembl_gene_id)
library(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$ensembl_gene_id
symbol <- getBM(filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id","hgnc_symbol"),
values = genes,
mart = mart)
df <- merge(x = symbol,
y = df,
by.x="ensembl_gene_id",
by.y="ensembl_gene_id")

Related

Why am I getting 'Can't bind data because some arguments have the same nameTraceback:' [duplicate]

#use readtable to create data frames of following unzipped files below
x.train <- read.table("UCI HAR Dataset/train/X_train.txt")
subject.train <- read.table("UCI HAR Dataset/train/subject_train.txt")
y.train <- read.table("UCI HAR Dataset/train/y_train.txt")
x.test <- read.table("UCI HAR Dataset/test/X_test.txt")
subject.test <- read.table("UCI HAR Dataset/test/subject_test.txt")
y.test <- read.table("UCI HAR Dataset/test/y_test.txt")
features <- read.table("UCI HAR Dataset/features.txt")
activity.labels <- read.table("UCI HAR Dataset/activity_labels.txt")
colnames(x.test) <- features[,2]
dataset_test <- cbind(subject.test,y.test,x.test)
colnames(dataset_test)[1] <- "subject"
colnames(dataset_test)[2] <- "activity"
test <- select(features, V2)
dataset_test <- select(dataset_test,subject,activity)
[1] Error: Can't bind data because some arguments have the same name
features is a two column dataframe with the second columns containing
the names for x.test
subject.test is a single column data frame
y.test is a single column data frame
x.test is a wide data frame
After naming and binding these data frames I tried to use dplyr::select to select certain frames. However, I get an error returning dataset_test:
"Error: Can't bind data because some arguments have the same name"
However, test does not return an error and properly filters. Why is there the difference in behaviour?
The data I am using can be downloaded online. The data sources correspond to the variable names, except "_" are used instead of "."
dput
> dput(head(x.test[,1:5],2))
structure(list(V1 = c(0.25717778, 0.28602671), V2 = c(-0.02328523,
-0.013163359), V3 = c(-0.014653762, -0.11908252), V4 = c(-0.938404,
-0.97541469), V5 = c(-0.92009078, -0.9674579)), row.names = 1:2, class = "data.frame")
> dput(head(subject.test,2))
structure(list(V1 = c(2L, 2L)), row.names = 1:2, class = "data.frame")
> dput(head(y.test,2))
structure(list(V1 = c(5L, 5L)), row.names = 1:2, class = "data.frame")
> dput(head(features,2))
structure(list(V1 = 1:2, V2 = c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y"
)), row.names = 1:2, class = "data.frame")
I had exactly the same problem and I think I'm looking at the same dataset as you. It's motion sensor data from a smart phone, isn't it?
The problem is exactly what the error message says! That dang set has duplicate column names. Here's how I explored it. I couldn't use your dput commands, so I couldn't try out your data. I'm showing my code and results. I suggest you substitute your variable, dataset_test, where I have samsungData.
Here's the error. If you just select the dataset, but don't indicate the columns, the error message identifies the duplicates.
select(samsungData)
That gave me this error, which is just what your own dplyr error was trying to tell you.
Error: Columns "fBodyAcc-bandsEnergy()-1,8", "fBodyAcc-bandsEnergy()-9,16", "fBodyAcc-bandsEnergy()-17,24", "fBodyAcc-bandsEnergy()-25,32", "fBodyAcc-bandsEnergy()-33,40", ... must have a unique name
Then I wanted to see where that first column was duplicated. (I don't think I'll ever work well with regular expressions, but this one made me mad and I wanted to find it.)
has_dupe_col <- grep("fBodyAcc\\-bandsEnergy\\(\\)\\-1,8", names(samsungData))
names(samsungData)[has_dupe_col]
Results:
[1] "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8"
That showed me that the same column name appears in three positions. That won't play nicely in dplyr.
Then I wanted to see a frequency table for all the column names and call out the duplicates.
names_freq <- as.data.frame(table(names(samsungData)))
names_freq[names_freq$Freq > 1, ]
A bunch of them appear three times! Here are just a few.
Var1 Freq
9 fBodyAcc-bandsEnergy()-1,16 3
10 fBodyAcc-bandsEnergy()-1,24 3
11 fBodyAcc-bandsEnergy()-1,8 3
Conclusion:
The tool (dplyr) isn't broken, the data is defective. If you want to use dplyr to select from this dataset, you're going to have to locate those duplicate column names and do something about them. Maybe you change the column name (dplyr's mutate will do it for you without grief). On the other hand, maybe they're supposed to be duplicated and they're there because they're a time series or some iteration of experimental observations. Maybe then what you need to do is merge those columns into one and provide another dimension (variable) to distinguish them.
That's the analysis part of data analysis. You'll have to dig into the data to see what the right answer is. Either that, or the question you're trying to answer need not even include those duplicate columns, in which case you throw them away and sleep peacefully.
Welcome to data science! At best, it's just 10% cool math and machine learning. 90% is putting on gloves and a mask and wiping up crap like this in your data.
I recently ran into this same problem with a different data set. My tidyverse solution to identifying duplicate column names in the dataframe (df) was:
tibble::enframe(names(df)) %>% count(value) %>% filter(n > 1)
This error is often caused by a data frame having columns with identical names, that should be the first thing to check. I was trying to check my own data frame with dplyr select helper functions (start_with, contains, etc.), but even those won't work, so you may need to export to a csv to check in Excel or some other program or use base functions to check for duplicate column names.
Another possibility to find duplicate column names using Base R would be using duplicated:
colnames(df)[which(duplicated(colnames(df)))]

How to use NCBI gene database in biomaRt R package

I'm not very expert with R but I'm trying to learn ho to use the biomaRt package to find genes located in my regions of interest.
I've managed to produce a valid output using the ensembl dataset with the following code:
> mart= useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
> results <- getBM(attributes =c("chromosome_name","start_position","end_position",
"band","hgnc_symbol","entrezgene"), filters = c("chromosome_name","start","end"),
values = list(1,226767027,227317593), mart=mart)
I know that the "entrezgene" corresponds to the NCBI gene ID, but I would like to have the GENE NAME from NCBI.
Is there a way to use biomaRt connected to NCBI database and retrieve that informartion?
Thank you in advanced.
Type listAttributes(mart) to see the list of attributes you can select
Regarding gene name, I think you might want external_gene_id but there are other gene name options as well.

Biomart in R to convert rssnp to gene name

I have the following code in R.
library(biomaRt)
snp_mart = useMart("ENSEMBL_MART_SNP", dataset="hsapiens_snp")
snp_attributes = c("refsnp_id", "chr_name", "chrom_start",
"associated_gene", "ensembl_gene_stable_id", "minor_allele_freq")
getENSG <- function(rs, mart = snp_mart) {
results <- getBM(attributes = snp_attributes,
filters = "snp_filter", values = rs, mart = mart)
return(results)
}
getENSG("rs144864312")
refsnp_id chr_name chrom_start associated_gene ensembl_gene_stable_id
1 rs144864312 8 20254959 NA ENSG00000061337
minor_allele_freq
1 0.000399361
I have no background in biology so please forgive me if this is an obvious question. I was told that rs144864312 should match to the gene name "LZTS1".
The code above I largely got from off the internet. My question is where do I extract that gene name from? I get that the listAttributes(snp_mart) gives a list of all possible outputs but I don't see any that give me the above "gene name". Where do I extract this gene name from using biomart (and given the rs number)? Thank you in advance.
PS: I need to do this for something like 500 entries (not just 1). Hence why I created a simple function as above to extract the gene name.
First I think your question will draw more professional attention on https://www.biostars.org/
That said, to my knowledge, now you have the ensembl ID (ENSG00000061337), you are just one step away from getting the gene name. If you google "how to convert ensembl ID to gene name" you will find many approaches. Here I list a few options:
use: https://david.ncifcrf.gov/conversion.jsp
use biomart under ensemble: http://www.ensembl.org/biomart/martview/1cb4c119ae91cb34b2cd5280be0a1aac
download a table with both gene name and ensembl ID, and customize your query. You might want to download it from UCSC Genome Browser, and here are some instructions: https://www.biostars.org/p/92939/
Good luck

Merge is duplicating rows in r

I have two data sets with country names in common.
first data frame
As you can see, both data sets have a two letter country code formated the same way.
After running this code:
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I get the following result
Rather than having 2 rows with the same country code, I'd like them to be combine.
Thanks!
I strongly suspect that the Group.1 strings in one or other of your data frames has one or more trailing spaces, so they appear identical when viewed, but are not. An easy way of visually checking whether they are the same:
levels(as.factor(Trade$Group.1))
levels(as.factor(aggdata$Group.1))
If the problem does turn out to be trailing spaces, then if you are using R 3.2.0 or higher, try:
Trade$Group.1 <- trimws(Trade$Group.1)
aggdata$Group.1 <- trimws(aggdata$Group.1)
Even better, if you are using read.table etc. to input your data, then use the parameter strip.white=TRUE
For future reference, it would be better to post at least a sample of your data rather than a screenshot.
The following works for me:
aggdata <- data.frame(Group.1 = c('AT', 'BE'), CASEID = c(1587.6551, 506.5), ISOCNTRY = c(NA, NA),
QC17_2 = c(2.0, 1.972332), D70 = c(1.787440, 1.800395))
Trade <- data.frame(Group.1 = c('AT', 'BE'), trade = c(99.77201, 100.10685))
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I had to transcribe your data by hand from your screenshots, so I only did the first two rows. If you could paste in a full sample of your data, that would be helpful. See here for some guidelines on producing a reproducible example: https://stackoverflow.com/a/5963610/236541

Vectorise an imported variable in R

I have imported a CSV file to R but now I would like to extract a variable into a vector and analyse it separately. Could you please tell me how I could do that?
I know that the summary() function gives a rough idea but I would like to learn more.
I apologise if this is a trivial question but I have watched a number of tutorial videos and have not seen that anywhere.
Read data into data frame using read.csv. Get names of data frame. They should be the names of the CSV columns unless you've done something wrong. Use dollar-notation to get vectors by name. Try reading some tutorials instead of watching videos, then you can try stuff out.
d = read.csv("foo.csv")
names(d)
v = d$whatever # for example
hist(v) # for example
This is totally trivial stuff.
I assume you have use the read.csv() or the read.table() function to import your data in R. (You can have help directly in R with ? e.g. ?read.csv
So normally, you have a data.frame. And if you check the documentation the data.frame is described as a "[...]tightly coupled collections of variables which share many of the properties of matrices and of lists[...]"
So basically you can already handle your data as vector.
A quick research on SO gave back this two posts among others:
Converting a dataframe to a vector (by rows) and
Extract Column from data.frame as a Vector
And I am sure they are more relevant ones. Try some good tutorials on R (videos are not so formative in this case).
There is a ton of good ones on the Internet, e.g:
* http://www.introductoryr.co.uk/R_Resources_for_Beginners.html (which lists some)
or
* http://tryr.codeschool.com/
Anyways, one way to deal with your csv would be:
#import the data to R as a data.frame
mydata = read.csv(file="SomeFile.csv", header = TRUE, sep = ",",
quote = "\"",dec = ".", fill = TRUE, comment.char = "")
#extract a column to a vector
firstColumn = mydata$col1 # extract the column named "col1" of mydata to a vector
#This previous line is equivalent to:
firstColumn = mydata[,"col1"]
#extract a row to a vector
firstline = mydata[1,] #extract the first row of mydata to a vector
Edit: In some cases[1], you might need to coerce the data in a vector by applying functions such as as.numeric or as.character:
firstline=as.numeric(mydata[1,])#extract the first row of mydata to a vector
#Note: the entire row *has to be* numeric or compatible with that class
[1] e.g. it happened to me when I wanted to extract a row of a data.frame inside a nested function

Resources