RNAseq: Convert RNA identifiers to Genes

RNAseq: Convert RNA identifiers to Genes - r

I am using DEseq2 for DE analysis and want to analyze a publshed data set.
However, the count matrix provided does not show ENSEMBL gene IDs but mRNA transcript ids instead, i.e. the first column looks like that: XM_0176201594.2
Could somebody tell me how to convert, or fetch from GTF to get the usual count matrix that serves as input for DESeq2? Basically, we are looking at a matrix, where several lines correspond to the same gene, right?
Thanks in advance!

Related

Row names showing up as row numbers in R

This probably has a simple fix, but I'm relatively new to using R and could use some assistance.
The toy data I'm using for a gene network analysis has rows that look like this:
whereas the data that I've uploaded has rows that look like this:
.
The code I'm using refers to the row names to map on as gene names. I am able to successfully run this analysis, however, the output I end up with has lists of row numbers where there should be lists of gene names.
Is there a simple way that I can convert my data into the toy data format so that the row names are gene names instead of numbers?

CreateSeuratObject issue in R - when first column is gene names it states '0 features across X samples'

I'm completely new to the analysis of scRNA data and have been having issues with the CreateSeuratObject command in R.
Essentially, I have the gene expression matrix in a csv file named X with the first row being cells, and the first column being ENSG gene codes, and the number of counts expressed within the matrix.
The gene names are in ENSG format so I have merged this data frame with a geneID data frame, and then deleted the ENSG column using the following code.
X <- merge(X, geneidname, by.x="V1", by.y="gene_id")
X<-X[,c(19144, 1:19143)]
X$V1 <- NULL
The end result is a matrix where the first column is the gene matrix and the first row are the cells, with count data in the corresponding place in the matrix.
However, when I try to create the seurat object using
`X_seurat <- CreateSeuratObject(counts = X, project = "X_seurat", min.cells = 5)
The end result is
An object of class Seurat
0 features across X samples within 1 assay
Active assay: RNA (0 features, 0 variable features)
However, when get rid of the first column using code X <-X[,-1] and then try to repeat creation of the SeuratObject again it works, giving me
An object of class Seurat
XXX features across 19142 samples within 1 assay
Active assay: RNA (XXX features, 0 variable features)
I'm super confused by this. Why am I not able to create the seuratObject with the gene name column included. I've checked and the gene name column class is 'character' and the rest of the columns class are 'integers'. It seems a bit counter-intuitive to me that the gene names should be deleted to be able to create the Seurat Object and I've clearly gone wrong somewhere - can anybody help? Thanks in advance!!

Converting TPM data to read counts for Seurat

I would like to do an analysis in R with Seurat, but for this I need a count matrix with read counts. However, the data I would like to use is provided in TPM, which is not ideal for using as input since I would like to compare with other analyses that used read counts.
Does anyone know a way to convert the TPM data to read counts?
Thanks in advance!

You would need total counts and gene (or transcript) lengths to an approximation of that conversion. See https://support.bioconductor.org/p/91218/ for the reverse operation.
From that link:
You can create a TPM matrix by dividing each column of the counts matrix by some estimate of the gene length (again this is not ideal for the reasons stated above).
x <- counts.mat / gene.length
Then with this matrix x, you do the following:
tpm.mat <- t( t(x) * 1e6 / colSums(x) )
Such that the columns sum to 1 million.
colSums(x) would be the counts per sample aligned to the genes in the TPM matrix, and gene.length would depend on the gene model used for read summarization.
So you may be out of luck, and would probably be better off using something like salmon or kallisto anyway to get the count matrix from the fastq files, if those are available, based on the gene or transcript model that you used in the data you want to compare it to.
If you have no other option than to use the TPM data (not really recommended), Seurat can work with that as well - see https://github.com/satijalab/seurat/issues/171.

Matching and sorting data in R or Excel

I have a list of bacteria each with it's own abundance in a dataframe. I also have the same list of bacteria but in a different order in the same dataframe.
I want to match the abundances to this second list but I'm not sure how to go about doing it.
dyplyr contains several methods for sorting data but I don't know how to match the abundance and print it into a new column so it now matches with the second list of bacteria.
Here's the beginning of my dataset:
Taxon Total_abundance Tips
Acaricomes phytoseiuli 0.000382414 Methanothermobacter thermautotrophicus
Acetivibrio cellulolyticus 0.013979274 Methanobacterium beijingense
Acetobacter aceti 0.181150551 Methanobacterium bryantii
Acetobacter estunensis 0.023074895 Methanosarcina mazei
Acetobacter tropicalis 0.014615221 Persephonella marina
Achromobacter piechaudii 0.031811039 Sulfurihydrogenibium azorense
Achromobacter xylosoxidans 0.041558442 Balnearium lithotrophicum
Acidicapsa borealis 0.035525932 Isosphaera pallida
Acidimicrobium ferrooxidans 0.013841209 Simkania negevensis
Acidiphilium angustum 0.041702984 Parachlamydia acanthamoebae
Acidiphilium cryptum 0.039265944 Leptospira biflexa
Acidiphilium rubrum 0.041702984 Leptospira fainei
...
So, the abundance matches the data in Taxon column, and I want the abundance to also be matched with the bacteria in the "Tips" column.
For example, Acaricomes phytoseiuli has an abundance of 0.000382414, so in column D 0.000382414 will be printed next to where Acaricomes phytoseiuli is located. Again, Taxon and Tips contains exactly the same data, just in a different order.
I hope that makes sense.
It doesn't matter if this is done in R or Excel, thanks.

As others have mentioned, it's hard to test without some data that matches, but something like this should work, using match to match up values.
df$D <- df$Total_abundance[ match( df$Tips, df$Taxon ) ]

I assume that your list of bacteria is unique
as a sample data frame:
dff <- data.frame(bacteria1=letters[1:10], abundance1=runif(10,0,1),
bacteri2=sample(letters[1:10],10), abundance2=0)
now we will find the bacteria rows and insert the abundances:
for(i in 1:nrow(dff)){
s <- which(dff$bacteri2[i]==dff$bacteria1)
dff$abundance2[i] <- dff$abundance1[s]
}

In excel under column D you can do the following:
=VLOOKUP(C3;A3:B13;2;FALSE)
C3 would be the TIP and A3:B13 the range where it searches for this, A being the bacteria name and B the abundance and if found will return the corresponding abundance of the match.
If you get an error like #N/A than there is no match. You can also avoid these errors by using this formula:
=IFNA(VLOOKUP(C3;$A$3:B13;2;FALSE);"No match")
Edit: Adjust the ranges to your file!
Edit 2: Keep in mind the seperator I use is ; and your excel might use the comma , seperator

First of all, if your Taxon and Tips columns contain exactly the same data, only in different order, they have no place being together in the same data frame. You should either have two data frames, or come up with some sort of key to define the place of a Taxon item in the phylogenetic tree and then re-sort the data frame as needed, either in alphabetic order or by phylogeny.
As a quick solution, I would first extract the Tips column in a separate data frame, join it with the original data frame by the Tips and Taxon columns, thus obtaining the correct order of abundance values in the new data frame and (if you still insist) using cbind to glue the newly re-sorted abundance column back into the original data frame. Like so, assuming you're using dplyr (df is a dummy stand-in for your data set):
df <- data.frame(Taxon=c("a","b","c","d","e"), Abundance=c(1:5), Tips=c("b","a","d","c","e"))
new_df <- select(df, Tips)
new_df <- left_join(new_df, df, by=c("Tips"= "Taxon"))
df <- cbind(df, New_Abund=new_df$Abundance)
rm(new_df)

Extract data between two pattern occurrences in dataframe column R

I am using R to perform some data manipulation. I want to extract all rows between 2 occurrences of a pattern. I have attached the dataframe image.
I want to extract all rows starting from 'edu-hist-mark' to 'objectives-mark' using "mark" as a pattern. But I am not sure how to achieve that. Appreciate any help.
Thanks.
EDIT:
After some manipulation , here is the data frame :
Enter code here
Data<- data.frame(class_name = c("edu-hist-mark","date","date","educational","qualif","date","date","educational","qualif","role","company","objectives-mark","additional-info-hobby-mark","nominal"),
text_val=c("EDUCATION AND QUALIFICATIONS:",2000,2003,"ILLINOIS INSTITUTE OF TECHNOLOGY","Master of Science,Computer Science",1999,2000,"MAHARASHTRA INSTITUTE OF TECHNOLOGY","Bachelor of Science","Mechanical Engineering","Enterprise Solution Architect","Liaison Technologies","SUMMARY:,PUBLICATIONS:","Abhay Daftari"))

In code below, I find the indices of the instances where your first column contains the pattern, "mark", and then subset the dataset to find all rows between the first and the second instance of that pattern. If there are more than two instances of that pattern, you can change the index to reflect how the data should be subsetted. Hope this helps!
Data[c(c(as.list(which(grepl("mark", Data$class_name)))[[1]]:as.list(which(grepl("mark", Data$class_name)))[[2]])), ]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex