I would like to do an analysis in R with Seurat, but for this I need a count matrix with read counts. However, the data I would like to use is provided in TPM, which is not ideal for using as input since I would like to compare with other analyses that used read counts.
Does anyone know a way to convert the TPM data to read counts?
Thanks in advance!
You would need total counts and gene (or transcript) lengths to an approximation of that conversion. See https://support.bioconductor.org/p/91218/ for the reverse operation.
From that link:
You can create a TPM matrix by dividing each column of the counts matrix by some estimate of the gene length (again this is not ideal for the reasons stated above).
x <- counts.mat / gene.length
Then with this matrix x, you do the following:
tpm.mat <- t( t(x) * 1e6 / colSums(x) )
Such that the columns sum to 1 million.
colSums(x) would be the counts per sample aligned to the genes in the TPM matrix, and gene.length would depend on the gene model used for read summarization.
So you may be out of luck, and would probably be better off using something like salmon or kallisto anyway to get the count matrix from the fastq files, if those are available, based on the gene or transcript model that you used in the data you want to compare it to.
If you have no other option than to use the TPM data (not really recommended), Seurat can work with that as well - see https://github.com/satijalab/seurat/issues/171.
Related
I have used RUVr() from RUVSeq R packages, to correct batch in my data, now I would like to use the corrected table to calculate TPM.
Can I use normCounts() to extract batch corrected data for TPM?
Instead of extracting batch-corrected data, the normCounts function from the RUVSeq package is designed to normalize read counts.
The normalization factors for each sample can be obtained with the help of the calcNormFactors function, which you can then use to calculate the TPM (transcripts per million) values for your data. Here's an illustration:
library(RUVSeq)
# Your batch-corrected data
bc_data <- # ...
# Obtain normalization factors
factors <- calcNormFactors(bc_data)
# Calculate TPM values
tpm_values <- tpm(bc_data, factors)
I decided to use corr.test to calculate the correlation between genes.
And I know the input object must be a matrix or dataframe. But I don't think it's a vital part for me to pay attention to.
Now what confuses me is I don't know which kind of normalized gene expression matrix is suitable for corr.test. I think the FPKM gene counts after t() is suitable before. But I don't think so after somebody told me something about vst-transformed counts ?
Can somebody give me some advice ? Now I have the FPKM value and corr.test .If I need to change my normalization method ?
FPKM is normalized data, so first you can apply log transform and transform the matrix to vectorized format taking transpose for statistical analysis. if you want to use count data, for VST is good option with Deseq2 package in R.
I am a new R programmer and am trying to create a loop through a large amount of columns to weigh data by a certain metric.
I have a large data set of variables (some factors, some numerics). I want to loop through my columns, determine which one is a factor, and then if it is a factor I would like to use some tapply functions to do some weighting and return a mean. I have established a function that can do this one at a time here:
weight.by.mean <- function(metric,by,x,funct=sum()){
if(is.factor(x)){
a <- tapply(metric, x, funct)
b <- tapply(by, x, funct)
return (a/b)
}
}
I am passing in the metric that I want to weigh and the by argument is what
I am weighting the metric BY. x is simply a factor variable that I would
like to group by.
Example: I have 5 donut types (my argument x) and I would like to see the mean dough (my argument metric) used by donut type but I need to weigh the dough used by the amount (argument by) of dough used for that donut type.
In other words, I am trying to avoid skewing my means by not weighting different donut types more than others (maybe I use a lot of normal dough for glazed donuts but dont use as much special dough for cream filled donuts. I hope this makes sense!
This is the function I am working on to loop through a large data set with many possible different factor variables such as "donut type" in my prior example. It is not yet functional because I am not sure what else to add. Thank you for any assistance you can provide for me. I have been using R for less than a month so please keep that in mind.
My end goal is to output a matrix or data frame of all these different means but each factor may have anywhere from 5 to 50 different levels so the row size is dependent on the number of levels of each factor.
weight.matrix <- function(df,metric,by,funct=sum()){
n <- ncol(df) ##Number of columns to iterate through
ColNames <- as.matrix(names(df))
OutputMatrix <- matrix(1, ,3,nrow=, ncol=3)
for(i in 1:n){
if(is.factor(paste("df$",ColNames[i], sep=""))){
a[[i]] <- tapply(metric, df[,i], funct)
b[[i]] <- tapply(by, df[,i], funct)
}
OutputMatrix <- (a[[i]]/b[[i]])
}
}
If each of your factors has different levels, then it would make more sense to use a long data frame instead of a wide one. For example:
Metric Value Mean
DonutType Glazed 3.0
DonutType Chocolate 5.2
DonutSize Small 1.2
DonutSize Medium 2.3
DonutSize Large 3.6
Data frames are not meant for vectors of different lengths. If you want to store your data in a data frame, you need to organize it so all the vector lengths are the same. gather() and spread() are functions from the tidyverse package you can use to convert between long and wide data frames.
I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})