I have used RUVr() from RUVSeq R packages, to correct batch in my data, now I would like to use the corrected table to calculate TPM.
Can I use normCounts() to extract batch corrected data for TPM?
Instead of extracting batch-corrected data, the normCounts function from the RUVSeq package is designed to normalize read counts.
The normalization factors for each sample can be obtained with the help of the calcNormFactors function, which you can then use to calculate the TPM (transcripts per million) values for your data. Here's an illustration:
library(RUVSeq)
# Your batch-corrected data
bc_data <- # ...
# Obtain normalization factors
factors <- calcNormFactors(bc_data)
# Calculate TPM values
tpm_values <- tpm(bc_data, factors)
Related
I am trying to use the rbga.bin genetic function in R.
I have a dataframe with 40 observations (rows) and 189 metrics (columns). In the evaluation function, I have to run a Principal Component Analysis on both the original dataset and the "chromosome dataset" (i.e., the dataframe with some of the metrics columns - the ones that have 1s in the chromosome) in order to produce the fitness score.
For example, a possible solution (chromosome) is the following:
(1,1,1,0,0,...,0)
The solution dataset that I would want to run a PCA on, would just have only the first 3 columns of the original dataset.
How can I refer to that "reduced" dataset inside the evaluation function?
It seems that the variable you provide to the evaluation function is the chromosome, i.e. the binary vector. You can get the reduced dataset the following way.
Assume chromosome is the binary vector, original is the starting dataframe and reduced is the resulting dataframe with only the columns that are 1 in the chromosome.
reduced = !!chromosome
reduced = original[reduced]
I would like to do an analysis in R with Seurat, but for this I need a count matrix with read counts. However, the data I would like to use is provided in TPM, which is not ideal for using as input since I would like to compare with other analyses that used read counts.
Does anyone know a way to convert the TPM data to read counts?
Thanks in advance!
You would need total counts and gene (or transcript) lengths to an approximation of that conversion. See https://support.bioconductor.org/p/91218/ for the reverse operation.
From that link:
You can create a TPM matrix by dividing each column of the counts matrix by some estimate of the gene length (again this is not ideal for the reasons stated above).
x <- counts.mat / gene.length
Then with this matrix x, you do the following:
tpm.mat <- t( t(x) * 1e6 / colSums(x) )
Such that the columns sum to 1 million.
colSums(x) would be the counts per sample aligned to the genes in the TPM matrix, and gene.length would depend on the gene model used for read summarization.
So you may be out of luck, and would probably be better off using something like salmon or kallisto anyway to get the count matrix from the fastq files, if those are available, based on the gene or transcript model that you used in the data you want to compare it to.
If you have no other option than to use the TPM data (not really recommended), Seurat can work with that as well - see https://github.com/satijalab/seurat/issues/171.
I am trying to run social capital data through a principal components analysis (pca) in r by using the following dataset: https://aese.psu.edu/nercrd/community/social-capital-resources/social-capital-variables-for-2014/social-capital-variables-spreadsheet-for-2014/view
I run the analysis just fine, but I need to merge the factor loadings onto the original dataset for further analysis and presentation. I simply need to know how I can retain the id variables when I run a pca analysis so I can merge those onto the original dataset.
I have standardized the data, and then run the code below (which I gleaned from another source). I receive what appears to be factor loadings in one column for each county in the United States, but my problem is that my original database contains the id variables for each county (FIPS codes) but the factor loadings do not.
calcpc <- function(variables,loadings)
{
# find the number of samples in the data set
as.data.frame(variables)
numsamples <- nrow(variables)
# make a vector to store the component
pc <- numeric(numsamples)
# find the number of variables
numvariables <- length(variables)
# calculate the value of the component for each sample
for (i in 1:numsamples)
{
valuei <- 0
for (j in 1:numvariables)
{
valueij <- variables[i,j]
loadingj <- loadings[j]
valuei <- valuei + (valueij * loadingj)
}
pc[i] <- valuei
}
return(pc)
}
xxx<-calcpc(standardisedconcentrations, socialcapital.pca$rotation[,1])
Assuming you computed socialcapital.pca as
socialcapital.pca <- prcomp(standardisedconcentrations)
and that standardisedconcentrations is equal to the standardized analysis variables in the same order they appear in the analysis dataset, then you can simply attach the FIPS codes (as another column or as row names) to the output PC vector created by the calcpc() function, since the order of the rows in the Principal Component scores is the same as the order of the rows in the original data.
Also, note two things:
You can avoid the two loops inside the calcpc() function and speed up the process by computing the PC vector using the following matrix calculation:
pc <- variables %*% loadings
assuming you call the calcpc() function as:
calcpc(standardisedconcentrations, socialcapital.pca$rotation[,1,drop=FALSE])
where I have added drop=FALSE to make sure the first column of the rotation attribute is preserved as a matrix with one column.
If you call the princomp() function instead of the prcomp() function to run Principal Component analysis, you get the principal components or the scores directly as part of the output object (in attribute scores).
You just need to be aware of the differences in running PCA using princomp() vs. prcomp(), mainly, quoting the documentation:
princomp: "Note that the default calculation uses divisor N for the covariance matrix."
prcomp: "Unlike princomp, variances are computed with the usual divisor N - 1."
EDIT: As indicated in my comment below, you could also set the rownames attribute of the analysis matrix or data frame to the FIPS variable in your data and the results of the analysis done by princomp() or prcomp() will contain those IDs as row names.
Ex:
Using princomp():
rownames(standardisedconcentrations) <- FIPS
socialcapital.pca <- princomp(standardisedconcentrations)
Then the row names of the principal components matrix socialcapital.pca$scores will contain the FIPS codes.
Or using prcomp():
rownames(standardisedconcentrations) <- FIPS
socialcapital.pca <- prcomp(standardisedconcentrations)
pc1 <- standardisedconcentrations %*% socialcapital.pca$rotation[,1]
Then the row names of pc1 will contain the FIPS codes.
From the multiple imputation output (e.g., object of class mids for mice) I want to extract several imputed values for some of the imputed variables into a single dataset that also includes original data with the missing values.
Here are sample dataset and code:
library("mice")
nhanes
tempData <- mice(nhanes, seed = 23109)
Using the code below I can extract these values for each variable into separate datasets:
age_imputed<-as.data.frame(tempData$imp$age)
bmi_imputed<-as.data.frame(tempData$imp$bmi)
hyp_imputed<-as.data.frame(tempData$imp$hyp)
chl_imputed<-as.data.frame(tempData$imp$chl)
But I want to extract several variables to preserve the order of the rows for further analysis.
I would appreciate any help.
Use the complete function from mice package to extract the complete data set including the imputations:
complete(tempData, action = 1)
action argument takes the imputation number or if you need it in "all", "long" formats etc. Refer R documentation.
I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)