I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)
Related
I would like to do an analysis in R with Seurat, but for this I need a count matrix with read counts. However, the data I would like to use is provided in TPM, which is not ideal for using as input since I would like to compare with other analyses that used read counts.
Does anyone know a way to convert the TPM data to read counts?
Thanks in advance!
You would need total counts and gene (or transcript) lengths to an approximation of that conversion. See https://support.bioconductor.org/p/91218/ for the reverse operation.
From that link:
You can create a TPM matrix by dividing each column of the counts matrix by some estimate of the gene length (again this is not ideal for the reasons stated above).
x <- counts.mat / gene.length
Then with this matrix x, you do the following:
tpm.mat <- t( t(x) * 1e6 / colSums(x) )
Such that the columns sum to 1 million.
colSums(x) would be the counts per sample aligned to the genes in the TPM matrix, and gene.length would depend on the gene model used for read summarization.
So you may be out of luck, and would probably be better off using something like salmon or kallisto anyway to get the count matrix from the fastq files, if those are available, based on the gene or transcript model that you used in the data you want to compare it to.
If you have no other option than to use the TPM data (not really recommended), Seurat can work with that as well - see https://github.com/satijalab/seurat/issues/171.
I am a new R programmer and am trying to create a loop through a large amount of columns to weigh data by a certain metric.
I have a large data set of variables (some factors, some numerics). I want to loop through my columns, determine which one is a factor, and then if it is a factor I would like to use some tapply functions to do some weighting and return a mean. I have established a function that can do this one at a time here:
weight.by.mean <- function(metric,by,x,funct=sum()){
if(is.factor(x)){
a <- tapply(metric, x, funct)
b <- tapply(by, x, funct)
return (a/b)
}
}
I am passing in the metric that I want to weigh and the by argument is what
I am weighting the metric BY. x is simply a factor variable that I would
like to group by.
Example: I have 5 donut types (my argument x) and I would like to see the mean dough (my argument metric) used by donut type but I need to weigh the dough used by the amount (argument by) of dough used for that donut type.
In other words, I am trying to avoid skewing my means by not weighting different donut types more than others (maybe I use a lot of normal dough for glazed donuts but dont use as much special dough for cream filled donuts. I hope this makes sense!
This is the function I am working on to loop through a large data set with many possible different factor variables such as "donut type" in my prior example. It is not yet functional because I am not sure what else to add. Thank you for any assistance you can provide for me. I have been using R for less than a month so please keep that in mind.
My end goal is to output a matrix or data frame of all these different means but each factor may have anywhere from 5 to 50 different levels so the row size is dependent on the number of levels of each factor.
weight.matrix <- function(df,metric,by,funct=sum()){
n <- ncol(df) ##Number of columns to iterate through
ColNames <- as.matrix(names(df))
OutputMatrix <- matrix(1, ,3,nrow=, ncol=3)
for(i in 1:n){
if(is.factor(paste("df$",ColNames[i], sep=""))){
a[[i]] <- tapply(metric, df[,i], funct)
b[[i]] <- tapply(by, df[,i], funct)
}
OutputMatrix <- (a[[i]]/b[[i]])
}
}
If each of your factors has different levels, then it would make more sense to use a long data frame instead of a wide one. For example:
Metric Value Mean
DonutType Glazed 3.0
DonutType Chocolate 5.2
DonutSize Small 1.2
DonutSize Medium 2.3
DonutSize Large 3.6
Data frames are not meant for vectors of different lengths. If you want to store your data in a data frame, you need to organize it so all the vector lengths are the same. gather() and spread() are functions from the tidyverse package you can use to convert between long and wide data frames.
One of the projects I'm working on (in R) involves storing n different confidence intervals from n samples, and each confidence interval is represented as a numeric vector of size 2 (so, for instance, if an interval is c(1, 2), the left end of the interval is 1, and right is 2).
I need a way to store n of these vectors. I've tried using a data frame, but I can't seem to get it to work. Which data structure should I use to store/keep track of all these vectors? I don't think there's such a thing as a "vector of vectors"? I'm fairly new to R, and not quite familiar with all the data structures. Thanks!
There are a couple of ways.
You could store them as
A data frame with one column as the first value and another column as the second value.
Elements of a list.
A nx2 matrix.
What it comes down to is how will the data be used.
I think a list of vectors would be best here, try this sample code:
x <- c(0,1)
y <- c(0.25,0.75)
z <- c(0.1,0.9)
li <- list(x,y,z)
from here, you can access individual confidence intervals by square bracket indexing to select the index of the desired confidence interval. ie:li[[confidence interval]]
I wanted to generate correlation matrices which are made of correlation of row couples. I used the corrgram function to generate them. In my first attempt, the function generated correlation matrix of which diagonals filled with ranks.
corrgram(t(datasetA),order="GW")
a sample of the output
However when I use it for my second dataset, somehow the diagonal of correlation matrix is filled with varxxx strings instead of rank of correlation.
corrgram(t(datasetB),order="GW")
The datasets contain nearly the same type of values (ints) and they are both dataframe. How can I solve this ?
Edit:
Here is the list of commands from which generates the correlation matrix contains varxxx's in diagonal
erase <- matrix(c(1,5,2,6,8,4,1,5,6),nrow=3)
corrgram(t(erase),order="HC")
output:
Because it is a huge dataset and contains sensitive data, I cannot share the dataset and show the series of operations by which I ended up with the first output above.
Renaming column names with numbers fixed the issue
names(dataSetB)<-c(1:totalNumberOfColumn)
Here's my hypothetical data frame;
location<- as.factor(rep(c("town1","town2","town3","town4","town5"),100))
visited<- as.factor(rbinom(500,1,.4)) #'Yes or No' variable
variable<- rnorm(500,10,2)
id<- 1:500
DF<- data.frame(id,location,visited,variable)
I want to create a new data frame where the number of 0's and 1's are equal for each location. I want to accomplish this by taking a random sample of the 0's for each location (since there are more 0's than 1's).
I found this solution to sample by group;
library(plyr)
ddply(DF[DF$visited=="0",],.(location),function(x) x[sample(nrow(x),size=5),])
I entered '5' for the size argument so the code would run, But I can't figure out how to set the 'size' argument equal to the number of observations where DF$visited==1.
I suspect the answer could be in other questions I've reviewed, but they've been a bit too advanced for me to implement.
Thanks for any help.
The key to using ddply well is to understand that it will:
break the original data frame down by groups into smaller data frames,
then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
and finally, it will stitch all the little transformed data frames back together.
With that in mind, here's an approach that (I think) solves your problem.
sampleFunction <- function(df) {
# Determine whether visited==1 or visited==0 is less common for this location,
# and use that count as our sample size.
n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
# Sample n from the two groups (visited==0 and visited==1).
ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}
newDF <- ddply(DF,.(location),sampleFunction)
# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))
How it works
The main ddply simply breaks DF down by location and applies sampleFunction, which does the heavy lifting.
sampleFunction takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1 and visited==0. How does it do this? With a second call to ddply: this time, using location to break it down, so we can sample from both the 1's and the 0's.
Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.