Function meaning, how is it deleting zero expression genes? - r

I'm working with an expression matrix obtained by single cell RNA sequencing, but I have a question related with the R code one mate has sent me...
sort(unique(1 + slot(as(data_matrix, "dgTMatrix"), "i")))
# there isn't more details in the code...
In theory, this function is to delete non expressed genes (if it's zero in all samples, it think...), but it's impossible for me to understand it, anyone can give me a tip?

Well, I think I have understood this code... let's try to explain it! (please, correct me if I'm wrong).
Our data has a structure of sparse matrix (ie. more handly in regards to memory, link) and with as it's coerced to a specific format for this kind of matrix (Triplet Format for Sparse Matrices, link): three columns with i and j index for these non-zero values.
y <- matrix_counts # sparse matrix
AAACCTGAGAACAACT-1 AAACCTGTCGGAAATA-1 AAACGGGAGAGCTGCA-1
ENSG00000243485 1 . .
ENSG00000237613 . . 2
y2 <- as(y, "dgTMatrix") #triplet format for sparse matrix
i j x
1 9 1 1 #in row(9) and column(1) we have the value 1
2 50 1 2
3 60 1 1
4 62 1 2
5 78 1 1
6 87 1 1
After, it takes only the column "i" (slot(data, "i")), because we only need the row index (to know what rows are different to zero), and delete duplicates (unique) to finally obtain a vector with the row index which will be used to filter the raw data:
y3 <- unique(1 + slot(as(exprs(gbm), "dgTMatrix"), "i"))
[1] 9 50 60 62 78 87
data <- data_raw[y3,]
I am a bit confused with sort and 1+, but I think this is the basic concept. So, to summarize, we take the row index from this non-zero rows (genes) and use it to filter our raw data... another original method for delete non-expressed genes, interesting!

Related

Apply a function that requires seq() in R

I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30

covariance matrix from a community list with grouping factors

I am still learning to use data.table (from the data.table package) and even after looking for help on the web and the help files, I am still struggling to do what I want.
I have a large data table with over 60 columns (the first three corresponding to factors and the remaining to response variables, in this case different species) and several rows corresponding to the different levels of the treatments and the species abundances. A very small version looks like this:
> TEST<-data.table(Time=c("0","0","0","7","7","7","12"),
Zone=c("1","1","0","1","0","0","1"),
quadrat=c(1,2,3,1,2,3,1),
Sp1=c(0,4,29,9,1,2,10),
Sp2=c(20,17,11,15,32,15,10),
Sp3=c(1,0,1,1,1,1,0))
>setkey(TEST,Time)
> TEST
Time Zone quadrat Sp1 Sp2 Sp3
1: 0 1 1 0 20 1
2: 0 1 2 4 17 0
3: 0 0 3 29 11 1
4: 12 1 1 10 10 0
5: 7 1 1 9 15 1
6: 7 0 2 1 32 1
7: 7 0 3 2 15 1
I need to calculate the sum of the covariances for each Zone x quadrat group. If I only had the species list for a given Zone x quadrat combination, then I could use the cov() function but using cov() in the same way that I would use mean() or sum() in
Abundance = TEST[,lapply(.SD,mean),by="Zone,quadrat"]
does not work as I get the following error message:
Error in cov(value) : supply both 'x' and 'y' or a matrix-like 'x'
I understand why but I cannot figure out how to solve this.
What I exactly want is to be able to get, for each Zone x quadrat combination, the covariance matrix of all the species across all the sampling Time points. From each matrix, I then need to calculate the sum of the covariances of all pairs of species, so that then I can have a sum of covariance for each Zone x quadrat combination.
Any help would be greatly appreciated, Thanks.
From the help provided above by #Frank and some additional searching that I did around the use of the upper.tri function, the following code works:
Cov= TEST[,sum(cov(.SD)[upper.tri(cov(.SD), diag = FALSE)]), by='Zone,quadrat', .SDcols=paste('Sp',1:3,sep='')]
The initial version proposed, where upper.tri() did not appear in [ ] only extracted logical values from the covariance matrix and having diag = FALSE allowed to exclude the diagonal values before summing the upper triangle of the matrix. In my case, I didn't care whether it was the upper or lower triangle but I'm sure that using lower.tri() would work equally well.
I hope this helps other users who might encounter a similar issue.

Count the number of instances where a variable or a combination of variables are TRUE

I'm an enthusiastic R newbie that needs some help! :)
I have a data frame that looks like this:
id<-c(100,200,300,400)
a<-c(1,1,0,1)
b<-c(1,0,1,0)
c<-c(0,0,1,1)
y=data.frame(id=id,a=a,b=b,c=c)
Where id is an unique identifier (e.g. a person) and a, b and c are dummy variables for whether the person has this feature or not (as always 1=TRUE).
I want R to create a matrix or data frame where I have the variables a, b and c both as the names of the columns and of the rows. For the values of the matrix R will have to calculate the number of identifiers that have this feature, or the combination of features.
So for example, IDs 100, 200 and 400 have feature a then in the diagonal of the matrix where a and a cross, R will input 3. Only ID 100 has both features a and b, hence R will input 1 where a and b cross, and so forth.
The resulting data frame will have to look like this:
l<-c("","a","b","c")
m<-c("a",3,1,1)
n<-c("b",1,2,1)
o<-c("c",1,1,2)
result<-matrix(c(l,m,n,o),nrow=4,ncol=4)
As my data set has 10 variables and hundreds of observations, I will have to automate the whole process.
Your help will be greatly appreciated.
Thanks a lot!
With base R:
crossprod(as.matrix(y[,-1]))
# a b c
# a 3 1 1
# b 1 2 1
# c 1 1 2
This is called an adjacency matrix. You can do this pretty easily with the qdap package:
library(qdap)
adjmat(y[,-1])$adjacency
## a b c
## a 3 1 1
## b 1 2 1
## c 1 1 2
It throws a warning because you're feeding it a dataframe. Not a big deal and can be ignored. Also noticed I dropped the first column (ID's) with negative indexing y[, -1].
Note that because you started out with a Boolean matrix you could have gotten there with:
Y <- as.matrix(y[,-1])
t(Y) %*% Y

Adding values to a matrix using index vectors that include row and column names

Suppose I have a really big matrix of sparse data, but i'm only interested in looking at a sample of it making it even more sparse. Suppose I also have a dataframe of triples including columns for row/column/value of the data (imported from a csv file). I know I can use the sparseMatrix() function of library(Matrix) to create a sparse matrix using
sparseMatrix(i=df$row,j=df$column,x=df$value)
However, because of my values I end up with a sparse matrix that's millions of rows by tens of thousands of columns (most of which are empty because my subset is excluding most of the rows and columns). All of those zero rows and columns end up skewing some of my functions (take clustering for example -- I end up with one cluster that includes the origin when the origin isn't even a valid point).
I'd like to perform the same operation, but using i and j as rownames and colnames. I've tried creating a dense vector, sampling down to the max size and adding values using
denseMatrix <- matrix(0,nrows,ncols,dimnames=c(df$row,df$column))
denseMatrix[as.character(df$row),as.character(df$column)]=df$value
(actually I've been setting it equal to 1 because I'm not interested in the value in this case) but I've been finding it fills in the entire matrix because it takes the cross of all the rows and columns rather than just row1*col1, row2*col2...
Does anybody know a way to accomplish what I'm trying to do? Alternatively i'd be fine with filling in a sparse matrix and simply having it somehow discard all of the zero rows and columns to compact itself into a denser form (but I'd like to maintain some reference back to the original row and column numbers)
I appreciate any suggestions!
Here's an example:
> rows<-c(3,1,3,5)
> cols<-c(2,4,6,6)
> mtx<-sparseMatrix(i=rows,j=cols,x=1)
> mtx
5 x 6 sparse Matrix of class "dgCMatrix"
[1,] . . . 1 . .
[2,] . . . . . .
[3,] . 1 . . . 1
[4,] . . . . . .
[5,] . . . . . 1
I'd like to get rid of colums 1,3 and 5 as well as rows 2 and 4. This is a pretty trivial example, but imagine if instead of having row numbers 1, 3 and 5 they were 1000, 3000 and 5000. Then there would be a lot more empty rows between them. Here's what happens when I using a dense matrix with named rows/columns
> dmtx<-matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx
2 4 6
1 0 0 0
3 0 0 0
5 0 0 0
> dmtx[as.character(rows),as.character(cols)]=1
> dmtx
2 4 6
1 1 1 1
3 1 1 1
5 1 1 1
When you say "get rid of" certain columns/rows, do you mean just this:
> mtx[-c(2,4), -c(1,3,5)]
3 x 3 sparse Matrix of class "dgCMatrix"
[1,] . 1 .
[2,] 1 . 1
[3,] . . 1
Subsetting works, so you just need a way of finding out which rows and columns are empty? If that is correct, then you can use colSums() and rowSums() as these have been enhanced by the Matrix package to have appropriate methods for sparse matrices. This should preserve the sparseness during the operation
> dimnames(mtx) <- list(letters[1:5], LETTERS[1:6])
> mtx[which(rowSums(mtx) != 0), which(colSums(mtx) != 0)]
3 x 3 sparse Matrix of class "dgCMatrix"
B D F
a . 1 .
c 1 . 1
e . . 1
or, perhaps safer
> mtx[rowSums(mtx) != 0, colSums(mtx) != 0]
3 x 3 sparse Matrix of class "dgCMatrix"
B D F
a . 1 .
c 1 . 1
e . . 1
Your code almost works, you just need to cbind together the row names and column names. Each row of the resulting matrix is then treated as a pair instead of treating the rows and the columns separately.
> dmtx <- matrix(0,3,3,dimnames=list(c(1,3,5),c(2,4,6)))
> dmtx[cbind(as.character(rows),as.character(cols))] <- 1
> dmtx
2 4 6
1 0 1 0
3 1 0 1
5 0 0 1
This may be faster if you use factors.
> rowF <- factor(rows)
> colF <- factor(cols)
> dmtx <- matrix(0, nlevels(rowF), nlevels(colF),
dimnames=list(levels(rowF), levels(colF)))
> dmtx[cbind(rowF,colF)] <- 1
> dmtx
2 4 6
1 0 1 0
3 1 0 1
5 0 0 1
You can also use these factors in a call to sparseMatrix.
> sparseMatrix(i=as.integer(rowF), j=as.integer(colF), x=1,
+ dimnames = list(levels(rowF), levels(colF)))
3 x 3 sparse Matrix of class "dgCMatrix"
2 4 6
1 . 1 .
3 1 . 1
5 . . 1
Note that one of the other solutions may be faster; converting to factors can be slow if there's a lot of data.
Your first issue stems from the fact that the coordinate list (COO) has non-contiguous values for the row and column indices. When faced with this, or even when dealing with most sparse matrices, I tend to reorder the rows and columns by their support.
You can do this in two ways:
Produce the sparse matrix and the do colSums and rowSums of logical(yourMatrix) to get the support values, or
Use a function like table or bigtabulate (from the bigmemory suite) to calculate the # of unique times that each value has occurred in the coordinate list. (My preference is bigtabulate.)
Once you have the support, you can use the rank function (actually, rank(-1 * support, ties = "first")) to map the original indices to new ones, based on their ranks.
At this point, if you create the matrix with sparseMatrix, it will only produce a matrix with dimensions such that all of your rows and columns have support. It will not map to anything larger.
This is similar to #GavinSimpson's approach, though his method only drops the missing rows and columns, while my approach reorders to put the maximum density in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns. In order to map back to the original indices in my approach, simply create a pair of mappings: "original to ranked" and "ranked to original", and you can perfectly recreate the original data, if you choose.
#Iterator's answer is very helpful for my application, but it's a pity that his/her response didn't include an example to illustrate the idea. Here is my implementation of the idea for reordering the rows and columns of very huge sparse matrix (e.g. with about one million rows and a few thousands of columns on supercomputer with sufficient memory to load the sparse matrix).
library(Matrix)
sparseY <- sparseMatrix( i=sample(2000, 500, replace=TRUE), j=sample(1000,500, replace=TRUE), x=sample(10000,500) )
# visualize the original sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The original sparse matrix")
numObs <- length( sparseY#x )
# replace all non-zero entries with 1 to calculate #non-zero entries per row/column and use rank() to sort based on supports
logicalY <- sparseY; logicalY#x <- rep(1, numObs)
# calculate the number of observed entries per row/column
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)
colObsFreqs
rowObsFreqs
# get the rank of supports for rows and columns
colRanks <- rank( -1*colObsFreqs, ties="first" )
rowRanks <- rank( -1*rowObsFreqs, ties="first" )
# Sort the ranks from small to large
sortColInds <- sort(colRanks, index.return=TRUE)
sortRowInds <- sort(rowRanks, index.return=TRUE)
# reorder the original sparse matrix so that the maximum density data block is placed in the upper left corner of the matrix, with decreasing density as you move to larger indices for the rows and columns.
sparseY <- sparseY[ sortRowInds$ix, sortColInds$ix ]
# visualize the reordered sparse matrix
image(sparseY, aspect=1, colorkey=TRUE, main="The sparse matrix after reordering")
logicalY <- sparseY; logicalY#x <- rep(1, numObs)
# Check whether the resulting sparse matrix is what's expected, i.e. with the maximum density data block placed in the upper left corner of the matrix
colObsFreqs <- colSums(logicalY)
rowObsFreqs <- rowSums(logicalY)
colObsFreqs
rowObsFreqs

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources