gene expression datamatrix filtration - r

I have one matrix with 3064 rows and 27 columns which contains values between -0.5 and 2.0. I want to extract every rows which have at least once value >=0.5. As answer I would like to have whole row in it's origional matrix form.
Consider m is my matrix, I tried:
m[m[1:190,1:16]>0.5,1:16]
As this command is not accepting process on more then 190 rows, I went for 190 rows, but somehow it went wrong, because it gave me rows which also have values < 0.5.
Is it possible to write any function, that can be applied for whole matrix ?

you can also try like this if your data name is df
df2<- df[apply(df, MARGIN = 1, function(x) any(x >= 0.5)), ]

library(fBasics)
m2 <- subset(x = m, subset = rowMaxs(m)>=0.5)

What mm=m[1:190,1:16]>0.5 gives you is a matrix of boolean indicating which values of m[1:190,1:16] are greater than 0.5.
Then when you do m[mm], it considers mm as a vector and gives you corresponding values. The thing is dim(m) = 3064*27 while dim(m[1:190,1:16]) = 190*16. Which means that the first 27 values of mm will be used to get the first line of m while they correspond to part of the second line of mm.
So in order to have only the elements greater than 0.5, you need to apply matrix to m[1:190,1:16] which has the same dimension, i.e:
`m[1:190,1:16][m[1:190,1:16]>0.5, 1:16]
But what you do here is m[mm, 1:16], so you consider each individual value of mm as a row number, while it is a 190*16 matrix. It means you specify 190*16=3040 rows, it does not work with more because m only has 3064 rows.
What you want is a vector of length 190 (or even 3064 I guess) specifying which rows to take. You can get this vector with rowSums(m >=0.5)>0, which means each row with more than 0 values greater than 0.5. Then you get your output with:
m[rowSums(m >= 0.5) > 0,]
And it will work for the whole matrix. Note that some values will be smaller than 0.5 since you selected the whole line if at least one value was greater than 0.5.
Edit
For rows with values <0.5, the idea is the same:
m[rowSums(m < 0.5) > 0,]

Related

R - I don't understand why my code generates a count rather than a sum

I have a list of 10,000 values that look like this
Points
1 118
2 564
3 15
4 729
5 49
6 614
Calling the list t1 and running sum(t1>quantile(t(t1),0.8)) I would expect to get a sum of the values in the list that are greater than the 80th quantile, but what I really get is a count (not a sum) of all the values.
Try this:
sum(t1[t1>quantile(t(t1),0.8), ])
To see the difference check t1>quantile(t(t1),0.8) and then t1[t1>quantile(t(t1),0.8), ].
One is a logical vector and contains TRUE (resp. 1) if the value is greater than the 80% quantile and zero otherwise.
The other is t1 evaluate at that logical vector, so only values which are greater than the 80% quantile are returned
t1>quantile(t(t1),0.8) is a boolean, i.e. a sequence of TRUE/FALSE values (you can check it easily). Consequently, the sum of this vector is the number of occurrences of TRUE values, i.e. the count of individuals that satisfy the condition you specify.
Here is an example:
set.seed(123)
df <- data.frame(Point = rnorm(10000))
sum(df$Point > quantile(df$Point, 0.8))
The second line returns the sum for a boolean vector (TRUE/FALSE), hence you get the count (the number of times TRUE occurs). Use
sum(df$Point[df$Point > quantile(df$Point, 0.8)])
to get what you want.
You could use the ifelse fonction, that will add t1 if t1 is above your threshold and 0 otherwise
sum(ifelse(t1>quantile(t(t1),0.8),t1,0))

Convert a one column matrix to n x c matrix

I have a (nxc+n+c) by 1 matrix. And I want to deselect the last n+c rows and convert the rest into a nxc matrix. Below is what I've tried, but it returns a matrix with every element the same in one row. I'm not sure why is this. Could someone help me out please?
tmp=x[1:n*c,]
Membership <- matrix(tmp, nrow=n, ncol=c)
You have a vector x of length n*c + n + c, when you do the extract, you put a comma in your code.
You should do tmp=x[1:(n*c)].
Notice the importance of parenthesis, since if you do tmp=x[1:n*c], it will take the range from 1 to n, multiply it by c - giving a new range and then extract based on this new range.
For example, you want to avoid:
(1:100)[1:5*5]
[1] 5 10 15 20 25
You can also do without messing up your head with indexing:
matrix(head(x, n*c), ncol=c)

problems with deleting rows in big data sets in r

I wrote a script that delete rows that 20% of their cells ara smaller then 10.
it's work great on small data Sets but for big it's useless.
can sombody help me please.
here is my script:
DataSets<-choose.files()
DataSet<-read.delim(DataSets,header = TRUE,
row.names = 1,sep="\t",blank.lines.skip=TRUE)
delete<-0
for(i in 1:length(DataSet[,1]))
{
count<-0
for(j in 1:length(DataSet[i,]))
{
if(DataSet[i,j]<10 || is.na(DataSet[i,j]))
{
count=count+1
}
}
if(count>0.2*length(DataSet[i,]))
{
DataSet=DataSet[-i,]
delete<-delete+1
}
}
This is essentially instantaneous on my machine:
m <- matrix(runif(100000),10000,10)
system.time(m1 <- m[rowSums((m <= 0.25 | is.na(m)) < 2,])
I only approximated your exact situation, but your version would be analogous. The idea here would be to:
Use a matrix, rather than a data frame, if your data is indeed all numeric.
Use vectorized comparison to determine which elements are less than some value (0.25 in my example).
Then use rowSums to count how many values are less than 0.25 in each row.
Subset the matrix according to which rows have fewer than two values less than (or equal to) 0.25.
Edit Added check for NAs to count them too.
This would solve your problem. You can leave your data as a DataFrame.
dat<-data.frame(matrix(rnorm(100,10,1),10))
bad<-apply(dat,1,function(x){
return((sum(x<10,na.rm=TRUE)+sum(is.na(x)))>length(x)*0.2)
})
dat<-dat[!bad,]
This works pretty quickly for me. Like the solution #joran used, I use a matrix:
data <- matrix(rnorm(1000, 15, 5), 100, 10)
tf <- apply(data, 1, function(x) x < 10) # your value of 10
data[-which(colSums(tf) > ncol(data)*0.2),] # here is where the 20% comes in
TRUE = 1 and FALSE = 0, which is why one can use colSums here
Update to handle NAs
If one follows OP's comment to include "just 20% of the numeric values" and not the original code that counts NA values as values < 10, (i.e. delete rows where 20 % of numeric entries are less than 10), then this will work:
data[-which(colSums(tf, na.rm=T) > (ncol(data) - colSums(apply(tf,2,is.na)))*0.2),]
colSums(apply(tf,2,is.na)) counts the number of entries in a row of data that are NA.
(ncol(data) - colSums(apply(tf,2,is.na))) subtracts that number from the number of columns so that only the total number of numeric columns is returned.
(ncol(data) - colSums(apply(tf,2,is.na)))*0.2 is 20% of the number of numeric entries per row

How to create a list from an array of z-scores in R?

I have an array of z-scores that is structured like num [1:27, 1:11, 1:467], so there are 467 entries with 27 rows and 11 columns. Is there a way that I can make a list from this array? For example a list of entries which contain a z-score over 2.0 (not just a list of z scores, a list which identifies which 1:467 entries have z > 2).
Say that your array is called z in your R session. The function you are looking for is which with the argument arr.ind set to TRUE.
m <- which(z > 2, arr.ind=TRUE)
This will give you a selection matrix, i.e. a matrix with three columns, each line corresponding to an entry with a Z-score greater than 2. To know the number of Z-scores greater than 2 you can do
nrow(m)
# Note that 'sum(z > 2)' is easier.
and to get the values
z[m]
# Note that 'z[z > 2]' is easier

An elegant way to count number of negative elements in a vector?

I have a data vector with 1024 values and need to count the number of negative entries. Is there an elegant way to do this without looping and checking if an element is <0 and incrementing a counter?
You want to read 'An Introduction to R'. Your answer here is simply
sum( x < 0 )
which works thanks to vectorisation. The x < 0 expression returns a vector of booleans over which sum() can operate (by converting the booleans to standard 0/1 values).
There is a good answer to this question from Steve Lianoglou How to identify the rows in my dataframe with a negative value in any column?
Let me just replicate his code with one small addition (4th point).
Imagine you had a data.frame like this:
df <- data.frame(a = 1:10, b = c(1:3,-4, 5:10), c = c(-1, 2:10))
This will return you a boolean vector of which rows have negative values:
has.neg <- apply(df, 1, function(row) any(row < 0))
Here are the indexes for negative numbers:
which(has.neg)
Here is a count of elements with negative numbers:
length(which(has.neg))
The above solutions prescribed need to be tweaked in-order to apply this for a df.
The below command helps get the count of negative or any other symbolic logical relationship.
Suppose you have a dataframe:
df <- data.frame(x=c(2,5,-10,NA,7), y=c(81,-1001,-1,NA,-991))
In-order to get count of negative records in x:
nrow(df[df$x<0,])

Resources