problems with deleting rows in big data sets in r - r

I wrote a script that delete rows that 20% of their cells ara smaller then 10.
it's work great on small data Sets but for big it's useless.
can sombody help me please.
here is my script:
DataSets<-choose.files()
DataSet<-read.delim(DataSets,header = TRUE,
row.names = 1,sep="\t",blank.lines.skip=TRUE)
delete<-0
for(i in 1:length(DataSet[,1]))
{
count<-0
for(j in 1:length(DataSet[i,]))
{
if(DataSet[i,j]<10 || is.na(DataSet[i,j]))
{
count=count+1
}
}
if(count>0.2*length(DataSet[i,]))
{
DataSet=DataSet[-i,]
delete<-delete+1
}
}

This is essentially instantaneous on my machine:
m <- matrix(runif(100000),10000,10)
system.time(m1 <- m[rowSums((m <= 0.25 | is.na(m)) < 2,])
I only approximated your exact situation, but your version would be analogous. The idea here would be to:
Use a matrix, rather than a data frame, if your data is indeed all numeric.
Use vectorized comparison to determine which elements are less than some value (0.25 in my example).
Then use rowSums to count how many values are less than 0.25 in each row.
Subset the matrix according to which rows have fewer than two values less than (or equal to) 0.25.
Edit Added check for NAs to count them too.

This would solve your problem. You can leave your data as a DataFrame.
dat<-data.frame(matrix(rnorm(100,10,1),10))
bad<-apply(dat,1,function(x){
return((sum(x<10,na.rm=TRUE)+sum(is.na(x)))>length(x)*0.2)
})
dat<-dat[!bad,]

This works pretty quickly for me. Like the solution #joran used, I use a matrix:
data <- matrix(rnorm(1000, 15, 5), 100, 10)
tf <- apply(data, 1, function(x) x < 10) # your value of 10
data[-which(colSums(tf) > ncol(data)*0.2),] # here is where the 20% comes in
TRUE = 1 and FALSE = 0, which is why one can use colSums here
Update to handle NAs
If one follows OP's comment to include "just 20% of the numeric values" and not the original code that counts NA values as values < 10, (i.e. delete rows where 20 % of numeric entries are less than 10), then this will work:
data[-which(colSums(tf, na.rm=T) > (ncol(data) - colSums(apply(tf,2,is.na)))*0.2),]
colSums(apply(tf,2,is.na)) counts the number of entries in a row of data that are NA.
(ncol(data) - colSums(apply(tf,2,is.na))) subtracts that number from the number of columns so that only the total number of numeric columns is returned.
(ncol(data) - colSums(apply(tf,2,is.na)))*0.2 is 20% of the number of numeric entries per row

Related

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

R: how to conditionally replace rows in data frame with randomly sampled rows from another data frame?

I need to conditionally replace rows in a data frame (x) with rows selected at random from another data frame (y).Some of the rows between the two data frames are the same and so data frame x will contain rows with repeated information. What sort of base r code would I need to achieve this?
I am writing an agent based model in r where rows can be thought of as vectors of attributes pertaining to an agent and columns are attribute types. For agents to transmit their attributes they need to send rows from one data frame (population) to another, but according to conditional learning rules. These rules need to be: conditionally replace values in row n in data frame x if attribute in column 10 for that row is value 1 or more and if probability s is greater than a randomly selected number between 0 and 1. Probability s is itself an adjustable parameter that can take any value from 0 to 1.
I have tried IF function in the code below, but I am new to r and have made a mistake somewhere with it as I get this warning:
"missing value where TRUE/FALSE needed"
I reckon that I have not specified what should happen to a row if the conditions are not satisfied.
I cannot think of an alternative method of achieving my aim.
Note: agent.dat is data frame x and top_ten_percent is data frame y.
s = 0.7
N = nrow(agent.dat)
copy <- runif(N) #to generate a random probability for each row in agent.dat
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 & copy < s){
agent.dat <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
The agent.dat data frame should have rows that are replaced with values from rows in the top_ten_percent data frame if the randomly selected value of copy between 0 and 1 for that row is less than the value of parameter s and if the value for that row in column 10 is 1 or more. For each row I need to replace the first 10 columns of agent.dat with the first 10 columns of top_ten_percent (excluding column 11 i.e. copy value).
Assistance with this problem is greatly appreciated.
So you just need to change a few things.
You need to get a particular value for copy for each iteration of the for loop (use: copy[i]).
You also need to make the & in the if statement an && (Boolean operators && and ||)
Then you need to replace a particular row (and columns 1 through 10) in agent.dat, instead of the whole thing (agent.dat[i,1:10])
So, the final code should look like:
copy <- runif(N)
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 && copy[i] < s){
agent.dat[i,1:10] <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
This should fix your errors, assuming your data structure fits your code:
copy <- runif(nrow(agent.dat))
s <- 0.7
for (i in 1:nrow(agent.dat)){
if(agent.dat[i,10] >= 1 & copy[i] < s){
agent.dat[i,] <- top_ten_percent[sample(1:nrow(top_ten_percent), 1), ]
}
}

gene expression datamatrix filtration

I have one matrix with 3064 rows and 27 columns which contains values between -0.5 and 2.0. I want to extract every rows which have at least once value >=0.5. As answer I would like to have whole row in it's origional matrix form.
Consider m is my matrix, I tried:
m[m[1:190,1:16]>0.5,1:16]
As this command is not accepting process on more then 190 rows, I went for 190 rows, but somehow it went wrong, because it gave me rows which also have values < 0.5.
Is it possible to write any function, that can be applied for whole matrix ?
you can also try like this if your data name is df
df2<- df[apply(df, MARGIN = 1, function(x) any(x >= 0.5)), ]
library(fBasics)
m2 <- subset(x = m, subset = rowMaxs(m)>=0.5)
What mm=m[1:190,1:16]>0.5 gives you is a matrix of boolean indicating which values of m[1:190,1:16] are greater than 0.5.
Then when you do m[mm], it considers mm as a vector and gives you corresponding values. The thing is dim(m) = 3064*27 while dim(m[1:190,1:16]) = 190*16. Which means that the first 27 values of mm will be used to get the first line of m while they correspond to part of the second line of mm.
So in order to have only the elements greater than 0.5, you need to apply matrix to m[1:190,1:16] which has the same dimension, i.e:
`m[1:190,1:16][m[1:190,1:16]>0.5, 1:16]
But what you do here is m[mm, 1:16], so you consider each individual value of mm as a row number, while it is a 190*16 matrix. It means you specify 190*16=3040 rows, it does not work with more because m only has 3064 rows.
What you want is a vector of length 190 (or even 3064 I guess) specifying which rows to take. You can get this vector with rowSums(m >=0.5)>0, which means each row with more than 0 values greater than 0.5. Then you get your output with:
m[rowSums(m >= 0.5) > 0,]
And it will work for the whole matrix. Note that some values will be smaller than 0.5 since you selected the whole line if at least one value was greater than 0.5.
Edit
For rows with values <0.5, the idea is the same:
m[rowSums(m < 0.5) > 0,]

simulate x percentage of missing and error in data in r

I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}

An elegant way to count number of negative elements in a vector?

I have a data vector with 1024 values and need to count the number of negative entries. Is there an elegant way to do this without looping and checking if an element is <0 and incrementing a counter?
You want to read 'An Introduction to R'. Your answer here is simply
sum( x < 0 )
which works thanks to vectorisation. The x < 0 expression returns a vector of booleans over which sum() can operate (by converting the booleans to standard 0/1 values).
There is a good answer to this question from Steve Lianoglou How to identify the rows in my dataframe with a negative value in any column?
Let me just replicate his code with one small addition (4th point).
Imagine you had a data.frame like this:
df <- data.frame(a = 1:10, b = c(1:3,-4, 5:10), c = c(-1, 2:10))
This will return you a boolean vector of which rows have negative values:
has.neg <- apply(df, 1, function(row) any(row < 0))
Here are the indexes for negative numbers:
which(has.neg)
Here is a count of elements with negative numbers:
length(which(has.neg))
The above solutions prescribed need to be tweaked in-order to apply this for a df.
The below command helps get the count of negative or any other symbolic logical relationship.
Suppose you have a dataframe:
df <- data.frame(x=c(2,5,-10,NA,7), y=c(81,-1001,-1,NA,-991))
In-order to get count of negative records in x:
nrow(df[df$x<0,])

Resources