Remove columns based on a condition - r

I have a data set. I want to remove all columns whose value in the first row is less than 10. I have tried to make a reproducible example. Please see the code.
data_set <- matrix(8:100, nrow = 5)
required_data_set <- data_set[, -1]

We can subset the first row with indexing on i, create a logical vector by checking if the values are greater than or equal to 10 and use that in the j for subsetting the columns.
out <- data_set[,data_set[1,] >= 10]
identical(out, required_data_set)
#[1] TRUE

Related

How to Delete Every Row&Columns Which Contains Negative Value

I have dataframe called lexico which has a dimension of 11293x512.
I'd like to purge every row and column if any element in that column or row holds negative value.
How could I do this?
Following is my code that I tried but it takes too long time to run since it has nested loop structure.
(I was about to first get every column number that holds neg value in it)
colneg <- c()
for(i in 1:11293){
for(j in 1:512){
if(as.numeric(as.character(lexico[1283,2]))< 0)
colneg <- c(colneg, j)
}
}
It would be appreciate for your harsh advice for this novice.
A possible solution:
# create an index of columns with negative values
col_index <- !colSums(d < 0)
# create an index of rows with negative values
row_index <- !rowSums(d < 0)
# subset the dataframe with the two indexes
d2 <- d[row_index, col_index]
What this does:
colSums(d < 0) gives a numeric vector of the number of negative values in the columns.
By negating it with ! you create a logical vector where for the columns with no negative values get a TRUE value.
It works the same for rows.
Subsetting the dataframe with the row_index and the col_index gives you a dataframe where the rows as wel as the columns where the negative values appeared are removed.
Reproducible example data:
set.seed(171228)
d <- data.frame(matrix(rnorm(1e4, mean = 3), ncol = 20))

How to compare a data frame with duplicates and a vector?

I have a data frame in which some ids appear more than once. I sampled this ids uniquely and now I have a vector with the sampled ids. Now I need to create a logical that tells me which rows in the data frame have ids that also appear on my sample.
I have tried the match function, but it selects only the first appearance and I need all appearances.
I have also tried merge but the dataset is to large so there is no memory to do it.
You can use %in% to get a logical vector and which together with in to get the row indices. Here is a reproducible example that contains duplicate IDs.
set.seed(1234)
df <- data.frame(id=sample(1:80, 100, replace=TRUE), b=rnorm(100))
mySample <- seq(1, 80, by=6)
#logical vector length of nrow(df)
myRows <- df$id %in% mySample
# row indices
myIndices <- which(df$id %in% mySample)
This is what you can do using match (as you were trying this function):
x=match(df$id, mySample, nomatch = 0) > 0
Which gives you a logical vector which is TRUE if df$id appears in mySample and FALSE otherwise.
To retrieve the respective indices:
which(x==T)

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

R: Can I read a vector and use the value to assign another vector?

In R, I have a vector anno_ref_seq, rows containing values like the following
c("NM_026671", "NP_080947"), c("NM_027853", "NP_082129"), c("NM_025791", "NP_080067")
I want to take each value, say, *c("NM_026671", "NP_080947")*, and take the first element "NM_026671" and assign it to a variable, so that I can use it again. So from each row, I want pick the first element and create another vector.
If I can read each row as a = c("NM_026671", "NP_080947"), where b is returned as vector containing two elements, that would help, but I don't know how to do it.
You could yield a vector b by changing x to a data.frame:
x[1, ] <- c("NM_026671", "NP_080947")
x[2, ] <- c("NM_027853", "NP_082129")
x[3, ] <- c("NM_025791", "NP_080067")
dx <- as.data.frame(x)
b <- as.vector(dx[, 1])

Error: (subscript) logical subscript too long

Can some one let me know why I am getting this error and how I can fix it?
Here is the code
What I am trying to do is remove the rows that associated 1's if the column of that one's less than 10
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
c4=c(rep(1,6),rep(0,34))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3,c4),nrow=40,ncol=8)
nam <- paste("V",2:9,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- colSums(dat)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
cs <- colSums(dat) < 10
indx <- dat[,which(cs)]==0
for(i in 1:dim(indx)[2]){
datnw <- dat[indx[,i],]
dat <- datnw}
datnw2 <- dat[, -which(cs)]
Thanks
If I understand correctly what you're trying to achieve, you might best write it this way:
cs <- colSums(dat) < 10
dat[rowSums(dat[,cs]) == 0, !cs]
This means: for any column with sum less than 10 (called a “small column” hereafter), drop any row which has a 1 in that column. So you only keep rows which have a zero in all those small columns. You drop the small columns as well, as they would only contain zeros in any case.
In your code, indx is a logical data frame with 40 rows, one for each row of input, and one column for each small column in the input. You use the first column of idx to remove the rows with a 1 in the first short column. This results in a new value for dat, which is a few rows shorter than the original. In the next iteration of the loop, you use the second logical vector in an attempt to remove more rows. But this won't work: after the first iteration, dat has less than 40 rows, but the second column still has all 40 rows. This is what's causing the error: you're subscripting a vector of less than 40 elements with a logical vector of length 40.
You could combine the three columns of your indx into a single vector suitable to subscript the rows of interest using the following expression:
apply(indx, 1, all)
This will have a TRUE value in its result for exactly those rows which have TRUE in each column. However, I guess I'd prefer my code above over this, as it is much shorter to write. The most likely reason to prefer the latter is if your data frame may contain negative number, so that a row sum of zero does not imply an all-zero row. Not a problem in your example data.

Resources