How to Delete Every Row&Columns Which Contains Negative Value - r

I have dataframe called lexico which has a dimension of 11293x512.
I'd like to purge every row and column if any element in that column or row holds negative value.
How could I do this?
Following is my code that I tried but it takes too long time to run since it has nested loop structure.
(I was about to first get every column number that holds neg value in it)
colneg <- c()
for(i in 1:11293){
for(j in 1:512){
if(as.numeric(as.character(lexico[1283,2]))< 0)
colneg <- c(colneg, j)
}
}
It would be appreciate for your harsh advice for this novice.

A possible solution:
# create an index of columns with negative values
col_index <- !colSums(d < 0)
# create an index of rows with negative values
row_index <- !rowSums(d < 0)
# subset the dataframe with the two indexes
d2 <- d[row_index, col_index]
What this does:
colSums(d < 0) gives a numeric vector of the number of negative values in the columns.
By negating it with ! you create a logical vector where for the columns with no negative values get a TRUE value.
It works the same for rows.
Subsetting the dataframe with the row_index and the col_index gives you a dataframe where the rows as wel as the columns where the negative values appeared are removed.
Reproducible example data:
set.seed(171228)
d <- data.frame(matrix(rnorm(1e4, mean = 3), ncol = 20))

Related

Count number of rows in each column in a dataframe that specify a specific condition

New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})

Remove columns based on a condition

I have a data set. I want to remove all columns whose value in the first row is less than 10. I have tried to make a reproducible example. Please see the code.
data_set <- matrix(8:100, nrow = 5)
required_data_set <- data_set[, -1]
We can subset the first row with indexing on i, create a logical vector by checking if the values are greater than or equal to 10 and use that in the j for subsetting the columns.
out <- data_set[,data_set[1,] >= 10]
identical(out, required_data_set)
#[1] TRUE

Faster method of counting specified values from rows in large matrix in R

MC is a very large matrix, 1E6 rows (or more) and 500 columns. I am trying to get the number of occurrences of the values 1 through 13 for each of the columns. Sometimes the number of occurrences for one of these values will be zero. I would like my final output to be a 300X13 matrix (or data frame) with these count values. I am wondering if anyone can suggest a more efficient manner then what I currently have, which is the following:
MCct<-matrix(0,500,13)
for (j in 1:500){
for (i in 1:13){
MCct[j,i]<-length(which(MC[,j]==i))}}
I don't that table works, because I need to also know if zero occurrences occurred...I couldn't figure it out how to do that if it is possible. And I am only somewhat familiar with apply, so maybe there is a method to use that...I haven't been successful in figuring that out yet.
Thanks for the help,
Vivien
You could do this with sapply (to iterate from 1 to 13) and colSums (to add up the columns of j):
MCct <- sapply(1:13, function(i) {
colSums(MC == i)
})
Suppose you have a set of values you're interested in
set <- 1:4
n = length(set)
and you have a matrix that includes those values, and others
m <- matrix(sample(10, 120, TRUE), 12, 10)
Create a vector indicating the index in the set of each matching value
idx <- match(m, set)
then make the index unique to each column
idx <- idx + (col(m) - 1) * n
idx ranges from 1 (occurrences of the first set element in the first column) to n * ncol(m) (occurrence of the nth set element in the last column of m). Tabulate the unique values of idx
v <- tabulate(idx, nbin = n * ncol(m))
The first n elements of v summarize the number of times set elements 1..n appear in the first column of m. The second n elements of v summarize the number of times set elements 1..n appear in the second column of m, etc. Reshape as the desired matrix, where each row represents the corresponding member of the set.
matrix(v, ncol=ncol(m))
table can count zero occurrences, you just need to create a factor that has the whole range of levels, e.g.
apply(MC, 2, function(x) table(factor(x, levels=1:13)))
This is not as efficient as #Patronus' solution though.

Choosing the Best Combination of Values from Data frame R

I have a data frame with 20 rows and 10 columns. Each value in the data is a number between 0 and 10.
I want to pick the combination of values with the highest sum, and I have to pick one and only one value from each column.
Is there a ready r function that does this, or a implication of a known algorithm.
Is there an r function that generates all the possible combinations from which I would pick the one with the highest sum?
Is this what you're trying to do? (I'm assuming your data frame is named df.)
maxList <- c(which(df$col1 == max(df[, 1]))) #Initialize list of row numbers with max value
total <- max(df[, 1]) #Initialize sum of allowable maximum values
combination <- c(total) #Initialize list of those maximum values
for(i in 2:ncol(df)) { #For the remaining columns in df
subCol <- df[, i]
for(j in 1:length(maxList)) { #For the number of items in maxList
subCol[maxList[j]] <- 0 #Set row values of previous maxima to zero
maxList <- c(maxList, which(subCol == max(subCol))) #Update maxList
}
combination <- c(combination, max(subCol))
total <- total + max(subCol) #Update total
}

Error: (subscript) logical subscript too long

Can some one let me know why I am getting this error and how I can fix it?
Here is the code
What I am trying to do is remove the rows that associated 1's if the column of that one's less than 10
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
c4=c(rep(1,6),rep(0,34))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3,c4),nrow=40,ncol=8)
nam <- paste("V",2:9,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- colSums(dat)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
cs <- colSums(dat) < 10
indx <- dat[,which(cs)]==0
for(i in 1:dim(indx)[2]){
datnw <- dat[indx[,i],]
dat <- datnw}
datnw2 <- dat[, -which(cs)]
Thanks
If I understand correctly what you're trying to achieve, you might best write it this way:
cs <- colSums(dat) < 10
dat[rowSums(dat[,cs]) == 0, !cs]
This means: for any column with sum less than 10 (called a “small column” hereafter), drop any row which has a 1 in that column. So you only keep rows which have a zero in all those small columns. You drop the small columns as well, as they would only contain zeros in any case.
In your code, indx is a logical data frame with 40 rows, one for each row of input, and one column for each small column in the input. You use the first column of idx to remove the rows with a 1 in the first short column. This results in a new value for dat, which is a few rows shorter than the original. In the next iteration of the loop, you use the second logical vector in an attempt to remove more rows. But this won't work: after the first iteration, dat has less than 40 rows, but the second column still has all 40 rows. This is what's causing the error: you're subscripting a vector of less than 40 elements with a logical vector of length 40.
You could combine the three columns of your indx into a single vector suitable to subscript the rows of interest using the following expression:
apply(indx, 1, all)
This will have a TRUE value in its result for exactly those rows which have TRUE in each column. However, I guess I'd prefer my code above over this, as it is much shorter to write. The most likely reason to prefer the latter is if your data frame may contain negative number, so that a row sum of zero does not imply an all-zero row. Not a problem in your example data.

Resources