delete rows that contain NAs in certain columns R - r

I have a data.frame that contains many columns. I want to keep the rows that have no NAs in 4 of these columns. The complication arises from the fact that I have other rows that are allowed have NAs in them so I can't use complete.cases or is.na. What's the most efficient way to do this?

You can still use complete.cases(). Just apply it to the desired columns (columns 1:4 in the example below) and then use the Boolean vector it returns to select valid rows from the entire data.frame.
set.seed(4)
x <- as.data.frame(replicate(6, sample(c(1:10,NA))))
x[complete.cases(x[1:4]),]
# V1 V2 V3 V4 V5 V6
# 1 7 4 6 8 10 5
# 2 1 2 5 5 1 2
# 5 6 8 4 10 6 6
# 6 2 6 9 3 4 4
# 7 4 3 3 1 2 1
# 9 8 5 2 7 7 3
# 10 10 10 1 2 5 NA

Related

Function to remove columns with max value less than a given value,

I'm doing initial data clean up with 34,000 columns in a dataframe and in order to do that, i have to remove columns whose max value is less than 2.
I'm clueless as to how to remove columns with maxvalue less than 2 but for just getting max values, I tried creating a function as below without converting data with is.numeric:
protein <- is.numeric(protein)
#a:
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colMax(protein)
I got the max not meaningful for factors error, which is why i used the is.numeric function to convert all data to numeric form. despite doing that I still am not getting the desired result. When running the function I got 0 as a result rather than a list of max values for each column.
Why am i getting 0 for my max function?How do I setup a function that can generate max values for each column and remove any columns whose max values are less than 2? Would I need 2 separate functions?
Here is another way using dplyr to select columns where max value is greater than equal to 2. Assuming, we want to test for all the columns and all those columns are of class factor. Using #Maurits data
library(dplyr)
df %>%
#Convert column from factor to numeric
mutate_all(~as.numeric(as.character(.))) %>%
#Select column whose max value is greater than equal to 2
select_if(~max(., na.rm = TRUE) >= 2)
# V3 V4 V5 V6 V7 V8 V9 V10
#1 3 4 5 6 7 8 9 10
#2 3 4 5 6 7 8 9 10
#3 3 4 5 6 7 8 9 10
#4 3 4 5 6 7 8 9 10
#5 3 4 5 6 7 8 9 10
#6 3 4 5 6 7 8 9 10
#7 3 4 5 6 7 8 9 10
#8 3 4 5 6 7 8 9 10
#9 3 4 5 6 7 8 9 10
#10 3 4 5 6 7 8 9 10
Instead of max, we can also use any
df %>%
mutate_all(~as.numeric(as.character(.))) %>%
select_if(~any(. >= 2))
You say that you have 34000 columns. Do you want to check for greater than 2 condition for all the columns? Are all the columns factors ? The above code checks for all the columns and selects the one which do not satisfy the condition. If you want to do this on selected columns (not all), you might need to subset data, select those column and then apply the code.
In base R, we can also use colSums after converting the data from factor to numeric
df[] <- lapply(df, function(x) as.numeric(as.character(x)))
df[, colSums(df >= 2) > 0]
You were nearly there.
Since you don't provide reproducible sample data let's first create some minimal sample data
df <- as.data.frame(matrix(rep(1:10, each = 10), ncol = 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 1 2 3 4 5 6 7 8 9 10
#2 1 2 3 4 5 6 7 8 9 10
#3 1 2 3 4 5 6 7 8 9 10
#4 1 2 3 4 5 6 7 8 9 10
#5 1 2 3 4 5 6 7 8 9 10
#6 1 2 3 4 5 6 7 8 9 10
#7 1 2 3 4 5 6 7 8 9 10
#8 1 2 3 4 5 6 7 8 9 10
#9 1 2 3 4 5 6 7 8 9 10
#10 1 2 3 4 5 6 7 8 9 10
We now would like to keep only those columns where the max value is >2; we can do this using sapply
df[sapply(df, function(x) max(x, na.rm = T) > 2)]
# V3 V4 V5 V6 V7 V8 V9 V10
#1 3 4 5 6 7 8 9 10
#2 3 4 5 6 7 8 9 10
#3 3 4 5 6 7 8 9 10
#4 3 4 5 6 7 8 9 10
#5 3 4 5 6 7 8 9 10
#6 3 4 5 6 7 8 9 10
#7 3 4 5 6 7 8 9 10
#8 3 4 5 6 7 8 9 10
#9 3 4 5 6 7 8 9 10
#10 3 4 5 6 7 8 9 10
Explanation: sapply loops over the columns of the data.frame df and returns a logical vector (with as many entries as there are columns in df).
Or we can use pmax with apply
df[apply(pmax(df) > 2, 2, all)]
giving the same result. The difference to the first method is that pmax returns a matrix on which we operate column-wise with apply(..., MARGIN = 2, ...).

Order data frame by column and display WITH indices

I have the following R data frame
> df
a
1 3
3 2
4 1
5 3
6 6
7 7
8 2
10 8
I order it by the a column with the order function df[ order(df), ]:
[1] 1 2 2 3 3 6 7 8
This is the result I want, BUT, how can list the whole data frame with the permuted indices?
The only thing that works is the following, but it seems sloppy and I don't really understand what it does:
> df[ order(df), c(1,1) ] # I want this but without the a.1 column!!!!
a a.1
4 1 1
3 2 2
8 2 2
1 3 3
5 3 3
6 6 6
7 7 7
10 8 8
Thanks
If we need the indices as well, use sort with index.return = TRUE
data.frame(sort(df$a, index.return=TRUE))

Reduce columns of a matrix by a function in R

I have a matrix sort of like:
data <- round(runif(30)*10)
dimnames <- list(c("1","2","3","4","5"),c("1","2","3","2","3","2"))
values <- matrix(data, ncol=6, dimnames=dimnames)
# 1 2 3 2 3 2
# 1 5 4 9 6 7 8
# 2 6 9 9 1 2 5
# 3 1 2 5 3 10 1
# 4 6 5 1 8 6 4
# 5 6 4 5 9 4 4
Some of the column names are the same. I want to essentially reduce the columns in this matrix by taking the min of all values in the same row where the columns have the same name. For this particular matrix, the result would look like this:
# 1 2 3
# 1 5 4 7
# 2 6 1 2
# 3 1 1 5
# 4 6 4 1
# 5 6 4 4
The actual data set I'm using here has around 50,000 columns and 4,500 rows. None of the values are missing and the result will have around 40,000 columns. The way I tried to solve this was by melting the data then using group_by from dplyr before reshaping back to a matrix. The problem is that it takes forever to generate the data frame from the melt and I'd like to be able to iterate faster.
We can use rowMins from library(matrixStats)
library(matrixStats)
res <- vapply(split(1:ncol(values), colnames(values)),
function(i) rowMins(values[,i,drop=FALSE]), rep(0, nrow(values)))
res
# 1 2 3
#[1,] 5 4 7
#[2,] 6 1 2
#[3,] 1 1 5
#[4,] 6 4 1
#[5,] 6 4 4
row.names(res) <- row.names(values)

obtaining a vector of non removed index rows

I have two data frames which have the same elements initially but after eliminating some rows in one of them are not the same length.
x <-c(4,2,3,6,7,3,1,8,5,2,4,1,2,6,3)
y <-c(1,4,2,3,6,7,3,1,8,5,2,3,1,4,3)
z <-c(4,2,3,1,8,5,2,4,1)
k <-c(1,4,2,3,1,8,5,2,3)
df1 <- data.frame(x,y)
df2 <- data.frame(z,k)
I would like to find a way in the second data frame (df2) to create a row or have the index reference with the index row number of the first data frame (df1) so it results in a new data frame as follows (a would be the index reference from df1).
df3
a z k
1 1 4 1
2 2 2 4
3 3 3 2
4 7 1 3
5 8 8 1
6 9 5 8
7 10 2 5
8 11 4 2
9 12 1 3
I could create a column manually of all rows that are eliminated or use
library(sqldf)
a1NotIna2 <- (sqldf('SELECT * FROM df1 EXCEPT SELECT * FROM df2'))
a1NotIna2
x y
1 2 1
2 3 3
3 3 7
4 6 3
5 6 4
6 7 6
I have tried using -which- without sucess on this last expression to find out the rows of df1 that were eliminated to be used this in removing from a sequencing vector of length equal to df1 those common elements as to obtain a vector with the index similar to df3
Any help is welcomed
A generic solution if your data.frames have two columns, using pmatch:
transform(df2, a=pmatch(do.call(paste0, df2), do.call(paste0, df1)))
# z k a
#1 4 1 1
#2 2 4 2
#3 3 2 3
#4 1 3 7
#5 8 1 8
#6 5 8 9
#7 2 5 10
#8 4 2 11
#9 1 3 12
You can get the first matching row of df1 for each row in df2 with:
match(paste(df2$z, df2$k), paste(df1$x, df1$y))
# [1] 1 2 3 7 8 9 10 11 7
Unfortunately this won't maintain ordering when you have duplicated rows, so for instance we got index 7 for the last row of df2 instead of 12.

Delete rows in R if a cell contains a value larger than x

I want to delete all rows containing a value larger than 7 in a cell in an arbitrary column, either across all columns or across specific columns.
a <- c(3,6,99,7,8,9)
b <- c(99,6,3,4,5,6)
c <- c(2,5,6,7,8,3)
df <- data.frame (a,b,c)
a b c
1 3 99 2
2 6 6 5
3 99 3 6
4 7 4 7
5 8 5 8
6 9 6 3
V1:
I want to delete all rows containing values larger than 7, regardless of the column.
# result V1
a b c
2 6 6 5
4 7 4 7
V2:
I want to delete all rows containing values larger than 7 in column b and c
# result V2
a b c
2 6 6 5
3 99 3 6
4 7 4 7
6 9 6 3
There are plenty of similar problems on SOF, but I couldn't find a solution to this problem. So far I can only find rows that include 7using res <- df[rowSums(df != 7) < ncol(df), ].
rowSums of the logical matrix df > 7 gives the number of 'TRUE' per each row. We get '0' if there are no 'TRUE' for that particular row. By negating the results, '0' will change to 'TRUE", and all other values not equal to 0 will be FALSE. This can be used for subsetting.
df[!rowSums(df >7),]
# a b c
#2 6 6 5
#4 7 4 7
For the 'V2', we use the same principle except that we are getting the logical matrix on a subset of 'df'. ie. selecting only the second and third columns.
df[!rowSums(df[-1] >7),]
# a b c
#2 6 6 5
#3 99 3 6
#4 7 4 7
#6 9 6 3

Resources