Function to remove columns with max value less than a given value, - r

I'm doing initial data clean up with 34,000 columns in a dataframe and in order to do that, i have to remove columns whose max value is less than 2.
I'm clueless as to how to remove columns with maxvalue less than 2 but for just getting max values, I tried creating a function as below without converting data with is.numeric:
protein <- is.numeric(protein)
#a:
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colMax(protein)
I got the max not meaningful for factors error, which is why i used the is.numeric function to convert all data to numeric form. despite doing that I still am not getting the desired result. When running the function I got 0 as a result rather than a list of max values for each column.
Why am i getting 0 for my max function?How do I setup a function that can generate max values for each column and remove any columns whose max values are less than 2? Would I need 2 separate functions?

Here is another way using dplyr to select columns where max value is greater than equal to 2. Assuming, we want to test for all the columns and all those columns are of class factor. Using #Maurits data
library(dplyr)
df %>%
#Convert column from factor to numeric
mutate_all(~as.numeric(as.character(.))) %>%
#Select column whose max value is greater than equal to 2
select_if(~max(., na.rm = TRUE) >= 2)
# V3 V4 V5 V6 V7 V8 V9 V10
#1 3 4 5 6 7 8 9 10
#2 3 4 5 6 7 8 9 10
#3 3 4 5 6 7 8 9 10
#4 3 4 5 6 7 8 9 10
#5 3 4 5 6 7 8 9 10
#6 3 4 5 6 7 8 9 10
#7 3 4 5 6 7 8 9 10
#8 3 4 5 6 7 8 9 10
#9 3 4 5 6 7 8 9 10
#10 3 4 5 6 7 8 9 10
Instead of max, we can also use any
df %>%
mutate_all(~as.numeric(as.character(.))) %>%
select_if(~any(. >= 2))
You say that you have 34000 columns. Do you want to check for greater than 2 condition for all the columns? Are all the columns factors ? The above code checks for all the columns and selects the one which do not satisfy the condition. If you want to do this on selected columns (not all), you might need to subset data, select those column and then apply the code.
In base R, we can also use colSums after converting the data from factor to numeric
df[] <- lapply(df, function(x) as.numeric(as.character(x)))
df[, colSums(df >= 2) > 0]

You were nearly there.
Since you don't provide reproducible sample data let's first create some minimal sample data
df <- as.data.frame(matrix(rep(1:10, each = 10), ncol = 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#1 1 2 3 4 5 6 7 8 9 10
#2 1 2 3 4 5 6 7 8 9 10
#3 1 2 3 4 5 6 7 8 9 10
#4 1 2 3 4 5 6 7 8 9 10
#5 1 2 3 4 5 6 7 8 9 10
#6 1 2 3 4 5 6 7 8 9 10
#7 1 2 3 4 5 6 7 8 9 10
#8 1 2 3 4 5 6 7 8 9 10
#9 1 2 3 4 5 6 7 8 9 10
#10 1 2 3 4 5 6 7 8 9 10
We now would like to keep only those columns where the max value is >2; we can do this using sapply
df[sapply(df, function(x) max(x, na.rm = T) > 2)]
# V3 V4 V5 V6 V7 V8 V9 V10
#1 3 4 5 6 7 8 9 10
#2 3 4 5 6 7 8 9 10
#3 3 4 5 6 7 8 9 10
#4 3 4 5 6 7 8 9 10
#5 3 4 5 6 7 8 9 10
#6 3 4 5 6 7 8 9 10
#7 3 4 5 6 7 8 9 10
#8 3 4 5 6 7 8 9 10
#9 3 4 5 6 7 8 9 10
#10 3 4 5 6 7 8 9 10
Explanation: sapply loops over the columns of the data.frame df and returns a logical vector (with as many entries as there are columns in df).
Or we can use pmax with apply
df[apply(pmax(df) > 2, 2, all)]
giving the same result. The difference to the first method is that pmax returns a matrix on which we operate column-wise with apply(..., MARGIN = 2, ...).

Related

How to collect outputs of vector-valued function into a dataframe?

I have a function f1 that takes a number k as input and returns 3 numbers k, k+1, k+2. I would like to ask how to concatenate these results into a dataframe for k from 1 to 10. In this way, the line k corresponds to the output f1(k).
f1 <- function(k){
return (c(k, k+1, k+2))
}
f1(1)
f1(2)
An option is to Vectorize the function 'f1', pass the values 1 to 10, returns a matrix, and then convert it to data.frame with as.data.frame
as.data.frame(Vectorize(f1)(1:10))
If it needs to be vertical, then transpose the output and apply as.data.frame
as.data.frame(t(Vectorize(f1)(1:10)))
-output
# V1 V2 V3
#1 1 2 3
#2 2 3 4
#3 3 4 5
#4 4 5 6
#5 5 6 7
#6 6 7 8
#7 7 8 9
#8 8 9 10
#9 9 10 11
#10 10 11 12
Or we can use outer
as.data.frame(outer(1:10, 0:2, `+`))
You can also use:
as.data.frame(do.call(rbind,lapply(1:10,f1)))
Output:
as.data.frame(do.call(rbind,lapply(1:10,f1)))
V1 V2 V3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12

How to select consecutive columns with across function dplyr [duplicate]

This question already has answers here:
How to replace all NA in a dataframe using tidyr::replace_na? [duplicate]
(3 answers)
dplyr mutate rowwise max of range of columns
(8 answers)
Closed 2 years ago.
I wanted to use the new across function from dplyr to select consecutive columns and to change the NA in zeros. However, it does not work. It seems like a very simple thing so it could be that I miss something.
A working example:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 4 3 NA 3 7 6 6 10 6 5
2 9 8 9 5 10 NA 2 1 7 2
3 1 1 6 3 6 NA 1 4 1 6
4 NA 4 NA 7 10 2 NA 4 1 8
5 1 2 4 NA 2 6 2 6 7 4
6 NA 3 NA NA 10 2 1 10 8 4
7 4 4 9 10 9 8 9 4 10 NA
8 5 8 3 2 1 4 5 9 4 7
9 3 9 10 1 9 9 10 5 3 3
10 4 2 2 5 NA 9 7 2 5 5
This works fine:
mutate_at(vars(V1:V4), ~replace(., is.na(.), 0))
But if try these options I get an error:
d %>% mutate(across(vars(V1:V4)), ~replace(., is.na(.), 0))
d %>% mutate(across(V1:V4)), ~replace(., is.na(.), 0))
d %>% mutate(across("V1":"V4")), ~replace(., is.na(.), 0))
I am not sure why this doesn't work
In across(), there are two basic arguments. The first argument are the columns that are to be modified, while the second argument is the function which should be applied to the columns. In addition, vars() is no longer needed to select the variables. Thus, the correct form is:
d %>%
mutate(across(V1:V4, ~ replace(., is.na(.), 0)))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 2 6 0 6 5 6 10 5 3 1
2 2 9 2 4 10 6 9 4 NA NA
3 5 5 3 0 3 7 1 5 9 5
4 7 1 1 6 2 1 8 NA 8 4
5 3 5 3 0 2 3 4 2 3 NA
6 0 10 0 2 5 10 1 10 4 3
7 4 3 10 6 NA 5 9 3 3 9
8 9 9 8 5 8 1 3 1 NA 10
9 6 3 0 1 1 9 3 5 8 4
10 3 2 9 1 5 2 4 NA 6 1

selecting common columns from different elements of a list

I have a data set in list format. The list is further divide into 20 elements. Each element contains 12 rows and some columns. Now I want to extract common columns from each element of the list and make a new data set. I try to make a reproducible example. Please see code
a<-data.frame(x=(1:10),y=(1:10),z=(1:10))
b<-data.frame(x=(1:10),y=(1:10),n=(1:10))
c<-data.frame(x=(1:10),y=(1:10),q=(1:10))
data<-list(a,b,c)
data1<-ldply(data)
required_data<-data1[,-3:-5]
Find the common columns using Reduce, subset them from list and bind them together
cols <- Reduce(intersect, lapply(data, colnames))
do.call(rbind, lapply(data, `[`, cols))
# x y
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
#6 6 6
#7 7 7
#8 8 8
#9 9 9
#10 10 10
#11 1 1
#...
The last step can also be performed using
purrr::map_df(data, `[`, cols)
with base R, you can fist find the names in common
commonName <- names((r<-table(unlist(Map(names,data))))[r>1])
then retrieve the columns from list and integrate (similar to the second step in the solution by #Ronak Shah)
res <- Reduce(rbind,lapply(data, '[',commonName))
which gives:
> res
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
11 1 1
12 2 2
13 3 3
14 4 4
15 5 5
16 6 6
17 7 7
18 8 8
19 9 9
20 10 10
21 1 1
22 2 2
23 3 3
24 4 4
25 5 5
26 6 6
27 7 7
28 8 8
29 9 9
30 10 10

Removing certain rows and replacing values based on a condition

I have the following data:
set.seed(2)
d <- data.frame(iteration=c(1,1,2,2,2,3,4,5,6,6,6),
value=sample(11),
var3=sample(11))
iteration value var3
1 1 3 7
2 1 8 4
3 2 6 8
4 2 2 3
5 2 7 9
6 3 9 11
7 4 1 10
8 5 4 1
9 6 10 2
10 6 11 6
11 6 5 5
Now, I want the following:
1. IF there are more than one iteration to remove the last row AND replace the value of the last row with the previous value.
So in the example above here is the output that I want:
d<-data.frame(iteration=c(1,2,2,3,4,5,6,6),
value=c(8,6,7,9,1,4,10,5))
iteration value var3
1 1 8 7
2 2 6 8
3 2 7 3
4 3 9 11
5 4 1 10
6 5 4 1
7 6 10 2
8 6 5 6
We can use data.table
library(data.table)
setDT(d)[, .(value = if(.N>1) c(value[seq_len(.N-2)], value[.N]) else value), iteration]
# iteration value
#1: 1 8
#2: 2 6
#3: 2 7
#4: 3 9
#5: 4 1
#6: 5 4
#7: 6 10
#8: 6 5
Update
Based on the update in OP's post, we can first create a new column with the lead values in 'value', assign the 'value1' to 'value' only for those meet the conditions in 'i1', then subset the rows
setDT(d)[, value1 := shift(value, type = "lead"), iteration]
i1 <- d[, if(.N >1) .I[.N-1], iteration]$V1
d[i1, value := value1]
d[d[, if(.N > 1) .I[-.N] else .I, iteration]$V1][, value1 := NULL][]
# iteration value var3
#1: 1 8 7
#2: 2 6 8
#3: 2 7 3
#4: 3 9 11
#5: 4 1 10
#6: 5 4 1
#7: 6 10 2
#8: 6 5 6
This base R solution using the split-apply-combine methodology returns the same values as #akrun's data.table version, although the logic appears to be different.
do.call(rbind, lapply(split(d, d$iteration),
function(i)
if(nrow(i) >= 3) i[-(nrow(i)-1),] else tail(i, 1)))
iteration value
1 1 8
2.3 2 6
2.5 2 7
3 3 9
4 4 1
5 5 4
6.9 6 10
6.11 6 5
The idea is to split the data.frame into a list of data.frames along iteration, then for each data.frame, check if there are more than 2 rows, if yes, grab the first and final row, if no, then return only the final row. do.call with rbind then compiles these observations into a single data.frame.
Note that this will not work in the presence of other variables.

delete rows that contain NAs in certain columns R

I have a data.frame that contains many columns. I want to keep the rows that have no NAs in 4 of these columns. The complication arises from the fact that I have other rows that are allowed have NAs in them so I can't use complete.cases or is.na. What's the most efficient way to do this?
You can still use complete.cases(). Just apply it to the desired columns (columns 1:4 in the example below) and then use the Boolean vector it returns to select valid rows from the entire data.frame.
set.seed(4)
x <- as.data.frame(replicate(6, sample(c(1:10,NA))))
x[complete.cases(x[1:4]),]
# V1 V2 V3 V4 V5 V6
# 1 7 4 6 8 10 5
# 2 1 2 5 5 1 2
# 5 6 8 4 10 6 6
# 6 2 6 9 3 4 4
# 7 4 3 3 1 2 1
# 9 8 5 2 7 7 3
# 10 10 10 1 2 5 NA

Resources