Using R to remove all columns that sum to 0 - r

I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!

You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1

welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)

Related

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

rmultinom() – but transposed?

I want a multinominal distributed data frame with dummies. The probabilities should be applied to the columns. I have following code which seems a bit awkward. Does anyone have a better idea?
set.seed(1234)
data.table::transpose(data.frame(rmultinom(10, 1, c(1:5)/5)))
# V1 V2 V3 V4 V5
# 1 0 0 0 1 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 0 0 0 1
# 6 0 0 0 0 1
# 7 0 0 0 1 0
# 8 0 1 0 0 0
# 9 0 0 0 0 1
# 10 0 0 0 1 0
A little shorter: and doesn't involve multiple coercions.
data.frame(t(rmultinom(10, 1, c(1:5)/5)))
or
library(data.table)
data.table(t(rmultinom(10, 1, c(1:5)/5)))

How to get a matrix with all combinations of three values in R?

Suppose we have a vector x with three values:
x <- c(0,1,2)
How to fill a matrix with 5 columns (V1, V2, V3, V4, V5) with combinations of all those values.
For example, we'd have:
V1 V2 V3 V4 V5
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
...
0 1 0 0 0
...
1 1 1 1 1
...
1 2 1 0 1
...
Is there a way to do that?
Something like:
head(expand.grid(x,x,x,x,x))
Var1 Var2 Var3 Var4 Var5
1 0 0 0 0 0
2 1 0 0 0 0
3 2 0 0 0 0
4 0 1 0 0 0
5 1 1 0 0 0
6 2 1 0 0 0

choose the specific variable in r

There are data like this table:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15
4 0 0 2 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 3 0 0 0 0 0 0 0 0 0 0 0
and I wish to make a new matrix with the specific variable columns which have a numerical value different from zero (in this case the specific variable columns are v1 and v4).
I know the subset function but I cannot find the way to choose conditional columns by using "if statement".
I mean... how can I make a matrix with only the specific columns that have numerical value different from zero by using "if statement"?
Please help me to solve my problem.
Thanks.
You haven't specified what format your data is in, but if you have a matrix or a data.frame, you should be able to use the R extract operator ([) to specify only the columns you want. You can feed it a vector of logical values (TRUE or FALSE) for that specification, so all you need is a function that will return the logical values you want.
As a simple example with a matrix, you could apply a function seeing if there are any non-zero values across each of the columns of the matrix:
> a
[,1] [,2] [,3] [,4]
[1,] 0 1 0 4
[2,] 0 2 0 5
[3,] 0 3 0 6
> a[, apply(a, 2, function(x) { return(any(x != 0)) })]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
This same extract mechanism works on data.frames as well:
> a
V1 V2 V3 V4
1 0 1 0 4
2 0 2 0 5
3 0 3 0 6
> a[, sapply(a, function(x) { return(any(x != 0)) })]
V2 V4
1 1 4
2 2 5
3 3 6

calculate correlation on every column

I want to calculate correlation of V2 with V3, V4, ..., V18:
That is cor(V2,V3, na.rm = TRUE), cor(V2, V4, na.rm =TRUE), etc
What is the most effective way to do this?
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 141_21311223 2.000 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 44_33331123 2.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 247_11131211 2.065 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 33_31122113 2.080 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 277_21212111 2.090 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0
Converting my comment to an answer, one simple approach would be to use the column positions in a sapply statement:
sapply(3:ncol(mydf), function(y) cor(mydf[, 2], mydf[, y], ))
This should create a vector of the output value. change sapply to lapply if you prefer a list as the output.
I've never seen na.rm for cor though....

Resources