Select or subset variables whose column sums are not zero - r

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D and this works, but I won't have variable A:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D in the sum() function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows

Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.
the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
As commented below by #zx8754, is.factor(.)should be used if one only wants to keep factor columns.
Edit: a base R solution
cols <- c('B', 'C', 'D') <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in%]

Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by #mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))

This is a soltion using data.table
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
df2<-df[,lapply(X = .SD,FUN = function(x){sum(as.numeric(x))}),.SDcols = colnames(df)]
df[,which([1,]) == F),with = F]


Comparing two columns in a dataframe using R

I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.
I would do:
df %>% filter(column1 != column2)
This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)
However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)
How can I include the latter case into the filter function?
Or use xor:
df %>% filter(a != b | xor(,
Or as #thelatemail mentioned, you could use Base R:
df[which(df$a != df$b | xor($a),$b))),]
Or as #runr mentioned, you could try subset in Base R:
subset(df, a != b | xor(,
You can include them with an OR (|) condition -
df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))
df %>% filter(a != b | |
# a b
#1 1 NA
#2 NA 3
#3 5 8
Another option would be to change NA values to string "NA" and then only using a != b should work.
df %>%
mutate(across(.fns = ~replace(.,, 'NA'))) %>%
filter(a != b) %>%
type.convert( = TRUE)
We can use if_any
df %>%
filter(a != b | if_any(everything(),
a b
1 1 NA
2 NA 3
3 5 8
df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
class = "data.frame", row.names = c(NA,

generate a weighted matrix from r dataframe

I have a toy example of a dataframe:
df <- data.frame(matrix(, nrow = 5, ncol = 0))
df["A|A"] <- c(0.3, 0, 0, 100, 23)
df["A|B"]= c(0, 0, 0.3, 10, 0.23)
df["A|C"]= c(0.3, 0.1, 0, 100, 2)
df["B|B"]= c(0, 0, 0, 12, 2)
df["B|B"]= c(0, 0, 0.3, 0, 0.23)
df["B|C"]= c(0.3, 0, 0, 21, 3)
df["C|A"]= c(0.3, 0, 1, 100, 0)
df["C|B"]= c(0, 0, 0.3, 10, 0.2)
df["C|C"]= c(0.3, 0, 1, 1, 0.3)
I need to get a matrix with counts of non-zero values between A and A, A and B, ..., C and C.
I started splitting the colnames and assigning them to variables. But I don't know how to create a matrix with certain rows and columns in a loop
counts <- colSums(df != 0)
df <- rbind(df, counts)
for(i in colnames(df)) {
cluster1 <- (strsplit(i, "\\|")[[1]])[1]
cluster2 <- (strsplit(i, "\\|")[[1]])[2]
A base R option
> table(read.table(text = rep(names(df), colSums(df > 0)), sep = "|"))
V1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
or a longer version
as.character(subset(stack(df), values > 0)$ind),
X1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
Reshape the data into 'long' format with pivot_longer, then separate the 'name' column into two, and reshape back to 'wide' with pivot_wider, specifying the values_fn as a lambda function to get the count of non-zero values
df %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c('name1', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value,
values_fn = list(value = ~ sum(. > 0)), values_fill = 0)
# A tibble: 3 x 4
name1 A B C
<chr> <int> <int> <int>
1 A 3 3 4
2 B 0 2 3
3 C 3 3 4

Calculate minimum distance between groups of points in data frame

my data frame looks like this:
Time, Value, Group
0, 1.0, A
1, 2.0, A
2, 3.0, A
0, 4.0, B
1, 6.0, B
2, 6.0, B
0, 7.0, C
1, 7.0, C
2, 9.0, C
I need to find for each combination (A, B), (A, C), (B, C) the maximum difference over each corresponding Time points.
So comparing A and B has maximum distance for t=1 which is 6 (B) - 2 (A) = 4.
The full output should be something like this:
AB, 0, 4
AC, 0, 6
BC, 0, 3
One way in base R using combn :, combn(unique(df$Group), 2, function(x) {
df1 <- subset(df, Group == x[1])
df2 <- subset(df, Group == x[2])
df3 <- merge(df1, df2, by = 'Time')
value <- abs(df3$Value.x - df3$Value.y)
data.frame(combn = paste(x, collapse = ''),
time = df3$Time[which.max(value)],
max_difference = max(value))
}, simplify = FALSE))
# combn time max_difference
#1 AB 1 4
#2 AC 0 8
#3 BC 0 5
We create all combination of unique Group values, subset the data for them and merge them on Time. Subtract the corresponding value columns and return the max difference between them.
df <- structure(list(Time = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 0L, 0L),
Value = c(1, 2, 3, 4, 6, 6, 7, 7, 9), Group = c("A", "A",
"A", "B", "B", "B", "C", "C", "C")),
class = "data.frame", row.names = c(NA, -9L))
One dplyr option could be:
df %>%
inner_join(df, by = "Time") %>%
filter(Group.x != Group.y) %>%
Group = paste(pmax(Group.x, Group.y), pmin(Group.x, Group.y), sep = "-")) %>%
summarise(Max_Distance = abs(max(Value.x[Group.x == first(Group.x)]) - max(Value.y[Group.y == first(Group.y)])))
Time Group Max_Distance
<int> <chr> <dbl>
1 0 B-A 3
2 0 C-A 8
3 0 C-B 5
4 1 B-A 4
5 2 B-A 3

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
We can sum the non-NA elements and create a logical condition to select the columns of interest
df1 <- df %>%
select_if(~ sum(! > 2)
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
keep(df, ~ sum(! > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(! > 0.5)
Or use Filter from base R
Filter(function(x) sum(! > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(! > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!, 2, sum) > 2]

R (Stratified) Random Sampling for Defined Cases

I have a data frame:
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1))
My question: I would like to create a new column that includes a (binary) random number ('0' or '1') for cases 'ID' == 1 with a fixed proportion (or pre-defined prevalence) (e.g., random numbers '0' x 2 and '1' x 4).
For non-case specific purposes, the solution might be:
DF$RANDOM[sample(1:nrow(DF), nrow(DF), FALSE)] <- rep(RANDOM, c(nrow(DF)-4,4))
But, I still need the cas-specific assignment AND the aforementioned solution does not explicitly refer to '0' or '1'.
(Note: The variable 'value' is not relevant for the question; only an identifier.)
I figured out relevant posts on stratified sampling or random row selection - but this question is not covered by those (and other) posts.
Thank you VERY much in advance.
You can subset the data first by case ID == 1. To ensure occurrence of 1s and 0s, we use rep function and set replace to False in sample function.
Here's a solution.
DF[ID == 1, new_column := sample(rep(c(0,1), c(2,4)), .N, replace = F)]
Value ID new_column
1: AB 1 1
2: BC 0 NA
3: CD 0 NA
4: DE 1 1
5: EF 0 NA
6: FG 1 1
7: GH 1 1
8: HI 0 NA
9: IJ 0 NA
10: JK 1 0
11: KL 0 NA
12: LM 1 0
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH",
"HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1),
stringsAsFactors = FALSE)
DF %>% group_by(ID) %>% sample_n(4, replace = FALSE)
