Select or subset variables whose column sums are not zero - r

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
require(dplyr)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D and this works, but I won't have variable A:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D in the sum() function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows

Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.
Note:
the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
As commented below by #zx8754, is.factor(.)should be used if one only wants to keep factor columns.
Edit: a base R solution
cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]

Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by #mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))

This is a soltion using data.table
df<-data.table(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
df2<-df[,lapply(X = .SD,FUN = function(x){sum(as.numeric(x))}),.SDcols = colnames(df)]
df[,which(is.na(df[1,]) == F),with = F]

Related

Comparing two columns in a dataframe using R

I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.
I would do:
df %>% filter(column1 != column2)
This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)
However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)
How can I include the latter case into the filter function?
Thanks
Or use xor:
df %>% filter(a != b | xor(is.na(a), is.na(b)))
Or as #thelatemail mentioned, you could use Base R:
df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]
Or as #runr mentioned, you could try subset in Base R:
subset(df, a != b | xor(is.na(a), is.na(b)))
You can include them with an OR (|) condition -
library(dplyr)
df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))
df %>% filter(a != b | is.na(a) | is.na(b))
# a b
#1 1 NA
#2 NA 3
#3 5 8
Another option would be to change NA values to string "NA" and then only using a != b should work.
df %>%
mutate(across(.fns = ~replace(., is.na(.), 'NA'))) %>%
filter(a != b) %>%
type.convert(as.is = TRUE)
We can use if_any
library(dplyr)
df %>%
filter(a != b | if_any(everything(), is.na))
a b
1 1 NA
2 NA 3
3 5 8
data
df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
class = "data.frame", row.names = c(NA,
-5L))

generate a weighted matrix from r dataframe

I have a toy example of a dataframe:
df <- data.frame(matrix(, nrow = 5, ncol = 0))
df["A|A"] <- c(0.3, 0, 0, 100, 23)
df["A|B"]= c(0, 0, 0.3, 10, 0.23)
df["A|C"]= c(0.3, 0.1, 0, 100, 2)
df["B|B"]= c(0, 0, 0, 12, 2)
df["B|B"]= c(0, 0, 0.3, 0, 0.23)
df["B|C"]= c(0.3, 0, 0, 21, 3)
df["C|A"]= c(0.3, 0, 1, 100, 0)
df["C|B"]= c(0, 0, 0.3, 10, 0.2)
df["C|C"]= c(0.3, 0, 1, 1, 0.3)
I need to get a matrix with counts of non-zero values between A and A, A and B, ..., C and C.
I started splitting the colnames and assigning them to variables. But I don't know how to create a matrix with certain rows and columns in a loop
counts <- colSums(df != 0)
df <- rbind(df, counts)
for(i in colnames(df)) {
cluster1 <- (strsplit(i, "\\|")[[1]])[1]
cluster2 <- (strsplit(i, "\\|")[[1]])[2]
}
A base R option
> table(read.table(text = rep(names(df), colSums(df > 0)), sep = "|"))
V2
V1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
or a longer version
table(
data.frame(
do.call(
rbind,
strsplit(
as.character(subset(stack(df), values > 0)$ind),
"\\|"
)
)
)
)
gives
X2
X1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
Reshape the data into 'long' format with pivot_longer, then separate the 'name' column into two, and reshape back to 'wide' with pivot_wider, specifying the values_fn as a lambda function to get the count of non-zero values
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c('name1', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value,
values_fn = list(value = ~ sum(. > 0)), values_fill = 0)
-output
# A tibble: 3 x 4
name1 A B C
<chr> <int> <int> <int>
1 A 3 3 4
2 B 0 2 3
3 C 3 3 4

Calculate minimum distance between groups of points in data frame

my data frame looks like this:
Time, Value, Group
0, 1.0, A
1, 2.0, A
2, 3.0, A
0, 4.0, B
1, 6.0, B
2, 6.0, B
0, 7.0, C
1, 7.0, C
2, 9.0, C
I need to find for each combination (A, B), (A, C), (B, C) the maximum difference over each corresponding Time points.
So comparing A and B has maximum distance for t=1 which is 6 (B) - 2 (A) = 4.
The full output should be something like this:
combination,time,distance
AB, 0, 4
AC, 0, 6
BC, 0, 3
One way in base R using combn :
do.call(rbind, combn(unique(df$Group), 2, function(x) {
df1 <- subset(df, Group == x[1])
df2 <- subset(df, Group == x[2])
df3 <- merge(df1, df2, by = 'Time')
value <- abs(df3$Value.x - df3$Value.y)
data.frame(combn = paste(x, collapse = ''),
time = df3$Time[which.max(value)],
max_difference = max(value))
}, simplify = FALSE))
# combn time max_difference
#1 AB 1 4
#2 AC 0 8
#3 BC 0 5
We create all combination of unique Group values, subset the data for them and merge them on Time. Subtract the corresponding value columns and return the max difference between them.
data
df <- structure(list(Time = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 0L, 0L),
Value = c(1, 2, 3, 4, 6, 6, 7, 7, 9), Group = c("A", "A",
"A", "B", "B", "B", "C", "C", "C")),
class = "data.frame", row.names = c(NA, -9L))
One dplyr option could be:
df %>%
inner_join(df, by = "Time") %>%
filter(Group.x != Group.y) %>%
group_by(Time,
Group = paste(pmax(Group.x, Group.y), pmin(Group.x, Group.y), sep = "-")) %>%
summarise(Max_Distance = abs(max(Value.x[Group.x == first(Group.x)]) - max(Value.y[Group.y == first(Group.y)])))
Time Group Max_Distance
<int> <chr> <dbl>
1 0 B-A 3
2 0 C-A 8
3 0 C-B 5
4 1 B-A 4
5 2 B-A 3

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
Thanks
We can sum the non-NA elements and create a logical condition to select the columns of interest
library(dplyr)
df1 <- df %>%
select_if(~ sum(!is.na(.)) > 2)
df1
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(!is.na(.)) > 0.5)
Or use Filter from base R
Filter(function(x) sum(!is.na(x)) > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(!is.na(df)) > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!is.na(df), 2, sum) > 2]

R (Stratified) Random Sampling for Defined Cases

I have a data frame:
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH", "HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1))
My question: I would like to create a new column that includes a (binary) random number ('0' or '1') for cases 'ID' == 1 with a fixed proportion (or pre-defined prevalence) (e.g., random numbers '0' x 2 and '1' x 4).
EDIT I:
For non-case specific purposes, the solution might be:
DF$RANDOM[sample(1:nrow(DF), nrow(DF), FALSE)] <- rep(RANDOM, c(nrow(DF)-4,4))
But, I still need the cas-specific assignment AND the aforementioned solution does not explicitly refer to '0' or '1'.
(Note: The variable 'value' is not relevant for the question; only an identifier.)
I figured out relevant posts on stratified sampling or random row selection - but this question is not covered by those (and other) posts.
Thank you VERY much in advance.
You can subset the data first by case ID == 1. To ensure occurrence of 1s and 0s, we use rep function and set replace to False in sample function.
Here's a solution.
library(data.table)
set.seed(121)
DF[ID == 1, new_column := sample(rep(c(0,1), c(2,4)), .N, replace = F)]
print(DF1)
Value ID new_column
1: AB 1 1
2: BC 0 NA
3: CD 0 NA
4: DE 1 1
5: EF 0 NA
6: FG 1 1
7: GH 1 1
8: HI 0 NA
9: IJ 0 NA
10: JK 1 0
11: KL 0 NA
12: LM 1 0
library(dplyr)
DF <- data.frame(Value = c("AB", "BC", "CD", "DE", "EF", "FG", "GH",
"HI", "IJ", "JK", "KL", "LM"),
ID = c(1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1),
stringsAsFactors = FALSE)
DF %>% group_by(ID) %>% sample_n(4, replace = FALSE)

Resources