Comparing two columns in a dataframe using R - r

I am trying to compare two columns in a dataframe to find rows where the two columns are not equal.
I would do:
df %>% filter(column1 != column2)
This will give me cases where values exist in both columns and are not equal (e.g. column1 = 5, column2 = 6)
However it will not give me cases where one of the values is NA (e.g. column1 = NA, column2 = 7)
How can I include the latter case into the filter function?
Thanks

Or use xor:
df %>% filter(a != b | xor(is.na(a), is.na(b)))
Or as #thelatemail mentioned, you could use Base R:
df[which(df$a != df$b | xor(is.na(df$a), is.na(df$b))),]
Or as #runr mentioned, you could try subset in Base R:
subset(df, a != b | xor(is.na(a), is.na(b)))

You can include them with an OR (|) condition -
library(dplyr)
df <- data.frame(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8))
df %>% filter(a != b | is.na(a) | is.na(b))
# a b
#1 1 NA
#2 NA 3
#3 5 8
Another option would be to change NA values to string "NA" and then only using a != b should work.
df %>%
mutate(across(.fns = ~replace(., is.na(.), 'NA'))) %>%
filter(a != b) %>%
type.convert(as.is = TRUE)

We can use if_any
library(dplyr)
df %>%
filter(a != b | if_any(everything(), is.na))
a b
1 1 NA
2 NA 3
3 5 8
data
df <- structure(list(a = c(1, 2, NA, 4, 5), b = c(NA, 2, 3, 4, 8)),
class = "data.frame", row.names = c(NA,
-5L))

Related

How to calculate the sum of all columns based on a grouped variable and remove NA

I am having a dataset where I would like to group by the ID variable and then calculate the sum of each column / variable. However, I am having some NA as you can see and I would like to remove them while the sum function is being excecuted as they return NA in some rows although some rows of the same ID contain values. I have tried to look around with no success and I have tried different methods again with no success. I would appreciate any help.
Thank you in advance.
data <- data.frame(ID = c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4),
var1 = c(1, 2, 5, 10, NA, 5, 23, NA, NA, 1),
var2 = c(1, NA, NA, 1, NA, 0, 1, 3, 23, 4))
data <- data %>%
group_by(ID) %>%
summarise(across(everything(), sum(., na.rm = T)))
Just the tilde ~ is missing:
data %>%
group_by(ID) %>%
summarise(across(everything(), ~sum(., na.rm = T)))
# A tibble: 4 x 3
ID var1 var2
* <dbl> <dbl> <dbl>
1 1 3 1
2 2 15 1
3 3 28 1
4 4 1 30
In case one ID group has only NA values you can do this:
data %>%
group_by(ID) %>%
summarise(across(everything(), ~ifelse(all(is.na(.)), NA, sum(., na.rm = T))))
We may specify the arguments of the function without using lambda function
library(dplyr)
data %>%
group_by(ID) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')

R group by column, count the combinations observed

I imagine this is already solved in many places, but I lack the right wordage to use to search for a solution. In R I have example data in long format like this:
A = tibble( c(1,2,3,1,2,4,5,5), c('a','b','c','a','f','-','b', 'f'))
and what I want returned is sort of a grouped result (something like a spread?) where I first collect the set of letters that match each number to get something like this.
1: 'a', 'a'
2: 'b', 'f'
3: 'c', 'c'
4: '_'
5: 'b', 'f'
and the actual final result I am looking for is the count of how many times each letter combination, when is observed:
'a','a': 1
'b','f': 2
'c','c': 1
'-': 1
I can do the last step with group_by() but I mention it here in case there is some magic sauce that does the whole thing.
We can do a group by 'a', then paste the second column while taking the number of distinct elements in 'b' and get the distinct rows
library(dplyr)
library(stringr)
A %>%
group_by(a) %>%
summarise(out = str_c(b, collapse=","), n = n_distinct(b))%>%
distinct(out, n)
# A tibble: 4 x 2
# out n
# <chr> <int>
#1 a,a 1
#2 b,f 2
#3 c 1
#4 - 1
data
A <- structure(list(a = c(1, 2, 3, 1, 2, 4, 5, 5), b = c("a", "b",
"c", "a", "f", "-", "b", "f")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
This is close to what you are looking for:
library(tidyverse)
#Data
A <- structure(list(v1 = c(1, 2, 3, 1, 2, 4, 5, 5), v2 = c("a", "b",
"c", "a", "f", "-", "b", "f")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
#Code
A %>% group_by(v1) %>% summarise(chain=paste0(v2,collapse = ',')) %>% ungroup() %>%
group_by(chain) %>% summarise(N=n())
# A tibble: 4 x 2
chain N
<chr> <int>
1 - 1
2 a,a 1
3 b,f 2
4 c 1
Here is a base R option using nested aggregate
aggregate(.~y,aggregate(y~.,A,toString),length)
which gives
> aggregate(.~y,aggregate(y~.,A,toString),length)
y x
1 - 1
2 a, a 1
3 b, f 2
4 c 1
Data
A = tibble(x = c(1,2,3,1,2,4,5,5), y = c('a','b','c','a','f','-','b', 'f'))
Maybe you want to cast the data in wide format and then count the combinations. Try :
library(dplyr)
library(tidyr)
A %>%
group_by(v1) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = row, values_from = v2, names_prefix = 'col_') %>%
ungroup %>%
count(col_1, col_2)
# col_1 col_2 n
# <chr> <chr> <int>
#1 - NA 1
#2 a a 1
#3 b f 2
#4 c NA 1

R dplyr - replace NA with 0 if [duplicate]

This question already has answers here:
Replace all NA with FALSE in selected columns in R
(5 answers)
Closed 2 years ago.
I have this dataframe
dtf <- data.frame(
id = seq(1, 4),
amt = c(1, 4, NA, 123),
xamt = c(1, 4, NA, 123),
camt = c(1, 4, NA, 123),
date = c("2020-01-01", NA, "2020-01-01", NA),
pamt = c(1, 4, NA, 123)
)
I'd like to replace all NA values in case that colname is numeric, in my case amt, xamt, pamt and camt. I'm looking for dplyr way. Normally I would use
replace(is.na(.), 0)
But this not works because of date column.
You can use across :
library(dplyr)
dtf %>% mutate(across(where(is.numeric), ~replace(., is.na(.), 0)))
#mutate_if for dplyr < 1.0.0
#dtf %>% mutate_if(is.numeric, ~replace(., is.na(.), 0))
You can also use replace_na from tidyr :
dtf %>% mutate(across(where(is.numeric), tidyr::replace_na, 0))
# id amt xamt camt date pamt
#1 1 1 1 1 2020-01-01 1
#2 2 4 4 4 <NA> 4
#3 3 0 0 0 2020-01-01 0
#4 4 123 123 123 <NA> 123
As suggested by #Darren Tsai we can also use coalesce.
dtf %>% mutate(across(where(is.numeric), coalesce, 0))

Remove columns from a dataframe based on number of rows with valid values

I have a dataframe:
df = data.frame(gene = c("a", "b", "c", "d", "e"),
value1 = c(NA, NA, NA, 2, 1),
value2 = c(NA, 1, 2, 3, 4),
value3 = c(NA, NA, NA, NA, 1))
I would like to keep all those columns (plus the first, gene) with more than or equal to atleast 2 valid values (i.e., not NA). How do I do this?
I am thinking something like this ...
df1 = df %>% select_if(function(.) ...)
Thanks
We can sum the non-NA elements and create a logical condition to select the columns of interest
library(dplyr)
df1 <- df %>%
select_if(~ sum(!is.na(.)) > 2)
df1
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or another option is keep
library(purrr)
keep(df, ~ sum(!is.na(.x)) > 2)
Or create the condition based on the number of rows
df %>%
select_if(~ mean(!is.na(.)) > 0.5)
Or use Filter from base R
Filter(function(x) sum(!is.na(x)) > 2, df)
We can use colSums in base R to count the non-NA value per column
df[colSums(!is.na(df)) > 2]
# gene value2
#1 a NA
#2 b 1
#3 c 2
#4 d 3
#5 e 4
Or using apply
df[apply(!is.na(df), 2, sum) > 2]

Select or subset variables whose column sums are not zero

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
require(dplyr)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D and this works, but I won't have variable A:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D in the sum() function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows
Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.
Note:
the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
As commented below by #zx8754, is.factor(.)should be used if one only wants to keep factor columns.
Edit: a base R solution
cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]
Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by #mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))
This is a soltion using data.table
df<-data.table(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
df2<-df[,lapply(X = .SD,FUN = function(x){sum(as.numeric(x))}),.SDcols = colnames(df)]
df[,which(is.na(df[1,]) == F),with = F]

Resources