How to count unique values in R while ignoring NAs - r

This is my input data
key col_a col_b
a QQQ <NA>
a QQC <NA>
b <NA> ACQ
b <NA> ACQ
I'd like to create this output
key col_a col_b
a 2 0
b 0 1
I tried to do this with length(unique(x$col_a)) but it counts the NA as values
I'm creating this object with data.tables and it comes from an ifelse() statement.
Can I change the value I'm putting in the ifelse statement to something else or count unique ignoring NAs?

For each key we can find out unique values in columns with n_distinct
library(dplyr)
df %>%
group_by(key) %>%
summarise(across(col_a:col_b, n_distinct, na.rm = TRUE))
In data.table this can be done as :
library(data.table)
setDT(df)[, lapply(.SD, uniqueN, na.rm = TRUE), key, .SDcols = col_a:col_b]
key col_a col_b
1: a 2 0
2: b 0 1

We can use base R methods for this
aggregate(. ~ key, df, FUN = function(x)
length(unique(na.omit(x))), na.action = NULL)
# key col_a col_b
#1 a 2 0
#2 b 0 1
Or with tidyverse using anonymous function
library(dplyr)
df %>%
group_by(key) %>%
summarise(across(everything(), ~ n_distinct(., na.rm = TRUE)))
data
df <- structure(list(key = c("a", "a", "b", "b"), col_a = c("QQQ",
"QQC", NA, NA), col_b = c(NA, NA, "ACQ", "ACQ")),
class = "data.frame", row.names = c(NA,
-4L))

Related

How to get all rows sharing same url into 1 row?

data frame after unseating has multiple rows with na values that can be summarized into one row. All text/character data.
Example:
link feature-1 feature-2 feature-3
link_1 a. NA NA
link_1. NA NA b
link_1. NA. c NA
link2 NA. a NA
link_2 NA NA d
link_2 x NA NA
Assuming that you are only ever combining NA values and text, then I recommend the following:
library(dplyr)
# here is a mock dataset
df = data.frame(grp = c('a','a','a','b','b','b'),
value1 = c(NA,NA,'text','text',NA,NA),
value2 = c(NA,'txt',NA,NA,'txt',NA),
stringsAsFactors = FALSE)
df %>%
# convert NA values to empty text strings
mutate(value1 = ifelse(is.na(value1), "", value1),
value2 = ifelse(is.na(value2), "", value2)) %>%
# specify the groups
group_by(grp) %>%
# append all the text in each group into a single row
summarise(val1 = paste(value1, collapse = ""),
val2 = paste(value2, collapse = ""))
Based on this answer.
Looking at the data in your question, you might need to first standardize some values. Because "link_1" vs "link_1." and "NA" vs "NA." will be treated as different.
You can use across to get first non-NA value by group in multiple columns.
library(dplyr)
df %>% group_by(link) %>% summarise(across(starts_with('feature'), ~na.omit(.)[1]))
# link feature.1 feature.2 feature.3
# <chr> <chr> <chr> <chr>
#1 link_1 a c b
#2 link_2 x a d
data
df <- structure(list(link = c("link_1", "link_1", "link_1", "link_2",
"link_2", "link_2"), feature.1 = c("a", NA, NA, NA, NA, "x"),
feature.2 = c(NA, NA, "c", "a", NA, NA), feature.3 = c(NA,
"b", NA, NA, "d", NA)), class = "data.frame", row.names = c(NA, -6L))

How fill a dataframe from another one in R?

I want to fill df2 with information from df1.
df1 as below
ID Mutation
1 A
2 B
2 C
3 A
df2 as below
ID A B C
1
2
3
For example, if mutation A is found in ID 1, then I want it in df2 it marked as "Y".
So the df2 result should be
ID A B C
1 Y
2 Y Y
3 Y
I have hundreds of IDs and more than 20 mutations. How can I efficiently achieve this in R? Thanks!
Using data.table you can try
setDT(df)
df2 <- dcast(df,formula = ID~Mutation )
df2[, c("A", "B", "C") := lapply(.SD, function(x) ifelse(is.na(x), " ", "Y")), ID]
df2
#Output
ID A B C
1: 1 Y
2: 2 Y Y
3: 3 Y
Create a new column with value 'Y' and cast the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(value = 'Y') %>%
pivot_wider(names_from = Mutation, values_from = value, values_fill = '')
# ID A B C
# <int> <chr> <chr> <chr>
#1 1 "Y" "" ""
#2 2 "" "Y" "Y"
#3 3 "Y" "" ""
data
df <- structure(list(ID = c(1L, 2L, 2L, 3L), Mutation = c("A", "B",
"C", "A")), class = "data.frame", row.names = c(NA, -4L))

Joining dataframes, overwriting x with y if y has value

I have 2 dataframes
DF x:
ID A B C
1 x y z
2 x y z
DF y:
ID A B C
1 NA d f
2 e NA NA
I want to join them in such way that the value of x gets overwritten by the value of y, but only if there is a value in y for the matching column in x.
Hence, the outcome of the above should be:
ID A B C
1 x d f
2 e y z
One option is coalesce
library(dplyr)
left_join(dfx, dfy, by = 'ID') %>%
transmute(ID, A = coalesce(A.y, A.x),
B = coalesce(B.y, B.x),
C = coalesce(C.y, C.x))
# ID A B C
#1 1 x d f
#2 2 e y z
Or if there are many columns, reshape it to 'long' format, do the coalesce and then reshape into 'wide' format
library(tidyr)
left_join(dfx, dfy, by = 'ID') %>%
pivot_longer(cols = -ID, names_to = c("group", ".value"), names_sep = "\\.") %>%
mutate(x = coalesce(y, x)) %>%
select(-y) %>%
pivot_wider(names_from = group, values_from = x)
Or another option is to bind_rows the two datasets and then do a group_by summarise (assuming one row per 'ID')
bind_rows(dfy, dfx) %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ first(.[!is.na(.)]))
Or using a join on data.table
library(data.table)
nm1 <- names(dfx)[-1]
nm2 <- paste0("i.", nm1)
setDT(dfy)[dfx, (nm1) := Map(coalesce, mget(nm1), mget(nm2)), on = .(ID)]
dfy
data
dfx <- structure(list(ID = 1:2, A = c("x", "x"), B = c("y", "y"), C = c("z",
"z")), class = "data.frame", row.names = c(NA, -2L))
dfy <- structure(list(ID = 1:2, A = c(NA, "e"), B = c("d", NA), C = c("f",
NA)), class = "data.frame", row.names = c(NA, -2L))
Using base R, we can get the index of non-NA values in dfy and replace the corresponding values of dfx with it.
#Rearrange dfy if you have columns in different order than dfx
#dfy <- dfy[names(dfx)]
inds <- which(!is.na(dfy[-1]), arr.ind = TRUE)
dfx[-1][inds] <- dfy[-1][inds]
dfx
# ID A B C
#1 1 x d f
#2 2 e y z

How can I select columns based on two conditions? [duplicate]

This question already has answers here:
Filter data.frame rows by a logical condition
(9 answers)
Closed 3 years ago.
I have a data frame with lots of columns. For example:
sample treatment col5 col6 col7
1 a 3 0 5
2 a 1 0 3
3 a 0 0 2
4 b 0 1 1
I want to select the sample and treatment columns plus all columns that meet the following 2 conditions:
Their value on the row in which treatment == 'b' is 0
Their value from at least one row where treatment == 'a' is not 0.
The expected result should look like this:
sample treatment col5
1 a 3
2 a 1
3 a 0
4 b 0
Example dataframe:
structure(list(sample = 1:4, treatment = structure(c(1L, 1L,
1L, 2L), .Label = c("a", "b"), class = "factor"), col5 = c(3,
1, 0, 0), col6 = c(0, 0, 0, 1), col7 = c(5, 3, 2, 1)), class = "data.frame", row.names = c(NA,
-4L))
Here's a way in base R -
cs_a <- colSums(df[df$treatment == "a",-c(1:2)]) > 0
cs_b <- colSums(df[df$treatment == "b",-c(1:2)]) == 0
df[, c(TRUE, TRUE, cs_a & cs_b)]
sample treatment col5
1 1 a 3
2 2 a 1
3 3 a 0
4 4 b 0
With dplyr -
df %>%
select_at(which(c(TRUE, TRUE, cs_a & cs_b)))
Here is much more verbose way in tidyverse that does not require manual colSums for each level of treatment:
library(dplyr)
library(purrr)
library(tidyr)
sample <- 1:4
treatment <- c("a", "a", "a", "b")
col5 <- c(3,1,0,0)
col6 <- c(0,0,0,1)
col7 <- c(5,3,2,1)
dd <- data.frame(sample, treatment, col5, col6, col7)
# first create new columns that report whether the entries are zero
dd2 <- mutate_if(
.tbl = dd,
.predicate = is.numeric,
.funs = function(x)
x == 0
)
# then find the sum per column and per treatment group
# in R TRUE = 1 and FALSE = 0
number_of_zeros <- dd2 %>%
group_by(treatment) %>%
summarise_at(.vars = vars(col5:col7), .funs = "sum")
# then find the names of the columns you want to keep
keeper_columns <-
number_of_zeros %>%
select(-treatment) %>% # remove the treatment grouping variable
map_dfr( # function to check if all entries per column (now per treatment level) are greater zero
.x = .,
.f = function(x)
all(x > 0)
) %>%
gather(column, keeper) %>% # reformat
filter(keeper == TRUE) %>% # to grab the keepers
select(column) %>% # then select the column with column names
unlist %>% # and convert to character vector
unname
# subset the original dataset for the wanted columns
wanted_columns <- dd %>% select(1:2, keeper_columns)

How to fill in NAs of various columns grouped by duplicated IDs in R

I have a table with columns id, colA, and colB. The data contains duplicated id columns where for some rows, colA or colB is null, but its duplicated id has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.
First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT:
Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.

Resources