Extract mismatch by groups

Extract mismatch by groups - r

I have a data frame like this:
ID
col1
col2
AB
1
3
AB
1
3
CD
2
4
CD
2
3
I would like to compare row within each ID.
For each column with difference add in the mismatch referred to the column.
Output:
ID
col1
col2
mismatch_extract_col1
mismatch_extract_col2
AB
1
3
Na
Na
AB
1
3
Na
Na
CD
2
4
Na
4:3
CD
2
3
Na
4:3

You can use n_distinct() == 1 to know if there is a mismatch in each column by ID groups.
library(dplyr)
df %>%
mutate(across(col1:col2, ~ if_else(n_distinct(.x) == 1, NA, toString(.x)),
.names = "mismatch_extract_{.col}"),
.by = ID)
# # A tibble: 4 × 5
# ID col1 col2 mismatch_extract_col1 mismatch_extract_col2
# <chr> <int> <int> <lgl> <chr>
# 1 AB 1 3 NA NA
# 2 AB 1 3 NA NA
# 3 CD 2 4 NA 4, 3
# 4 CD 2 3 NA 4, 3

Related

Remove groups if all NA

Let's say I have a table like so:
df <- data.frame("Group" = c("A","A","A","B","B","B","C","C","C"),
"Num" = c(1,2,3,1,2,NA,NA,NA,NA))
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA
7 C NA
8 C NA
9 C NA
In this case, because group C has Num as NA for all entries, I would like to remove rows in group C from the table. Any help is appreciated!

You could group_by on you Group and filter the groups with all values that are NA. You can use the following code:
library(dplyr)
df %>%
group_by(Group) %>%
filter(!all(is.na(Num)))
#> # A tibble: 6 × 2
#> # Groups: Group [2]
#> Group Num
#> <chr> <dbl>
#> 1 A 1
#> 2 A 2
#> 3 A 3
#> 4 B 1
#> 5 B 2
#> 6 B NA
Created on 2023-01-18 with reprex v2.0.2

In base R you could index based on all the groups that have at least one non-NA value:
idx <- df$Group %in% unique(df[!is.na(df$Num),"Group"])
idx
df[idx,]
# or in one line
df[df$Group %in% unique(df[!is.na(df$Num),"Group"]),]
output
Group Num
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 B NA

Using ave.
df[with(df, !ave(Num, Group, FUN=\(x) all(is.na(x)))), ]
# Group Num
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 1
# 5 B 2
# 6 B NA

Calculating observations by group for multiple variables

I have data as follows, with which I want to calculate observations by group as follows:
library(data.table)
dat <- fread("col1 col2 col3 group
1 2 4 A
3 2 2 A
1 NA 1 B
3 2 1 B")
vars_of_interest <- c("col1", "col2")
vars_of_interest_obs <- paste0(vars_of_interest, "_obs_tot")
dat <- setDT(dat)[, (vars_of_interest_obs) := sum(!is.na(vars_of_interest)), by = c("group")]
However the outcome of this is:
col1 col2 col3 group col1_obs_tot col2_obs_tot
1: 1 2 4 A 2 2
2: 3 2 2 A 2 2
3: 1 NA 1 B 2 2
4: 3 2 1 B 2 2
Where the last column should be
col2_obs_tot
2
2
1
1
What am I doing wrong here?

This is because vars_of_interest are evaluated litteraly:
sum(is.na('col1','col2')) = 2
You need to get their content:
setDT(dat)[, (vars_of_interest_obs) := lapply(vars_of_interest, function(x) sum(!is.na(get(x)))), by = c("group")][]
col1 col2 col3 group col1_obs_tot col2_obs_tot
<int> <int> <int> <char> <int> <int>
1: 1 2 4 A 2 2
2: 3 2 2 A 2 2
3: 1 NA 1 B 2 1
4: 3 2 1 B 2 1

counting the number of observations row wise using dplyr

I have a dataset look like this -
sample <- tibble(x = c (1,2,3,NA), y = c (5, NA,2, NA))
sample
# A tibble: 4 x 2
x y
<dbl> <dbl>
1 1 5
2 2 NA
3 3 2
4 NA NA
Now I want create a new variable Z, which will count how many observations are in each row. For example for the sample dataset above the first value of new variable Z should be 2 because both x and y have values. Similarly, for 2nd row the value of Z is 1 as there is one missing value and for 4th row, the value is 0 as there is no observations in the row.
The expected dataset looks like this -
x y z
<dbl> <dbl> <dbl>
1 1 5 2
2 2 NA 1
3 3 2 2
4 NA NA 0
I want to do this on few number of variables, not the whole dataset.

Using base R. First line checks all columns, second one checks columns by name, third might not work as good if the number of columns is substantial.
sample$z1 <- rowSums(!is.na(sample))
sample$z2 <- rowSums(!is.na(sample[c("x", "y")]))
sample$z3 <- is.finite(sample$x) + is.finite(sample$y)
> sample
# A tibble: 4 x 5
x y z1 z2 z3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 5 2 2 2
2 2 NA 1 1 1
3 3 2 2 2 2
4 NA NA 0 0 0

We can use
library(dplyr)
sample %>%
rowwise %>%
mutate(z = sum(!is.na(cur_data()))) %>%
ungroup
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <int>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0
If it is select columns
sample %>%
rowwise %>%
mutate(z = sum(!is.na(select(cur_data(), x:y))))
Or with rowSums on a logical matrix
sample %>%
mutate(z = rowSums(!is.na(cur_data())))
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 5 2
#2 2 NA 1
#3 3 2 2
#4 NA NA 0

apply function with selected columns example:
set.seed(7)
vals <- sample(c(1:20, NA, NA), 20)
sample <- matrix(vals, ncol = 5)
# Select columns 1, 3, 4
cols <- c(1, 3, 4)
rowcnts <- apply(sample[ , cols], 1, function(x) length(x[!is.na(x)]))
sample <- cbind(sample, rowcnts)
> sample
rowcnts
[1,] 10 15 16 NA 12 2
[2,] 19 8 14 18 9 3
[3,] 7 17 6 4 1 3
[4,] 2 3 13 NA 5 2

How to combine the values of various columns in a tibble by the same row ID

So I have a tibble (data frame) like this (the actual data frame is like 100+ rows)
sample_ID <- c(1, 2, 2, 3)
A <- c(NA, NA, 1, 3)
B <- c(1, 2, NA, 1)
C <- c(5, 1, NA, 2)
D <- c(NA, NA, 3, 1)
tibble(sample_ID,A,B,C,D)
# which reads
# A tibble: 4 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 NA 2 1 NA
3 2 1 NA NA 3
4 3 3 1 2 1
As can be seen here, the second and third rows have the same sample ID. I want to combine these two rows so that the tibble looks like
# A tibble: 3 × 5
sample_ID A B C D
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 1 5 NA
2 2 1 2 1 3
3 3 3 1 2 1
In other words, I want the rows for sample_ID to be unique (order doesn't matter), and the values of other columns are merged (overwrite NA when possible). Can this be achieved in a simple way, such as using gather and spread? Many thanks.

We can use summarise_each after grouping by 'sample_ID'
library(dplyr)
df %>%
group_by(sample_ID) %>%
summarise_each(funs(na.omit))
# A tibble: 3 × 5
# sample_ID A B C D
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 NA 1 5 NA
#2 2 1 2 1 3
#3 3 3 1 2 1

Add a column to a dataframe using (extracting unique values) from existing columns

I am new to R, and was not able to search answers for the specific problem I have encountered.
If my dataframe looks like below:
d <- data.frame(Name = c("Jon", "Jon", "Jon", "Kel", "Kel", "Kel", "Don", "Don", "Don"),
No1 = c(1,2,3,1,1,1,3,3,3),
No2 = c(1,1,1,2,2,2,3,3,3))
Name No1 No2
Jon 1 1
Jon 2 1
Jon 3 1
Kel 1 2
Kel 1 2
Kel 1 2
Don 3 3
Don 3 3
Don 3 3
...
How would I add be able to add new columns to the dataframe, where the columns would indicate the unique values in column No1 and No2: which would be (1,2,3), (1,2), (3) for John, Kelly, Don, respectively
So, if the new columns are named ID#, The desired results should be
d2 <- data.frame(Name = c("Jon", "Jon", "Jon", "Kel", "Kel", "Kel", "Don", "Don", "Don"),
No1 = c(1,2,3,1,1,1,3,3,3),
No2 = c(1,1,1,2,2,2,3,3,3),
ID1 = c(1,1,1,1,1,1,3,3,3),
ID2 = c(2,2,2,2,2,2,NA,NA,NA),
ID3 = c(3,3,3,NA,NA,NA,NA,NA,NA))
Name No1 No2 ID1 ID2 ID3
Jon 1 1 1 2 3
Jon 2 1 1 2 3
Jon 3 1 1 2 3
Kel 1 2 1 2 NA
Kel 1 2 1 2 NA
Kel 1 2 1 2 NA
Don 3 3 3 NA NA
Don 3 3 3 NA NA
Don 3 3 3 NA NA

A tidyverse approach:
library(dplyr)
library(tidyr)
# evaluate separately for each name
d %>% group_by(Name) %>%
# add a column of the unique values pasted together into a string
mutate(ID = paste(unique(c(No1, No2)), collapse = ' ')) %>%
# separate the string into individual columns, filling with NA and converting to numbers
separate(ID, into = paste0('ID', 1:3), fill = 'right', convert = TRUE)
## Source: local data frame [9 x 6]
## Groups: Name [3]
##
## Name No1 No2 ID1 ID2 ID3
## * <fctr> <dbl> <dbl> <int> <int> <int>
## 1 Jon 1 1 1 2 3
## 2 Jon 2 1 1 2 3
## 3 Jon 3 1 1 2 3
## 4 Kel 1 2 1 2 NA
## 5 Kel 1 2 1 2 NA
## 6 Kel 1 2 1 2 NA
## 7 Don 3 3 3 NA NA
## 8 Don 3 3 3 NA NA
## 9 Don 3 3 3 NA NA
Here's a nice base version with a basic split-apply-combine approach:
# store distinct values in No1 and No2
cols <- unique(unlist(d[,-1]))
# split No1 and No2 by Name,
ids <- data.frame(t(sapply(split(d[,-1], d$Name),
# find unique values for each split,
function(x){y <- unique(unlist(x))
# pad with NAs,
c(y, rep(NA, length(cols) - length(y)))
# and return a data.frame
})))
# fix column names
names(ids) <- paste0('ID', cols)
# turn rownames into column
ids$Name <- rownames(ids)
# join two data.frames on Name columns
merge(d, ids, sort = FALSE)
## Name No1 No2 ID1 ID2 ID3
## 1 Jon 1 1 1 2 3
## 2 Jon 2 1 1 2 3
## 3 Jon 3 1 1 2 3
## 4 Kel 1 2 1 2 NA
## 5 Kel 1 2 1 2 NA
## 6 Kel 1 2 1 2 NA
## 7 Don 3 3 3 NA NA
## 8 Don 3 3 3 NA NA
## 9 Don 3 3 3 NA NA
And just for kicks, here's a creative alternate base version that leverages table instead of splitting/grouping:
# copy d so as not to distort original with factor columns
d_f <- d
# make No* columns factors to ensure similar table structure
d_f[, -1] <- lapply(d[,-1], factor, levels = unique(unlist(d[, -1])))
# make tables of cols, sum to aggregate occurrences, and set as boolean mask for > 0
tab <- Reduce(`+`, lapply(d_f[, -1], table, d_f$Name)) > 0
# replace all TRUE values with values they tabulated
tab <- tab * matrix(as.integer(rownames(tab)), nrow = nrow(tab), ncol = ncol(tab))
# replace 0s with NAs
tab[tab == 0] <- NA
# store column names
cols <- paste0('ID', rownames(tab))
# sort each row, keeping NAs
tab <- data.frame(t(apply(tab, 2, sort, na.last = T)))
# apply stored column names
names(tab) <- cols
# turn rownames into column
tab$Name <- rownames(tab)
# join two data.frames on Name columns
merge(d, tab, sort = FALSE)
Results are identical.

library(dplyr)
library(tidyr)
d %>%
group_by(Name) %>%
mutate(unique_id = paste0(unique(c(No1, No2)), collapse = ",")) %>%
separate(., unique_id, paste0("id_", 1:max(c(.$No1, .$No2))), fill = "right")

We can use a single external package i.e. data.table and get the output. Convert the 'data.frame' to 'data.table' (setDT(d)), grouped by 'Name', we unlist the columns mentioned in the .SDcols, get the unique values, and dcast from 'long' to 'wide' format, do a join with the original dataset on the "Name" column.
library(data.table)
dcast(setDT(d)[, unique(unlist(.SD)) , Name, .SDcols = No1:No2],
Name~paste0("ID", rowid(Name)), value.var="V1")[d, on = "Name"]
# Name ID1 ID2 ID3 No1 No2
#1: Jon 1 2 3 1 1
#2: Jon 1 2 3 2 1
#3: Jon 1 2 3 3 1
#4: Kel 1 2 NA 1 2
#5: Kel 1 2 NA 1 2
#6: Kel 1 2 NA 1 2
#7: Don 3 NA NA 3 3
#8: Don 3 NA NA 3 3
#9: Don 3 NA NA 3 3
Or this can be done in one-line by first pasteing the unique elements in 'No1' and 'No2', grouped by 'Name', and then split it to three columns by using cSplit from splitstackshape.
library(splitstackshape)
cSplit(setDT(d)[, ID:= paste(unique(c(No1, No2)), collapse=" ") , Name], "ID", " ")
# Name No1 No2 ID_1 ID_2 ID_3
#1: Jon 1 1 1 2 3
#2: Jon 2 1 1 2 3
#3: Jon 3 1 1 2 3
#4: Kel 1 2 1 2 NA
#5: Kel 1 2 1 2 NA
#6: Kel 1 2 1 2 NA
#7: Don 3 3 3 NA NA
#8: Don 3 3 3 NA NA
#9: Don 3 3 3 NA NA
Or using the baseVerse just for kicks
d1 <- read.table(text=ave(unlist(d[-1]), rep(d$Name, 2),
FUN = function(x) paste(unique(x), collapse=" "))[1:nrow(d)],
header=FALSE, fill=TRUE, col.names= paste0("ID", 1:3))
cbind(d, d1)
# Name No1 No2 ID1 ID2 ID3
#1 Jon 1 1 1 2 3
#2 Jon 2 1 1 2 3
#3 Jon 3 1 1 2 3
#4 Kel 1 2 1 2 NA
#5 Kel 1 2 1 2 NA
#6 Kel 1 2 1 2 NA
#7 Don 3 3 3 NA NA
#8 Don 3 3 3 NA NA
#9 Don 3 3 3 NA NA
NOTE: No packages used and without much effort in splitting.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract mismatch by groups - r

Related

Remove groups if all NA

Calculating observations by group for multiple variables

counting the number of observations row wise using dplyr

How to combine the values of various columns in a tibble by the same row ID

Add a column to a dataframe using (extracting unique values) from existing columns

Categories

Resources