Different cells between two data frames - r

I need differences between two data frames. setdiff() gives me modyfied and new rows. But it shows a whole modified row, but I want only different cells. How to do this? I assume the number of columns is the same.
Input data:
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1) # don't give results as my expectation
As a result it should be such dataframe:
result <- data.frame(ID = c(3, 4),
A = c(NA, 4),
B = c(3, NA))
Column ID should be preserved and always should contain value.
Summary:
Output should contain only new, or modified rows from df2.
In modified rows should be displayed only modified or new cells.
Values in ID column should be displayed even they are not modified.
compare, compare_df? How to do this?

You can do this in separate steps since you are applying different logic to different columns (ID vs A), but can't be achieved as a set of all columns.
df1 <- data.frame(ID = c(1, 2, 3),
A = c(1, 2, 3),
B = c(1, 2, NA))
df2 <- data.frame(ID = c(1, 2, 3, 4),
A = c(1, 2, 3, 4),
B = c(1, 2, 3, NA))
newdata = setdiff(df2,df1)
newdata
ID A B
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 NA
You can apply your logic to cols A & B, and not apply it to ID,
newdata$A[which(df2$A == df1$A)] <- NA
newdata$B[which(df2$B == df1$B)] <- NA
newdata
ID A B
1 1 NA NA
2 2 NA NA
3 3 NA 3
4 4 4 NA
newdata[3:4,]
There are wizards far better than me that might opine, but I see no way to do this in one pass with the ID restriction.

Related

Filter only EQUAL values into columns with dplyr::filter_if

I have following data:
df <- data.frame(
x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3)
)
I try use this code:
library(tidyverse)
df %>%
filter_if(~ is.numeric(.), all_vars(. %in% c('3', '4')))
x y
1 4 3
2 3 4
3 4 4
4 3 3
But, the expected result is:
x y
1 3 3
2 4 4
How do this?
A different approach:
require(tidyverse)
df <- data.frame(
x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3),
z = letters[1:6]
)
df %>%
filter(apply(.,1,function(x) length(unique(x[grepl('[0-9]',x)]))==1))
gives:
x y z
1 4 4 d
2 3 3 f
I have added a non-numeric column to the example data, to illustrate this solution.
Not a filter_if() possibility, but essentially following that logic:
df %>%
filter(rowMeans(select_if(., is.numeric) == pmax(!!!select_if(., is.numeric))) == 1)
x y z
1 4 4 d
2 3 3 f
Sample data:
df <- data.frame(x = c(1, 4, 3, 4, 4, 3),
y = c(2, 3, 4, 4, 2, 3),
z = letters[1:6],
stringsAsFactors = FALSE)

Frequency of vectors inside list

Let's say I have a list
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
and I need to count all of these vectors so the desired output should looks like:
Category Count
1, 2, 3 3
2, 4, 6 1
1, 5, 10 2
Is there any simple way in R how to achieve this?
You can just paste and use table, i.e.
as.data.frame(table(sapply(test, paste, collapse = ' ')))
which gives,
Var1 Freq
1 1 2 3 3
2 1 5 10 2
3 2 4 6 1
The function unique() can work on a list. For counting one can use identical():
test <- list(c(1, 2, 3), c(2, 4, 6), c(1, 5, 10), c(1, 2, 3), c(1, 5, 10), c(1, 2, 3))
Lcount <- function(xx, L) sum(sapply(L, identical, y=xx))
sapply(unique(test), FUN=Lcount, L=test)
unique(test)
The result as data.frame:
result <- data.frame(
Set=sapply(unique(test), FUN=paste0, collapse=','),
count= sapply(unique(test), FUN=Lcount, L=test)
)
result
# > result
# Set count
# 1 1,2,3 3
# 2 2,4,6 1
# 3 1,5,10 2

Remove duplicate elements by row in a data frame

I need to replace duplicate elements to NA by row from a data frame. I will take base, tidyverse or data.table solutions. Thank you. Example:
library(tibble)
#input data.frame
tribble(
~x, ~y, ~z,
1, 2, 3,
1, 1, NA,
4, 1, 4,
2, 2, 3
)
#desired output
tribble(
~x, ~y, ~z,
1, 2, 3,
1, NA, NA,
4, 1, NA,
2, 3, NA
)
Here is a base R option where we loop through the rows, replace the duplicated elements with NA and concatenate (c) the non-NA elements with the NA elements, transpose (t) and assign the output back to the original dataset
df1[] <- t(apply(df1, 1, function(x) {
x1 <- replace(x, duplicated(x), NA)
c(x1[!is.na(x1)], x1[is.na(x1)])
}))
df1
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 2 3
#2 1 NA NA
#3 4 1 NA
#4 2 3 NA

Subsetting data in R to remove rows if values for two variables are NA

I want to remove all rows from my dataset that are NA in two columns. If a row has a non-NA value in either column, I want to keep it. How do I do this?
you can do this
library(tidyverse)
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))
df1 <- df %>%
filter(is.na(a) == FALSE | is.na(b) == FALSE)
and you get:
> df1
a b
1 2 5
2 4 4
3 6 8
4 3 6
5 NA 7
Here are a couple of base R suggestions. Loop through the columns of datasets, convert it to a logical vector, and collapse the logical vectors by comparing each corresponding element with Reduce, negate the output and subset the dataset
df[!Reduce(`&`, lapply(df, is.na)),]
Or converting the logical matrix (!is.na(df)) to a logical vector to subset the dataset
df[rowSums(!is.na(df))>0,]
data
df <- data.frame(a = c(2, 4, 6, NA, 3, NA),
b = c(5, 4, 8, NA, 6, 7))

match rows across two columns

Given a data frame
df=data.frame(
E=c(1,1,2,1,3,2,2),
N=c(4,4,10,4,3,2,2)
)
I would like to create a third column: Every time a value equals another value in the same column and these rows are also equal in the other column it results in a match (new character for every match).
dfx=data.frame(
E=c(1,1,2,1,3,2,2,3, 2),
N=c(4,4,10,4,3,2,2,6, 10),
matched=c("A", "A", "B","A", NA, "C", "C", NA, "B")
)
Thanks!
Here, df is:
df <- structure(list(E = c(1, 1, 2, 1, 3, 2, 2, 3, 2), N = c(4, 4,
10, 4, 3, 2, 2, 6, 10)), .Names = c("E", "N"), row.names = c(NA,
-9L), class = "data.frame")
You can do:
dfx <- transform(df, matched = {
i <- as.character(interaction(df[c("E", "N")]))
tab <- table(i)[order(unique(i))]
LETTERS[match(i, names(tab)[tab > 1])]
})
# E N matched
# 1 1 4 A
# 2 1 4 A
# 3 2 10 B
# 4 1 4 A
# 5 3 3 <NA>
# 6 2 2 C
# 7 2 2 C
# 8 3 6 <NA>
# 9 2 10 B

Resources