Identify difference in 2 data frame with missing values - r

Suppose I have 2 data frames:
a1 <- data.frame(a = 1:5, b=2:6)
a2 <- data.frame(a = 1:5, b=c(2:5,NA))
I would like to identify which columns are not identical (I will need the column number later). I thought that this would do the trick:
apply(!a1==a2, 2, sum, na.rm=TRUE)
However, because the last entry in a2 is an NA, it doesn't work.

Not sure why you're using sum, but to identify which columns are not identical you could use mapply with identical and negate the result.
which(!mapply(identical, a1, a2))
# b
# 2
for the column number. Or more simply for use in a column subset
!mapply(identical, a1, a2)
# a b
# FALSE TRUE
Just as a note, the word identical has a meaning in R that may be different from the result of ==, so it's possible you may need to clarify your question a bit.
x <- 1
y <- 1L
x == y
# [1] TRUE
identical(x, y)
# [1] FALSE

If you wanted to use sum, you could try
colSums(a1==a2, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE
Or using your code
apply(a1==a2, 2, sum, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE

Related

R subset vector when treated as strings

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)

Unstacking a stacked dataframe unstacks columns in a different order

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Backfilling from another data.frame

I frequently have situations where I have to "fill in" information from another data source.
For example:
x <- data.frame(c1=letters[1:26],c2=letters[26:1])
x[x$c1 == "m","c2"] <- NA
x[x$c1 == "a","c2"] <- NA
c1 c2
1 a <NA>
2 b y
3 c x
4 d w
5 e v
6 f u
7 g t
8 h s
9 i r
10 j q
11 k p
12 l o
13 m <NA>
...
Now, with that missing variable, I'd like to check and fill it in using a seperate data.frame, lets' call it y
y <- data.frame(c1=c("m","a"),c2=c("n","z"))
So, what I would like to happen is for x to be filled in with y. (row 13 should be c("m","n"), row 1 should be c("a","z"))
The method I use to deal with this currently seems convoluted and indirect. What would your approach be? Keeping in mind that my data is not necessarily in a nice order like this one is but the order should be maintained in x. My preference would be for a solution that does not rely on anything but base R.
This will be a far simpler proposition if you deal with character variables, not factors.
I will present a simple data.table solution (for elegant and easy to use syntax amongst many other advantages)
x <- data.frame(c1=letters[1:26],c2=letters[26:1], stringsAsFactors =FALSE)
x[x$c1 == "m","c2"] <- NA
y <- data.frame(c1="m",c2="n", stringsAsFactors = FALSE)
library(data.table)
X <- as.data.table(x)
Y <- as.data.table(y)
For simplicity of merging, I will create a column that indicating
X[,missing_c2 := is.na(c2)]
# a similar column in Y
Y[,missing_c2 := TRUE]
setkey(X, c2, missing_c2)
setkey(Y, c2, missing_c2)
# merge and replace (by reference) those values in X with the the values in `Y`
X[Y, c2 := i.c2]
The i.c2 means that we use the values of c2 from the i argument to [
This approach assumes that not all values where c1 = 'm' will be missing in X and you don't want to replace all values in c2 with 'm' where c1='m', only those which are missing
A base solution
Here is a base solution -- I use merge so that the y data.frame can contain more missing replacements than actually needed (i.e. could have values for all c1 values, although only c1=m`` is required.
# add a second missing value row because to make the solution more generalizable
x <- rbind(x, data.frame(c1 = 'm',c2 = NA, stringsAsFactors = FALSE) )
missing <- x[is.na(x$c2),]
merged <- merge(missing, y, by = 'c1')
x[is.na(x$c2),] <- with(merged, data.frame(c1 = c1, c2 = c2.y, stringsAsFactors = FALSE))
If you use factors you will come up against a wall of pain ensuring that the levels correspond.
In base R, I believe this will work for you:
nas <- is.na(x$c2)
x[nas, ] <- y[y$c1 %in% x[nas, 1], ]

Compare if two dataframe objects in R are equal?

How do I check if two objects, e.g. dataframes, are value equal in R?
By value equal, I mean the value of each row of each column of one dataframe is equal to the value of the corresponding row and column in the second dataframe.
It is not clear what it means to test if two data frames are "value equal" but to test if the values are the same, here is an example of two non-identical dataframes with equal values:
a <- data.frame(x = 1:10)
b <- data.frame(y = 1:10)
To test if all values are equal:
all(a == b) # TRUE
To test if objects are identical (they are not, they have different column names):
identical(a,b) # FALSE: class, colnames, rownames must all match.
In addition, identical is still useful and supports the practical goal:
identical(a[, "x"], b[, "y"]) # TRUE
We can use the R package compare to test whether the names of the object and the values are the same, in just one step.
a <- data.frame(x = 1:10)
b <- data.frame(y = 1:10)
library(compare)
compare(a, b)
#FALSE [TRUE]#objects are not identical (different names), but values are the same.
In case we only care about equality of the values, we can set ignoreNames=TRUE
compare(a, b, ignoreNames=T)
#TRUE
# dropped names
The package has additional interesting functions such as compareEqual and compareIdentical.
Here is another method using comparedf from the arsenal package.
It gives you the differences detected by variable, the variables not shared (different columns, for example), the number of observations not share as well as a summary of the overall comparison.
df1 <- data.frame(id = paste0("person", 1:3),
a = c("a", "b", "c"),
b = c(1, 3, 4))
> df1
id a b
1 person1 a 1
2 person2 b 3
3 person3 c 4
df2 <- data.frame(id = paste0("person", 4:1),
a = c("c", "b", "a", "f"),
b = c(1, 3, 4, 4),
d = paste0("rn", 1:4))
> df2
id a b d
1 person4 c 1 rn1
2 person3 b 3 rn2
3 person2 a 4 rn3
4 person1 f 4 rn4
library(arsenal)
comparedf(df1, df2)
Compare Object
Function Call:
comparedf(x = df1, y = df2)
Shared: 3 non-by variables and 3 observations.
Not shared: 1 variables and 0 observations.
Differences found in 2/3 variables compared.
0 variables compared have non-identical attributes.
There is a possibility to get a more detailed summary.
summary(comparedf(df1, df2))
The code below will return several tables:
Summary of data.frames
Summary of overall comparison
Variables not shared
Other variables not compared
Observations not shared
Differences detected by variable
Differences detected
Non-identical attributes
Here you have more info about the package and the function.
Additionally, you can use all.equal(df1, df2) too.
[1] "Attributes: < Component “row.names”: Numeric: lengths (3, 4) differ >"
[2] "Length mismatch: comparison on first 3 components"
[3] "Component “id”: Lengths (3, 4) differ (string compare on first 3)"
[4] "Component “id”: 3 string mismatches"
[5] "Component “a”: Lengths (3, 4) differ (string compare on first 3)"
[6] "Component “a”: 2 string mismatches"
[7] "Component “b”: Numeric: lengths (3, 4) differ"
Without the need to rely on another package, but to compare structure (class and attributes) of two data sets:
structure_df1 <- sapply(df1, function(x) paste(class(x), attributes(x), collapse = ""))
structure_df2 <- sapply(df2, function(x) paste(class(x), attributes(x), collapse = ""))
all(structure_df1 == structure_df2)

How I can select rows from a dataframe that do not match?

I'm trying to identify the values in a data frame that do not match, but can't figure out how to do this.
# make data frame
a <- data.frame( x = c(1,2,3,4))
b <- data.frame( y = c(1,2,3,4,5,6))
# select only values from b that are not in 'a'
# attempt 1:
results1 <- b$y[ !a$x ]
# attempt 2:
results2 <- b[b$y != a$x,]
If a = c(1,2,3) this works, as a is a multiple of b. However, I'm trying to just select all the values from data frame y, that are not in x, and don't understand what function to use.
If I understand correctly, you need the negation of the %in% operator. Something like this should work:
subset(b, !(y %in% a$x))
> subset(b, !(y %in% a$x))
y
5 5
6 6
Try the set difference function setdiff. So you would have
results1 = setdiff(a$x, b$y) # elements in a$x NOT in b$y
results2 = setdiff(b$y, a$x) # elements in b$y NOT in a$x
You could also use dplyr for this task. To find what is in b but not a:
library(dplyr)
anti_join(b, a, by = c("y" = "x"))
# y
# 1 5
# 2 6

Resources