Compare if two dataframe objects in R are equal? - r

How do I check if two objects, e.g. dataframes, are value equal in R?
By value equal, I mean the value of each row of each column of one dataframe is equal to the value of the corresponding row and column in the second dataframe.

It is not clear what it means to test if two data frames are "value equal" but to test if the values are the same, here is an example of two non-identical dataframes with equal values:
a <- data.frame(x = 1:10)
b <- data.frame(y = 1:10)
To test if all values are equal:
all(a == b) # TRUE
To test if objects are identical (they are not, they have different column names):
identical(a,b) # FALSE: class, colnames, rownames must all match.

In addition, identical is still useful and supports the practical goal:
identical(a[, "x"], b[, "y"]) # TRUE

We can use the R package compare to test whether the names of the object and the values are the same, in just one step.
a <- data.frame(x = 1:10)
b <- data.frame(y = 1:10)
library(compare)
compare(a, b)
#FALSE [TRUE]#objects are not identical (different names), but values are the same.
In case we only care about equality of the values, we can set ignoreNames=TRUE
compare(a, b, ignoreNames=T)
#TRUE
# dropped names
The package has additional interesting functions such as compareEqual and compareIdentical.

Here is another method using comparedf from the arsenal package.
It gives you the differences detected by variable, the variables not shared (different columns, for example), the number of observations not share as well as a summary of the overall comparison.
df1 <- data.frame(id = paste0("person", 1:3),
a = c("a", "b", "c"),
b = c(1, 3, 4))
> df1
id a b
1 person1 a 1
2 person2 b 3
3 person3 c 4
df2 <- data.frame(id = paste0("person", 4:1),
a = c("c", "b", "a", "f"),
b = c(1, 3, 4, 4),
d = paste0("rn", 1:4))
> df2
id a b d
1 person4 c 1 rn1
2 person3 b 3 rn2
3 person2 a 4 rn3
4 person1 f 4 rn4
library(arsenal)
comparedf(df1, df2)
Compare Object
Function Call:
comparedf(x = df1, y = df2)
Shared: 3 non-by variables and 3 observations.
Not shared: 1 variables and 0 observations.
Differences found in 2/3 variables compared.
0 variables compared have non-identical attributes.
There is a possibility to get a more detailed summary.
summary(comparedf(df1, df2))
The code below will return several tables:
Summary of data.frames
Summary of overall comparison
Variables not shared
Other variables not compared
Observations not shared
Differences detected by variable
Differences detected
Non-identical attributes
Here you have more info about the package and the function.
Additionally, you can use all.equal(df1, df2) too.
[1] "Attributes: < Component “row.names”: Numeric: lengths (3, 4) differ >"
[2] "Length mismatch: comparison on first 3 components"
[3] "Component “id”: Lengths (3, 4) differ (string compare on first 3)"
[4] "Component “id”: 3 string mismatches"
[5] "Component “a”: Lengths (3, 4) differ (string compare on first 3)"
[6] "Component “a”: 2 string mismatches"
[7] "Component “b”: Numeric: lengths (3, 4) differ"

Without the need to rely on another package, but to compare structure (class and attributes) of two data sets:
structure_df1 <- sapply(df1, function(x) paste(class(x), attributes(x), collapse = ""))
structure_df2 <- sapply(df2, function(x) paste(class(x), attributes(x), collapse = ""))
all(structure_df1 == structure_df2)

Related

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

How do I reshape a character vector based on the contents in R?

I have a series of character vectors in which for every participant (denoted in ReprEx as a letter), there is a time point (in RePrex either 1 or 2), and then a score. Here is the ReprEx:
l <- c("A","1","27","B","1","26","2","54")
How can I reshape the vector to create a dataframe that has three columns, with Column A as participant, Column B as Time Point, and Column C as Score?
The intended output would like something like this:
data.frame("Participant" = c("A","B","B"),
"Time Point" = c("1","1","2"),
"Score" = c("27","26","54"))
If easier to make, it could be brought into this shape:
data.frame("Participant" = c("A","B"),
"TimePoint1" = c("27","26"),
"TimePoint2" = c("NA","54"))
Any direction/thoughts are appreciated.
Here is one way in base R.
Based on some pattern in Participant name we can find their position using grep. In the example shared the pattern is every Participant has an upper-case letter. We use their position to split data so each Participant has their own list. We use the first value in each list as Participant name and alternate values as Time.point and Score respectively.
output <- do.call(rbind, lapply(split(l,
findInterval(seq_along(l), grep('[A-Z]', l))), function(x) {
data.frame(Participant = x[1],
Time.Point = x[-1][c(TRUE, FALSE)],
Score = x[-1][c(FALSE, TRUE)])
}))
rownames(output) <- NULL
output <- type.convert(output)
output
# Participant Time.Point Score
#1 A 1 27
#2 B 1 26
#3 B 2 54

How can I create vector of multiple strings containing specific column names?

Given a 3 x 100 matrix, how could I create a vector of strings containing individual column names? Specifically, columns comprise 20 sets of 5 consecutive measures and therefore strings should match variable (i.e. varA, ... varC), sets (SET1 to SET20) and order (1 to 5). For example:
my_matrix = replicate(100, rnorm(3))
my_names <- c("varA.SET1.1", "varA.SET1.2", "varA.SET1.3", "varA.SET1.4", "varA.SET1.5",
"varA.SET2.1", "varA.SET2.2", "varA.SET2.3", "varA.SET2.4", "varA.SET2.5",
...
"varC.SET5.5")
You can use sprintf.
v <- LETTERS[1:3]
set <- 1:20
ord <- 1:5
ex <- expand.grid(v, set, ord)
my_names <- sprintf("var%s.SET%i.%i", ex[, 1],ex[, 2], ex[, 3])
head(my_names)
#[1] "varA.SET1.1" "varB.SET1.1" "varC.SET1.1" "varA.SET2.1" "varB.SET2.1"
#[6] "varC.SET2.1"

Identify difference in 2 data frame with missing values

Suppose I have 2 data frames:
a1 <- data.frame(a = 1:5, b=2:6)
a2 <- data.frame(a = 1:5, b=c(2:5,NA))
I would like to identify which columns are not identical (I will need the column number later). I thought that this would do the trick:
apply(!a1==a2, 2, sum, na.rm=TRUE)
However, because the last entry in a2 is an NA, it doesn't work.
Not sure why you're using sum, but to identify which columns are not identical you could use mapply with identical and negate the result.
which(!mapply(identical, a1, a2))
# b
# 2
for the column number. Or more simply for use in a column subset
!mapply(identical, a1, a2)
# a b
# FALSE TRUE
Just as a note, the word identical has a meaning in R that may be different from the result of ==, so it's possible you may need to clarify your question a bit.
x <- 1
y <- 1L
x == y
# [1] TRUE
identical(x, y)
# [1] FALSE
If you wanted to use sum, you could try
colSums(a1==a2, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE
Or using your code
apply(a1==a2, 2, sum, na.rm=TRUE)!=nrow(a1)
# a b
#FALSE TRUE

Unstacking a stacked dataframe unstacks columns in a different order

Using R 3.1.0
a = as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
b = unstack(stack(a))
# Returns FALSE
all(colnames(a) == colnames(b))
The documentation on stack/unstack says unstacking should "reverse this [stack] operation". Am I missing something? Why do I need to re-order the columns of b?
The last few lines of the stack (see utils:::stack.data.frame) function create a data.frame with two columns, "values" and "ind". The "ind" column is created with the code:
ind = factor(rep.int(names(x), lapply(x, length)))
But, look at how factor works in general (pay attention to the order of the "Levels"):
factor(c(1, 2, 3, 10, 4))
# [1] 1 2 3 10 4
# Levels: 1 2 3 4 10
factor(paste0("A", c(1, 2, 3, 10, 4)))
# [1] A1 A2 A3 A10 A4
# Levels: A1 A10 A2 A3 A4
If the functionality you describe is important for your analysis, you might do better modifying a version of stack.data.frame to capture the order of the data.frame names during the factoring process, like this:
Stack <- function (x, select, ...)
{
if (!missing(select)) {
nl <- as.list(1L:ncol(x))
names(nl) <- names(x)
vars <- eval(substitute(select), nl, parent.frame())
x <- x[, vars, drop = FALSE]
}
keep <- unlist(lapply(x, is.vector))
if (!sum(keep))
stop("no vector columns were selected")
if (!all(keep))
warning("non-vector columns will be ignored")
x <- x[, keep, drop = FALSE]
data.frame(values = unlist(unname(x)),
# REMOVE THIS --> ind = factor(rep.int(names(x), lapply(x, length))),
# AND ADD THIS:
ind = factor(rep.int(names(x), lapply(x, length)), unique(names(x))),
stringsAsFactors = FALSE)
}
Testing, one, two, three...
## Not using identical here because
## the factor levels are different
all.equal(Stack(a), stack(a))
# [1] TRUE
identical(unstack(Stack(a)), a)
# [1] TRUE
You'll never get me to defend the R documentation...
stack(...) creates a new data frame with two columns, values and ind. The latter has the column names from the original table, as a factor, ordered alphabetically. unstack(...) uses that factor to (re-) create columns of the new data frame. So the phrase "Unstacking reverses this operation" should be interpreted loosely...
To get the result you want, you need to reorder the factor ind, as follows:
a <- as.data.frame(do.call(cbind, lapply(1:100, function(x) { c(1,2,3)})))
c <- stack(a)
c$ind <- factor(c$ind, levels=colnames(a))
d <- unstack(c)
identical(a,d)
# [1] TRUE

Resources