Merge on multiple columns results in strange ordering - r

When two data frames are merged by a numerical column then (by default) they are ordered by that column as a number. However, if two numerical columns are used as the by then it results in a different ordering (in fact it seems as if the numerical columns are converted to strings and sorted as such). Is this expected, or a bug?
For example, consider the following two data frames:
A <- data.frame(a = 1:12, b = 1, x = runif(12))
B <- data.frame(a = 1:12, b = 1, y = runif(12))
Then merge(A, B, by = 'a') results in a data frame with a column a with values 1, 2, ..., 9, 10, 11, 12 (i.e., the expected numerical ordering). However merge(A, B, by = c('a', 'b')) results in a data frame with a column a with values 1, 10, 11, 12, 2, 3, ..., 8, 9 (i.e., the same ordering as sort(as.character(1:12))).

I guess it's rather a feature than a bug of merge.
Inspection of the source code of merge showed that in the case when multiple columns are used for merging, the 'key' columns are internally combined into a vector by using paste().
For example, columns a and b from your data frame A will be represented by the string "1\r1" "2\r1" "3\r1" "4\r1" "5\r1" "6\r1" "7\r1" "8\r1" "9\r1" "10\r1" "11\r1" "12\r1".
merge uses this string to sort the resulting data frame, and that is how it ends up with the alphabetical ordering.
In the case when you merge only by one column, there is no need for using paste, and therefore sorting is performed by using the original type of the column.
Here is the relevant piece of the source code of merge (full text can be obtained by running merge.data.frame without parentheses in R console)
if (l.b == 1L) {
bx <- x[, by.x]
if (is.factor(bx))
bx <- as.character(bx)
by <- y[, by.y]
if (is.factor(by))
by <- as.character(by)
}
else {
if (!is.null(incomparables))
stop("'incomparables' is supported only for merging on a single column")
bx <- x[, by.x, drop = FALSE]
by <- y[, by.y, drop = FALSE]
names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}

Using the dplyr package, we can get the following result
library("dplyr", lib.loc="~/R/win-library/3.2")
full_join(A, B, by=c("a", "b"))
a b x y
1 1 1 0.39907404 0.700782559
2 2 1 0.84429488 0.600727090
3 3 1 0.32232471 0.141495156
4 4 1 0.74214210 0.262601640
5 5 1 0.92944116 0.779255689
6 6 1 0.10902661 0.001185645
7 7 1 0.46336478 0.961711785
8 8 1 0.58396008 0.211824751
9 9 1 0.63126074 0.422233784
10 10 1 0.09995935 0.179069642
11 11 1 0.40832159 0.581116173
12 12 1 0.48440814 0.004372634

Related

How to "translate" variables in one data frame using a second data frame as a key?

I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).

Create dataframe from smallest vector available

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.
I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b

Faster Way to Create a Subset within a Loop or Apply Function in R

I'm new to R, so apologies in advance for bad form in my code.
I'm trying to figure out the best way to go through a dataframe, row by row, and modify a value based on logic that references other columns within that row or an entirely different dataframe. The issue is that the logic I'm using necessitates creating and subsetting a dataframe for each row to retrieve a minimum value. My real data set is 47000 rows and 15 columns, so creating 47,000 subsets is taking a long time.
Here are sample datasets to help describe what I'm talking about.
df1 <- data.frame('A' = c(rep("Beer", 2), rep("Chip", 2)), 'B' = c(NA, 3,
NA,9), 'C' = 5:8, 'D' = NA)
df2 <- data.frame('Q' = c(rep("Beer", 2), rep("Chip", 2)), 'R' = 6:9, 'S' =
c(12, 15, 4, 18), 'T' = c(23, 45, 75, 34))
df1:
A B C D
Beer NA 5 NA
Beer 3 6 NA
Chip NA 7 NA
Chip 9 8 NA
df2:
Q R S T
Beer 6 12 23
Beer 7 15 45
Chip 8 4 75
Chip 9 18 34
This loop does what I want, namely checking whether a value is NA in column B or not, if it isn't then use that value in for column D, if it is NA then retrieve the minimum value from a filtered subset of df2. In the real use case I have other filtering conditions.
require(dplyr)
for (i in 1:nrow(df1)) {
if (!(is.na(df1$B[i]))) {
df1$D[i] <- df1$B[i]}
else {x <- filter(df2, df1$A[i] == df2$Q)
x <- min(x$S)
df1$D[i] <- x
}
}
Everyone says to avoid loops in R, so I created this function using apply which also works (although is a little more difficult to follow):
FUNC <- function(x) {
apply(x, 1, function(y) {
if (!(is.na(y[2]))) {
y[4] <- y[2]}
else {z <- filter(df2, y[1] == df2$Q)
z <- min(z$S)
y[4] <- z}
}
)
}
df1$D <- as.numeric(FUNC(df1))
Output:
A B C D
Beer NA 5 12
Beer 3 6 3
Chip NA 7 4
Chip 9 8 9
Aside question: is there a way to reference items in vector y by name instead of by index position?
So is there a better way to do this? Right now both methods take about 5-8 minutes to run through 47,000+ rows which seems long to me.
df1$D <- df2 %>%
rename(A=Q) %>%
group_by(A) %>%
summarise(D=min(S)) %>%
right_join(df1, by="A") %>%
mutate(D=ifelse(is.na(B), D.x, B)) %>%
`[[`("D")

Comparing two columns

I am new to R and I am trouble with a command that I did all the time in Python.
I have two data-frames (database and creditIDs), and what I want to do is compare one column in database and one column in creditIDs. More specifically in a value exists in creditIDs[,1] but doesn't in database[,5], I want to delete that entire row in database.
Here is the code:
for (i in 1:lengthColumns){
if (!(database$credit_id[i] %in% creditosVencidos)){
database[i,]<-database[-i,]
}
}
But I keep on getting this error:
50: In `[<-.data.frame`(`*tmp*`, i, , value = structure(list( ... :
replacement element 50 has 9696 rows to replace 1 rows
Could someone explain why this is happening? Thanks!
the which() command will return the row indices that satisfy a boolean statement, much like numpy.where() in python. Using the $ after a dataframe with a column name gives you a vector of that column... alternatively you could do d[,column_number].
In this example I'm creating an x and y column which share the first five values, and use which() to slice the dataframe on their by-row equality:
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = rep(1:5, 2), y = 1:10, fac = fac))
d = d[which(d$x == d$y),]
d
x y fac
1 1 A
2 2 B
3 3 C
4 4 B
5 5 B
You will need to adjust this for your column names/numbers.
# Create two example data.frames
creditID <- data.frame(ID = c("896-19", "895-8", "899-1", "899-5"))
database <- data.frame(ID = c("896-19", "camel", "899-1", "goat", "899-1"))
# Method 1
database[database$ID %in% creditID$ID, ]
# Method 2 (subset() function)
database <- subset(database, ID %in% creditID$ID)

Resources