I have two dataframe in R.
dataframe 1
A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e
dataframe 2
X Y Z
1 2 g
2 1 h
3 4 i
1 4 j
I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.
When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.
Final dataframe
X Y Z C D E F G
1 2 g a a a a a
2 1 h a a a a a
3 4 i na na na na na
1 4 j e e f f e
I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.
For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.
df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'),
D=c('a','b','e'), E=c('a','b','f'),
F=c('a','c','f'), G=c('a','c', 'e'))
df2 <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])
The following works, but no doubt can be improved.
I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).
replace_index <- function(dat){
x <- as.data.frame(t(sapply(seq_len(nrow(dat)),
function(i)sort(unlist(dat[i, 1:2])))))
names(x) <- paste("V", seq_len(ncol(x)), sep="")
data.frame(x, dat[, -(1:2), drop=FALSE])
}
replace_index(df1)
V1 V2 C D E F G
1 1 2 a a a a a
2 2 3 b b b c c
3 1 4 e e f f e
This means you can use a straight-forward merge to combine the data.
merge(replace_index(df1), replace_index(df2), all.y=TRUE)
V1 V2 C D E F G Z
1 1 2 a a a a a g
2 1 2 a a a a a h
3 1 4 e e f f e j
4 3 4 <NA> <NA> <NA> <NA> <NA> i
This is slightly clunky, and has some potential collision and order issues but works with your example
df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)
to produce
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
One approach would be to create an id key for matching that is order invariant.
# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B), "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y), "-", max(X, Y)))
# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])
This produces the output
X Y Z id C D E F G
1 1 2 g 1 - 2 a a a a a
1.1 2 1 h 1 - 2 a a a a a
NA 3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3 1 4 j 1 - 4 e e f f e
You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.
library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))
rbind(m,n) %>%
group_by(X,Y) %>%
arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
slice(1) # take top row from combination
Produces:
Source: local data frame [4 x 8]
Groups: X, Y
X Y Z C D E F G
1 1 2 g a a a a a
2 1 4 j e e f f e
3 2 1 h a a a a a
4 3 4 i NA NA NA NA NA
Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:
merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
## X Y Z C D E F G
## 1 1 2 g a a a a a
## 2 2 1 h a a a a a
## 3 1 4 j e e f f e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>
Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what #Andrie does in his helper function, which would work for any number of key columns, but would be less performant.
Related
I got these two data frames:
a <- c('A','B','C','D','E','F','G','H')
b <- c(1,2,1,3,1,3,1,6)
c <- c('K','K','H','H','K','K','H','H')
frame1 <- data.frame(a,b,c)
a <- c('A','A','B','B','C','C','D','D','E','E','F','F','G','H','H')
d <- c(5,5,6,3,1,9,1,0,2,3,6,5,5,5,4)
e <- c('W','W','D','D','D','D','W','W','D','D','W','W','D','W','W')
frame2<- data.frame(a,d,e)
And now I want to include the column 'e' from 'frame2' into 'frame1' depending on the matching value in column 'a' of both data frames. Note: 'e' is the same for all rows with the same value in 'a'.
The result should look like this:
a b c e
1 A 1 K W
2 B 2 K D
3 C 1 H D
4 D 3 H W
5 E 1 K D
6 F 3 K W
7 G 1 H D
8 H 6 H W
Any sugestions?
You can use match to matching value in column 'a' of both data frames:
frame1$e <- frame2$e[match(frame1$a, frame2$a)]
frame1
# a b c e
#1 A 1 K W
#2 B 2 K D
#3 C 1 H D
#4 D 3 H W
#5 E 1 K D
#6 F 3 K W
#7 G 1 H D
#8 H 6 H W
or using merge:
merge(frame1, frame2[!duplicated(frame2$a), c("a", "e")], all.x=TRUE)
you can perform join operation on 'a' column of both dataframes and take those values only which are matched. you can do left join , and after that remove 'a' column from 2nd dataframe and also remove rest of the columns, which are'nt needed from 2nd dataframe.
Using dplyr :
library(dplyr)
frame2 %>%
distinct(a, e, .keep_all = TRUE) %>%
right_join(frame1, by = 'a') %>%
select(-d) %>%
arrange(a)
# a e b c
#1 A W 1 K
#2 B D 2 K
#3 C D 1 H
#4 D W 3 H
#5 E D 1 K
#6 F W 3 K
#7 G D 1 H
#8 H W 6 H
I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>
I want to join two dataframes by two columns they have in common but I do not want mutual pairs to be considered as duplicates.
Sample dataframes look like:
>df
letter1 letter2 value
d e 1
c d 2
c e 4
>dc
letter1 letter2
a e
c a
c d
c e
d a
d c
d e
e a
I want to join them by the first two columns, leaving in the third column the value in df$value and NA if the row does not exist in df. I have tried:
s <- join(dc,df, by = c("letter1","letter2"))
>s
letter1 letter2 value
a e NA
c a NA
c d 2
c e 4
d a NA
d c 2
d e 1
e a NA
Here, the pair d c is considered the same as c d and the value in the third column is the same. What I want is d c being considered as non-present in df, so their row value is NA. My desired output is:
>s
letter1 letter2 value
a e NA
c a NA
c d 2
c e 4
d a NA
d c NA
d e 1
e a NA
How can I join the dataframes so mutual pairs are considered different combinations?
UPDATE: I am sorry but I have just realized there was a problem with my input dataframes and that the join line I was trying actually works. I will accept the first answer that also works to give credit to the author.
We can use apply to change the order
df[1:2] <- t(apply(df[1:2], 1, sort))
dc <- t(apply(dc, 1, sort)
and then do the join
You could use merge instead of join:
merge(dc,df, by = c("letter1","letter2"),all=TRUE)
#Creating the data frames
df <- data.frame(letter1=c("d","c","c"),
letter2=c("e","d","e"),
value=c(1,2,4))
dc <- data.frame(letter1=c("a","c","c","c","d","d","d","e"),
letter2=c("e","a","d","e","a","c","e","a"))
# Merging the data frames
dout <- merge(df,dc,by=c("letter1","letter2"),all=T)
# Outcome
letter1 letter2 value
1 c d 2
2 c e 4
3 c a NA
4 d e 1
5 d a NA
6 d c NA
7 a e NA
8 e a NA
I have 8 columns of variables which I must keep column 1 to 3. For column 4 to 8 I need to keep those with only 3 levels and drop which does not qualify that condition.
I tried the following command
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
It managed to retain the variables with 3 levels, but deleted my first 3 columns.
You could do a two step process:
data4 <- data2[1:3]
#Your answer for the second part here:
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
merge(data3,data4)
Depending on what you would like your expected output to be, could try with the option all =TRUE inside the merge().
I would suggest another approach:
x = 1:3
cbind(data2[x], Filter(function(i) length(unique(i))==3, data2[-x]))
# 1 2 3 5
#1 a 1 3 b
#2 b 2 4 b
#3 c 3 5 b
#4 d 4 6 a
#5 e 5 7 c
#6 f 6 8 c
#7 g 7 9 c
#8 h 8 10 a
#9 i 9 11 c
#10 j 10 12 b
Data:
data2 = setNames(
data.frame(letters[1:10],
1:10,
3:12,
sample(letters[1:10],10, replace=T),
sample(letters[1:3],10, replace=T)),
1:5)
Assuming that the columns 4:8 are factor class, we can also use nlevels to filter the columns. We create 'toKeep' as the numeric index of columns to keep, and 'toFilter' as numeric index of columns to filter. We subset the dataset into two: 1) using the 'toKeep' as the index (data2[toKeep]), 2) using the 'toFilter', we further subset the dataset by looping with sapply to find the number of levels (nlevels), create logical index (==3) to filter the columns and cbind with the first subset.
toKeep <- 1:3
toFilter <- setdiff(seq_len(ncol(data2)), n)
cbind(data2[toKeep], data2[toFilter][sapply(data2[toFilter], nlevels)==3])
# V1 V2 V3 V4 V6
#1 B B D C B
#2 B D D A B
#3 D E B A B
#4 C B E C A
#5 D D A D E
#6 E B A A B
data
set.seed(24)
data2 <- as.data.frame(matrix(sample(LETTERS[1:5], 8*6, replace=TRUE), ncol=8))
I have a data frame like this:
n = c(2, 2, 3, 3, 4, 4)
n <- as.factor(n)
s = c("a", "b", "c", "d", "e", "f")
df = data.frame(n, s)
df
n s
1 2 a
2 2 b
3 3 c
4 3 d
5 4 e
6 4 f
and I want to access the first element of each level of my factor (and have in this example a vector containing a, c, e).
It is possible to reach the first element of one level, with
df$s[df$n == 2][1]
but it does not work for all levels:
df$s[df$n == levels(n)]
[1] a f
How would you do that?
And to go further, I’d like to modify my data frame to see which is the first element for each level at every occurrence. In my example, a new column should be:
n s rep firstelement
1 2 a a a
2 2 b c a
3 3 c e c
4 3 d a c
5 4 e c e
6 4 f e e
Edit. The first part of my answer addresses the original question, i.e. before "And to go further" (which was added by OP in an edit).
Another possibility, using duplicated. From ?duplicated: "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts."
Here we use !, the logical negation (NOT), to select not duplicated elements of 'n', i.e. first elements of each level of 'n'.
df[!duplicated(df$n), ]
# n s
# 1 2 a
# 3 3 c
# 5 4 e
Update Didn't see your "And to go further" edit until now. My first suggestion would definitely be to use ave, as already proposed by #thelatemail and #sparrow. But just to dig around in the R toolbox and show you an alternative, here's a dplyr way:
Group the data by n, use the mutate function to create a new variable 'first', with the value 'first element of s' (s[1]),
library(dplyr)
df %.%
group_by(n) %.%
mutate(
first = s[1])
# n s first
# 1 2 a a
# 2 2 b a
# 3 3 c c
# 4 3 d c
# 5 4 e e
# 6 4 f e
Or go all in with dplyr convenience functions and use first instead of [1]:
df %.%
group_by(n) %.%
mutate(
first = first(s))
A dplyr solution for your original question would be to use summarise:
df %.%
group_by(n) %.%
summarise(
first = first(s))
# n first
# 1 2 a
# 2 3 c
# 3 4 e
Here is an approach using match:
df$s[match(levels(n), df$n)]
EDIT: Maybe this looks a bit confusing ...
To get a column which lists the first elements you could use match twice (but with x and table arguments swapped):
df$firstelement <- df$s[match(levels(n), df$n)[match(df$n, levels(n))]]
df$firstelement
# [1] a a c c e e
# Levels: a b c d e f
Lets look at this in detail:
## this returns the first matching elements
match(levels(n), df$n)
# [1] 1 3 5
## when we swap the x and table argument in match we get the level index
## for each df$n (the duplicated indices are important)
match(df$n, levels(n))
# [1] 1 1 2 2 3 3
## results in
c(1, 3, 5)[c(1, 1, 2, 2, 3, 3)]
# [1] 1 1 3 3 5 5
df$s[c(1, 1, 3, 3, 5, 5)]
# [1] a a c c e e
# Levels: a b c d e f
the function ave is useful in these cases:
df$firstelement = ave(df$s, df$n, FUN = function(x) x[1])
df
n s firstelement
1 2 a a
2 2 b a
3 3 c c
4 3 d c
5 4 e e
6 4 f e
In this case I prefer plyr package, it gives further freedom to manipulate the data.
library(plyr)
ddply(df,.(n),function(subdf){return(subdf[1,])})
n s
1 2 a
2 3 c
3 4 e
You could also use data.table
library(data.table)
dt = as.data.table(df)
dt[, list(firstelement = s[1]), by=n]
which would get you:
n firstelement
1: 2 a
2: 3 c
3: 4 e
The by=n bit groups everything by each value of n so s[1] is getting the first element of each of those groups.
To get this as an extra column you could do:
dt[, newcol := s[1], by=n]
dt
# n s newcol
#1: 2 a a
#2: 2 b a
#3: 3 c c
#4: 3 d c
#5: 4 e e
#6: 4 f e
So this just takes the value of s from the first row of each group and assigns it to a new column.
df$s[sapply(levels(n), function(particular.level) { which(df$n == particular.level)[1]})]
I believe your problem is that you are comparing two vectors df$n is a vector and levels(n) is a vector. vector == vector only happens to work for you since df$n is a multiple length of levels(n)
Surprised not to see this classic in the answer stream yet.
> do.call(rbind, lapply(split(df, df$n), function(x) x[1,]))
## n s
## 2 2 a
## 3 3 c
## 4 4 e