comparing two columns of two dataframe and finding miss matching - r

I have two different dataframes as below
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
state district
1 a d
2 b e
3 c f
and df2
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
state district
1 a e
2 b d
3 c f
I want to check whether districts of df1 exists in df2? if not select state and district.
And If districts in df1 exists in df2 does it belongs to exact same state indf1 or not?
suppose district "d" belongs to state "a" in df1 but district "d" belongs to state "b" in df2 which is wrong.
What I am trying is:
'%noin%' <- Negate('%in%')
#creating unique id for df1
df1$uuid <- tolower(paste0(df1$state,"_",df1$district))
#creating unique id for df2
df2$uuid <- tolower(paste0(df2$state,"_",df2$district))
df_result <- df1 %>% filter(df1$uuid %noin% df2$uuid) %>%
select(state,district)
state district
1 a d
2 b e
how can I select the right state in df2 which these districts belongs to?
what my expected output looks like is:
expected_output <- data.frame(state=c("a","b"), district=c("d","e"),state_in_df_2=c("b","a"))
state district state_in_df_2
1 a d b
2 b e a
Thank you in advance

Using an anti_join and a left_join you could do:
library(dplyr)
df1 <- data.frame(state=letters[1:3],district=letters[4:6])
df2 <- data.frame(state=letters[1:3], district= c("e","d","f"))
df1 %>%
anti_join(df2, by = c("state", "district")) %>%
left_join(df2, by = c("district"), suffix = c("", "_in_df2"))
#> state district state_in_df2
#> 1 a d b
#> 2 b e a

Not sure If this will generalise in your case but you can try,
filter(merge(df1, df2, by = 'district'), state.x != state.y)
# district state.x state.y
#1 d a b
#2 e b a

Related

group_by, get most frequent and second most frequent

I have the following dataset:
a b
1 a
1 a
1 a
1 none
2 none
2 none
2 b
3 a
3 c
3 c
3 d
4 a
I want to get the most frequent value in b for any a and the second most frequent value of b for any a. in case two values in b have the same frequency I m indifferent about any of the two being considered the "first" or the "second".
in this case the expected output would be:
d2:
a first second
1 a none
2 none b
3 c a(or d, doesn't matter)
4 a NA
as you can see a=4 has just one value in b, thus I expect a NA in the output column "second" as there is no second most frequent value.
data:
a <- c(1,1,1,1,2,2,2,3,3,3,3,4)
b<- c("a","a", "a", "none", "none", "none", "b", "a", "c" , "c", "d","a")
d <- data.frame(a,b)
what I tried at the moment is the following
d1 <- d %>% group_by(a) %>% summarize ( first =names(which.max(table(b))) , second= names(which.max(table(b)[-which.max(table(b))] )))
but it doesn't work properly, any idea on how to do this?
You can count number of rows for a and b combination and for each value of a select 1st and 2nd value in summarise.
library(dplyr)
d %>%
count(a, b, sort = TRUE) %>%
group_by(a) %>%
summarise(first = b[1],second = b[2])
# A tibble: 4 x 3
# a first second
# <dbl> <chr> <chr>
#1 1 a none
#2 2 none b
#3 3 c a
#4 4 a NA
Here is one option with data.table
library(data.table)
setDT(d)[, .N, .(a, b)][order(N), .(first = first(b), second = b[2]), a]

Joining / merging two data frames by symmetric differences in rows and columns

I would like to join / merge two data frames, but ignoring similarities in rows and columns in the resulting data frame. Consider the following example:
df1 <- data.frame(
id = c("a","b","c"),
a = runif(3,1,9),
b = runif(3,1,9)
)
df2 <- data.frame(
df1[1:2,],
c = runif(2,1,9)
)
Results in two data frames that have exactly four cells in common (not counting id), so df1[1:2,2:3] == df2[1:2,2:3]. However, they do differ in regard that df1 as an additional row and df2 has an additional column:
> print(df1)
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469
> print(df2)
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
I want a new data frame to consist of the symmetric differences between these two, so no duplicates in rows or columns. The closest result I have achieved is by using dplyr::full_join(df1, df2, by = "id"), but this results in duplicated columns.
The result should look like this:
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280
3 c 5.608775 4.219469 NA
What's the best way of achieving this dynamically? Thanks
With data.table we can join on the 'id' and assign the 'c' from the second dataset to create the 'c' column in the first data. By default, the non-matching elements will be assigned as NA
library(data.table)
setDT(df1)[df2, c := c, on = .(id)]
df1
# id a b c
#1: a 4.601639 1.065642 7.476494
#2: b 6.065758 6.234421 8.929932
#3: c 4.000351 7.365717 NA
NOTE: The values are different as there was not set seed
In base R, an option would be match
df1$c <- df2$c[match(df1$id, df2$id)]
Regarding the OP's use of full_join (left_join would be fine based on the example), the trick is to remove the columns that are not needed in the second dataset
library(dplyr)
nm1 <- c("id", setdiff(names(df2), names(df1)))
left_join(df1, select(df2, nm1), by = 'id')
Another approach if one of the data frames has all the rows you want (df2 here):
library(dplyr)
bind_rows(df2, anti_join(df1, df2))
#Joining, by = c("id", "a", "b")
# id a b c
#1 a 1.912298 5.792475 6.899253
#2 b 2.537666 1.495075 1.186120
#3 c 5.947766 6.594028 NA
In this particular case this would be sufficient
library(sqldf)
sqldf("select * from df1 left natural join df2")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr:
library(dplyr)
left_join(df1, df2)
but in general you might need the following. Note this is perfectly general. We did not need to specify the column or row names in either the above or following code and in the following code it is symmetric in df1 and df2 so it does not rely on knowing the structure of either.
sqldf("select * from df1 left natural join df2
union
select * from df2 left natural join df1")
## id a b c
## 1 a 6.396168 4.037320 2.444122
## 2 b 4.119025 8.181253 6.444280
## 3 c 5.608775 4.219469 NA
or with dplyr. This will give a warning but still works. You can avoid the warning if id were character rather than factor or if you convert it to character first.
library(dplyr)
rbind(left_join(df1, df2), left_join(df2, df1)) %>% distinct
Note
Because the question did not use set.seed the code to generate the input is
not reproducible but we can copy the particular df1 and df2 so that we have the same data as in the question.
Lines1 <- "
id a b
1 a 6.396168 4.037320
2 b 4.119025 8.181253
3 c 5.608775 4.219469"
df1 <- read.table(text = Lines1)
Lines2 <- "
id a b c
1 a 6.396168 4.037320 2.444122
2 b 4.119025 8.181253 6.444280"
df2 <- read.table(text = Lines2)

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).
One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4
An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

Function that ignores missing columns

Say I have the following two data frames:
col1 <- c("a","b","c","d","e")
col2 <- c("A","B","C","D","E")
col1a <- c("a","b","c","d","e")
col2a <- c("A","B","C","D","E")
df1 <- data.frame(col1, col2)
df2 <- data.frame(col1a, col2a)
colnames(df1) <- c("c1","c2")
colnames(df2) <- c("c1","c3")
And I have the following function to rename column headers:
library(dplyr)
col_rename <- function(x) x %>% rename(new_c1 = c1, new_c2 = c2, new_c3 = c3)
When I run this function, I get an error because the columns in the function does not match the columns in the data frame.
df1 <- col_rename(df1)
Error: `c3` contains unknown variables
How can I make the function run only on the present columns, and ignore the ones not present, without removing or changing the column names specified in the function?
EDIT:
I can see how the example was a bit confusing. I have many dataframes with many columns. These columns are shared by some dataframes but not all. However, I want to rename all columns specified by the function, regardless of what is present in the dataframe. It looks something like this:
col1 <- c(1:5)
col2 <- c(1:5)
col3 <- c(1:5)
col4 <- c(1:5)
df1 <- data.frame(col1,col2,col3,col4)
df2 <- data.frame(col1,col2,col3,col4)
colnames(df1) <- c("c1","c2","c6","c8")
colnames(df2) <- c("c1","c3","c2","c8")
AB_rename <- function(x) x %>% rename(aa=col1,bb=col2,
cc=col3,dd=col4,
ee=col5,ff=col6,
gg=col7,hh=col8)
Therefore I cannot follow the example of #Ycw, as they do not all follow the same rename rule. How do I make this ignore columns that are not present?
Here is a workaround to use setNames for the col_rename function.
col_rename <- function(x) setNames(x, paste0("new_", names(x)))
col_rename(df1)
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
col_rename(df2)
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
Or use the select_all function from the dplyr.
library(dplyr)
df1 %>% select_all(function(x) paste0("new_", x))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
This (~) also works for select_all
df2 %>% select_all(~paste0("new_", .))
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
rename_all also works well
library(dplyr)
df1 %>% rename_all(~paste0("new_", .))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
Update
This is an update to address OP's updated question.
We can create a named vector showing the relationship between old column names and new column names. And defined a function to change the name based on the setNames function.
# Create name vector
vec <- paste0("c", 1:8)
names(vec) <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh")
# Create the function
AB_rename <- function(x, name_vec){
old_colname <- names(x)
new_colname <- name_vec[name_vec %in% old_colname]
x2 <- setNames(x, names(new_colname))
return(x2)
}
AB_rename(df1, vec)
aa bb ff hh
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

Resources