Merge R dataframes - r

What I would like to do is a bit difficult to explain, but the code would look something like this:
df_merged <- merge(df1, df2,
by.x = c("City", "District"),
by.y = c("City", "District" | "Area"),
all.x = TRUE)
Here "|", in the code above, would mean "OR".
The basic point is that I would like to merge the two frames by two columns. "City" matches for both data frames. However, I also need there to be a match based on "District".
The problem is that, due to human error while the dataset was originally made, in df2 some values for "District" were put in the "Area" column. Hence, ideally, if we have "District" being "A" in df1, then the merge occurs if "A" is found in either the "District" or "Area" column from df2.
Here is an example:
df1 <- data.frame(City = c("A", "B"), District = c("cc", "dd"))
df2 <- data.frame(City = c("A", "A", "B", "B"), Code = c("1a","2a","3a","4a"), District = c("cc", "Apple", "Pear", "Orange"), Area = c("e", "a", "dd", "f"))
df3 <- data.frame(City = c("A", "B"), District = c("cc","dd"), Code = c("1a", "3a"))
Here df3 is what I am aiming for! As you can see in df2, there is something messed up and the values for district got into the wrong column. In my original dataset, it is difficult to clean up this error.
> df1
City District
1 A cc
2 B dd
> df2
City Code District Area
1 A 1a cc e
2 A 2a Apple a
3 B 3a Pear dd
4 B 4a Orange f
> df3
City District Code
1 A cc 1a
2 B dd 3a

Here are a couple of options. One idea is that district might be NA if it is actually in Area. In this case you could coalesce the NA and join on District. Alternatively, you could map out rows in df2 that match rows in df1 and then expand df1 to accommodate those rows.
library(tidyverse)
df1 <- tibble(City = c(rep("Chicago", 4), rep("Tucson", 3)),
District = c("A", "A", "A", "B", "A", "B", "C"))
df2 <- tibble(City = c("Chicago", "Chicago", "Tucson", "Tucson"),
District = c("A", "B", NA, "C"),
Area = c("10", "30", "A", "20"),
value = c(1:4))
#option 1
left_join(df1,
df2 |>
mutate(District = coalesce(District, Area)),
by = c("City", "District" ))
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4
#option 2
df1 |>
mutate(matches = map2(City, District,
~filter(df2, City == .x & (District == .y | Area == .y))|>
select(-City, -District))) |>
unnest_wider(matches)
#> # A tibble: 7 x 4
#> City District Area value
#> <chr> <chr> <chr> <int>
#> 1 Chicago A 10 1
#> 2 Chicago A 10 1
#> 3 Chicago A 10 1
#> 4 Chicago B 30 2
#> 5 Tucson A A 3
#> 6 Tucson B <NA> NA
#> 7 Tucson C 20 4

Related

How to combine two rows of a dataframe into one row

I have a dataframe which looks like this.
Name info.1 info.2
ab a 1
123 a 1
de c 4
456 c 4
fg d 5
789 d 5
The two rows that need to be combined are identical aside from the name column and are together in the dataframe. I want the new dataframe to look like this:
Name ID info.1 info.2
ab 123 a 1
de 456 c 4
fg 789 d 5
I have no clue how to do this and google search hasn't been helpful so far
In base R you could do:
data.frame(Name = df[seq(nrow(df)) %% 2 == 0, 1],
ID = df[seq(nrow(df)) %% 2 == 1, 1],
df[seq(nrow(df)) %% 2 == 0, 2:3])
#> Name ID info.1 info.2
#> 2 ab 456 a 1
#> 4 123 fg c 4
#> 6 de 789 d 5
Created on 2022-07-20 by the reprex package (v2.0.1)
A possible solution:
library(tidyverse)
df %>%
group_by(info.1) %>%
summarise(Name = str_c(Name, collapse = "_"), info.2 = first(info.2)) %>%
separate(Name, into = c("Name", "ID"), convert = T) %>%
relocate(info.1, .before = info.2)
#> # A tibble: 3 × 4
#> Name ID info.1 info.2
#> <chr> <int> <chr> <int>
#> 1 ab 123 a 1
#> 2 de 456 c 4
#> 3 fg 789 d 5
Assuming the Name column is consistently ordered Name-ID-Name-ID then:
library(tidyverse)
data <- tibble(Name = c('ab', 123, 'de', 456, 'fg', 789),
info.1 = c('a', 'a', 'c', 'c', 'd', 'd'),
info.2 = c(1, 1, 4, 4, 5, 5))
# remove the troublesome column and make a tibble
# with the unique combos of info1 and 2
data_2 <- data %>% select(info.1, info.2) %>% distinct()
# add columns for name and ID by skipping every other row in the
# original tibble
data_2$Name <- data$Name[seq(from = 1, to = nrow(data), by = 2)]
data_2$ID <- data$Name[seq(from = 2, to = nrow(data), by = 2)]
We could also use summarise and extract first as name and last as id:
data |>
group_by(info.1, info.2) |>
summarise(name = first(Name), ID = last(Name)) |>
ungroup() #|>
#relocate(3:4,1:2)
Output:
# A tibble: 3 × 4
info.1 info.2 name ID
<chr> <dbl> <chr> <chr>
1 a 1 ab 123
2 c 4 de 456
3 d 5 fg 789
We could also use
library(dplyr)
library(stringr)
data %>%
group_by(across(starts_with('info'))) %>%
mutate(ID = str_subset(Name, "^\\d+$"), .before = 2) %>%
ungroup %>%
filter(str_detect(Name, '^\\d+$', negate = TRUE))
-output
# A tibble: 3 × 4
Name ID info.1 info.2
<chr> <chr> <chr> <dbl>
1 ab 123 a 1
2 de 456 c 4
3 fg 789 d 5
data
data <- structure(list(Name = c("ab", "123", "de", "456", "fg", "789"
), info.1 = c("a", "a", "c", "c", "d", "d"), info.2 = c(1, 1,
4, 4, 5, 5)), row.names = c(NA, -6L), class = "data.frame")

Slice out sequence of grouped rows [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Closed 1 year ago.
I have this data:
df <- data.frame(
node = c("A", "B", "A", "A", "A", "B", "A", "A", "A", "B", "B", "B", "B"),
left = c("ab", "ab", "ab", "ab", "cc", "xx", "cc", "ab", "zz", "xx", "xx", "zz", "zz")
)
I want to count grouped frequencies and proportions and slice/filter out a sequence of grouped rows. Say, given the small dataset, I want to have the rows with the two highest Freq_left values per group. How can that be done? I can only extract the rows with the maximum Freq_left values but not the desired sequence of rows:
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(Freq_left)
# A tibble: 2 × 4
# Groups: node [2]
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
1 A ab 4 30.8
2 B xx 3 23.1
Expected output:
node left Freq_left Prop_left
<chr> <chr> <int> <dbl>
A ab 4 30.8
A cc 2 15.4
B xx 3 23.1
B zz 2 15.4
You could use dplyr::top_n or dplyr::slice_max:
Thanks to #PaulSmith for pointing out that dplyr::top_n is superseded in favor of dplyr::slice_max:
library(dplyr)
df %>%
group_by(node, left) %>%
# summarise
summarise(
Freq_left = n(),
Prop_left = round(Freq_left/nrow(.)*100, 4)
) %>%
slice_max(order_by = Prop_left, n = 2)
#> `summarise()` has grouped output by 'node'. You can override using the `.groups` argument.
#> # A tibble: 4 × 4
#> # Groups: node [2]
#> node left Freq_left Prop_left
#> <chr> <chr> <int> <dbl>
#> 1 A ab 4 30.8
#> 2 A cc 2 15.4
#> 3 B xx 3 23.1
#> 4 B zz 2 15.4

Replacing missing values by group and identifying mutual exclusiveness

I am working with the grouped data in R.
In the following data example, I would like to fill the missing values in "sex" variable, and keep as is if there was no corresponding data (i.e. for id=6).
In the "diagnosis" variable, some had only one diagnosis and some had multiple diagnosis. So, I also would like to group the variable "diagnosis" into "wanted" to identify mutual exclusiveness.
The example data is;
d.f <- tribble (
~id, ~sex, ~diagnosis,
1, "M", "A",
1, NA, "B",
1, NA, "C",
2, NA, "A",
2, "F", NA,
2, NA, "A",
3, NA, NA,
3, "M", "A",
3, "M", "B",
4, "F", "C",
5, "F", "B",
6, NA, "A",
7, "M", NA
)
The desired data is ;
wanted <- tribble (
~id, ~sex, ~diagnosis,~wanted,
1, "M", "A", "ABC group",
1, "M", "B", "ABC group",
1, "M", "C", "ABC group",
2, "F", "A", "Only A",
2, "F", NA, "Only A",
2, "F", "A", "Only A",
3, "M", NA, "AB group",
3, "M", "A", "AB group",
3, "M", "B", "AB group",
4, "F", "C", "Only C",
5, "F", "B", "Only B",
6, NA, "A", "Only A",
7, "M", NA, "Missing"
)
mutate sex column by using first(na.omit(sex)) first is just an aggregating function which is safe to use here
another column say wanted can be mutated in two steps.
paste all strings together in the group using paste(unique(na.omit(diagnosis)), collapse = '')
thereafter use case_when to mutate strings as per your choice
library(tidyverse)
d.f %>%
group_by(id) %>%
mutate(sex = first(na.omit(sex)),
wanted = { x <- paste(unique(na.omit(diagnosis)), collapse = '');
case_when(nchar(x) == 1 ~ paste0('Only ', x),
nchar(x) == 0 ~ 'Missing',
TRUE ~ paste(x, ' Group'))})
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC Group
#> 2 1 M B ABC Group
#> 3 1 M C ABC Group
#> 4 2 F A Only A
#> 5 2 F <NA> Only A
#> 6 2 F A Only A
#> 7 3 M <NA> AB Group
#> 8 3 M A AB Group
#> 9 3 M B AB Group
#> 10 4 F C Only C
#> 11 5 F B Only B
#> 12 6 <NA> A Only A
#> 13 7 M <NA> Missing
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
group_by(id) %>%
drop_na(diagnosis) %>%
summarise(wanted = str_c(c(unique(diagnosis)), collapse = "")) %>%
full_join(df1, . , by = "id") %>%
group_by(id) %>%
fill(sex, .direction = "updown")
#> # A tibble: 13 x 4
#> # Groups: id [7]
#> id sex diagnosis wanted
#> <dbl> <chr> <chr> <chr>
#> 1 1 M A ABC
#> 2 1 M B ABC
#> 3 1 M C ABC
#> 4 2 F A A
#> 5 2 F <NA> A
#> 6 2 F A A
#> 7 3 M <NA> AB
#> 8 3 M A AB
#> 9 3 M B AB
#> 10 4 F C C
#> 11 5 F B B
#> 12 6 <NA> A A
#> 13 7 M <NA> <NA>
This can also be used:
library(dplyr)
d.f %>%
group_by(id) %>%
mutate(sex = coalesce(sex, sex[!is.na(sex)][1]),
wanted = across(diagnosis, ~ {x <- unique(diagnosis[!is.na(diagnosis)])
if_else(length(x) > 1, paste(paste(x, collapse = ""), "Group"),
if_else(length(x) == 1, paste("Only", x[1]), "Missing")
)}))
# A tibble: 13 x 4
# Groups: id [7]
id sex diagnosis wanted$diagnosis
<dbl> <chr> <chr> <chr>
1 1 M A ABC Group
2 1 M B ABC Group
3 1 M C ABC Group
4 2 F A Only A
5 2 F NA Only A
6 2 F A Only A
7 3 M NA AB Group
8 3 M A AB Group
9 3 M B AB Group
10 4 F C Only C
11 5 F B Only B
12 6 NA A Only A
13 7 M NA Missing

fill in NA values for multiple columns in 1 df based on values from another df

I have 2 dfs. There are NA values for 2 variables in 1 data frame that I want to replace with values in another df. Here is my sample data:
df1
id Sex Race Income
1 M White 1
2 NA Hispanic 2
3 NA NA 3
df2
id Sex Race
1 M White
2 F Hispanic
3 M White
4 F Black
I want the data to look like this where the NA values for df1 for sex and race are filled in by the values for df2.
df2
id Sex Race Income
1 M White 1
2 F Hispanic 2
3 M White 3
4 F Black NA
Can someone please help?
A base R option using merge
subset(
merge(df1, df2, by = "id", all.y = TRUE),
select = c("id", "Sex.y", "Race.y", "Income")
)
which gives
id Sex.y Race.y Income
1 1 M White 1
2 2 F Hispanic 2
3 3 M White 3
4 4 F Black NA
We can use a join here
library(data.table)
setDT(df2)[df1, Income := Income, on = .(id)]
-output
df2
# id Sex Race Income
#1: 1 M White 1
#2: 2 F Hispanic 2
#3: 3 M White 3
#4: 4 F Black NA
If we need to choose the 'Sex', 'Race' between the non-NA elements
nm1 <- names(df2)[-1]
setDT(df2)[df1, c(nm1, 'Income') := c(Map(fcoalesce,
.SD[, nm1, with = FALSE], mget(paste0('i.', nm1))), list(Income)), on = .(id)]
-output
df2
# id Sex Race Income
#1: 1 M White 1
#2: 2 F Hispanic 2
#3: 3 M White 3
#4: 4 F Black NA
Or using tidyverse, with just dplyr functions
library(dplyr)
left_join(df2, df1, by = 'id') %>%
transmute(id, Sex = coalesce(Sex.x, Sex.y),
Race = coalesce(Race.x, Race.y),
Income)
-output
# id Sex Race Income
#1 1 M White 1
#2 2 F Hispanic 2
#3 3 M White 3
#4 4 F Black NA
data
df1 <- structure(list(id = 1:3, Sex = c("M", NA, NA), Race = c("White",
"Hispanic", NA), Income = 1:3), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(id = 1:4, Sex = c("M", "F", "M", "F"), Race = c("White",
"Hispanic", "White", "Black")), class = "data.frame", row.names = c(NA,
-4L))
A tidyverse approach can be using a join after reshaping both dataframes to long (using the well known pivot_longer()) and then reshaping to wide (using pivot_wider()) to obtain the expected result. Here the code:
library(tidyverse)
#Code
newdf <- df2 %>%
mutate(across(-id,~as.character(.))) %>%
pivot_longer(-id) %>%
full_join(df1 %>%
mutate(across(-id,~as.character(.))) %>%
pivot_longer(-id) %>% rename(value2=value)) %>%
mutate(value=ifelse(is.na(value),value2,value)) %>% select(-value2) %>%
pivot_wider(names_from = name,values_from=value) %>%
mutate(Income=as.numeric(Income))
Output:
# A tibble: 4 x 4
id Sex Race Income
<int> <chr> <chr> <dbl>
1 1 M White 1
2 2 F Hispanic 2
3 3 M White 3
4 4 F Black NA
Some data used:
#Data 1
df1 <- structure(list(id = 1:3, Sex = c("M", NA, NA), Race = c("White",
"Hispanic", NA), Income = 1:3), class = "data.frame", row.names = c(NA,
-3L))
#Data 2
df2 <- structure(list(id = 1:4, Sex = c("M", "F", "M", "F"), Race = c("White",
"Hispanic", "White", "Black")), class = "data.frame", row.names = c(NA,
-4L))

Comparing two dataframes in R and extract the values from one dataframe

I have two dataframes which have different number of rows and columns. one dataframe is with two columns and other dataframe with multiple columns.
The first dataframes looks like,
Second dataframe is like
Actually, i need to replace the second dataframe which contains A,B,C etc with the values of 2nd column of first dataframe.
I need the output in below format.
Help me to solve this problem.
dput:
df
structure(list(col1 = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J", "K", "L"), col2 = c(10, 1, 2, 3, 4, 3, 1, 8, 19, 200,
12, 112)), row.names = c(NA, -12L), class = c("tbl_df", "tbl",
"data.frame"))
df2
structure(list(col1 = c("A", "F", "W", "E", "F", "G"), col2 = c(NA,
NA, "J", "K", "L", NA), col3 = c(NA, "H", "I", NA, "A", "B")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
A one-liner:
as_tibble(`colnames<-`(matrix(df1$col2[match(as.matrix(df2),df1$col1)], ncol=3), names(df2)))
#> # A tibble: 6 x 3
#> col1 col2 col3
#> <dbl> <dbl> <dbl>
#> 1 10 NA NA
#> 2 3 NA 8
#> 3 NA 200 19
#> 4 4 12 NA
#> 5 3 112 10
#> 6 1 NA 1
You can accomplish this with a little data manipulation. Make the data in df2 long, then join to df, then make the data wide again.
The rowid_to_column is necessary to make the transition from long to wide work. You can easily remove that column by adding select(-rowid) at the end of the chain.
library(tidyverse)
df2 %>%
rowid_to_column() %>%
pivot_longer(cols = -rowid) %>%
left_join(df, by = c("value" = "col1")) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = col2)
# rowid col1 col2 col3
# <int> <dbl> <dbl> <dbl>
# 1 1 10 NA NA
# 2 2 3 NA 8
# 3 3 NA 200 19
# 4 4 4 12 NA
# 5 5 3 112 10
# 6 6 1 NA 1
one-liner in base R:
df2 <- as.data.frame(lapply(df2, function(x) ifelse(!is.na(x), setNames(df$col2, df$col1)[x], NA)))
Output
> df2
col1 col2 col3
1 10 NA NA
2 3 NA 8
3 NA 200 19
4 4 12 NA
5 3 112 10
6 1 NA 1
Another short one liner in base. You can use match and assign the result to df2[]:
df2[] <- df[match(unlist(df2), df[,1]), 2]
df2
# col1 col2 col3
#1 10 NA NA
#2 3 NA 8
#3 NA 200 19
#4 4 12 NA
#5 3 112 10
#6 1 NA 1

Resources