Create new column based on matches in another table - r

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method" that I would like to add into data2. I want to match it based on ONLY Col1, Col2, Col3 (the Count column doesn't need to match). If there is a match, take the Method from data1 and put it in data2. If there is no match, then make the value "Not Determined". I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ"), "Col2" = c("AA", "AA",
"AA", "BB"), "Col3" = c("Al", "B", "Al", "Al"), "Count" = c(1, 4, 6, 2), "Method" =
c("Sample", "Dry", "Sample", "Sample"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Count" = c(1, 4, 5, 6, 2, 1))
I would like to create a new data frame that looks like this with what I described above:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Count" = c(1, 4, 5, 6, 2, 1), "Method" = c("Sample", "Dry", "Not Determined",
"Sample", "Sample", "Not Determined"))
Thank you for your help!

In Base R, you could do a merge:
data3 <- merge( data1, data2, all.y = TRUE )
and then replace the NAs with your string of choice:
data3[ is.na( data3[ 5 ] ), 5 ] <- "Not Determined"
which gives you
> data3
Col1 Col2 Col3 Count Method
1 ABC AA Al 1 Sample
2 ABC AA B 4 Dry
3 ABC CC C 5 Not Determined
4 EFG AA Al 6 Sample
5 XYZ BB Al 2 Sample
6 XYZ CC C 1 Not Determined
Attention: If you are on an older version of R (< 4.0), you might be dealing with factors and need to add the additional factor level before with
levels( data3$Method ) <- c( levels( data3$Method ), "Not Determined" )

Related

Create a new column based on matches in another table, but bring in other columns

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined". I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ"), "Col2" = c("AA", "AA",
"AA", "BB"),"Col3" = c("Al", "B", "Al", "Al"), "Method1" =
c("Sample", "Dry", "Sample", "Sample"), "Method2" = c("Blank", "Not Blank", "Blank",
"Not Blank"), "Method3" = c("Yes", "Yes", "No", "No"), "Count" = c(1, 4, 6, 2),
"Method_Final" = c("AAR", "ARG", "PCO", "YRG"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))
I would like to create a new data frame that looks like this with what I described above:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1),
"Method_Final" = c("AAR", "ARG", "Not Determined", "YRG", "Not Determined", "Not
Determined"))
This should be possible by:
left joining data1 into data2 on just Method 1-3 (only these need to match)
Removing the extra Col 1-3 and count (I just give them the suffix _remove and remove them after the joining)
Replace NAs with Not determined
Under is an example. Note that I don't get the exact same result as you do in data_final, but I believe I have captured the logic you want.
library(dplyr, warn.conflicts = FALSE)
data2 %>%
left_join(
data1,
by = c("Method1", "Method2", "Method3"),
keep = FALSE,
suffix = c("", "_remove")
) %>%
select(-contains("_remove")) %>%
tidyr::replace_na(
list("Method_Final" = "Not Determined")
)
#> Col1 Col2 Col3 Method1 Method2 Method3 Count Method_Final
#> 1 ABC AA Al Sample Blank Yes 1 AAR
#> 2 ABC AA B Dry Not Blank Yes 4 ARG
#> 3 ABC CC C Sample Blank Yes 5 AAR
#> 4 EFG AA Al Sample Not Blank No 6 YRG
#> 5 XYZ BB Al Dry Not Blank No 2 Not Determined
#> 6 XYZ CC C Bucket Not Blank Yes 1 Not Determined

Create new column based on matches in another table/merging

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined".
Note: data1 has rows that are not in data2. I would only like rows that are in data2 to be in my final table. Any rows that are in data1 that are not in data2 should be removed.
I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ", "ZZZ"), "Col2" = c("AA",
"AA","AA", "BB", "AA"), "Col3" = c("Al", "B", "Al", "Al", "B"), "Method1" =
c("Sample", "Dry", "Sample", "Sample", "Dry"), "Method2" = c("Blank", "Not Blank",
"Blank", "Not Blank", "Not Blank"), "Method3" = c("Yes", "Yes", "No", "No", "No"),
"Count" = c(1, 4, 6, 2, 4), "Method_Final" = c("AAR", "ARG", "PCO", "YRG", "ZYX"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))
This is the data set I would like to end up with:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1),
"Method_Final" = c("AAR", "ARG", "AAR", "YRG", "Not Determined", "Not Determined"))
I've tried various joins using dplyr (left_join, right_join, etc.) and I can't figure it out.
Thank you so much!
You can do:
library(tidyverse)
data2 %>%
left_join(data1 %>%
select(starts_with('Method')),
by = paste0('Method', 1:3)) %>%
mutate(Method_Final = if_else(is.na(Method_Final), 'Not determined', Method_Final))
which gives:
Col1 Col2 Col3 Method1 Method2 Method3 Count Method_Final
1 ABC AA Al Sample Blank Yes 1 AAR
2 ABC AA B Dry Not Blank Yes 4 ARG
3 ABC CC C Sample Blank Yes 5 AAR
4 EFG AA Al Sample Not Blank No 6 YRG
5 XYZ BB Al Dry Not Blank No 2 ZYX
6 XYZ CC C Bucket Not Blank Yes 1 Not determined
Note that this differs from your expected output for the fifth row. Can you please check, what the Method_Fianl value should be here? Since there is a value for it in data1, it shouldn‘t be 'Not determined'.
You can try this. Combine the fields to match into one:
dm1 <- paste(data1$Method1, data1$Method2, data1$Method3, sep="|")
dm2 <- paste(data2$Method1, data2$Method2, data2$Method3, sep="|")
Now match the two:
m <- match(dm2, dm1)
# will return NA where not matching
Get the Method_Final from data1 where it matches:
data2$Method_Final <- as.character(data1$Method_Final[m])
Where NA, make it "Not Determined":
data2$Method_Final[is.na(data2$Method_Final)] <- "Not Determined"
Result is the same as #deschen's above.

how to filter the value base one the upper row value

I have a df:
df<-structure(list(Name = c("test", "a", "nb", "c", "r", "f", NA,
"d", "ee", "test", "value", "test", "b")), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
How can I only keep the row which upper row=="test" and row value !="value"?
The new df1 will looks like this (any of either case is Ok):
library(dplyr)
df %>%
filter(lag(Name == "test"), Name != "value")
# A tibble: 2 x 1
Name
<chr>
1 a
2 b

calculate duration in a complex table

I have a table as shown.
df <- data.frame("name" = c("jack", "william", "david", "john"),
"01-Jan-19" = c(NA,"A",NA,"A"),
"01-Feb-19" = c("A","A",NA,"A"),
"01-Mar-19" = c("A","A","A","A"),
"01-Apr-19" = c("A","A","A","A"),
"01-May-19" = c(NA,"A","A","A"),
"01-Jun-19" = c("A","SA","A","SA"),
"01-Jul-19" = c("A","SA","A","SA"),
"01-Aug-19" = c(NA,"SA","A","SA"),
"01-Sep-19" = c(NA,"SA","A","SA"),
"01-Oct-19" = c("SA","SA","A","SA"),
"01-Nov-19" = c("SA","SA",NA,"SA"),
"01-Dec-19" = c("SA","SA","SA",NA),
"01-Jan-20" = c("SA","M","A","M"),
"01-Feb-20" = c("M","M","M","M"))
Over a time period, each person journeys through of position progression (3 position categories from A to SA to M). My objective is:
Calculate the average duration of A (assistant) position and SA (senior assistant) position. i.e. the duration between the date the first of one category appears, and the date the last of this category appears, regardless of missing data in between.
I transposed the data using R “gather” function
df1 <- gather (df, "date", "position", 2:15)
then I am not sure how to best proceed. What might be the best way to further approach this?
We can get the data in longer format and calculate the number of days between first date when the person was "SA" and the first date when he was "A".
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -name, names_to = 'person', values_drop_na = TRUE) %>%
mutate(person = dmy(person)) %>%
group_by(name) %>%
summarise(avg_duration = person[match('SA', value)] - person[match('A', value)])
# name duration
# <fct> <drtn>
#1 david 275 days
#2 jack 242 days
#3 john 151 days
#4 william 151 days
If needed the mean value we can pull and then calculate mean by adding to the above chain
%>% pull(duration) %>% mean
#Time difference of 204.75 days
data
df <- structure(list(name = c("jack", "william", "david", "john"),
`01-Jan-19` = c(NA, "A", NA, "A"), `01-Feb-19` = c("A", "A",
NA, "A"), `01-Mar-19` = c("A", "A", "A", "A"), `01-Apr-19` = c("A",
"A", "A", "A"), `01-May-19` = c(NA, "A", "A", "A"), `01-Jun-19` = c("A",
"SA", "A", "SA"), `01-Jul-19` = c("A", "SA", "A", "SA"),
`01-Aug-19` = c(NA, "SA", "A", "SA"), `01-Sep-19` = c(NA,
"SA", "A", "SA"), `01-Oct-19` = c("SA", "SA", "A", "SA"),
`01-Nov-19` = c("SA", "SA", NA, "SA"), `01-Dec-19` = c("SA",
"SA", "SA", NA), `01-Jan-20` = c("SA", "M", "A", "M"), `01-Feb-20` = c("M",
"M", "M", "M")), row.names = c(NA, -4L), class = "data.frame")

Conditionally fill empty cells

I have a named vector with some missing values:
x = c(99, 88, 1, 2, 3, NA, NA)
names(x) = c("A", "C", "AA", "AB", "AC", "AD", "CA")
And a second dataframe which reflects the hierarchical naming structure (e.g. A is a superordinate to AA, AB, & AC)
filler = data.frame(super = c("A", "A", "A", "A", "C"), sub = c("AA", "AB", "AC", "AD", "CA"))
If a value is missing in x, I want to fill it with the superordinate from filler. So that the outcome would be
x = c(99, 88, 1, 2, 3, 99, 88)
Does anyone have any clever way to do this without looping through each possibility?
We can create a logical vector ('i1') based on the NA elements, get the index of matching elements in 'filler' with match and then do the assignmnt
i1 <- is.na(x)
x[i1] <- x[match(filler$super[match(names(x[i1]), filler$sub)], names(x))]
as.vector(x)
#[1] 99 88 1 2 3 99 88
As x is a named vector we could convert it to a dataframe (enframe) and then do a join, replace NA values with corresponding value and if needed convert it into vector again. (deframe).
library(dplyr)
library(tibble)
enframe(x) %>%
left_join(filler, by = c("name" = "sub")) %>%
mutate(value = if_else(is.na(value), value[match(super, name)], value)) %>%
select(-super) %>%
deframe()
# A C AA AB AC AD CA
#99 88 1 2 3 99 88

Resources