Create new column based on matches in another table/merging - r

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined".
Note: data1 has rows that are not in data2. I would only like rows that are in data2 to be in my final table. Any rows that are in data1 that are not in data2 should be removed.
I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ", "ZZZ"), "Col2" = c("AA",
"AA","AA", "BB", "AA"), "Col3" = c("Al", "B", "Al", "Al", "B"), "Method1" =
c("Sample", "Dry", "Sample", "Sample", "Dry"), "Method2" = c("Blank", "Not Blank",
"Blank", "Not Blank", "Not Blank"), "Method3" = c("Yes", "Yes", "No", "No", "No"),
"Count" = c(1, 4, 6, 2, 4), "Method_Final" = c("AAR", "ARG", "PCO", "YRG", "ZYX"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))
This is the data set I would like to end up with:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1),
"Method_Final" = c("AAR", "ARG", "AAR", "YRG", "Not Determined", "Not Determined"))
I've tried various joins using dplyr (left_join, right_join, etc.) and I can't figure it out.
Thank you so much!

You can do:
library(tidyverse)
data2 %>%
left_join(data1 %>%
select(starts_with('Method')),
by = paste0('Method', 1:3)) %>%
mutate(Method_Final = if_else(is.na(Method_Final), 'Not determined', Method_Final))
which gives:
Col1 Col2 Col3 Method1 Method2 Method3 Count Method_Final
1 ABC AA Al Sample Blank Yes 1 AAR
2 ABC AA B Dry Not Blank Yes 4 ARG
3 ABC CC C Sample Blank Yes 5 AAR
4 EFG AA Al Sample Not Blank No 6 YRG
5 XYZ BB Al Dry Not Blank No 2 ZYX
6 XYZ CC C Bucket Not Blank Yes 1 Not determined
Note that this differs from your expected output for the fifth row. Can you please check, what the Method_Fianl value should be here? Since there is a value for it in data1, it shouldn‘t be 'Not determined'.

You can try this. Combine the fields to match into one:
dm1 <- paste(data1$Method1, data1$Method2, data1$Method3, sep="|")
dm2 <- paste(data2$Method1, data2$Method2, data2$Method3, sep="|")
Now match the two:
m <- match(dm2, dm1)
# will return NA where not matching
Get the Method_Final from data1 where it matches:
data2$Method_Final <- as.character(data1$Method_Final[m])
Where NA, make it "Not Determined":
data2$Method_Final[is.na(data2$Method_Final)] <- "Not Determined"
Result is the same as #deschen's above.

Related

Create a new column based on matches in another table, but bring in other columns

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method_Final" that I would like to add into data2. I want to match it based on ONLY columns Method1, Method2, Method3 (the Col1, Col2, Col3, and Count columns don't need to match, but I want them to be brought into the final dataframe). If there is a match on those 3 columns, take the Method_Final from data1 and put it in data2. If there is no match, then make the value "Not Determined". I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ"), "Col2" = c("AA", "AA",
"AA", "BB"),"Col3" = c("Al", "B", "Al", "Al"), "Method1" =
c("Sample", "Dry", "Sample", "Sample"), "Method2" = c("Blank", "Not Blank", "Blank",
"Not Blank"), "Method3" = c("Yes", "Yes", "No", "No"), "Count" = c(1, 4, 6, 2),
"Method_Final" = c("AAR", "ARG", "PCO", "YRG"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1))
I would like to create a new data frame that looks like this with what I described above:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Method1" = c("Sample", "Dry", "Sample", "Sample", "Dry", "Bucket"), "Method2" =
c("Blank", "Not Blank", "Blank", "Not Blank", "Not Blank", "Not Blank"), "Method3" =
c("Yes", "Yes", "Yes", "No", "No", "Yes"), "Count" = c(1, 4, 5, 6, 2, 1),
"Method_Final" = c("AAR", "ARG", "Not Determined", "YRG", "Not Determined", "Not
Determined"))
This should be possible by:
left joining data1 into data2 on just Method 1-3 (only these need to match)
Removing the extra Col 1-3 and count (I just give them the suffix _remove and remove them after the joining)
Replace NAs with Not determined
Under is an example. Note that I don't get the exact same result as you do in data_final, but I believe I have captured the logic you want.
library(dplyr, warn.conflicts = FALSE)
data2 %>%
left_join(
data1,
by = c("Method1", "Method2", "Method3"),
keep = FALSE,
suffix = c("", "_remove")
) %>%
select(-contains("_remove")) %>%
tidyr::replace_na(
list("Method_Final" = "Not Determined")
)
#> Col1 Col2 Col3 Method1 Method2 Method3 Count Method_Final
#> 1 ABC AA Al Sample Blank Yes 1 AAR
#> 2 ABC AA B Dry Not Blank Yes 4 ARG
#> 3 ABC CC C Sample Blank Yes 5 AAR
#> 4 EFG AA Al Sample Not Blank No 6 YRG
#> 5 XYZ BB Al Dry Not Blank No 2 Not Determined
#> 6 XYZ CC C Bucket Not Blank Yes 1 Not Determined

Create new column based on matches in another table

I have 2 data frames that look something like this (it is a count table). data1 has a column called "Method" that I would like to add into data2. I want to match it based on ONLY Col1, Col2, Col3 (the Count column doesn't need to match). If there is a match, take the Method from data1 and put it in data2. If there is no match, then make the value "Not Determined". I have an example of what I'm looking for in the data frame below called data_final.
data1 <- data.frame("Col1" = c("ABC", "ABC", "EFG", "XYZ"), "Col2" = c("AA", "AA",
"AA", "BB"), "Col3" = c("Al", "B", "Al", "Al"), "Count" = c(1, 4, 6, 2), "Method" =
c("Sample", "Dry", "Sample", "Sample"))
data2 <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2" =
c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Count" = c(1, 4, 5, 6, 2, 1))
I would like to create a new data frame that looks like this with what I described above:
data_final <- data.frame("Col1" = c("ABC", "ABC", "ABC", "EFG", "XYZ", "XYZ"), "Col2"
= c("AA", "AA","CC", "AA", "BB", "CC"), "Col3" = c("Al", "B", "C", "Al", "Al", "C"),
"Count" = c(1, 4, 5, 6, 2, 1), "Method" = c("Sample", "Dry", "Not Determined",
"Sample", "Sample", "Not Determined"))
Thank you for your help!
In Base R, you could do a merge:
data3 <- merge( data1, data2, all.y = TRUE )
and then replace the NAs with your string of choice:
data3[ is.na( data3[ 5 ] ), 5 ] <- "Not Determined"
which gives you
> data3
Col1 Col2 Col3 Count Method
1 ABC AA Al 1 Sample
2 ABC AA B 4 Dry
3 ABC CC C 5 Not Determined
4 EFG AA Al 6 Sample
5 XYZ BB Al 2 Sample
6 XYZ CC C 1 Not Determined
Attention: If you are on an older version of R (< 4.0), you might be dealing with factors and need to add the additional factor level before with
levels( data3$Method ) <- c( levels( data3$Method ), "Not Determined" )

how to filter the value base one the upper row value

I have a df:
df<-structure(list(Name = c("test", "a", "nb", "c", "r", "f", NA,
"d", "ee", "test", "value", "test", "b")), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
How can I only keep the row which upper row=="test" and row value !="value"?
The new df1 will looks like this (any of either case is Ok):
library(dplyr)
df %>%
filter(lag(Name == "test"), Name != "value")
# A tibble: 2 x 1
Name
<chr>
1 a
2 b

calculate duration in a complex table

I have a table as shown.
df <- data.frame("name" = c("jack", "william", "david", "john"),
"01-Jan-19" = c(NA,"A",NA,"A"),
"01-Feb-19" = c("A","A",NA,"A"),
"01-Mar-19" = c("A","A","A","A"),
"01-Apr-19" = c("A","A","A","A"),
"01-May-19" = c(NA,"A","A","A"),
"01-Jun-19" = c("A","SA","A","SA"),
"01-Jul-19" = c("A","SA","A","SA"),
"01-Aug-19" = c(NA,"SA","A","SA"),
"01-Sep-19" = c(NA,"SA","A","SA"),
"01-Oct-19" = c("SA","SA","A","SA"),
"01-Nov-19" = c("SA","SA",NA,"SA"),
"01-Dec-19" = c("SA","SA","SA",NA),
"01-Jan-20" = c("SA","M","A","M"),
"01-Feb-20" = c("M","M","M","M"))
Over a time period, each person journeys through of position progression (3 position categories from A to SA to M). My objective is:
Calculate the average duration of A (assistant) position and SA (senior assistant) position. i.e. the duration between the date the first of one category appears, and the date the last of this category appears, regardless of missing data in between.
I transposed the data using R “gather” function
df1 <- gather (df, "date", "position", 2:15)
then I am not sure how to best proceed. What might be the best way to further approach this?
We can get the data in longer format and calculate the number of days between first date when the person was "SA" and the first date when he was "A".
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -name, names_to = 'person', values_drop_na = TRUE) %>%
mutate(person = dmy(person)) %>%
group_by(name) %>%
summarise(avg_duration = person[match('SA', value)] - person[match('A', value)])
# name duration
# <fct> <drtn>
#1 david 275 days
#2 jack 242 days
#3 john 151 days
#4 william 151 days
If needed the mean value we can pull and then calculate mean by adding to the above chain
%>% pull(duration) %>% mean
#Time difference of 204.75 days
data
df <- structure(list(name = c("jack", "william", "david", "john"),
`01-Jan-19` = c(NA, "A", NA, "A"), `01-Feb-19` = c("A", "A",
NA, "A"), `01-Mar-19` = c("A", "A", "A", "A"), `01-Apr-19` = c("A",
"A", "A", "A"), `01-May-19` = c(NA, "A", "A", "A"), `01-Jun-19` = c("A",
"SA", "A", "SA"), `01-Jul-19` = c("A", "SA", "A", "SA"),
`01-Aug-19` = c(NA, "SA", "A", "SA"), `01-Sep-19` = c(NA,
"SA", "A", "SA"), `01-Oct-19` = c("SA", "SA", "A", "SA"),
`01-Nov-19` = c("SA", "SA", NA, "SA"), `01-Dec-19` = c("SA",
"SA", "SA", NA), `01-Jan-20` = c("SA", "M", "A", "M"), `01-Feb-20` = c("M",
"M", "M", "M")), row.names = c(NA, -4L), class = "data.frame")

How to invoke functions on subsets of random samples of data

I'm trying to perform a t.test for a specific subset of data. Say I have a data set of 116 birds, and want to find a random sample of 35 birds (non-unique) of the "Species" category. I then want to find the mean of the "Body.Mass" of these random species. Then, I want to invoke a t.test on this sample as representative of the whole data.
I first stored the data in object "bird." I tried taking the random sample using sample(bird$Species, 35), which yielded 35 random species of bird. Now I can't seem to further subset this random sample to find the means of the Body.Mass of every random sample species. I tried to subset using tidyverse, but that's the only way I'm aware of to solve a problem like this.
library(dplyr)
bird = read.csv("NZBIRDS.csv")
dput(head(bird))
set.seed(20)
sambird = sample(bird$Species,35)
sambird
bmbird <- sambird %>% summarize(avg = mean(Body.Mass))
bmbird
structure(list(Species = c("Grebes", "Grebes", "Petrels", "Petrels",
"Petrels", "Petrels"), Name = c("P. cristatus", "P. rufopectus",
"P. gavia", "P. assimilis", "P. urinatrix", "P. georgicus"),
Extinct = c("No", "No", "Yes", "Yes", "Yes", "No"), Habitat = c("A",
"A", "A", "A", "A", "A"), Nest.Site = c("G", "G", "GC", "GC",
"GC", "GC"), Nest.Density = c("L", "L", "H", "H", "H", "H"
), Diet = c("F", "F", "F", "F", "F", "F"), Flight = c("Yes",
"Yes", "Yes", "Yes", "Yes", "Yes"), Body.Mass = c(1100L,
250L, 300L, 200L, 130L, 120L), Egg.Length = c(57, 43, 57,
54, 38, 39)), .Names = c("Species", "Name", "Extinct", "Habitat",
"Nest.Site", "Nest.Density", "Diet", "Flight", "Body.Mass", "Egg.Length"
), row.names = c(NA, 6L), class = "data.frame")
Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an object of class "factor"
It's a bit unclear whether you want to sample from a list of the unique species in the data, or sample rows so that each "Species" type can appear multiple times in the data. If you want to sample from the unique species, you can do:
# Only sampling one species since the example data
# contains only two, should work fine
# for more random species
random_species = sample(unique(bird$Species), 1, replace = FALSE)
bird %>%
filter(Species %in% random_species) %>%
group_by(Species) %>%
summarize(avg = mean(Body.Mass))

Resources