I have asked a similar question in Python (How to create column of ascending values based on unique values in another column in pandas), and got the script working, but for various reasons I need to achieve the same thing in R now. I am also adding complexity here of wanting to be able to add new batches of data to the dataset periodically.
I have a list of samples which have unique sample ID numbers ("Sample_ID"). Each row of the dataset is a sample. Some samples are duplicated multiple times. I want to create a new set of sample names ("Sample_code") that ascends up from 1 as you go down the rows using a prefix (e.g. "SAMP00001", "SAMP00002" etc). I want the order of rows to be preserved (as they are roughly in date order of sample collection). And for duplicated samples, I want the number given for Sample_code to correspond to the first row that sample ID appears in, not rows further down the table (which came later in sample collection).
My starting data is illustrated with df1:
# df1
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
df1 <- data.frame(Sample_ID, Date, Variable, Batch)
df1
I want to create the Sample_code column shown in df1b:
# df1b
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003')
df1b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df1b
I would save df1b at this point and those Sample_code names used for downstream processing. The added complexity comes because I will then collect a new batch of samples - let's call it df2 (Batch 2 samples):
# df2
Sample_ID <- c('456789', '123654', '123123', '123789', '121212')
Date <- c('15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue", "red","blue", "red", "red")
Batch <- 2
df2 <- data.frame(Sample_ID, Date, Variable, Batch)
df2
I want to rbind df2 to the bottom of df1, and generate more Sample_code names for the new rows. Importantly, the new Sample_code names need to take account of any Sample_ID duplicates that were present in df1, but also not change any of the Sample_code names that were already assigned back when I only had df1. The result at this point would be df2b, below:
# df2b
Sample_ID <- c('123123','123456','123123','123789','456789', '123654', '123123', '123789', '121212')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019', '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue","red","red","blue","blue", "red","blue", "red", "red")
Batch <- c(1,1,1,1,2,2,2,2,2)
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003', 'SAMP0004', 'SAMP0005', 'SAMP0001', 'SAMP0003', 'SAMP0006')
df2b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df2b
And then I would add Batch 3 samples in the same way etc etc.
I appreciate there are at least 2 stages to this problem: 1) Producing an ascending list of Sample_code names using unique Sample_ID values; and 2) Building in an iterative way of adding batches of samples. But because the second point impacts on the functionality I want for the Sample_code names I have included both stages here.
Lastly - ideally I want to only use base R and tidyverse packages for this.
Any help much appreciated! Thanks.
Because you need to know all possible sample IDs before sample code assignment, consider reversing the order by calling rbind on all sample data frames. Then assign the Sample_code using factor levels. Otherwise, re-assign Sample_code with each batch data frame.
# BUILD A LIST OF DATA FRAMES BY CALLING lapply ON ITERATIVE PROCESS
# df_list <- lapply(batch_iterable, method_to_build_sample)
df_list <- list(df1, df1b, df2) # FOR THIS PARTICULAR POST
# RBIND ALL DFs TOGETHER
df2b <- do.call(rbind, df_list)
df2b <- within(df2b, {
# CONVERT TO CHARACTER
Sample_ID <- as.character(Sample_ID)
# CONVERT TO FACTOR AT POSITIONED VALUES, THEN INTEGER FOR LEVEL NUMBER
Sample_code <- as.character(as.integer(factor(Sample_ID, levels = unique(Sample_ID))))
# RE-ASSIGN WITH SAMP AND LEADING ZEROS
Sample_code <- ifelse(nchar(Sample_code) == 1, paste0('SAMP000', Sample_code),
ifelse(nchar(Sample_code) == 2, paste0('SAMP00', Sample_code),
ifelse(nchar(Sample_code) == 3, paste0('SAMP0', Sample_code), NA)
)
)
})
df2b
# Sample_ID Date Variable Batch Sample_code
# 1 123123 15/06/2019 blue 1 SAMP0001
# 2 123456 23/06/2019 red 1 SAMP0002
# 3 123123 30/06/2019 red 1 SAMP0001
# 4 123789 07/07/2019 blue 1 SAMP0003
# 5 456789 15/07/2019 blue 2 SAMP0004
# 6 123654 31/07/2019 red 2 SAMP0005
# 7 123123 12/08/2019 blue 2 SAMP0001
# 8 123789 27/08/2019 red 2 SAMP0003
# 9 121212 31/08/2019 red 2 SAMP0006
I have a dataframe A that I want to filter based on whether or not the corresponding sample names in dataframe B have NA in the second column (ID). Sample names in dataframe A repeat while sample names in dataframe B appear only once, making the dataframe lengths different.
Essentially, I want to have a final table where the sample names that have a value in dataframe B, column 'ID', are removed from dataframe A entirely.
I tried the following filter function, but it gave me an error related to the different dataframe lengths:
filtered_table <- filter(A_table_to_filter, is.na(B_filter_table$ID))
Here is some example data:
dataframe_A_table_to_filter <- data.frame(sample = c("OP2645ii_d","OP5048___g","OP5046___e","OP5048___g","OP2413iiia","OP5048___g","OP5043___b","OP5048___g","OP3088i__a","OP5048___g","OP5046___a","OP5048___g","OP5048___b","OP5048___g","OP5043___a","OP5048___g","OP2645ii_d","OP5048___f","OP2645ii_d","OP5044___c","OP2413iiib","OP5048___g","OP5046___c","OP5048___g","OP5046___d","OP5046___e","OP5048___e","OP5048___g","OP5046___e","OP5048___c","OP2413iiia","OP5046___e","OP2645ii_b","OP2645ii_d","OP2645ii_a","OP5046___e","OP5046___e","OP5048___d","OP5046___e","OP5048___e","OP2413iiia","OP5048___f","OP5044___c","OP5046___e","OP2413iiia","OP2645ii_c","OP5046___e","OP5047___b","OP2645ii_a","OP2645ii_d","OP5046___c","OP5046___e","OP5046___d","OP5048___g","OP2645ii_e","OP5048___g","OP2645ii_c","OP5046___d","OP5048___c","OP5048___g","OP2645ii_c","OP5048___c","OP2645ii_c","OP5048___e","OP2645ii_c","OP5048___g","OP5046___e","OP5048___f","OP2645ii_d","OP5046___d","OP2645ii_c","OP5046___c","OP2645ii_d","OP5048___d","OP5043___b","OP5048___f","OP5046___c","OP5048___f","OP2645ii_d","OP5048___c","OP2413iiib","OP5046___e","OP2413iiib","OP5048___f","OP5044___a","OP5048___g","OP5043___a","OP5048___f","OP3088i__a","OP5048___f","OP5048___e","OP5048___f","OP5044___c","OP5048___b","OP2645ii_d","OP5047___b","OP2413iiia","OP2645ii_b","OP5046___a","OP5048___f","OP5043___b","OP5044___c","OP2645ii_c","OP5048___d","OP5047___b","OP5048___g","OP5048___b","OP5048___f","OP2413iiia","OP5044___c","OP2645ii_b","OP5046___e","OP2645ii_c","OP5047___b","OP5044___c","OP5046___a","OP2413iiib","OP2645ii_c","OP2645ii_e","OP5046___e","OP5048___d","OP5048___g","OP5046___d","OP5048___b","OP2645ii_a","OP2645ii_c","OP3088i__a","OP5044___c"),
gr = c("gr3","gr2","gr5","gr2","gr1","gr2","gr1","gr2","gr1","gr2","gr5","gr2","gr5","gr2","gr3","gr2","gr3","gr2","gr3","gr2","gr4","gr2","gr1","gr2","gr4","gr5","gr1","gr2","gr5","gr4","gr1","gr5","gr2","gr3","gr1","gr5","gr5","gr4","gr5","gr1","gr1","gr2","gr2","gr5","gr1","gr5","gr5","gr4","gr1","gr3","gr1","gr5","gr4","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr2","gr5","gr2","gr3","gr4","gr5","gr1","gr3","gr4","gr1","gr2","gr1","gr2","gr3","gr4","gr4","gr5","gr4","gr2","gr3","gr2","gr3","gr2","gr1","gr2","gr1","gr2","gr2","gr5","gr3","gr4","gr1","gr2","gr5","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr2","gr1","gr2","gr2","gr5","gr5","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr4","gr2","gr4","gr5","gr1","gr5","gr1","gr2"), dist = c(7.59036265840066,7.59036265840066,6.44967614976991,6.44967614976991,6.41995474653303,6.41995474653303,6.34991780754275,6.34991780754275,6.18262339507581,6.18262339507581,6.16265512136205,6.16265512136205,6.15423247141993,6.15423247141993,6.14014702309176,6.14014702309176,6.05863330633262,6.05863330633262,5.96292319399187,5.96292319399187,5.94395576047878,5.94395576047878,5.86375256401321,5.86375256401321,5.78102441659872,5.78102441659872,5.7345012847377,5.7345012847377,5.67874617854728,5.67874617854728,5.53957425202641,5.53957425202641,5.44753353881181,5.44753353881181,5.43742118904064,5.43742118904064,5.42270717863966,5.42270717863966,5.40852682965639,5.40852682965639,5.37916907844967,5.37916907844967,5.28542559212653,5.28542559212653,5.28127574537985,5.28127574537985,5.27883657001377,5.27883657001377,5.26111686809869,5.26111686809869,5.25446925024172,5.25446925024172,5.18612527748647,5.18612527748647,5.16152942865884,5.16152942865884,5.13493683199873,5.13493683199873,5.11477487647704,5.11477487647704,5.02518908529805,5.02518908529805,4.96387986494177,4.96387986494177,4.93803544508224,4.93803544508224,4.90484535173276,4.90484535173276,4.88609183324537,4.88609183324537,4.87064721174553,4.87064721174553,4.87044988024298,4.87044988024298,4.87018300982248,4.87018300982248,4.81850235997663,4.81850235997663,4.81315159594962,4.81315159594962,4.79708386349633,4.79708386349633,4.79137478521543,4.79137478521543,4.79076662890575,4.79076662890575,4.7629557294752,4.7629557294752,4.75107063347786,4.75107063347786,4.73927394720927,4.73927394720927,4.65856308508064,4.65856308508064,4.65459244413676,4.65459244413676,4.65168460273128,4.65168460273128,4.64631379714574,4.64631379714574,4.63427356346989,4.63427356346989,4.61758860663907,4.61758860663907,4.61520572342783,4.61520572342783,4.59738310693479,4.59738310693479,4.56270527374553,4.56270527374553,4.53521289030436,4.53521289030436,4.52843905005562,4.52843905005562,4.51867277408847,4.51867277408847,4.50634336104738,4.50634336104738,4.46047201471265,4.46047201471265,4.45241678415362,4.45241678415362,4.43613430884318,4.43613430884318,4.43212669019848,4.43212669019848,4.41051867890157,4.41051867890157))
dataframe_B_filter_table <- data.frame(sample = c("OP2413iiia","OP2413iiib","OP2645ii_a","OP2645ii_b","OP2645ii_c","OP2645ii_d","OP2645ii_e","OP3088i__a","OP5043___a","OP5043___b","OP5044___a","OP5044___b","OP5044___c","OP5046___a","OP5046___b","OP5046___c","OP5046___d","OP5046___e","OP5047___a","OP5047___b","OP5048___b","OP5048___c","OP5048___d","OP5048___e","OP5048___f","OP5048___g","OP5048___h","OP5049___a","OP5049___b","OP5051DNAa","OP5051DNAc","OP5052DNAa","OP5053DNAa","OP5053DNAb","OP5053DNAc","OP5054DNAa","OP5054DNAb","OP5054DNAc","OP5051DNAb"),
ID = c(NA,NA,"gr1",NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,"gr2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
I expect a table where the rows with sample names from dataframe B, that are not NA (i.e. that have a value), are removed from dataframe A. However, I recieve an error related to the differing lengths of the tables.
So you want to keep the sample whose corresponding ID has NA value.
We can match the sample get corresponding ID from dfB and keep only if it returns NA
dfA[is.na(dfB$ID[match(dfA$sample, dfB$sample)]), ]
# sample gr dist
#3 OP5046___e gr5 6.4497
#5 OP2413iiia gr1 6.4200
#7 OP5043___b gr1 6.3499
#9 OP3088i__a gr1 6.1826
#11 OP5046___a gr5 6.1627
#13 OP5048___b gr5 6.1542
#....
Renamed your dataframe to dfA and dfB for readability.
If you are sure that every value in dfA is present in dfB we can also use
dfA[dfA$sample %in% dfB$sample[is.na(dfB$ID)], ]
Just realized an answer has been posted, will give a solution based on {data.table}.
To append to your code, I have run the following lines.
DT_A <- data.table(dataframe_A_table_to_filter)
DT_B <- data.table(dataframe_B_filter_table)
KeepSamples <- DT_B[is.na(ID), sample]
DT_A <- DT_A[sample %in% KeepSamples, ]