R: summing values of matched names and adding on new names' values - r

I am trying to a simple task, and created a simple example. I would like to add the counts of a taxon recorded in a vector ('introduced',below) to the counts already measured in another vector ('existing'), according to the taxon name. However, when there is a new taxon (present in introduced by not in existing), I would like this taxon and its count to be added as a new entry in the matrix (doesn't matter what order, but name needs to be retained).
For example:
existing<-c(3,4,5,6)
names(existing)<-c("Tax1","Tax2","Tax3","Tax4")
introduced<-c(2,2)
names(introduced)<-c("Tax1","Tax5")
I want new matrix, called "combined" here, to look like this:
#names(combined)= c("Tax1","Tax2","Tax3","Tax4","Tax5")
#combined= c(5,4,5,6,2)
The main thing to see is that "Tax1"'s values are combined (3+2=5), "Tax5" (2) is added on to the end
I have looked around but previous answers similar to this have much more complex data and it is difficult to extract which function I need. I have been trying combinations of match and which, but just cannot get it right.

grp <- c(existing,introduced)
tapply(grp,names(grp),sum)
#Tax1 Tax2 Tax3 Tax4 Tax5
# 5 4 5 6 2

Instead of keeping your data in 'loose' vectors, you may consider collecting them in one data frame. First, put you two sets of vector data in data frames:
existing <- c(3, 4, 5, 6)
taxon <- c("Tax1", "Tax2", "Tax3", "Tax4")
df1 <- data.frame(existing, taxon)
introduced <- c(2, 2)
taxon <- c("Tax1", "Tax5")
df2 <- data.frame(introduced, taxon)
Then merge the two data frames by the common column, 'taxon'. Set all = TRUE to include all rows from both data frames:
df3 <- merge(df1, df2, all = TRUE)
Finally, sum 'existing' and 'introduced' taxon, and add the result to the data frame:
df3$combined <- rowSums(df3[ , c("existing", "introduced")], na.rm = TRUE)
df3
# taxon existing introduced combined
# 1 Tax1 3 2 5
# 2 Tax2 4 NA 4
# 3 Tax3 5 NA 5
# 4 Tax4 6 NA 6
# 5 Tax5 NA 2 2

Related

Create column of ascending values based on unique values in another column in R with new data added in batches

I have asked a similar question in Python (How to create column of ascending values based on unique values in another column in pandas), and got the script working, but for various reasons I need to achieve the same thing in R now. I am also adding complexity here of wanting to be able to add new batches of data to the dataset periodically.
I have a list of samples which have unique sample ID numbers ("Sample_ID"). Each row of the dataset is a sample. Some samples are duplicated multiple times. I want to create a new set of sample names ("Sample_code") that ascends up from 1 as you go down the rows using a prefix (e.g. "SAMP00001", "SAMP00002" etc). I want the order of rows to be preserved (as they are roughly in date order of sample collection). And for duplicated samples, I want the number given for Sample_code to correspond to the first row that sample ID appears in, not rows further down the table (which came later in sample collection).
My starting data is illustrated with df1:
# df1
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
df1 <- data.frame(Sample_ID, Date, Variable, Batch)
df1
I want to create the Sample_code column shown in df1b:
# df1b
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003')
df1b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df1b
I would save df1b at this point and those Sample_code names used for downstream processing. The added complexity comes because I will then collect a new batch of samples - let's call it df2 (Batch 2 samples):
# df2
Sample_ID <- c('456789', '123654', '123123', '123789', '121212')
Date <- c('15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue", "red","blue", "red", "red")
Batch <- 2
df2 <- data.frame(Sample_ID, Date, Variable, Batch)
df2
I want to rbind df2 to the bottom of df1, and generate more Sample_code names for the new rows. Importantly, the new Sample_code names need to take account of any Sample_ID duplicates that were present in df1, but also not change any of the Sample_code names that were already assigned back when I only had df1. The result at this point would be df2b, below:
# df2b
Sample_ID <- c('123123','123456','123123','123789','456789', '123654', '123123', '123789', '121212')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019', '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue","red","red","blue","blue", "red","blue", "red", "red")
Batch <- c(1,1,1,1,2,2,2,2,2)
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003', 'SAMP0004', 'SAMP0005', 'SAMP0001', 'SAMP0003', 'SAMP0006')
df2b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df2b
And then I would add Batch 3 samples in the same way etc etc.
I appreciate there are at least 2 stages to this problem: 1) Producing an ascending list of Sample_code names using unique Sample_ID values; and 2) Building in an iterative way of adding batches of samples. But because the second point impacts on the functionality I want for the Sample_code names I have included both stages here.
Lastly - ideally I want to only use base R and tidyverse packages for this.
Any help much appreciated! Thanks.
Because you need to know all possible sample IDs before sample code assignment, consider reversing the order by calling rbind on all sample data frames. Then assign the Sample_code using factor levels. Otherwise, re-assign Sample_code with each batch data frame.
# BUILD A LIST OF DATA FRAMES BY CALLING lapply ON ITERATIVE PROCESS
# df_list <- lapply(batch_iterable, method_to_build_sample)
df_list <- list(df1, df1b, df2) # FOR THIS PARTICULAR POST
# RBIND ALL DFs TOGETHER
df2b <- do.call(rbind, df_list)
df2b <- within(df2b, {
# CONVERT TO CHARACTER
Sample_ID <- as.character(Sample_ID)
# CONVERT TO FACTOR AT POSITIONED VALUES, THEN INTEGER FOR LEVEL NUMBER
Sample_code <- as.character(as.integer(factor(Sample_ID, levels = unique(Sample_ID))))
# RE-ASSIGN WITH SAMP AND LEADING ZEROS
Sample_code <- ifelse(nchar(Sample_code) == 1, paste0('SAMP000', Sample_code),
ifelse(nchar(Sample_code) == 2, paste0('SAMP00', Sample_code),
ifelse(nchar(Sample_code) == 3, paste0('SAMP0', Sample_code), NA)
)
)
})
df2b
# Sample_ID Date Variable Batch Sample_code
# 1 123123 15/06/2019 blue 1 SAMP0001
# 2 123456 23/06/2019 red 1 SAMP0002
# 3 123123 30/06/2019 red 1 SAMP0001
# 4 123789 07/07/2019 blue 1 SAMP0003
# 5 456789 15/07/2019 blue 2 SAMP0004
# 6 123654 31/07/2019 red 2 SAMP0005
# 7 123123 12/08/2019 blue 2 SAMP0001
# 8 123789 27/08/2019 red 2 SAMP0003
# 9 121212 31/08/2019 red 2 SAMP0006

Filter one dataframe based on presence of NA in another dataframe of different length

I have a dataframe A that I want to filter based on whether or not the corresponding sample names in dataframe B have NA in the second column (ID). Sample names in dataframe A repeat while sample names in dataframe B appear only once, making the dataframe lengths different.
Essentially, I want to have a final table where the sample names that have a value in dataframe B, column 'ID', are removed from dataframe A entirely.
I tried the following filter function, but it gave me an error related to the different dataframe lengths:
filtered_table <- filter(A_table_to_filter, is.na(B_filter_table$ID))
Here is some example data:
dataframe_A_table_to_filter <- data.frame(sample = c("OP2645ii_d","OP5048___g","OP5046___e","OP5048___g","OP2413iiia","OP5048___g","OP5043___b","OP5048___g","OP3088i__a","OP5048___g","OP5046___a","OP5048___g","OP5048___b","OP5048___g","OP5043___a","OP5048___g","OP2645ii_d","OP5048___f","OP2645ii_d","OP5044___c","OP2413iiib","OP5048___g","OP5046___c","OP5048___g","OP5046___d","OP5046___e","OP5048___e","OP5048___g","OP5046___e","OP5048___c","OP2413iiia","OP5046___e","OP2645ii_b","OP2645ii_d","OP2645ii_a","OP5046___e","OP5046___e","OP5048___d","OP5046___e","OP5048___e","OP2413iiia","OP5048___f","OP5044___c","OP5046___e","OP2413iiia","OP2645ii_c","OP5046___e","OP5047___b","OP2645ii_a","OP2645ii_d","OP5046___c","OP5046___e","OP5046___d","OP5048___g","OP2645ii_e","OP5048___g","OP2645ii_c","OP5046___d","OP5048___c","OP5048___g","OP2645ii_c","OP5048___c","OP2645ii_c","OP5048___e","OP2645ii_c","OP5048___g","OP5046___e","OP5048___f","OP2645ii_d","OP5046___d","OP2645ii_c","OP5046___c","OP2645ii_d","OP5048___d","OP5043___b","OP5048___f","OP5046___c","OP5048___f","OP2645ii_d","OP5048___c","OP2413iiib","OP5046___e","OP2413iiib","OP5048___f","OP5044___a","OP5048___g","OP5043___a","OP5048___f","OP3088i__a","OP5048___f","OP5048___e","OP5048___f","OP5044___c","OP5048___b","OP2645ii_d","OP5047___b","OP2413iiia","OP2645ii_b","OP5046___a","OP5048___f","OP5043___b","OP5044___c","OP2645ii_c","OP5048___d","OP5047___b","OP5048___g","OP5048___b","OP5048___f","OP2413iiia","OP5044___c","OP2645ii_b","OP5046___e","OP2645ii_c","OP5047___b","OP5044___c","OP5046___a","OP2413iiib","OP2645ii_c","OP2645ii_e","OP5046___e","OP5048___d","OP5048___g","OP5046___d","OP5048___b","OP2645ii_a","OP2645ii_c","OP3088i__a","OP5044___c"),
gr = c("gr3","gr2","gr5","gr2","gr1","gr2","gr1","gr2","gr1","gr2","gr5","gr2","gr5","gr2","gr3","gr2","gr3","gr2","gr3","gr2","gr4","gr2","gr1","gr2","gr4","gr5","gr1","gr2","gr5","gr4","gr1","gr5","gr2","gr3","gr1","gr5","gr5","gr4","gr5","gr1","gr1","gr2","gr2","gr5","gr1","gr5","gr5","gr4","gr1","gr3","gr1","gr5","gr4","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr2","gr5","gr2","gr3","gr4","gr5","gr1","gr3","gr4","gr1","gr2","gr1","gr2","gr3","gr4","gr4","gr5","gr4","gr2","gr3","gr2","gr3","gr2","gr1","gr2","gr1","gr2","gr2","gr5","gr3","gr4","gr1","gr2","gr5","gr2","gr1","gr2","gr5","gr4","gr4","gr2","gr5","gr2","gr1","gr2","gr2","gr5","gr5","gr4","gr2","gr5","gr4","gr5","gr1","gr5","gr4","gr2","gr4","gr5","gr1","gr5","gr1","gr2"), dist = c(7.59036265840066,7.59036265840066,6.44967614976991,6.44967614976991,6.41995474653303,6.41995474653303,6.34991780754275,6.34991780754275,6.18262339507581,6.18262339507581,6.16265512136205,6.16265512136205,6.15423247141993,6.15423247141993,6.14014702309176,6.14014702309176,6.05863330633262,6.05863330633262,5.96292319399187,5.96292319399187,5.94395576047878,5.94395576047878,5.86375256401321,5.86375256401321,5.78102441659872,5.78102441659872,5.7345012847377,5.7345012847377,5.67874617854728,5.67874617854728,5.53957425202641,5.53957425202641,5.44753353881181,5.44753353881181,5.43742118904064,5.43742118904064,5.42270717863966,5.42270717863966,5.40852682965639,5.40852682965639,5.37916907844967,5.37916907844967,5.28542559212653,5.28542559212653,5.28127574537985,5.28127574537985,5.27883657001377,5.27883657001377,5.26111686809869,5.26111686809869,5.25446925024172,5.25446925024172,5.18612527748647,5.18612527748647,5.16152942865884,5.16152942865884,5.13493683199873,5.13493683199873,5.11477487647704,5.11477487647704,5.02518908529805,5.02518908529805,4.96387986494177,4.96387986494177,4.93803544508224,4.93803544508224,4.90484535173276,4.90484535173276,4.88609183324537,4.88609183324537,4.87064721174553,4.87064721174553,4.87044988024298,4.87044988024298,4.87018300982248,4.87018300982248,4.81850235997663,4.81850235997663,4.81315159594962,4.81315159594962,4.79708386349633,4.79708386349633,4.79137478521543,4.79137478521543,4.79076662890575,4.79076662890575,4.7629557294752,4.7629557294752,4.75107063347786,4.75107063347786,4.73927394720927,4.73927394720927,4.65856308508064,4.65856308508064,4.65459244413676,4.65459244413676,4.65168460273128,4.65168460273128,4.64631379714574,4.64631379714574,4.63427356346989,4.63427356346989,4.61758860663907,4.61758860663907,4.61520572342783,4.61520572342783,4.59738310693479,4.59738310693479,4.56270527374553,4.56270527374553,4.53521289030436,4.53521289030436,4.52843905005562,4.52843905005562,4.51867277408847,4.51867277408847,4.50634336104738,4.50634336104738,4.46047201471265,4.46047201471265,4.45241678415362,4.45241678415362,4.43613430884318,4.43613430884318,4.43212669019848,4.43212669019848,4.41051867890157,4.41051867890157))
dataframe_B_filter_table <- data.frame(sample = c("OP2413iiia","OP2413iiib","OP2645ii_a","OP2645ii_b","OP2645ii_c","OP2645ii_d","OP2645ii_e","OP3088i__a","OP5043___a","OP5043___b","OP5044___a","OP5044___b","OP5044___c","OP5046___a","OP5046___b","OP5046___c","OP5046___d","OP5046___e","OP5047___a","OP5047___b","OP5048___b","OP5048___c","OP5048___d","OP5048___e","OP5048___f","OP5048___g","OP5048___h","OP5049___a","OP5049___b","OP5051DNAa","OP5051DNAc","OP5052DNAa","OP5053DNAa","OP5053DNAb","OP5053DNAc","OP5054DNAa","OP5054DNAb","OP5054DNAc","OP5051DNAb"),
ID = c(NA,NA,"gr1",NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,"gr3",NA,NA,NA,NA,NA,NA,NA,NA,"gr2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
I expect a table where the rows with sample names from dataframe B, that are not NA (i.e. that have a value), are removed from dataframe A. However, I recieve an error related to the differing lengths of the tables.
So you want to keep the sample whose corresponding ID has NA value.
We can match the sample get corresponding ID from dfB and keep only if it returns NA
dfA[is.na(dfB$ID[match(dfA$sample, dfB$sample)]), ]
# sample gr dist
#3 OP5046___e gr5 6.4497
#5 OP2413iiia gr1 6.4200
#7 OP5043___b gr1 6.3499
#9 OP3088i__a gr1 6.1826
#11 OP5046___a gr5 6.1627
#13 OP5048___b gr5 6.1542
#....
Renamed your dataframe to dfA and dfB for readability.
If you are sure that every value in dfA is present in dfB we can also use
dfA[dfA$sample %in% dfB$sample[is.na(dfB$ID)], ]
Just realized an answer has been posted, will give a solution based on {data.table}.
To append to your code, I have run the following lines.
DT_A <- data.table(dataframe_A_table_to_filter)
DT_B <- data.table(dataframe_B_filter_table)
KeepSamples <- DT_B[is.na(ID), sample]
DT_A <- DT_A[sample %in% KeepSamples, ]

How to merge data frame and a string

In the case that a=matrix(c(1,2,3,4),nrow=2,ncol=2) and b=c('name',3). I am trying to merge a and b such that the outcome is [1 3 name 3] in the first row and [2 4] in the second row.
The number of rows differs in each dataframe. Therefore cbind is going to have a hard time merging the data and will by default loop the shorter dataframe, in this case b.
I would suggest adding in the rowname as a column and then binding on that. By default, full_join will then generate NA values for dataframes missing that value of the bind. This question is partially a duplicate of Add (not merge!) two data frames with unequal rows and columns so you may find more help there.
# Load packages
library(tidyverse)
library(magrittr) # To use the inplace assignment operator (%<>%)
# Create dataframes
a <- data.frame(1:2,3:4)
b <- merge('name', 3)
# Create rowname column for each dataframe
a %<>% tibble::rownames_to_column()
b %<>% tibble::rownames_to_column()
# Use 'full join' to bind dataframes together
c <- dplyr::full_join(a, b, by=rowname) %>%
# Remove the rowname column
dplyr::select(-rowname)
# Print c
print(c)
X1.2 X3.4 x y
1 1 3 name 3
2 2 4 <NA> NA
If you are satisfied with a list, not data frame, this will work.
a <- matrix(c(1,2,3,4),nrow=2,ncol=2)
b <- c('name',3)
c <- list(a[,1],a[,2],b[1],b[2] )
If you need a data frame,
you have to make the 1st and 2nd row have the same number of columns, by stuffing the gaps with something.
d <- as.data.frame(c)
d[2,3:4] <- NA

match/merge dataframes with a number columns with different column names in r

I have two dataframe with different columns that has large number of rows (about 2 million)
The first one is df1
The second one is df2
I need to get match the values in y column from table one to R column in table two
Example:
see the two rows in df1 in red box have matched the two rows in df2 in red box
Then I need to get the score of the matched values
so the result should look like this and it should be stores in a dataframe:
My attempt : first Im beginner in R, so when I searched I found that I can use Match function, merge function but I did not get the result that I want it might because I did not know how to use them correctly, therefore, I need step by step very simple solution
We can use match from base R
df2[match(df2$R, df1$y, nomatch = 0), c("R", "score")]
# R score
#3 2 3
#4 111 4
Or another option is semi_join from dplyr
library(dplyr)
semi_join(df2[-1], df1, by = c(R = "y"))
# R score
#1 2 3
#2 111 4
merge(df1,df2,by.x="y",by.y="R")[c("y","score")]
y score
1 2 3
2 111 4

Rename column R

I am trying to rename columns but I do not know if that column will be present in the dataset. I have a large data set and if a certain column name is present I want to rename it. For example:
A B C D E
1 4 5 9 2
3 5 6 9 1
4 4 4 9 1
newNames <- data %>% rename(1=A,2=B,3=C,4=D,5=E)
This works to rename what is in the dataset but I am looking for the flexibility to add more potential name changes, without an error occurring.
newNames2 <- data %>% rename(1=A,2=B,3=C,4=D,5=E,6=F,7=G)
This ^ will not work it give me an error because F and G are not in the data set.
Is there any way to write a code to ignore the column change if the name does not exist?
Thanks!
There can be plenty of ways to do this. One would be to create a named vector with the names and their corresponding 'new name' (as the vector's names) and use that, i.e.
#The below vector v1, uses LETTERS as old names and 1:7 as the new ones
v1 <- setNames(LETTERS[1:7], 1:7)
names(df) <- names(v1)[v1 %in% names(df)]

Resources