Merge dataframes rows by shared pattern - r

I want merge 2 dataframes based on a shared pattern.
The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(.+);ACS"
If the ID is a match in both dataframes, then combine the respective rows!
DF1 DF2 MERGED ( DF2 + DF1 )
col1 col2 col1 col2 col1 col2 col3 col4
HAND2 H2 OFS ID=GATA5;ACS=45 OFS ID=GATA5;ACS=45
HAND6 H6 FAM ID=HAND2;ACS=20 FAM ID=HAND2;ACS=20 HAND2 H2
In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.
Script tried
MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(.+);ACS"), all = TRUE)[-1]
error
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.
Thank you in advance for your help.

You may try fuzzyjoin. In the match_fun argument you can define a function for your specific needs.
In your case gsub is extracting the pattern of the DF2 col2 variable. And with str_detect the extraction is compared to the col1 column of DF1.
Data
DF1 <- read.table(text = "col1 col2
HAND2 H2
HAND6 H6", header = T)
DF2 <- read.table(text = "col1 col2
OFS ID=GATA5;ACS=45
FAM ID=HAND2;ACS=20", header = T)
Code
library(fuzzyjoin)
library(stringr)
DF2 %>%
fuzzy_left_join(DF1,
by = c("col2"= "col1"),
match_fun = function(x,y) str_detect(y, gsub("ID=(.+);(.*)", "\\1", x)) )
Output
col1.x col2.x col1.y col2.y
1 OFS ID=GATA5;ACS=45 <NA> <NA>
2 FAM ID=HAND2;ACS=20 HAND2 H2

Related

Combining two columns with character strings into a new column

Below I have two columns of data (column 6 and 7) of genus and species names. I would like to combine those two columns with character string data into a new column with the names combined.
I am quite new to R and the code below does not work! Thank you for the help wonderful people of stack overflow!
#TRYING TO MIX GENUS & SPECIES COLUMN
accepted_genus <- merged_subsets_2[6]
accepted_species <- merged_subsets_2[7]
accepted_genus
accepted_species
merged_subsets_2%>%
bind_cols(accepted_genus, accepted_species)
merged_subsets_2
We can use str_c from stringr
library(dplyr)
library(stringr)
df %>%
mutate(Col3 = str_c(Col1, Col2))
Or with unite
library(tidyr)
df %>%
unite(Col3, Col1, Col2, sep="", remove = FALSE)
Please take a look at this if this doesn't answer your question.
df <- data.frame(Col1 = letters[1:2], Col2=LETTERS[1:2]) # Sample data
> df
Col1 Col2
1 a A
2 b B
df$Col3 <- paste0(df$Col1, df$Col2) # Without spacing
> df
Col1 Col2 Col3
1 a A aA
2 b B bB
df$Col3 <- paste(df$Col1, df$Col2)
> df
Col1 Col2 Col3
1 a A a A
2 b B b B

Get column names for all dataframes in R

I have large number of dataframes in R. Now I want to have a readable output for all column names against each dataframe
Let us say there are three dataframes A, B, C with with different number of columns and different column names as c("Col1", "Col2","Col3"), c("Col4", "Col5") and c("Col6", "Col7", "Col8", "Col9", "COl10") respectively
Now I want to have a output like this
Note: My intention is to later write it in a .csv file and break column names as per requirement (separated by tabs or "," separated)
Here's a stab.
df1 <- data.frame(a=1,b=2,c=3)
df2 <- data.frame(A=1,E=2)
df3 <- data.frame(quux=7,cronk=9)
dfnms <- rownames(subset(ls.objects(), Type %in% c("data.frame", "tbl_df", "data.table")))
dfnms
# [1] "df1" "df2" "df3"
data.frame(name = dfnms, columns = sapply(mget(dfnms), function(x) paste(colnames(x), collapse = ",")))
# name columns
# df1 df1 a,b,c
# df2 df2 A,E
# df3 df3 quux,cronk
If you really need them double-quoted, then add dQuote, as in
data.frame(name = dfnms, columns = sapply(mget(dfnms), function(x) paste(dQuote(colnames(x)), collapse = ",")))
# name columns
# df1 df1 "a","b","c"
# df2 df2 "A","E"
# df3 df3 "quux","cronk"
I am putting my work here, although it is similar to r2evans's solution.
Data
A <- data.frame(col1=1:2, col2=1:2, col3=1:2)
B <- data.frame(col4=1:2, col5=1:2)
C <- data.frame(col6=1:2, col7=1:2, col8=1:2, col9=1:2, col10=1:2)
Code
DataFrameName = c('A', 'B', 'C')
data.frame(DataFrameName = DataFrameName,
Columns = sapply(DataFrameName, function(x) paste(names(get(x)), collapse = ",")),
stringsAsFactors = FALSE)
Output
# DataFrameName Columns
# A A col1,col2,col3
# B B col4,col5
# C C col6,col7,col8,col9,col10
With tidyverse, we can get the datasets in a list with lst, then loop over the list, get the column names, convert it to string, get the list of named strings into a two column tibble with enframe and unnest the 'Columns'
library(dplyr)
library(tidyr)
librarry(purrr)
lst(A, B, C) %>%
map(~ .x %>% names %>% toString) %>%
enframe(name = "DataFrameName", value = "Columns") %>%
unnest(c(Columns))
# A tibble: 3 x 2
# DataFrameName Columns
# <chr> <chr>
#1 A Col1, Col2, Col3
#2 B Col1, Col5
#3 C Col6, Col7, Col8, Col9, Col10
data
A <- data.frame(Col1 = 1:5, Col2 = 6:10, Col3 = 11:15)
B <- data.frame(Col1 = 1:5, Col5 = 6:10)
C <- data.frame(Col6 =1:5, Col7 = 6:10, Col8 = 6:10, Col9 = 7:11, Col10 = 11:15)

How to combine all unique values of a dataframe column into a string

I have created a dataframe that looks like this
data <- data.frame(col1,col2,col3)
>data
col1 col2 col3
1 a1 b1 c1
2 a1 b2 c2
3 a1 b3 c3
and would like to transform into
col1 col2 col3
1 a1 b1,b2,b3 c1,c2,c3
It seems that rbind is what I am looking for. But after reading the description, I still have no clue how to implement this.
Create example dataset:
df <- data.frame(
col1 = c("a1","a1","a1"),
col2 = c("b1","b2","b3"),
col3 = c("c1","c2","c3"),
stringsAsFactors = FALSE
)
Short version:
data.frame(lapply(df, function(x) paste(unique(x), collapse=",")))
With explanation and intermediate steps:
#create a custom function to list unique elements as comma separated
myfun <- function(x) {
paste(unique(x), collapse=",")
}
#apply our function to our dataframe's columns
temp <- lapply(df, myfun)
#temp is a list, turn it into a dataframe
result <- data.frame(temp)
Another option would be to use summarise_all
library(dplyr)
df %>% summarise_all(funs(paste(unique(.), collapse = ",")))
# col1 col2 col3
# 1 a1 b1,b2,b3 c1,c2,c3

Purify df1 by rows that have no duplicates in df2 based on several columns

I have two data frames, df1 and df2, each with several columns. My goal is to modify df1 such that it contains only rows that have duplicates in df2 based on several columns. Unfortunately, I only found ways to do it based on either one or all columns. Here is an example:
df1 <- data.frame(c(seq(1:5)),
c(letters[1:5]),
c(letters[22:26]))
colnames(df1) <- c("col1", "col2", "col3")
df2 <- data.frame(c(1, 20, 30, 4, 5),
c(letters[1:5]),
c(letters[15:19]))
colnames(df2) <- c("col1", "col2", "col3")
Now, I want to modify df1 such that it contains only rows that have duplicates in df2 based on col1 and col2. Thus, my goal is to get:
> df3
col1 col2 col3
1 1 a v
2 4 d y
3 5 e z
With merge in base R, you can do
merge(df1, df2[, 1:2])
col1 col2 col3
1 1 a v
2 4 d y
3 5 e z
You have to drop the final column of df2 (or keep only the ID columns). By default only the IDs that match in both data.frames are kept. Also, merge searches for the names of the IDs that match in both data.frames (via setdiff), and uses them for the merge operation, which is what we want here, so we don't even have to specify the "by" or "by.x" / "by.y" arguments.
Here is a join option with data.table
library(data.table)
setDT(df1)[df2[1:2], on = .(col1, col2), nomatch = 0]
# col1 col2 col3
#1: 1 a v
#2: 4 d y
#3: 5 e z
A base R solution could be
df1[with(df1,paste0(col1,"_",col2)) %in% with(df2,paste0(col1,"_",col2)),]
modified according to comments by #docendo discimus
Alternative solution by #docendo discimus:
cols <- c("col1", "col2"); df1[Reduce(&, Map(==, df1[cols], df2[cols])),]
We can use semi_join from dplyr. df3 is the final output.
library(dplyr)
df3 <- df1 %>% semi_join(df2, by = c("col1", "col2"))

Merge multiple csv by columns r

I have multiple csv files, and these files contain some identical columns as well as different columns.
For example,
#1st.csv
col1,col2
1,2
#2nd.csv
col1,col3,col4
1,2,3
#3rd.csv
col1,col2,col3,col5
1,2,3,4
I try to combine these files based on the same columns, but for those different columns, I simply
include all columns but fill the cell with NA (for those data without that columns).
So I expect to see:
col1,col2,col3,col4,col5
1,2,NA,NA,NA #this is 1st.csv
1,NA,2,3,NA #this is 2nd.csv
1,2,3,NA,4 #this is 3rd.csv
Here is the r code I give, but it returns an error message
> Combine_data <- smartbind(1st,2nd,3rd)
Error in `[<-.data.frame`(`*tmp*`, , value = list(ID = c(1001, 1001, :
replacement element 1 has 143460 rows, need 143462
Does anyone know any alternative or elegant way to get the expected result?
The R version is 3.3.2.
You should be able to accomplish this with the bind_rows function from dplyr
df1 <- read.csv(text = "col1, col2
1,2", header = TRUE)
df2 <- read.csv(text = "col1, col3, col4
1,2,3", header = TRUE)
df3 <- read.csv(text = "col1, col2, col3, col5
1,2,3,4", header = TRUE)
library(dplyr)
res <- bind_rows(df1, df2, df3)
> res
col1 col2 col3 col4 col5
1 1 2 NA NA NA
2 1 NA 2 3 NA
3 1 2 3 NA 4

Resources