Merge columns that are separated by character in order - r

I have a table that has two columns with information separated by ":". Te problem is that not all of them has the same size.
I'll write an example:
Col1 ol2
AA:BB:CC 1:2:3
AA:DD:BB:CC 4:5:6:7
And I would like a third column that is
Col3
AA=1:BB=2:CC=3
AA=4:DD=5:BB=6:CC=7
I've not idea where to start, I've try to split them, but it took me nowere

We can use strsplit to split the 'Col1', 'Col2' by :, then concatenate the corresponding list elements with str_c to create the 'Col3'
library(dplyr)
library(purrr)
library(stringr)
df1 %>%
mutate(col3 = map2_chr(strsplit(Col1, ":"), strsplit(Col2, ":"),
~ str_c(.x, .y, sep="=", collapse=':')))
# Col1 Col2 col3
#1 AA:BB:CC 1:2:3 AA=1:BB=2:CC=3
#2 AA:DD:BB:CC 4:5:6:7 AA=4:DD=5:BB=6:CC=7
data
df1 <- structure(list(Col1 = c("AA:BB:CC", "AA:DD:BB:CC"), Col2 = c("1:2:3",
"4:5:6:7")), class = "data.frame", row.names = c(NA, -2L))

Related

Merge dataframes rows by shared pattern

I want merge 2 dataframes based on a shared pattern.
The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(.+);ACS"
If the ID is a match in both dataframes, then combine the respective rows!
DF1 DF2 MERGED ( DF2 + DF1 )
col1 col2 col1 col2 col1 col2 col3 col4
HAND2 H2 OFS ID=GATA5;ACS=45 OFS ID=GATA5;ACS=45
HAND6 H6 FAM ID=HAND2;ACS=20 FAM ID=HAND2;ACS=20 HAND2 H2
In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.
Script tried
MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(.+);ACS"), all = TRUE)[-1]
error
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.
Thank you in advance for your help.
You may try fuzzyjoin. In the match_fun argument you can define a function for your specific needs.
In your case gsub is extracting the pattern of the DF2 col2 variable. And with str_detect the extraction is compared to the col1 column of DF1.
Data
DF1 <- read.table(text = "col1 col2
HAND2 H2
HAND6 H6", header = T)
DF2 <- read.table(text = "col1 col2
OFS ID=GATA5;ACS=45
FAM ID=HAND2;ACS=20", header = T)
Code
library(fuzzyjoin)
library(stringr)
DF2 %>%
fuzzy_left_join(DF1,
by = c("col2"= "col1"),
match_fun = function(x,y) str_detect(y, gsub("ID=(.+);(.*)", "\\1", x)) )
Output
col1.x col2.x col1.y col2.y
1 OFS ID=GATA5;ACS=45 <NA> <NA>
2 FAM ID=HAND2;ACS=20 HAND2 H2

Create two column with multiple separators

I have a dataframe such as
COl1
scaffold_97606_2-BACs_-__SP1_1
UELV01165908.1_2-BACs_+__SP2_2
UXGC01046554.1_9-702_+__SP3_3
scaffold_12002_1087-1579_-__SP4_4
and I would like to separate both into two columns and get :
COL1 COL2
scaffold_97606 2-BACs_-__SP1_1
UELV01165908.1 2-BACs_+__SP2_2
UXGC01046554.1 9-702_+__SP3_3
scaffold_12002 1087-1579_-__SP4_4
so as you can see the separator changes it can be .Number_ or Number_Number
So far I wrote ;
df2 <- df1 %>%
separate(COL1, paste0('col', 1:2), sep = " the separator patterns ", extra = "merge")
but I do not know what separator I should use here in the " the separator patterns "part
You may use
> df1 %>%
separate(COl1, paste0('col', 1:2), sep = "(?<=\\d)_(?=\\d+-)", extra = "merge")
col1 col2
1 scaffold_97606 2-BACs_-__SP1_1
2 UELV01165908.1 2-BACs_+__SP2_2
3 UXGC01046554.1 9-702_+__SP3_3
4 scaffold_12002 1087-1579_-__SP4_4
See the regex demo
Pattern details
(?<=\d) - a positive lookbehind that requires a digit immediately to the left of the current location
_ - an underscore
(?=\d+-) - a positive lookahead that requires one or more digits and then a - immediately to the right of the current location.
You can use extract :
tidyr::extract(df, COl1, c('Col1', 'Col2'), regex = '(.*?\\d+)_(.*)')
# Col1 Col2
#1 scaffold_97606 2-BACs_-__SP1_1
#2 UELV01165908.1 2-BACs_+__SP2_2
#3 UXGC01046554.1 9-702_+__SP3_3
#4 scaffold_12002 1087-1579_-__SP4_4
data
df <- structure(list(COl1 = c("scaffold_97606_2-BACs_-__SP1_1",
"UELV01165908.1_2-BACs_+__SP2_2",
"UXGC01046554.1_9-702_+__SP3_3", "scaffold_12002_1087-1579_-__SP4_4"
)), class = "data.frame", row.names = c(NA, -4L))

Get column names for all dataframes in R

I have large number of dataframes in R. Now I want to have a readable output for all column names against each dataframe
Let us say there are three dataframes A, B, C with with different number of columns and different column names as c("Col1", "Col2","Col3"), c("Col4", "Col5") and c("Col6", "Col7", "Col8", "Col9", "COl10") respectively
Now I want to have a output like this
Note: My intention is to later write it in a .csv file and break column names as per requirement (separated by tabs or "," separated)
Here's a stab.
df1 <- data.frame(a=1,b=2,c=3)
df2 <- data.frame(A=1,E=2)
df3 <- data.frame(quux=7,cronk=9)
dfnms <- rownames(subset(ls.objects(), Type %in% c("data.frame", "tbl_df", "data.table")))
dfnms
# [1] "df1" "df2" "df3"
data.frame(name = dfnms, columns = sapply(mget(dfnms), function(x) paste(colnames(x), collapse = ",")))
# name columns
# df1 df1 a,b,c
# df2 df2 A,E
# df3 df3 quux,cronk
If you really need them double-quoted, then add dQuote, as in
data.frame(name = dfnms, columns = sapply(mget(dfnms), function(x) paste(dQuote(colnames(x)), collapse = ",")))
# name columns
# df1 df1 "a","b","c"
# df2 df2 "A","E"
# df3 df3 "quux","cronk"
I am putting my work here, although it is similar to r2evans's solution.
Data
A <- data.frame(col1=1:2, col2=1:2, col3=1:2)
B <- data.frame(col4=1:2, col5=1:2)
C <- data.frame(col6=1:2, col7=1:2, col8=1:2, col9=1:2, col10=1:2)
Code
DataFrameName = c('A', 'B', 'C')
data.frame(DataFrameName = DataFrameName,
Columns = sapply(DataFrameName, function(x) paste(names(get(x)), collapse = ",")),
stringsAsFactors = FALSE)
Output
# DataFrameName Columns
# A A col1,col2,col3
# B B col4,col5
# C C col6,col7,col8,col9,col10
With tidyverse, we can get the datasets in a list with lst, then loop over the list, get the column names, convert it to string, get the list of named strings into a two column tibble with enframe and unnest the 'Columns'
library(dplyr)
library(tidyr)
librarry(purrr)
lst(A, B, C) %>%
map(~ .x %>% names %>% toString) %>%
enframe(name = "DataFrameName", value = "Columns") %>%
unnest(c(Columns))
# A tibble: 3 x 2
# DataFrameName Columns
# <chr> <chr>
#1 A Col1, Col2, Col3
#2 B Col1, Col5
#3 C Col6, Col7, Col8, Col9, Col10
data
A <- data.frame(Col1 = 1:5, Col2 = 6:10, Col3 = 11:15)
B <- data.frame(Col1 = 1:5, Col5 = 6:10)
C <- data.frame(Col6 =1:5, Col7 = 6:10, Col8 = 6:10, Col9 = 7:11, Col10 = 11:15)

Remove unnecessary symbols in the data in R

That's my dataset
1.abc
2.def
3.2354
4.. $.?,
How can I delete those obs in which only digits, in which only symbols like point, comma ..., well, in which any symbols and digits(1#5??%).And words in the text where less than two letters
We can use str_count to count the number of characters and subset the dataset
library(stringr)
library(dplyr)
df1 %>%
filter(str_count(v1, "[[:alpha:]]") > 2)
Or with gsub to remove any character that is not a letter and count the number of characters with nchar to create a logical index for subsetting
subset(df1, nchar(gsub("[^[:alpha:]]+", "", v1))>2)
# v1
#1 1.abc
#2 2.def
data
df1 <- structure(list(v1 = c("1.abc", "2.def", "3.2354", "4.. $.?,")),
.Names = "v1", class = "data.frame", row.names = c(NA, -4L))

r compare column types between two dataframes

This may be a bad question because I am not posting any reproducible example. My main goal is to identify columns that are of different types between two dataframe that have the same column names.
For example
df1
Id Col1 Col2 Col3
Numeric Factor Integer Date
df2
Id Col1 Col2 Col3
Numeric Numeric Integer Date
Here both the dataframes (df1, df2) have same column names but the Col1 type is different and I am interested in identifying such columns. Expected output.
Col1 Factor Numeric
Any suggestions or tips on achieving this ?. Thanks
Try compare_df_cols() from the janitor package:
library(janitor)
mtcars2 <- mtcars
mtcars2$cyl <- as.character(mtcars2$cyl)
compare_df_cols(mtcars, mtcars2, return = "mismatch")
#> column_name mtcars mtcars2
#> 1 cyl numeric character
Self-promotion alert, I authored this package - am posting this function because it exists to solve precisely this problem.
Try this:
compareColumns <- function(df1, df2) {
commonNames <- names(df1)[names(df1) %in% names(df2)]
data.frame(Column = commonNames,
df1 = sapply(df1[,commonNames], class),
df2 = sapply(df2[,commonNames], class)) }
For a more compact method, you could use a list with sapply(). Efficiency shouldn't be a problem here since all we're doing is grabbing the class. Here I add data frame names to the list to create a more clear output.
m <- sapply(list(df1 = df1, df2 = df2), sapply, class)
m[m[, "df1"] != m[, "df2"], , drop = FALSE]
# df1 df2
# Col1 "factor" "character"
where df1 and df2 are the data from #ycw's answer.
If two data frame have same column names, then below will give you columns with different classes.
library(dplyr)
m1 = mtcars
m2 = mtcars %>% mutate(cyl = factor(cyl), vs = factor(cyl))
out = cbind(sapply(m1, class), sapply(m2, class))
out[apply(out, 1, function(x) !identical(x[1], x[2])), ]
We can use sapply with class to loop through all columns in df1 and df2. After that, we can compare the results.
# Create example data frames
df1 <- data.frame(ID = 1:3,
Col1 = as.character(2:4),
Col2 = 2:4,
Col3 = as.Date(paste0("2017-01-0", 2:4)))
df2 <- data.frame(ID = 1:3,
Col1 = as.character(2:4),
Col2 = 2:4,
Col3 = as.Date(paste0("2017-01-0", 2:4)),
stringsAsFactors = FALSE)
# Use sapply and class to find out all the class
class1 <- sapply(df1, class)
class2 <- sapply(df2, class)
# Combine the results, then filter for rows that are different
result <- data.frame(class1, class2, stringsAsFactors = FALSE)
result[!(result$class1 == result$class2), ]
class1 class2
Col1 factor character

Resources