Merging two data frames with columns with certain patterns in strings - r

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)
I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.
Name Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5
The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.
Name ChaR
tony.x ENG
tom.x US
The main element in column Name of both dfs is "tony', "tom" and so on.
So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".
I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use
StoRe<-merge(A,B, all=T)
I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:
Name Numbers ChaR
tony 15.6 NA
tony NULL ENG
tom 14.5 NA
tom NULL US
It has been giving me splitting headache. I request you to help.

One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.
a <- data.frame(
Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
Numbers = rnorm(5)
)
b <- data.frame(
Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
ChaR = LETTERS[11:15]
)
# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2'
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)
# String consists of 'letter1 point letter2'; replace by 'letter1 letter2'
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)
result <- merge(a, b, all = TRUE, by = "Name_stand")
Output:
#> result
# Name_stand Name.x Numbers Name.y ChaR
#1 barx .bar.x.rds 1.38072696 bar.x M
#2 companyx <NA> NA company.x O
#3 foobarx .foobar.x.rds -1.53076596 foobar.x N
#4 foox .foo.x.rds 1.40829287 <NA> <NA>
#5 tomx .tom.x.rds -0.01204651 tom.x L
#6 tonyx .tony.x.rds 0.34159406 tony.x K
Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/
# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)
result2 <- merge(a, b, all = TRUE, by = "Name_stand")

Related

Calculate all possible product combinations between variables

I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)

Replacement of some equal elements of a string using different values in a data frame

I want to replace each 'COL' word in the column 'b' of the 'test' data frame, by each element in the column 'a', and put the result in other column, but preserving both order and structure of the character string of the column 'b'.
test <- data.frame(a = c("COL167", "COL2010;COL2012"),
b = c("COL;MO, K", "P;COL, NY, S, COL"))
I have tried the following, but it is not the result that I need:
for(i in 1:length(test$a)){
test$c[i] <- gsub(pattern = "COL", x = test$b[i], replacement = test$a[i])
}
> test
a b c
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010;COL2012, NY, S, COL2010;COL2012
I expect the following result:
a b c
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2012
Building on what you have already done, I think this would work, but note that you might see some performance issues if your table is large. Also note that, this assumes that size of values to be replaced is equal to values used for replacement.
As gsub doesn't allow for vectorized replacement (replaces all the matched instances with first values of replacement), here I have converted both strings and replacements into vectors, so I can replace each matched substring individually.
test <- data.frame(a = c("COL167", "COL2010;COL2012"),
b = c("COL;MO, K", "P;COL, NY, S, COL"))
re = function(string, replacement){
gsub('COL', replacement, string)
}
for(i in 1:nrow(test)){
#splitting values of column a into vector, this is required for replacement
replacement = unlist(strsplit(test$a[i], ';'))
#split values of column b into vecto, this is required for replacement
b_value = unlist(strsplit(test$b[i], ' '))
#select those which have 'COL' substring
ind_to_replace = which(grepl('COL', b_value))
#replace matched values
result = mapply(re, b_value[ind_to_replace], replacement)
#replace the column b value with new string
b_value[ind_to_replace] = result
#join the string
test$results[i] = paste(b_value, collapse = ' ')
}
test
#> a b results
#> 1 COL167 COL;MO, K COL167;MO, K
#> 2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2012
Created on 2020-09-05 by the reprex package (v0.3.0)
I'll propose a solution using the rowwise function of dplyr.
While it's true that gsub isn't vectorized, the mgsub function in the package of the same name is. My approach is for each row:
turn all of the instances of COL in column b into a vector
make a vector from all the COL+ entries from column a
use vector 2 to replace the old values of COL from b. mutate creates a new column with the result.
library(mgsub)
library(stringr)
library(dplyr)
test %>%
rowwise() %>%
mutate(new_col =
unlist((mgsub(b,
unlist(str_extract_all(b,"COL")),
unlist(str_extract_all(a,"COL.*?\\b")))
)))
# A tibble: 2 x 3
# Rowwise:
a b new_col
<chr> <chr> <chr>
1 COL167 COL;MO, K COL167;MO, K
2 COL2010;COL2012 P;COL, NY, S, COL P;COL2010, NY, S, COL2010
mgsub takes 3 arguments. The string you're working on, the expression you want to replace within that string, and the expression you want to use as the replacement. This package allows you to have multiple patterns to replace and be replaced - both can appear as vectors.
I applied this function to each row - first I designated the b column as the string of interest. Second, all the COL's in column b is what we want to replace and I made this into a vector using stringr::str_extract_all. I extracted all instances of COL and then we have to unlist this output because str_extract_all returns a list. Third, I used the same process to extract the COL+ entries from column a. In summary, we use the entries in column a to replace the characters of interest within column b.
"COL.*?\\b"
selects the letters COL followed by as few characters as possible before reaching a word boundary which allows us to turn the entries in column a into multiple items (COL2010, COL2012 etc).
We have to unlist the mutated row (i.e. the first "unlist") because dplyr outputs a list-column.

dplyr::mutate() return "Error in Ops.factor(gene_id$symbol, x) : level sets of factors are different"

My data
gene_list <- data.frame(mouse_gene = c("Ccnb1", "Cdk1", "Cdh3", "Cdkn1c"),
human_gene = c("SLCO2B1", "PPP1R3C", "MMP13", "CLEC6A"))
gene_id <- data.frame(gene_id = c("23334", "100001", "12341236", "34553433", "22998", "123121213"),
symbol = c("SLCO2B1", "PPP1R3C", "FX-232", "MMP13", "CLEC6A", "CSCCD"))
I want to add a column in gene_list that can find the corresponding gene_id of human_gene, so I defined a function
find_geneID <- function(x){
ID <- gene_id[which(gene_id$symbol == x),1]
return(ID)
}
Then I use dplyr::mutate
gene_list <- gene_list %>% mutate(find_geneID(human_gene))
However, I get a return
Error in Ops.factor(gene_id$symbol, x) : level sets of factors are different
I know that I can use join in this case. However, I would like to know what should I do if I need to use a function in dplyr::mutate.
Besides, sometimes when I want to use a value in one column, input into a function, and put into a new column, I will get
Column `new_column` must be length 568 (the number of rows) or one, not 2
Can someone tell me the reason? Thanks
Instead of ==, use match to get the index because == does an elementwise comparison and it would create an issue in the length if both dataset have different number of rows i.e. it is comparing row1 of first to row1 on second, row2 -> row2, row3 -> row3 and the values can be anywhere in the column which would be likely missed when we do ==
find_geneID <- function(x) {gene_id$gene_id[match(gene_list[[x]], gene_id$symbol)]}
gene_list %>%
mutate(gene_id = find_geneID('human_gene'))
# mouse_gene human_gene gene_id
#1 Ccnb1 SLCO2B1 23334
#2 Cdk1 PPP1R3C 100001
#3 Cdh3 MMP13 34553433
#4 Cdkn1c CLEC6A 22998
Also, make sure that the columns are character class instead of factor by using stringsAsFactors = FALSE while constructing the datasets
It could be done easily with a join
left_join(gene_list, gene_id, by = c('human_gene' = 'symbol'))
# mouse_gene human_gene gene_id
#1 Ccnb1 SLCO2B1 23334
#2 Cdk1 PPP1R3C 100001
#3 Cdh3 MMP13 34553433
#4 Cdkn1c CLEC6A 22998
data
gene_list <- data.frame(mouse_gene = c("Ccnb1", "Cdk1", "Cdh3", "Cdkn1c"),
human_gene = c("SLCO2B1", "PPP1R3C", "MMP13", "CLEC6A"),
stringsAsFactors = FALSE)
gene_id <- data.frame(gene_id = c("23334", "100001", "12341236",
"34553433", "22998", "123121213"),
symbol = c("SLCO2B1", "PPP1R3C", "FX-232",
"MMP13", "CLEC6A", "CSCCD"), stringsAsFactors = FALSE)

Saving the output of a str_which loop in R

I work with a sheet of data that lists a variety of scientific publications. Rows are publications,
columns are a variety of metrics describing each publication (author name and position, Pubmed IDs, Date etc...)
I want to filter for publications for each author and extract parts of them. The caveat is the format:
all author names (5-80 per cell) are lumped together in one cell for each row.
I managed to solve this with the use of str_which, saving the coordinates for each author and later extract. This works only for manual use. When I try to automate this process using a loop to draw on a list of authors I fail to save the output.
I am at a bit of a loss on how to store the results without overwriting previous ones.
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
list of names that I want the coordinates for
namesOfInterest <-
list(c("Doe J", "Theobald U"))
the manual extraction, requiring me to type the exact name and output object
Doe <- str_which(sampleDat$var1, "Doe J")
Theobald <- str_which(sampleDat$var1, "Theobald U")
one of many attempts that does not replicate the manual version.
results <- c()
for (i in namesOfInterest) {
results[i] <- str_which(sampleDat$var1, i)
}
The for loop is set up incorrectly (it needs to be something like for(i in 1:n){do something}). Also, even if you fix that, you'll get an error related to the fact that str_which returns a vector of varying length, indicating the position of each of the matches it makes (and it can make multiple matches). Thus, indexing a vector in a loop won't work here because whenever a author has multiple matches, more than one entry will be saved to a single element, throwing an error.
Solve this by working with lists, because lists can hold vectors of arbitrary length. Index the list with double bracket notation: [[.
library(stringr)
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
# no need for list here. a simple vector will do
namesOfInterest <- c("Doe J", "Theobald U")
# initalize list
results <- vector("list", length = length(namesOfInterest))
# loop over list, saving output of `str_which` in each list element.
# seq_along(x) is similar to 1:length(x)
for (i in seq_along(namesOfInterest)) {
results[[i]] <- str_which(sampleDat$var1, namesOfInterest[i])
}
which returns:
> results
[[1]]
[1] 1
[[2]]
[1] 2 3
The way to understand the output above is that the ith element of the list, results[[i]] contains the output of str_which(sampleDat$var1, namesOfInterest[i]), where namesOfInterest[i] is always exactly one author. However, the length of results[[i]] can be longer than one:
> sapply(results, length)
[1] 1 2
indicating that a single author can be mentioned multiple times. In the example above, sapply counts the length of each vector along the list results, showing that namesOfInterest[1] has one paper, and namesOfInterest[2] has 2. `
Here is another approach for you. If you want to know which scholar is in which publication, you can do the following as well. First, assign unique IDs to publications. Then, split authors and create a long-format data frame. Define groups by authors and aggregate publication ID (pub_id) as string (character). If you need to extract some authors, you can use this data frame (foo) and subset rows.
library(tidyverse)
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = toString(pub_id)) -> foo
var1 pub_id
<chr> <chr>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2, 3
filter(foo, var1 %in% c("Doe J", "Theobald U"))
var1 pub_id
<chr> <chr>
1 Doe J 1
2 Theobald U 2, 3
If you want to have index as numeric, you can twist the idea above and do the following. You can subset rows with targeted names with filter().
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = list(pub_id)) %>%
unnest(pub_id)
var1 pub_id
<chr> <int>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2
9 Theobald U 3

Concatenate columns in data frame

We have brands data in a column/variable which is delimited by semicolon(;). Our task is to split these column data to multiple columns which we were able to do with the following syntax.
Attached the data as Screen shot.
Data set
Here is the R code:
x<-dataset$Pref_All
point<-df %>% separate(x, c("Pref_01","Pref_02","Pref_03","Pref_04","Pref_05"), ";")
point[is.na(point)] <- ""
However our question is: We have this type of brands data in more than 10 to 15 columns and if we use the above syntax the maximum number of columns to be split is to be decided on the number of brands each column holds (which we manually calculated and taken as 5 columns).
We would like to know is there any way where we can write the code in a dynamic way such that it should calculate the maximum number of brands each column holds and accordingly it should create those many new columns in a data frame. for e.g.
Pref_01,Pref_02,Pref_03,Pref_04,Pref_05.
the preferred output is given as a screen shot.
Output
Thanks for the help in advance.
x <- c("Swift;Baleno;Ciaz;Scross;Brezza", "Baleno;swift;celerio;ignis", "Scross;Baleno;celerio;brezza", "", "Ciaz;Scross;Brezza")
strsplit(x,";")
library(dplyr)
library(tidyr)
x <- data.frame(ID = c(1,2,3,4,5),
Pref_All = c("S;B;C;S;B",
"B;S;C;I",
"S;B;C;B",
" ",
"C;S;B"))
x$Pref_All <- as.character(levels(x$Pref_All))[x$Pref_All]
final_df <- x %>%
tidyr::separate(Pref_All, c(paste0("Pref_0", 1:b[[which.max(b)]])), ";")
final_df$ID <- x$Pref_All
final_df <- rename(final_df, Pref_All = ID)
final_df[is.na(final_df)] <- ""
Pref_All Pref_01 Pref_02 Pref_03 Pref_04 Pref_05
1 S;B;C;S;B S B C S B
2 B;S;C;I B S C I
3 S;B;C;B S B C B
4
5 C;S;B C S B
The trick for the column names is given by paste0 going from 1 to the maximum number of brands in your data!
I would use str_split() which returns a list of character vectors. From that, we can work out the max number of preferences in the dataframe and then apply over it a function to add the missing elements.
df=data.frame("id"=1:5,
"Pref_All"=c("brand1", "brand1;brand2;brand3", "", "brand2;brand4", "brand5"))
spl = str_split(df$Pref_All, ";")
# Find the max number of preferences
maxl = max(unlist(lapply(spl, length)))
# Add missing values to each element of the list
spl = lapply(spl, function(x){c(x, rep("", maxl-length(x)))})
# Bind each element of the list in a data.frame
dfr = data.frame(do.call(rbind, spl))
# Rename the columns
names(dfr) = paste0("Pref_", 1:maxl)
print(dfr)
# Pref_1 Pref_2 Pref_3
#1 brand1
#2 brand1 brand2 brand3
#3
#4 brand2 brand4
#5 brand5

Resources