Match part of a pattern to a string - r

I have two dataframes, and I want to do a match and merge.
Initially I was using inner_join and coalesce, but realized the match portion wasn't properly matching.
I found an example which seemed to be in the right direction How to merge two data frame based on partial string match with R? . One answer suggested using this code:
idx2 <- sapply(df_mouse_human$Protein.IDs, grep, df_mouse$Protein.IDs)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
merged <- cbind(df_mouse_human[unlist(idx1),,drop=F], df_mouse[unlist(idx2),,drop=F])
However it fell short. The issue being is the dataset that I want to use as the pattern match, has strings which are longer than what I want to match to, and thus didn't match anything. Let me show a subset of the data:
dput(droplevels(df_mouse))
structure(list(Protein.IDs = c("Q8CBM2;A2AL85;Q8BSY0", "A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8",
"A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6", "Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15379-2;P15379-3;P15379-6;P15379-11;P15379-5;P15379-10;P15379-9;P15379-4;P15379-8;P15379-7;P15379;P15379-12;P15379-13",
"A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78", "A2AUR7;Q9D031;Q01730"
), Replicate = c(2L, 2L, 2L, 2L, 2L, 2L), Ratio.H.L.normalized.01 = c(NaN,
NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.02 = c(NaN, NaN,
NaN, NaN, NaN, NaN), Ratio.H.L.normalized.03 = c(NaN, NaN, NaN,
NaN, NaN, NaN)), .Names = c("Protein.IDs", "Replicate", "Ratio.H.L.normalized.01",
"Ratio.H.L.normalized.02", "Ratio.H.L.normalized.03"), row.names = 12:17, class = "data.frame")
dput(droplevels(df_mouse_human))
structure(list(Human = c("Q8WZ42", "Q8NF91", "Q9UPN3", "Q96RW7",
"Q8WXG9", "P20929", "Q5T4S7", "O14686", "Q2LD37", "Q92736"),
Protein.IDs = c("A2ASS6", "Q6ZWR6", "Q9QXZ0", "D3YXG0", "Q8VHN7",
"E9Q1W3", "A2AN08", "Q6PDK2", "A2AAE1", "E9Q401")), .Names = c("Human",
"Protein.IDs"), row.names = c(NA, 10L), class = "data.frame")
So I want to match the Protein.IDs in df_mouse to where they exist in df_mouse_human. In the sample data I'm trying to match A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 to the entry A2ASS6. It works well if I do it the other way, but is there a way so that if part of the pattern matches the query, it will come back TRUE?
My long term goal is to match and merge the data, so that df_mouse gets a new column with the matching Human protein ids, and where there is no match I'll just replace the NA value with the original string of mouse IDs.
thanks

One method I commonly use with partial matches like this is to reduce the more-complex field to make it look like the simpler one. Sometimes this involves just removing extraneous characters (e.g., if "match only on the first four chars", then I'd make a new index column from substr(idcol, 1, 4) and join on that), but in this case it involves breaking one string into multiple.
This involves associating each of the semi-colon-delimited ids with the big-string, making this intermediate frame taller (sometimes much taller) than the original data.
(For the sake of presentability/aesthetics, I'm modifying df1 to remove the other invariant columns and, for the sake of "other data", adding a row number column.)
I'm using dplyr and tidyr, so:
library(dplyr)
library(tidyr)
df1 <- select(df1, Protein.IDs) %>%
mutate(other = row_number())
First I'll break the 6-row frame into a much larger one:
df1ids <- tbl_df(df1) %>%
select(Protein.IDs) %>%
mutate(eachID = strsplit(Protein.IDs, ";")) %>%
unnest()
df1ids
# # A tibble: 46 x 2
# Protein.IDs eachID
# <chr> <chr>
# 1 Q8CBM2;A2AL85;Q8BSY0 Q8CBM2
# 2 Q8CBM2;A2AL85;Q8BSY0 A2AL85
# 3 Q8CBM2;A2AL85;Q8BSY0 Q8BSY0
# 4 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH3
# 5 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH5
# 6 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH4
# 7 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893
# 8 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 Q6X893-2
# 9 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 A2AMH8
# 10 A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 A2AMW0
# # ... with 36 more rows
Notice how the first row of three is now three rows of three. We'll use "eachID" to join.
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6
If you happen to have multiple Human rows for each Proteins.IDs, things change a little.
df2$Protein.IDs[2] <- "E9Q8K5"
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 7 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~ 4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 Q8NF91 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 7 <NA> A2AUR7;Q9D031;Q01730 6
Notice how you now have two copies of other 5? Likely not what you want. If you intend to continue with the semi-colon-delimited theme, though:
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
filter(complete.cases(.)) %>%
group_by(Protein.IDs) %>%
summarize(Human = paste(Human, collapse = ";")) %>%
select(Human, Protein.IDs) %>%
right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
# Human Protein.IDs other
# <chr> <chr> <int>
# 1 <NA> Q8CBM2;A2AL85;Q8BSY0 1
# 2 <NA> A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8 2
# 3 <NA> A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 3
# 4 <NA> Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM~ 4
# 5 Q8WZ42;Q8N~ A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 5
# 6 <NA> A2AUR7;Q9D031;Q01730 6

#r2evans asks a good question about what to do with multiple matches. Once that question gets answered, I may need to edit my answer, but here is a quick solution. First, we split up the string of possible IDs, then we see which IDs are matched in the other dataframe, then we join on the row index of the match.
library(tidyverse)
df_mouse %>% mutate(all_id = str_split(Protein.IDs, ";"),
row = map(all_id, ~.x %in% df_mouse_human$Protein.IDs %>% which())) %>%
unnest(row) %>%
list(., df_mouse_human %>% rownames_to_column("row") %>% mutate(row = as.numeric(row))) %>%
reduce(left_join, by = "row")
#> Protein.IDs.x Replicate
#> 1 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78 2
#> Ratio.H.L.normalized.01 Ratio.H.L.normalized.02 Ratio.H.L.normalized.03
#> 1 NaN NaN NaN
#> row Human Protein.IDs.y
#> 1 1 Q8WZ42 A2ASS6

Related

Merging data with partial match

I have two large data frames, and want to merge them based on one of the column. However, some of the cells only have partial match. Please see the example below:
df1 = data.frame(SampleID = c(1:6), Gene = c("ARF5;ARG1","AP3B1","CLDN5","XPO1;STX7","ABCC4","FLOT1"))
df2 = data.frame(Operation = c("Y"), Gene = c("ARG1","CLDN5;STK10","XPO1","PDE5A","ARF5","IPO7","VAPB","ABCC4"))
#-----------------
SampleID Gene
1 ARF5;ARG1
2 AP3B1
3 CLDN5
4 XPO1;STX7
5 ABCC4
6 FLOT1
#-----------------
Operation Gene
Y ARG1
Y CLDN5;STK10
Y XPO1
Y PDE5A
Y ARF5
Y IPO7
Y VAPB
Y ABCC4
Expected Output
#-----------------
SampleID Gene Operation
1 ARF5;ARG1 Y
2 AP3B1 -
3 CLDN5 Y
4 XPO1;STX7 Y
5 ABCC4 Y
6 FLOT1 -
You can see that df1$Gene and df2$Gene have partially matched, and I want to add Operation information into df1 whenever there is a match. In the example, the df1 row 1 and row 4 have partially match to the df2 row 1 and row 2. For those has no matches, it can be NA, or whatever. I have thousands of rows for my data frame, so I cannot adjust them one by one.
Using dplyr and fuzzyjoin:
library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
group_by(SampleID) %>%
summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
The first step converts df2$Gene[2] from CLDN5;STK10 to \\b(CLDN5|STK10)\\b, a pattern that allows a match on any of its ;-delimited values (inferred from your expected output).
Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]). For example, the above might be rewritten as:
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
rename(Gene = Gene.x) %>%
group_by(across(SampleID:Gene)) %>%
summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
(Renaming from Gene.x to Gene is not necessary but looked nice :-)
This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn use of :-ranges) or not difficult to add individually.

Paste colnames by sequence

Hi and happy new year at all.
I have a tricky task (in my opinion) and I can not find a way to solve it.
Please see following toy data. The orginal dataset has hundreds of cols/rows.
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan"),
US=c(8,2,NA,7),
UK=c(5,4,1,7))
I want to create a new column, called "origin", which pastes the colname of each cell (without NA) seperated by "|" under consideration of the corresponding value. Higher values should be pasted first. As for same values (like Zlatan), the sequence isn´t relevant. Output for Zlatan could be US|UK OR UK|US.
This is the desired ouput:
I tried some hours to solve it but no approach worked. May be it make sense to convert the values as.factor...
Help is much appreciated. Thank you in advance!
Here's a dplyr approach. First, we can use rowwise to work on individual rows independently. Next, we can use c_across which allows us to select values from that row only. We can subset a vector of c("US","UK") based on whether the US and UK columns are not NA.
paste with collapse = "|" allows us to put the values together with the seperator. I added a row to see what would happen if they are both NA.
library(dplyr)
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK")[rev(order(c_across(US:UK), na.last = NA))], collapse = "|"))
# A tibble: 5 x 4
# Rowwise:
name US UK origin
<chr> <dbl> <dbl> <chr>
1 Amber 8 5 "US|UK"
2 Thomas 2 4 "UK|US"
3 Stefan NA 1 "UK"
4 Zlatan 7 7 "UK|US"
5 Bob NA NA ""
This is also trivially expanded to more columns:
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK","AUS")[rev(order(c_across(US:AUS), na.last = NA))], collapse = "|"))
# A tibble: 5 x 5
# Rowwise:
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Thomas 2 4 2 UK|AUS|US
3 Stefan NA 1 NA UK
4 Zlatan 7 7 NA UK|US
5 Bob NA NA 1 AUS
Or with tidyselect assistance to perform all columns but name:
test %>%
rowwise() %>%
mutate(origin = paste(names(across(-name))[rev(order(c_across(-name), na.last = NA))], collapse = "|"))
Another possibility with tidyverse. It is longer than the other two solutions, but it should work directly with a dataframe with as many columns as you need.
I changed the dataframe to long format, filtered out NAs, grouped by name, summarized using paste, and joined with the original dataframe to get the original columns (and rows with all NAs).
library(tidyverse)
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
# change to long format
tidyr::pivot_longer(cols=-name, names_to = "country", values_to = "value") %>%
# remove rows with NA
dplyr::filter(!is.na(value)) %>%
# group by name and sort
dplyr::group_by(name) %>% dplyr::arrange(-value) %>%
# create summary of countries for each name in column 'origin'
dplyr::summarise(origin=paste(country, collapse = "|")) %>%
# join with original data frame to include original columns (and names with only NA) and change NA to '' in origin
dplyr::right_join(test, by='name') %>% dplyr::mutate(origin=ifelse(is.na(origin), '', origin)) %>%
# move origin column to end
dplyr::relocate(origin, .after = last_col())
Result
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Bob NA NA 1 AUS
3 Stefan NA 1 NA UK
4 Thomas 2 4 2 UK|US|AUS
5 Zlatan 7 7 NA US|UK
Here's a different tidyverse solution using case_when:
library(tidyverse)
data <- data.frame (test<-data.frame(
"name" =c("Amber","Thomas","Stefan","Zlatan"),
"US" =c(8,2,NA,7),
"UK" =c(5,4,1,7)))
data <- data %>% mutate(origin = case_when( US > UK ~ "US|UK",
UK >= US ~ "UK|US",
is.na(UK) & !is.na(US) ~ "US",
is.na(US) & !is.na(UK) ~ "UK"))
data
#> name US UK origin
#> 1 Amber 8 5 US|UK
#> 2 Thomas 2 4 UK|US
#> 3 Stefan NA 1 UK
#> 4 Zlatan 7 7 UK|US
Created on 2021-01-06 by the reprex package (v0.3.0)

convert data frame of "missed" numbers into data frame of numbers "hit"

I have quite an specific doubt, but it should be easy to solve, I just cannot think how...
I have a simple data frame like this:
mydf <- data.frame(Shooter=1:3, Targets.missed=c(paste(sample(1:10,4),collapse=";"), paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))
mydf
Shooter Targets.missed
1 1 3;8;4;7
2 2 10;1;5;7;4
3 3 5;9;4;10;8;1;6;7
This data frame tells me the Targets (from 1 to 10) that are missed by each Shooter.
I would like to obtain a different data frame that tells me, per Target, which Shooter\s made it.
The result would be:
Target hit.by.Shooters
1 1
2 1;2;3
3 2;3
4 NA
5 1
6 1;2
7 NA
8 2
9 1;2
10 1
We expand the data by splitting at the ; of the 'Targets.missed' into 'long' format, then grouped by 'Shooter', summarise with a list of numbers that are not in the 'Targets.missed' from 1:10, unnest the list column, grouped by 'Target', summarise by pasteing the unique 'Shooter' elements into a single string, and fill the missing elements from 1:10 with NA by using complete
library(tidyverse)
mydf %>%
separate_rows(Targets.missed) %>%
group_by(Shooter) %>%
summarise(Target = list(setdiff(1:10, Targets.missed))) %>%
unnest %>%
group_by(Target) %>%
summarise(hit.by.Shooters = paste(unique(Shooter), collapse=";")) %>%
complete(Target = 1:10)
# A tibble: 10 x 2
# Target hit.by.Shooters
# <int> <chr>
# 1 1 1
# 2 2 1;2;3
# 3 3 2;3
# 4 4 <NA>
# 5 5 1
# 6 6 1;2
# 7 7 <NA>
# 8 8 2
# 9 9 1;2
#10 10 1
Or another option is base R by splitting the 'Targets.missed' (assuming character class) into a list of vectors, loop through the list, get the values that are not in 1:10 (with setdiff), set the names of the list with the 'Shooter' column, stack the key/val list pairs into a two column data.frame, get the unique rows, aggregate by pasteing the 'ind' column grouped by 'values', merge with a full 'values' dataset from 1:10
out <- aggregate(ind ~ values,
unique(stack(setNames(lapply(strsplit(mydf$Targets.missed, ';'),
setdiff, x= 1:10), mydf$Shooter))), FUN = paste, collapse=";")
out1 <- merge(data.frame(values = 1:10), out, all.x = TRUE)
and change the column names if necessary
names(out1) <- c('Target', 'hit.by.Shooters')
data
mydf <- structure(list(Shooter = 1:3, Targets.missed = c("3;8;4;7", "10;1;5;7;4",
"5;9;4;10;8;1;6;7")), class = "data.frame", row.names = c("1",
"2", "3"))
Another tidyverse possibility. We first create dataframe with all possible combinations of Shooter and Targets and then remove rows which are present in mydf using anti_join, fill in the missing Targets by adding them as NA and finally summarise by Targets to get Shooters who actually hit the target.
library(tidyverse)
crossing(Shooter = unique(mydf$Shooter), Targets.missed = 1:10) %>%
anti_join(mydf %>% separate_rows(Targets.missed) %>% mutate_all(as.numeric)) %>%
complete(Targets.missed = 1:10) %>%
group_by(Targets.missed) %>%
summarise(hit.by.Shooters = paste0(Shooter, collapse = ";"))
# Targets.missed hit.by.Shooters
# <int> <chr>
# 1 1 1;2
# 2 2 1;2
# 3 3 1
# 4 4 1
# 5 5 2
# 6 6 1;3
# 7 7 1;2
# 8 8 2
# 9 9 NA
#10 10 3
data
set.seed(987)
mydf <- data.frame(Shooter=1:3,
Targets.missed=c(paste(sample(1:10,4),collapse=";"),
paste(sample(1:10,5),collapse=";"), paste(sample(1:10,8),collapse=";")))
data.table approach
library( data.table )
#vector with all possible targets
targets.v <- 1:10
#split the missed targets to a list
missed.list <- strsplit( mydf$Targets.missed, ";")
#inverse, to get all hit targets
hit.list <- lapply( missed.list, function(x) as.data.table( targets.v[!targets.v %in% x] ) )
#bind hit targets to data.table
dt <- rbindlist( hit.list, idcol = "shooter" )
#summarise (paste with collapse), and join on all possible targets
dt[, .(hit.by.shooters = paste(shooter, collapse = ";")), by = .(target = V1)][data.table(target = targets.v), on = c("target")]
# target hit.by.shooters
# 1: 1 1
# 2: 2 1;2;3
# 3: 3 2;3
# 4: 4 <NA>
# 5: 5 1
# 6: 6 1;2
# 7: 7 <NA>
# 8: 8 2
# 9: 9 1;2
# 10: 10 1

Using any() or all() with is.na() over multiple columns

I'd like to drop rows from my dataset that are all NAs (AKA keep rows with any non-NAs) for a list of columns. How could I update this code so that x & y are supplied as a vector? This would enable me to flexibly add and drop columns for inspection.
library(dplyr)
ds <-
tibble(
id = c(1:4),
x = c(NA, 1, NA, 4),
y = c(NA, NA , 3, 4)
)
ds %>%
rowwise() %>%
filter(
any(
!is.na(x),
!is.na(y)
)
) %>%
ungroup()
I'm trying to write something like any(!is.na(c(x,y))) but I'm not sure how to supply multiple arguments to is.na().
We can use filter_at with any_vars
ds %>%
filter_at(vars(x:y), any_vars(!is.na(.)))
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 2 1 NA
#2 3 NA 3
#3 4 4 4
-Update - Feb 7 2022
In the new version of dplyr (as #GitHunter0 suggested) can use if_all/if_any or across
ds %>%
filter(if_any(x:y, complete.cases))
# A tibble: 3 × 3
id x y
<int> <dbl> <dbl>
1 2 1 NA
2 3 NA 3
3 4 4 4
You can also use ds %>% filter(!if_all(x:y, is.na)).

mutate_at does not create variable suffixes in some cases?

I have been playing with dplyr::mutate_at to create new variables by applying the same function to some of the columns. When I name my function in the .funs argument, the mutate call creates new columns with a suffix instead of replacing the existing ones, which is a cool option that I discovered in this thread.
df = data.frame(var1=1:2, var2=4:5, other=9)
df %>% mutate_at(vars(contains("var")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other var1_sqrt var2_sqrt
#### 1 1 4 9 1.000000 2.000000
#### 2 2 5 9 1.414214 2.236068
However, I noticed that when the vars argument used to point my columns returns only one column instead of several, the resulting new column drops the initial name: it gets named sqrt instead of other_sqrt here:
df %>% mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3
I would like to understand why this behaviour happens, and how to avoid it because I don't know in advance how many columns the contains() will return.
EDIT:
The newly created columns must inherit the original name of the original columns, plus the suffix 'sqrt' at the end.
Thanks
Here is another idea. We can add setNames(sub("^sqrt$", "other_sqrt", names(.))) after the mutate_at call. The idea is to replace the column name sqrt with other_sqrt. The pattern ^sqrt$ should only match the derived column sqrt if there is only one column named other, which is demonstrated in Example 1. If there are more than one columns with other, such as Example 2, the setNames would not change the column names.
library(dplyr)
# Example 1
df <- data.frame(var1 = 1:2, var2 = 4:5, other = 9)
df %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
# Example 2
df2 <- data.frame(var1 = 1:2, var2 = 4:5, other1 = 9, other2 = 16)
df2 %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
Or we can design a function to check how many columns contain the string other before manipulating the data frame.
mutate_sqrt <- function(df, string){
string_col <- grep(string, names(df), value = TRUE)
df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
if (length(string_col) == 1){
df2 <- df2 %>% setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
}
return(df2)
}
mutate_sqrt(df, "other")
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
mutate_sqrt(df2, "other")
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
I just figured out a (not so clean) way to do it;
I add a extra dummy variable to the dataset, with a name that ensures that it will be selected and that we don't fall into the 1-variable case, and after the calculation I remove the 2 dummies, like this:
df %>% mutate(other_fake=NA) %>%
mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt)) %>%
select(-contains("other_fake"))
#### var1 var2 other other_sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3

Resources