The piping in dplyr is cool and sometimes I want to clean up one column by applying multiple commands to it. Is there a way to use the pipe within the mutate() command? I notice this most when using regex and it comes up also in other contexts. In the example below, I can clearly see the different manipulations I am applying to the column "Clean" and I am curious if there is a way to do something that mimics %>% within mutate().
library(dplyr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone2 <- phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers), #remove extensions
Clean = gsub("[^0-9]", "", Clean), #remove parentheses, dashes, etc
Clean = substr(Clean, nchar(Clean)-9, nchar(Clean)), #grab the right 10 characters
Clean = gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", Clean)) #format
phone2
I know there might be a better gsub() command but for the purposes of this question, I want to know if there is a way to pipe these gsub() elements together so that I don't have to keep writing Clean = gsub(...) but also not have to use the method where I embed these inside each other.
It would be fine with me if you answer this question using a simpler example.
Don't fall into the trap of endless pipes. Do the correct thing for readability and efficiency, write a function.
phone %>% mutate(Clean = cleanPhone(Numbers))
# Numbers Clean
# 1 1234567890 (123)456-7890
# 2 555-3456789 (555)345-6789
# 3 222-222-2222 (222)222-2222
# 4 5131831249 (513)183-1249
# 5 123.321.1234 (123)321-1234
# 6 (333)444-5555 (333)444-5555
# 7 +1 123-223-3234 (123)223-3234
# 8 555-666-7777 x100 (666)777-7100
Custom function:
cleanPhone <- function(x) {
x2 <- gsub("[^0-9]", "", x)
x3 <- substr(x2, nchar(x2)-9, nchar(x2))
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", x3)
}
I guess you need
phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers) %>%
gsub("[^0-9]", "", .) %>%
substr(., nchar(.)-9, nchar(.)) %>%
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .))
# Numbers Clean
#1 1234567890 (123)456-7890
#2 555-3456789 (555)345-6789
#3 222-222-2222 (222)222-2222
#4 5131831249 (513)183-1249
#5 123.321.1234 (123)321-1234
#6 (333)444-5555 (333)444-5555
#7 +1 123-223-3234 (123)223-3234
#8 555-666-7777 x100 (555)666-7777
Even though the question is answered, consider this method that uses magrittr instead of dplyr
require(magrittr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone
cleanchain<- phone$Numbers %>% gsub("[A-Za-z].*", "", .) %>% gsub("[^0-9]", "", .) %>% substr(., nchar(.)-9, nchar(.)) %>% gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .)
cleanchain
data.frame(old=phone$Numbers,new=cleanchain, stringsAsFactors = F)
Related
I would like to drop the characters obovate at second position
df <- data.frame(x = c("Antidesma obovate",
"Ardisia obovate", "Knema obovate", "Lauraceae obovate"))
My desired output
Antidesma
Ardisia
Knema
Lauraceae
I found one topic kind of answering my quesion (Drop characters from string based on position)
but here I need to call specific character that I want to remove.
So far, I only know using str_detect to change the name right away e.g.
df %>% mutate(x= ifelse(str_detect(x, "Antidesma obovate"), "Antidesma ", x)) %>%
Any suggestions for me, please?
We don't need ifelse or str_detect here. Instead, use str_remove the remove the substring
library(dplyr)
library(stringr)
df %>%
mutate(x = str_remove(x, "\\s+obovate"))
x
1 Antidesma
2 Ardisia
3 Knema
4 Lauraceae
Another option is to use gsub:
gsub(" obovate", "", df$x)
We could use word from stringr package:
library(dplyr)
library(stringr)
df %>%
mutate(x = word(x,1))
Output:
x
1 Antidesma
2 Ardisia
3 Knema
4 Lauraceae
I would like to remove all spaces from columns where I know the column names contain a specific string
Reproducible example
library(dplyr)
df <- data.frame(x_first = c("How are you", "Hello", "Good bye"), x_second = c(1:3))
x_first x_second
1 How are you 1
2 Hello 2
3 Good bye 3
df %>%
mutate_at(vars(contains("first")), gsub(" ", "", vars(.))
But I can't find out how I'm supposed to refer to the column inside gsub, right now I have vars(.) but this gives an error.
I would like to get
x_first x_second
1 Howareyou 1
2 Hello 2
3 Goodbye 3
The lastest version of dplyr prefers to use the new across() function rather than mutate_at. Here's what that would look like
df %>%
mutate(across(contains("first"), ~gsub(" ", "", .)))
You can use ~ to create an anonymous function where . will be the data from that column.
But the same would work for mutate_at
df %>%
mutate_at(vars(contains("first")), ~gsub(" ", "", .))
or you can use the longer function syntax
df %>%
mutate_at(vars(contains("first")), function(x) gsub(" ", "", x))
We could use str_remove from stringr. It may be better to use a regex pattern \\s+ - i.e. one or more space so that if there is any chance of having unequal width of spaces, it gets removed
library(dplyr)
library(stringr)
df %>%
mutate(across(ends_with('first'), str_remove_all, "\\s+"))
-output
# x_first x_second
#1 Howareyou 1
#2 Hello 2
#3 Goodbye 3
A summary of my aim
I have the following dataframe structure:
my.df <-data.frame("col1_A.C"=c("AA","AC","CC"),
"col2_A.T"=c("TT","AT","TT"),
"col3_C.G"=c("GG","CG","CG"))
my.df
# col1_A.C col2_A.T col1_C.G
# 1 AA TT GG
# 2 AC AT CG
# 3 CC TT CG
For each column, I want to replace any character that matches the 3rd last character of the column name with the character "R".
Using the above dataframe I thus would like to obtain this:
my.df2 <- data.frame("col1_A.C"=c("RR","RC","CC"),
"col2_A.T"=c("TT","RT","TT"),
"col3_C.G"=c("GG","RG","RG"))
my.df2
# col1_A.C col2_A.T col1_C.G
# 1 RR TT GG
# 2 RC RT RG
# 3 CC TT RG
In the first column for instance the column name is col1_A.C, and A is the 3rd last character. All the A's were thus replaced with an R.
My code so far
To achieve this, I have produced the following code
my.df2 <- my.df %>% mutate(across(.cols=everything(),
.funs=str_replace_all(.,
substr(cur_column(),
nchar(cur_column()-2),
nchar(cur_column()-2)
),
"R")
)
)
Unfortunately, the resulting dataframe, my.df2, looks exactly like my.df and no character replacement occurred. No error is returned although.
I have tested the str_replace_all() approach in the following way and it works on a vector. I imagine then there is something I am missing/not understanding in the way str_replace_all() is interpreted within the mutate(across()) function.
first.column <- c("CC","CT","CC")
first.column <- str_replace_all(first.column,
substr(colnames(my.df)[1],
nchar(colnames(my.df)[1])-2,
nchar(colnames(my.df)[1])-2
),
"R")
print(first.column)
# [1] "RR" "RT" "RR"
I have ran out of ideas of what might not be working. My understanding of R and its functions is not very thorough so I apologise if I have missed something simple. I have also searched for similar questions but to no avail.
I think you just needed a tilde ~, and to use .fns instead of .funs.
my.df %>%
mutate(
across(
.cols = everything(),
.fns = ~ str_replace_all(
string = ..1,
pattern = str_sub(cur_column(), nchar(cur_column()) - 2, nchar(cur_column()) - 2),
replacement = "R"
)
)
)
You can use Map :
my.df[] <- Map(function(x, y) gsub(y, 'R', x), my.df,
substring(names(my.df), nchar(names(my.df)) - 2,nchar(names(my.df)) - 2))
my.df
# col1_A.C col2_A.T col3_C.G
31 RR TT GG
#2 RC RT RG
#3 CC TT RG
Using #thelatemail's chartr trick with imap_dfc from purrr :
purrr::imap_dfc(my.df, ~chartr(substr(.y, nchar(.y)-2, nchar(.y)-2), 'R', .x))
The same can be achieved by first converting your data from wide to long format:
library(tidyverse)
my.df %>%
gather(colx, rowx) %>%
mutate(rowx = str_replace_all(rowx, substring(colx, nchar(colx) - 2, nchar(colx) -
2), "R")) %>%
group_by(colx) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = colx, values_from = rowx)
I have a dataframe that contains one column separated by ; like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate(), but I donĀ“t know how to continue. Any suggestions?
If your data frame is called df and your column is called V1, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.
I thought of using stringr and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001
1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Note
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)
I would like to mutate a string differently, depending on the format. This example has 2 formats based on inclusion of certain punctuation. Each element of the vector contains specific words uniquely associated with the format.
I have tried multiple approaches with ifelse and casewhen but not getting the desired results, which is to "keep" the last part of the string.
I am trying to use easy verbs and am not proficient in grex. Open to any suggestions for an efficient general method.
library(dplyr)
library(stringr)
df <- data.frame(KPI = c("xxxxx.x...Alpha...Keep.1",
"xxxxx.x...Alpha..Keep.2",
"Bravo...Keep3",
"Bravo...Keep4",
"xxxxx...Charlie...Keep.5",
"xxxxx...Charlie...Keep.6"))
dot3dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][3]
dot3dot3split("xxxxx.x...Alpha...Keep.1") # returns as expected
"Keep.1"
dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][2]
dot3split("Bravo...Keep3") # returns as expected
"Keep3"
df1 <- df %>% mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = ifelse(str_detect(KPI, paste(c("Alpha", "Charlie"), collapse = '|')), dot3dot3split(KPI),
ifelse(str_detect(KPI, "Bravo"), dot3split(KPI), KPI))) # not working as expected
df1$KPI.v2
"Keep.1" "Keep.1" "Alpha" "Alpha" "Keep.1" "Keep.1"
The functions you designed (dot3dot3split and dot3split) are not able to vectorize the operation. For example, if there are more than one elements, only the first one is returned. That may cause some problems.
dot3dot3split(c("xxxxx.x...Alpha...Keep.1", "xxxxx.x...Alpha..Keep.2"))
# [1] "Keep.1"
Since you are using stringr, I suggest that you can use str_extract to extract the string you want, without using ifelse or functions that can do vectorized operation.
df <- data.frame(KPI = c("xxxxx.x...Alpha...apples",
"xxxxx.x...Alpha..bananas",
"Bravo...oranges",
"Bravo...grapes",
"xxxxx...Charlie...cherries",
"xxxxx...Charlie...guavas"))
library(dplyr)
library(stringr)
df1 <- df %>%
mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = str_extract(KPI, "[A-Za-z]*$"))
df1
# KPI KPI.v2
# 1 xxxxx.x...Alpha...apples apples
# 2 xxxxx.x...Alpha..bananas bananas
# 3 Bravo...oranges oranges
# 4 Bravo...grapes grapes
# 5 xxxxx...Charlie...cherries cherries
# 6 xxxxx...Charlie...guavas guavas