I would like to drop the characters obovate at second position
df <- data.frame(x = c("Antidesma obovate",
"Ardisia obovate", "Knema obovate", "Lauraceae obovate"))
My desired output
Antidesma
Ardisia
Knema
Lauraceae
I found one topic kind of answering my quesion (Drop characters from string based on position)
but here I need to call specific character that I want to remove.
So far, I only know using str_detect to change the name right away e.g.
df %>% mutate(x= ifelse(str_detect(x, "Antidesma obovate"), "Antidesma ", x)) %>%
Any suggestions for me, please?
We don't need ifelse or str_detect here. Instead, use str_remove the remove the substring
library(dplyr)
library(stringr)
df %>%
mutate(x = str_remove(x, "\\s+obovate"))
x
1 Antidesma
2 Ardisia
3 Knema
4 Lauraceae
Another option is to use gsub:
gsub(" obovate", "", df$x)
We could use word from stringr package:
library(dplyr)
library(stringr)
df %>%
mutate(x = word(x,1))
Output:
x
1 Antidesma
2 Ardisia
3 Knema
4 Lauraceae
Related
I would like to remove all spaces from columns where I know the column names contain a specific string
Reproducible example
library(dplyr)
df <- data.frame(x_first = c("How are you", "Hello", "Good bye"), x_second = c(1:3))
x_first x_second
1 How are you 1
2 Hello 2
3 Good bye 3
df %>%
mutate_at(vars(contains("first")), gsub(" ", "", vars(.))
But I can't find out how I'm supposed to refer to the column inside gsub, right now I have vars(.) but this gives an error.
I would like to get
x_first x_second
1 Howareyou 1
2 Hello 2
3 Goodbye 3
The lastest version of dplyr prefers to use the new across() function rather than mutate_at. Here's what that would look like
df %>%
mutate(across(contains("first"), ~gsub(" ", "", .)))
You can use ~ to create an anonymous function where . will be the data from that column.
But the same would work for mutate_at
df %>%
mutate_at(vars(contains("first")), ~gsub(" ", "", .))
or you can use the longer function syntax
df %>%
mutate_at(vars(contains("first")), function(x) gsub(" ", "", x))
We could use str_remove from stringr. It may be better to use a regex pattern \\s+ - i.e. one or more space so that if there is any chance of having unequal width of spaces, it gets removed
library(dplyr)
library(stringr)
df %>%
mutate(across(ends_with('first'), str_remove_all, "\\s+"))
-output
# x_first x_second
#1 Howareyou 1
#2 Hello 2
#3 Goodbye 3
I have several IDs I am working with. I want to add a leading zero for values that have 1 integer after the dash in id. Here is sample data. I
id
2034-5
1023-12
1042-22
1231-9
I want this:
id
2034-05
1023-12
1042-22
1231-09
I tried this, but it's not working. Any advice?
x <-sprintf("%02d", df$id)
You could actually use sub here for a base R option:
df$id <- sub("-(\\d)$", "-0\\1", df$id)
df
id
1 2034-05
2 1023-12
3 1042-22
4 1231-09
Data:
df <- data.frame(id=c("2034-5", "1023-12", "1042-22", "1231-9"), stringsAsFactors=FALSE)
To use sprintf you have to separate out both the numbers, use sprintf on second number and then combine them again.
library(dplyr)
library(tidyr)
df %>%
separate(id, c('id1', 'id2')) %>%
mutate(id2 = sprintf('%02s', id2)) %>%
unite(id, id1, id2, sep = '-')
# id
#1 2034-05
#2 1023-12
#3 1042-22
#4 1231-09
An option with strsplit and sprintf from base R
df$id <- sapply(strsplit(df$id, "-"), function(x)
do.call(sprintf, c(as.list(x), fmt = "%s-%02s")))
df$id
#[1] "2034-05" "1023-12" "1042-22" "1231-09"
What's the easiest way to drop a string before a certain character?
The data looks as follows:
library(tidyverse)
df <- data.frame(var1 = c("lang:10,q1:10,m2:20,q3:20,m5:10",
"lang:1,q1:10,m2:20,m3:20,q3:10",
"lang:100,q1:10,m2:20"))
Now, I'd like to remove the "lang:xy," part at the beginning of each row.
I tried to use "separate", but the comma is also used afterwards (everything that comes after the first comma should stay together).
So my desired output is:
var1
-------------------------
q1:10,m2:20,q3:20,m5:10
q1:10,m2:20,m3:20,q3:10",
q1:10,m2:20
Thanks!
You can use str_remove from stringr package:
df %>%
mutate(
var1 = var1 %>% str_remove("^lang:[0-9]*,")
)
Or try this:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%separate_rows(var1,sep = ',') %>%
filter(!grepl('lang',var1)) %>%
mutate(var='var') %>%
group_by(id) %>%
summarise(var1=paste0(var1,collapse = ',')) %>% ungroup() %>%
select(-id)
Output:
# A tibble: 3 x 1
var1
<chr>
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20
Just to round out the answers, the sub function from base R can also work here:
df$var1 <- sub("^lang:\\d+,", "", df$var1)
df
var1
1 q1:10,m2:20,q3:20,m5:10
2 q1:10,m2:20,m3:20,q3:10
3 q1:10,m2:20
We can use trimws from base R
df$var1 <- trimws(df$var1, whitespace = "lang:\\d+,")
I would like to mutate a string differently, depending on the format. This example has 2 formats based on inclusion of certain punctuation. Each element of the vector contains specific words uniquely associated with the format.
I have tried multiple approaches with ifelse and casewhen but not getting the desired results, which is to "keep" the last part of the string.
I am trying to use easy verbs and am not proficient in grex. Open to any suggestions for an efficient general method.
library(dplyr)
library(stringr)
df <- data.frame(KPI = c("xxxxx.x...Alpha...Keep.1",
"xxxxx.x...Alpha..Keep.2",
"Bravo...Keep3",
"Bravo...Keep4",
"xxxxx...Charlie...Keep.5",
"xxxxx...Charlie...Keep.6"))
dot3dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][3]
dot3dot3split("xxxxx.x...Alpha...Keep.1") # returns as expected
"Keep.1"
dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][2]
dot3split("Bravo...Keep3") # returns as expected
"Keep3"
df1 <- df %>% mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = ifelse(str_detect(KPI, paste(c("Alpha", "Charlie"), collapse = '|')), dot3dot3split(KPI),
ifelse(str_detect(KPI, "Bravo"), dot3split(KPI), KPI))) # not working as expected
df1$KPI.v2
"Keep.1" "Keep.1" "Alpha" "Alpha" "Keep.1" "Keep.1"
The functions you designed (dot3dot3split and dot3split) are not able to vectorize the operation. For example, if there are more than one elements, only the first one is returned. That may cause some problems.
dot3dot3split(c("xxxxx.x...Alpha...Keep.1", "xxxxx.x...Alpha..Keep.2"))
# [1] "Keep.1"
Since you are using stringr, I suggest that you can use str_extract to extract the string you want, without using ifelse or functions that can do vectorized operation.
df <- data.frame(KPI = c("xxxxx.x...Alpha...apples",
"xxxxx.x...Alpha..bananas",
"Bravo...oranges",
"Bravo...grapes",
"xxxxx...Charlie...cherries",
"xxxxx...Charlie...guavas"))
library(dplyr)
library(stringr)
df1 <- df %>%
mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = str_extract(KPI, "[A-Za-z]*$"))
df1
# KPI KPI.v2
# 1 xxxxx.x...Alpha...apples apples
# 2 xxxxx.x...Alpha..bananas bananas
# 3 Bravo...oranges oranges
# 4 Bravo...grapes grapes
# 5 xxxxx...Charlie...cherries cherries
# 6 xxxxx...Charlie...guavas guavas
The piping in dplyr is cool and sometimes I want to clean up one column by applying multiple commands to it. Is there a way to use the pipe within the mutate() command? I notice this most when using regex and it comes up also in other contexts. In the example below, I can clearly see the different manipulations I am applying to the column "Clean" and I am curious if there is a way to do something that mimics %>% within mutate().
library(dplyr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone2 <- phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers), #remove extensions
Clean = gsub("[^0-9]", "", Clean), #remove parentheses, dashes, etc
Clean = substr(Clean, nchar(Clean)-9, nchar(Clean)), #grab the right 10 characters
Clean = gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", Clean)) #format
phone2
I know there might be a better gsub() command but for the purposes of this question, I want to know if there is a way to pipe these gsub() elements together so that I don't have to keep writing Clean = gsub(...) but also not have to use the method where I embed these inside each other.
It would be fine with me if you answer this question using a simpler example.
Don't fall into the trap of endless pipes. Do the correct thing for readability and efficiency, write a function.
phone %>% mutate(Clean = cleanPhone(Numbers))
# Numbers Clean
# 1 1234567890 (123)456-7890
# 2 555-3456789 (555)345-6789
# 3 222-222-2222 (222)222-2222
# 4 5131831249 (513)183-1249
# 5 123.321.1234 (123)321-1234
# 6 (333)444-5555 (333)444-5555
# 7 +1 123-223-3234 (123)223-3234
# 8 555-666-7777 x100 (666)777-7100
Custom function:
cleanPhone <- function(x) {
x2 <- gsub("[^0-9]", "", x)
x3 <- substr(x2, nchar(x2)-9, nchar(x2))
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", x3)
}
I guess you need
phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers) %>%
gsub("[^0-9]", "", .) %>%
substr(., nchar(.)-9, nchar(.)) %>%
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .))
# Numbers Clean
#1 1234567890 (123)456-7890
#2 555-3456789 (555)345-6789
#3 222-222-2222 (222)222-2222
#4 5131831249 (513)183-1249
#5 123.321.1234 (123)321-1234
#6 (333)444-5555 (333)444-5555
#7 +1 123-223-3234 (123)223-3234
#8 555-666-7777 x100 (555)666-7777
Even though the question is answered, consider this method that uses magrittr instead of dplyr
require(magrittr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone
cleanchain<- phone$Numbers %>% gsub("[A-Za-z].*", "", .) %>% gsub("[^0-9]", "", .) %>% substr(., nchar(.)-9, nchar(.)) %>% gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .)
cleanchain
data.frame(old=phone$Numbers,new=cleanchain, stringsAsFactors = F)