Split String with second (single) Backslash / R Emojis (Unicode) without Modifier - r

I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.
Here is what I tried:
df <- tibble::tribble(
~RUser, ~REmoji,
"User1", "\U0001f64f\U0001f3fb",
"User2", "\U0001f64f",
"User2", "\U0001f64f\U0001f3fc"
)
df %>% mutate(newcol = gsub("\\\\*", "", REmoji))
I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.
The result should look like this output:
df2 <- tibble::tribble(
~RUser, ~REmoji1, ~newcol,
"User1", "\U0001f64f", "\U0001f3fb",
"User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
"User2", "\U0001f64f", "\U0001f3fc"
)
Thanks a lot!

We could also use substring from base R
df$newcol <- substring(df$REmoji, 2)

Note these \U... are single Unicode code points, not just a backslash + digits/letters.
Using the ^. PCRE regex with sub provides the expected results:
> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
RUser REmoji newcol
<chr> <chr> <chr>
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f" ""
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"
Make sure you pass the perl=TRUE argument.
And in order to do the reverse, i.e. keep the first code point only, you can use:
df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))

Related

split cell at special character if comma found after first word

hi i've got some budget data with names and titles that read "Last, First - Title" and other rows in same column position that read "anything really - ,asd;flkajsd". I'd like to split the column IF first word ends in a "," at the "-" position that follows it.
ive tried this:
C22$ITEM2 <- ifelse(grepl(",", C22$ITEM), C22$ITEM, NA)
test <- str_split_fixed(C22$ITEM2, "-", 2)
C22 <- cbind(C22, test)
but i'm getting other cells with commas elsewhere, need to limit to just "if first word ends in comma"
library(tidyverse)
data <- tibble(data = c("Doe, John - Mr", "Anna, Anna - Ms", " ,asd;flkajsd"))
data
data %>%
# first word must ed with a
filter(data %>% str_detect("^[A-z]+a")) %>%
separate(data, into = c("Last", "First", "Title"), sep = "[,-]") %>%
mutate_all(str_trim)
# A tibble: 1 × 3
# Last First Title
# <chr> <chr> <chr>
#1 Anna Anna Ms
We may use extract to do this - capture the regex pattern as two groups ((...)) where the first group would return word (\\w+) from the start (^) of the string followed by a ,, zero or more space (\\s*), another word (\\w+), then the - (preceding or succeeding zero or more space and the second capture group with the word (\\w+) before the end ($) of the string
library(tidyr)
library(dplyr)
extract(C22, ITEM, into = c("Name", "Title"),
"^(\\w+,\\s*\\w+)\\s*-\\s*(\\w+)$") %>%
mutate(Name = coalesce(Name, ITEM), .keep = 'unused')
NOTE: The mutate is added in case the regex didn't match and return NA elements, we coalesce with the original column to return the value that corresponds to NA

Stringr pattern to detect capitalized words

I am trying to write a function to detect capitalized words that are all capitalised
currently, code:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
Where output is:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
DONT
WAS
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS".
ie:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
THIS
DONT
WANT
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.
If you run the code with your regex you'll realise 'THIS' is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
Your code would work if you use the above pattern in it.
Or you could also use :
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS

extracting names and numbers using regex

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.
Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).
If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

regex commas not between two numbers

I am looking for a regex for gsub to remove all the unwanted commas:
Data:
,,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354
Desired result:
12345
12345,1345,1354
123
12345
12354
This is the progress I have made so far:
(,(?!\d+))
You seem to want to remove all leading and trailing commas.
You may do it with
gsub("^,+|,+$", "", x)
See the regex demo
The regex contans two alternations, ^,+ matches 1 or more commas at the start and ,+$ matches 1+ commas at the end, and gsub replaces these matches with empty strings.
See R demo
x <- c(",,,,,,,12345","12345,1345,1354","123,,,,,,","12345,",",12354")
gsub("^,+|,+$", "", x)
## [1] "12345" "12345,1345,1354" "123" "12345"
## [5] "12354"
You can also use str_extract from stringr. Thanks to greedy matching, you don't have to specify how many times a digit occurs, the longest match is automatically chosen:
library(dplyr)
library(stringr)
df %>%
mutate(V1 = str_extract(V1, "\\d.+\\d"))
or if you prefer base R:
df$V1 = regmatches(df$V1, gregexpr("\\d.+\\d", df$V1))
Result:
V1
1 12345
2 12345,1345,1354
3 123
4 12345
5 12354
Data:
df = read.table(text = ",,,,,,,12345
12345,1345,1354
123,,,,,,
12345,
,12354")

add day to date using regex

Could someone show me how to add a day to a date using a regex?
Here is my starting code:
#Create data frame
a = c("01/2009","03/2006","","12/2003")
b = c("03/2016","05/2010","07/2011","")
df = data.frame(a,b)
Here's what I like to create:
#Create data frame
a = c("01/01/2009","03/01/006","","12/01/2003")
b = c("03/01/2016","05/01/2010","07/01/2011","")
df = data.frame(a,b)
I tried something like this:
df$c <- gsub("(/.*)","\\01/\\1", df$a, perl=TRUE)
But am obviously not getting the results I'm looking for. Am new to regex's and am looking for some help. Thank you.
You needn't use a regex if all you've got are values like dd/yyyy or empty ones. Just use a literal string replacement:
gsub("/","/01/", df$a, fixed=TRUE)
that just replaces all / symbols with /01/ substring.
If you have to make sure you only change strings falling under 2-digits/4-digits pattern, use
gsub("^(\\d{2})/(\\d{4})$", "\\1/01/\\2", df$a)
where the pattern matches:
^ - start of string
(\\d{2}) - capturing group #1 matching 2 digits
/ - a literal /
(\\d{4}) - capturing group #2 matching 4 digits
$ - end of string.
The replacement pattern contains \\1, a backreference to Group 1 captured value, /01/ as a literal substring and the \\2 backreference (i.e. the value captured into Group 2).
R demo:
> a = c("01/2009","03/2006","","12/2003")
> b = c("03/2016","05/2010","07/2011","")
> df = data.frame(a,b)
> gsub("/","/01/", df$a, fixed=TRUE)
[1] "01/01/2009" "03/01/2006" "" "12/01/2003"
> gsub("^(\\d{2})/(\\d{4})$", "\\1/01/\\2", df$a)
[1] "01/01/2009" "03/01/2006" "" "12/01/2003"

Resources