Extract emails in brackets - r

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. name#mail.com) I need to keep those elements.
This is an example
x2 <- c("John Smith <jsmith#company.ch> <abrown#company.ch>","no-reply#cdon.com" ,
"<rikke.hc#hotmail.com>")
I need output like:
[1] "jsmith#company.ch" "abrown#company.ch"
[2] "no-reply#cdon.com"
[3] "rikke.hc#hotmail.com"
I tried this in purpose to merge that 2 results
library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )
My data code sample:
from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)
gmail_DK <- gmail_DK %>%
mutate(from = unlist(y1)) %>%
mutate(from = unlist(y2))
but when I use this function to my data (only one day emails) and unlist I get
Error in mutate():
! Problem while computing cc = unlist(cc2).
x cc must be size 103 or 1, not 104.
Run rlang::last_error() to see where the error occurred.
I suppose that in data from more days difference should be bigger, so I prefer to not go this way.
Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.

We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>]+ - implies one or more characters that are not a >
sub(".*<([^>]+)>", "\\1", unlist(strsplit(x2,
"(?<=>)\\s+", perl = TRUE)))
[1] "jsmith#company.ch" "abrown#company.ch"
[3] "no-reply#cdon.com" "rikke.hc#hotmail.com"

Clunky but OK?
(x2
## split into single words/tokens
%>% strsplit(" ")
%>% unlist()
## find e-mail-like strings, with or without brackets
%>% stringr::str_extract("<?[\\w-.]+#[\\w-.]+>?")
## drop elements with no e-mail component
%>% na.omit()
## strip brackets
%>% stringr::str_remove_all("[<>]")
)

Related

extract the shortest and first encounter match between two strings in R

I want the function to return the string that follows the below condition.
after "def"
in the parentheses right before the first %ile after "def"
So the desirable output is "4", not "5". So far, I was able to extract "2)(3)(4". If I change the function to str_extract_all, the output became "2)(3)(4" and "5" . I couldn't figure out how to fix this problem. Thanks!
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
string.after.match <- str_match(string = x,
pattern = "(?<=def)(.*)")[1, 1]
parentheses.value <- str_extract(string.after.match, # get value in ()
"(?<=\\()(.*?)(?=\\)\\%ile)")
parentheses.value
Take the
Here is a one liner that will do the trick using gsub()
gsub(".*def.*(\\d+)\\)%ile.*%ile", "\\1", x, perl = TRUE)
Here's an approach that will work with any number of "%ile"s. Based on str_split()
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile(9)%ile"
x %>%
str_split("def", simplify = TRUE) %>%
subset(TRUE, 2) %>%
str_split("%ile", simplify = TRUE) %>%
subset(TRUE, 1) %>%
str_replace(".*(\\d+)\\)$", "\\1")
sub(".*?def.*?(\\d)\\)%ile.*", "\\1", x)
[1] "4"
You can use
x <- "abc(0)(1)%ile, def(2)(3)(4)%ile(5)%ile"
library(stringr)
result <- str_match(x, "\\bdef(?:\\((\\d+)\\))+%ile")
result[,2]
See the R demo online and the regex demo.
Details:
\b - word boundary
def - a def string
(?:\((\d+)\))+ - zero or more occurrences of ( + one or more digits (captured into Group 1) + ) and the last one captured is saved in Group 1
%ile - an %ile string.

Use stringr to extract the whole word in a string with a particular set of characters in it

I have a series of strings that have a particular set of characters. What I'd like to do is be able to extract just the word from the string with those characters in it, and discard the rest.
I've tried various regex expressions to do it but I either get it to split all the words or it returns the entire string. Following is an example of the kinds of strings. I've been trying to use stringr::str_extract_all() as there are instances where there are more than one word that needs to be pulled out.
data <- c("AlvariA?o, 1961","Andrade-Salas, Pineda-Lopez & Garcia-MagaA?a, 1994", "A?vila & Cordeiro, 2015", "BabiA?, 1922")
result <- unlist(stringr::str_extract_all(data, "regex"))
From this I'd like a result that pulls all the words that has the "A?", like this:
result <- c("AlvariA?o", "MagaA?a", "A?vila", "BabiA"?)
It seems really simple but my regex knowledge is just not cutting it at the moment.
To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).
unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"
I made as function but pretty worse than Gki's...
library(quanteda)
set_of_character <- function(dummy, key){
n <- nchar(key)
dummy %>% str_split(., " ") %>%
unlist %>%
str_replace(., ",", "") %>%
sapply(., function(x) {
x %>%
tokens("character") %>%
unlist() %>%
char_ngrams(n, concatenator = "")
}) %>%
sapply(., function(x) (key %in% x)) %>% which(TRUE) %>% names %>%
return
}
for your example,
set_of_character(data, "A?")
[1] "AlvariA?o" "Garcia-MagaA?a" "A?vila" "BabiA?"

How do I convert a string to a number in R if the string contains a letter?

I am currently helping a friend with his research and am gathering information about different natural disasters that occured from 2004-2016. The data can be found using this link:
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
when you import it to R it gives helpful information, however, my friend, and now I, am only interested in State, Year, Month, Event, Type, County, Direct & indirect deaths and injuries, and property damage. So first I am extracting the columns I need and will later in the code combine them back together, however the data is currently in string mode, for the Property Damage column I need it to present as numeric since it is in cash value. So for example, I have a data entry in that column that looks like "8.6k" and I need it as this 8600 and for all the "NA" entries to be replaced with a 0.
I have this so far but it gives me back a string of "NA"s. Can anyone think of a better way of doing this?
State<- W2004$STATE
Year<-W2004$YEAR
Month<-W2004$MONTH_NAME
Event<-W2004$EVENT_TYPE
Type<-W2004$CZ_TYPE
County<-W2004$CZ_NAME
Direct_Death<-W2004$DEATHS_DIRECT
Indirect_Death<-W2004$DEATHS_INDIRECT
Direct_Injury<-W2004$INJURIES_DIRECT
Indirect_Injury<-W2004$INJURIES_INDIRECT
W2004$DAMAGE_PROPERTY<-as.numeric(W2004$DAMAGE_PROPERTY)
Damage_Property<-W2004$DAMAGE_PROPERTY
l <- cbind( all the columns up there)
print(l)
We can try using a case when expression here, to map each type of unit to a bona fide number. Going with the two examples you actually showed us:
library(dplyr)
x <- c("1.00M", "8.6k")
result <- case_when(
grepl("\\d+k$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000,
grepl("\\d+M$", x) ~ as.numeric(sub("\\D+$", "", x)) * 1000000,
TRUE ~ as.numeric(sub("\\D+$", "", x))
)
You can extract the letter and use switch() which is easily maintainable, if you want to add additional symbols it is very easy.
First, the setup:
options(scipen = 999) # to prevent R from printing scientific numbers
library(stringr) # to extract letters
This is the sample vector:
numbers_with_letters <- c("1.00M", "8.6k", 50)
Use lapply() to loop through vector, extract the letter, replace it with a number, remove the letter, convert to numeric, and multiply:
lapply(numbers_with_letters, function(x) {
letter <- str_extract(x, "[A-Za-z]")
letter_to_num <- switch(letter,
k = 1000,
M = 1000000,
1) # 1 is the default option if no letter found
numbers_with_letters <- as.numeric(gsub("[A-Za-z]", "", x))
#remove all NAs and replace with 0
numbers_with_letters[is.na(numbers_with_letters)] <- 0
return(numbers_with_letters * letter_to_num)
})
This returns:
[[1]]
[1] 1000000
[[2]]
[1] 8600
[[3]]
[1] 50
[[4]]
[1] 0
Maybe I'm oversimplifying here, but . . .
library(tidyverse)
data <- tibble(property_damage = c("8.6k", "NA"))
data %>%
mutate(
as_number = if_else(
property_damage != "NA",
str_extract(property_damage, "\\d+\\.*\\d*"),
"0"
),
as_number = as.numeric(as_number)
)

replacing repeated strings using regex in R

I have a string as follows:
text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I want to eliminate all duplicated addresses, so my expected result is:
expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:
gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)
I've tried with perl = FALSE and TRUE to no avail.
What am I doing wrong?
If they are sequential, you just need to modify your regex slightly.
Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.
([\w.:/]+)(?:,\1)+
https://regex101.com/r/FDzop9/1
( [\w.:/]+ ) # (1), The adress
(?: # Cluster
, \1 # Comma followed by what found in group 1
)+ # Cluster end, 1 to many times
Note - if you use split and unique then combine, you will lose the ordering of
the items.
An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text
paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
"http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)
You can use functions from tidyverse if your strings are in a dataframe:
library(tidyverse)
separate_rows(df, text, sep = ",") %>%
distinct %>%
group_by(no) %>%
mutate(text = paste(text, collapse = ",")) %>%
slice(1)
The output is:
# no text
# <int> <chr>
# 1 1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2 2 http://q.co/imag/qrs.png

Convert character to numeric without NA in r

I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)

Resources