Extract words only with R - r

I have strings like these:
x <-c("DATE TODAY d. 011 + e. 0030 + r. 1061","Now or never d. 003 + e. 011 + g. 021", "Long term is long time (e. 104 to d. 10110)","Time is everything (1012) - /1072, 091A/")
Desired output:
d <- c("DATE TODAY","Now or never","Long term is long time","Time is everything")
After an hour with SO search, I just could not do it. Any help is appreciated.

This bit uses stringr to extract anything containing two or more alphabeticals:
> library(stringr)
> unlist(lapply(str_extract_all(x,"[a-zA-Z][a-zA-Z]+"),paste,collapse=" "))
[1] "DATE TODAY" "Now or never"
[3] "Long term is long time to" "Time is everything"
I'm hoping the "to" missing from your desired output is a mistake on your part. Its a perfectly good word, and you said you wanted to extract words.

The pattern is not very clear. But, based on the example showed, here are a couple of ways to get the expected result.
sub('( .\\.| \\().*', '', x)
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
or
pat1 <- '(?<=[0-9] )[A-Za-z]+(*SKIP)(*F)|[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
If to is a valid word and the expected result had a typo
pat1 <- '[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never"
#[3] "Long term is long time to" "Time is everything"

I agree with the others that "to" is a valid word. Here's a stringi approach
library(stringi)
stri_replace_all_regex(x, "\\s?[A-Za-z]?[+[:punct:]0-9]", "")
# [1] "DATE TODAY" "Now or never"
# [3] "Long term is long time to" "Time is everything"

Related

Remove both English and Non-English names from a dataframe

I am working with several hundreds of rows of a junk data. A dummy data is as thus:
foo_data <- c("Mary Smith is not here", "Wiremu Karen is not a nice person",
"Rawiri Herewini is my name", "Ajibade Smith is my man", NA)
I need to remove all names (both English and non-English first names and family names such that my desired output will be:
[1] "is not here" " is not a nice person" " is my name"
[4] "is my man" NA
However, using textclean package, I was only able to remove English names leaving the non-English names:
library(textclean)
textclean::replace_names(foo_data)
[1] " is not here" "Wiremu is not a nice person" "Rawiri Herewini is my name"
[4] "Ajibade is my man" NA
Any help will be appreciated.
You could do:
s <- textclean::replace_names(foo_data)
trimws(gsub(sprintf('\\b(%s)\\b',
paste0(unlist(hunspell::hunspell(s)), collapse = '|')), '', s))
[1] "is not here" "is not a nice person" "is my name" "is my man" NA

R - put space at word begins with capital letter, for full column

i am having a column from XLSX imported to R, where each row is having a sentence without space, but words begins with Capital letters. tried to use
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
but this is working, if i start converting each row,
Example
1 HowDoYouWorkOnThis
2 ThisIsGreatExample
3 ProgrammingIsGood
Expected is
1 How Do You Work On This
2 This Is Great Example
3 Programming Is Good
Is this what you're after?
s <- c("HowDoYouWorkOnThis", "ThisIsGreatExample", "ProgrammingIsGood");
sapply(s, function(x) trimws(gsub("([A-Z])", " \\1", x)))
# HowDoYouWorkOnThis ThisIsGreatExample ProgrammingIsGood
#"How Do You Work On This" "This Is Great Example" "Programming Is Good"
Or using stringr::str_replace_all:
library(stringr);
trimws(str_replace_all(s, "([A-Z])", " \\1"));
#[1] "How Do You Work On This" "This Is Great Example"
#[3] "Programming Is Good"

how to sort text value word by word

In my problem, each value of a vector is a text composed of multiple words. I try to sort word by word the text. I don't care to sort the vector.
e.g.
vect <- c("tim is a man", "sam was a studend", "my young daughter")
how to get vect arranged like this :
"a is man tim"
"a sam student was"
"daughter my young"
Thanks for your help.
We can split the string into substring and then do the sort
sapply(strsplit(vect, "\\s+"), function(x) paste(sort(x), collapse=' '))
#[1] "a is man tim" "a sam studend was" "daughter my young"

Ignore part of a string when splitting using regular expression in R

I'm trying to split a string in R (using strsplit) at some specific points (dash, -) however not if the dash are within a string in brackets ([).
Example:
xx <- c("Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
xx
[1] "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[2] "Total Internet-Time Spent Online-Past 7 Days"
should give me something like:
list(c("Radio Stations","Listened to Past Week","Toronto [FM-CFXJ-93.5 (93.5 The Move)]"), c("Total Internet","Time Spent Online","Past 7 Days"))
[[1]]
[1] "Radio Stations" "Listened to Past Week"
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[[2]]
[1] "Total Internet" "Time Spent Online" "Past 7 Days"
Is there a way with regular expression to do this? The position and the number of dashs change within each elements of the vector, and there is not always brackets. However, when there are brackets, they are always at the end.
I've tried different things, but none are working:
## Trying to match "-" before "[" in Perl
strsplit(xx, split = "-(?=\\[)", perl=T)
# does nothing
## trying to first extract what follow "[" then splitting what is preceding that
temp <- strsplit(xx, "[", fixed = T)
temp <- lapply(temp, function(yy) substr(head(yy, -1),"-"))
# doesn't work as there are some elements with no brackets...
Any help would be appreciated.
Based on: Regex for matching a character, but not when it's enclosed in square bracket
You can use:
strsplit(xx, "-(?![^\\[]*\\])", perl = TRUE)
[[1]]
[1] "Radio Stations" "Listened to Past Week"
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[[2]]
[1] "Total Internet" "Time Spent Online" "Past 7 Days"
To match a - that is not inside [ and ] you must match a part of the string that is enclosed with [ and ] and omit it, and match - in all other contexts. In abc-def], the - is not in between [ and ] and acc. to the specs should not be split against.
It is done with this regex:
\[[^][]*](*SKIP)(*FAIL)|-
Here,
\[ - matches a [
[^][]* - zero or more chars other than [ and ] (if you use [^]] it will match any char but ])
] - a literal ]
(*SKIP)(*FAIL)- PCRE verbs that omit the match and make the engine go on looking for the match after the end of the omitted one
| - or
- - a hyphen in other contexts.
Or, to match [...[...] like substrings (demo):
\[[^]]*](*SKIP)(*FAIL)|-
Or, to account for nested square brackets (demo):
(\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-
Here, (\[(?:[^][]++|(?1))*]) matches and captures [, then 1+ chars other than [ and ] (with [^][]++) or (|) (?1) recurses the whole capturing group 1 pattern (the whole part between (...)).
See the R demo:
xx <- c("abc-def]", "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
pattern <- "\\[[^][]*](*SKIP)(*FAIL)|-"
strsplit(xx, pattern, perl=TRUE)
# [[1]]
# [1] "abc" "def]"
# [[2]]
# [1] "Radio Stations"
# [2] "Listened to Past Week"
# [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
# [[3]]
# [1] "Total Internet" "Time Spent Online" "Past 7 Days"
pattern_recursive <- "(\\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-"
xx2 <- c("Radio Stations-Listened to Past Week-Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
strsplit(xx2, pattern_recursive, perl=TRUE)
# [[1]]
# [1] "Radio Stations"
# [2] "Listened to Past Week"
# [3] "Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]"
# [[2]]
# [1] "Total Internet" "Time Spent Online" "Past 7 Days"
1) gsubfn Assuming square brackets are balanced and not nested, gsubfn locates each [...] and within those uses gsub to convert dashes to exclamation marks. We then split what is left on the remaining dashes and replace the exclamation marks with dashes.
The regular expression means match a [ followed by the shortest string until the next ].
library(gsubfn)
s <- strsplit(gsubfn("\\[.*?\\]", ~ gsub("-", "!", x), xx), "-")
lapply(s, gsub, pattern = "!", replacement = "-")
which could be expressed using a magrittr pipeline:
library(gsubfn)
library(magrittr)
xx %>%
gsubfn(pattern = "\\[.*?\\]", replacement = ~ gsub("-", "!", x)) %>%
strsplit("-") %>%
lapply(gsub, pattern = "!", replacement = "-")
2) readLines This alternative uses no packages, does not use strsplit and uses only simple fixed regular expressions. It also assumes balanced non-nested square brackets.
Using gsub it first prepends each [ with a newline and suffixes each ] with a new line. Then for each input string it reads the result into r, and for the odd positioned strings replaces dash with newline. Finally it pastes r back together again and re-reads it which has the effect of splitting it at the newlines (which were previously dashes.
lapply(gsub("\\]", "]\n", gsub("\\[", "\n[", xx)), function(x) {
r <- readLines(textConnection(x))
i <- seq(1, length(r), 2)
r[i] <- gsub("-", "\n", r[i])
readLines(textConnection(paste(r, collapse = "")))
})

R sorting by most commonly occuring

This is probably very very simple but, I have a vector of phrases, some of which repeat, some of which dont, and I would like a list of unique phrases, sorted by the most commonly occurring.
e.g.
vec <- c("hello","hi","hi","greetings","good day", "hi", "hello", "good day","good morning","hello","good day")
sort(unique(vec))
[1] "good day" "good morning" "greetings" "hello" "hi"
I would expect "hi" to be first then followed by "hello" then followed by "good day" etc....
Just use sort(table(vec)) :
sort(table(vec), decreasing=TRUE)
# vec
# good day hello hi good morning greetings
# 3 3 3 1 1

Resources