How to keep only specific punctuation mark in a column [duplicate] - r

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
In the column text how it is possible to remove all punctuation remarks but keep only the ?
data.frame(id = c(1), text = c("keep<>-??it--!##"))
expected output
data.frame(id = c(1), text = c("keep??it"))

A more general solution would be to used nested gsub commands that converts ? to a particular unusual string (like "foobar"), gets rid of all punctuation, then writes "foobar" back to ?:
gsub("foobar", "?", gsub("[[:punct:]]", "", gsub("\\?", "foobar", df$text)))
#> [1] "keep??it"

Using gsub you could do:
gsub("(\\?+)|[[:punct:]]","\\1",df$text)
[1] "keep??it"

gsub('[[:punct:] ]+',' ',data) removes all punctuation which is not what you want.
But this is:
library(stringr)
sapply(df, function(x) str_replace_all(x, "<|>|-|!|#|#",""))
id text
[1,] "1" "a"
[2,] "2" "keep??it"
Better IMO than other answers because no need for nesting, and lets you define whichever characters to sub.

Here's another solution using negative lookahead:
gsub("(?!\\?)[[:punct:]]", "", df$text, perl = T)
[1] "keep??it"
The negative lookahead asserts that the next character is not a ? and then matches any punctuation.
Data:
df <- data.frame(id = c(1), text = c("keep<>-??it--!##"))

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Extracting a pattern considering different patterns [duplicate]

This question already has answers here:
Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?
(1 answer)
find multiple strings using str_extract_all
(3 answers)
Closed 2 years ago.
Let's say I have this toy vectors
vec <- c("FOO blabla", "fail bla", "blabla FEEbla", "textFOO", "textttt")
to_match <- c("FOO", "FEE")
I would like to obtain a vector of the same length of vec in which to store only the patterns from to_match, if present, otherwise leave NA. Therefore, my desired result would be
c("FOO", NA, "FEE", "FOO", NA)
My first thought was to replace everything that does not match any of the patterns in to_match with whitespaces (""). I tried the following code which does the exact opposite, i.e. it replaces everything that does match any of the patterns in to_match with whitespaces.
sub(paste(to_match, collapse = "|"), "", vec)
# [1] " blabla" "fail bla" "blabla bla" "text" "textttt"
However, I tried to invert this behaviour using a caret (^) before a grouping structure but with scarse success.
# fail
sub(paste0("^(", paste(to_match, collapse = "|"), ")"), "", vec)
# [1] " blabla" "fail bla" "blabla FEEbla" "textFOO" "textttt"
How can I reach the desired output?
Your approach was correct but you should look at extracting the pattern that you want instead of removing which you don't want.
library(stringr)
str_extract(vec, str_c(to_match, collapse = "|"))
#[1] "FOO" NA "FEE" "FOO" NA

Match all elements with punctuation mark except asterisk in r [duplicate]

This question already has answers here:
in R, use gsub to remove all punctuation except period
(4 answers)
Closed 2 years ago.
I have a vector vec which has elements with a punctuation mark in it. I want to return all elements with punctuation mark except the one with asterisk.
vec <- c("a,","abc","ef","abc-","abc|","abc*01")
> vec[grepl("[^*][[:punct:]]", vec)]
[1] "a," "abc-" "abc|" "abc*01"
why does it return "abc*01" if there is a negation mark[^*] for it?
Maybe you can try grep like below
grep("\\*",grep("[[:punct:]]",vec,value = TRUE), value = TRUE,invert = TRUE) # nested `grep`s for double filtering
or
grep("[^\\*[:^punct:]]",vec,perl = TRUE, value = TRUE) # but this will fail for case `abc*01|` (thanks for feedback from #Tim Biegeleisen)
which gives
[1] "a," "abc-" "abc|"
You could use grepl here:
vec <- c("a,","abc-","abc|","abc*01")
vec[grepl("^(?!.*\\*).*[[:punct:]].*$", vec, perl=TRUE)]
[1] "a," "abc-" "abc|"
The regex pattern used ^(?!.*\\*).*[[:punct:]].*$ will only match contents which does not contain any asterisk characters, while also containing at least one punctuation character:
^ from the start of the string
(?!.*\*) assert that no * occurs anywhere in the string
.* match any content
[[:punct:]] match any single punctuation character (but not *)
.* match any content
$ end of the string

How would I remove the text before the initial period, the initial period itself and text after final period in a string?

I need to remove the text before the leading period (as well as the leading period) and the text following the last period from a string.
Given this string for example:
"ABCD.EF.GH.IJKL.MN"
I'd like to get the output:
[1] "IJKL"
I have tried the following:
split_string <- sub("^.*?\\.","", string)
split_string <- sub("^\\.+|\\.[^.]*$", "", string)
I believe I have it working for the period and text after for that string output I want. However, the first line needs to be executed multiple times to remove the text before that period in question e.g. '.I'.
One option in base R is to capture as a group ((...)) the word followed by the dot (\\.) and the word (\\w+) till the end ($) of the string. In the replacement, use the backreference (\\1) of the captured word
sub(".*\\.(\\w+)\\.\\w+$", "\\1", str1)
#[1] "IJKL"
Here, we match characters (.*) till the . (\\. - escaped to get the literal value because . is a metacharacter that will match any character if not escaped), followed by the word captured ((\\w+)), followed by a dot and another word at the end ($)of the string. The replacement part is mentioned above
Or another option is regmatches/regexpr from base R
regmatches(str1, regexpr("\\w+(?=\\.\\w+$)", str1, perl = TRUE))
#[1] "IJKL"
Or another option is word from stringr
library(stringr)
word(str1, -2, sep="[.]")
#[1] "IJKL"
data
str1 <- "ABCD.EF.GH.IJKL.MN"
Here is a janky dplyr version in case the other values are of importance and you want to select them later on, just include them in the "select".
df<- data.frame(x=c("ABCD.EF.GH.IJKL.MN"))
df2<-df %>%
separate(x, into=c("var1", "var2","var3","var4","var5")) %>%
select("var4")
Split into groups at period and take the second one from last.
sapply(strsplit(str1, "\\."), function(x) x[length(x) - 1])
#[1] "IJKL"
Get indices of the periods and use substr to extract the relevant portion
sapply(str1, function(x){
ind = gregexpr("\\.", x)[[1]]
substr(x, ind[length(ind) - 1] + 1, ind[length(ind)] - 1)
}, USE.NAMES = FALSE)
#[1] "IJKL"
These alternatives all use no packages or regular expressions.
1) basename/dirname Assuming the test input s shown in the Note at the end convert the dots to slashes and then use dirname and basename.
basename(dirname(chartr(".", "/", s)))
## [1] "IJKL" "IJKL"
2) strsplit Using strsplit split the strings at dot creating a list of character vectors, one vector per input string, and then for each such vector take the last 2 elements using tail and the first of those using indexing.
sapply(strsplit(s, ".", fixed = TRUE), function(x) tail(x, 2)[1])
## [1] "IJKL" "IJKL"
3) read.table It is not clear from the question what the general case is but if all the components of s have the same number of dot separated fields then we can use read.table to create a data.frame with one row per input string and one column per dot-separated component. Then take the column just before the last.
dd <- read.table(text = s, sep = ".", as.is = TRUE)
dd[[ncol(dd)-1]]
## [1] "IJKL" "IJKL"
4) substr Again, the general case is not clear but if the string of interest is always at character positions 12-15 then a simple solution is:
substr(s, 12, 15)
## [1] "IJKL" "IJKL"
Note
s <- c("ABCD.EF.GH.IJKL.MN", "ABCD.EF.GH.IJKL.MN")

Extract date from given string in r

string<-c("Posted 69 months ago (7/4/2011)")
library(gsubfn)
strapplyc(string, "(.*)", simplify = TRUE)
I apply above function but nothing happens.
In this I want to extract only date part i.e 7/4/2011.
The first one shows how to fix the code in the question to give the desired answer. The next 2 solutions are the same except they use different regular expressions. The fourth solution shows how to do it with gsub. The fifth breaks the gsub into two sub calls and the sixth uses read.table.
1) Escape parens The problem is that ( and ) have special meaning in regular expressions so you must escape them if you want to match them literally. By using "[(]" as we do below (or writing them as "\\(" ) they are matched literally. The inner parentheses define the capture group as we don't want that group to include the literal parentheses themselves:
strapplyc(string, "[(](.*)[)]", simplify = TRUE)
## [1] "7/4/2011"
2) Match content Another way to do it is to match the data itself rather than the surrounding parentheses. Here "\\d+" matches one or more digits:
strapplyc(string, "\\d+/\\d+/\\d+", simplify = TRUE)
## [1] "7/4/2011"
You could specify the number of digits if you want to be even more specific but it seems unnecessary here if the data looks similar to that in the question.
3) Match 8 or more digits and slashes Given that there are no other sequences of 8 or more characters consisting only of slashes and digits in the rest of the string we could just pick out that:
strapplyc(string, "[0-9/]{8,}", simplify = TRUE)
## [1] "7/4/2011"
4) Remove text before and after Another way of doing it is to remove everything up to the ( and after the ) like this:
gsub(".*[(]|[)].*", "", string)
## [1] "7/4/2011"
5) sub This is the same as (4) except it breaks the gsub into two sub invocations, one removing everything up to ( and the other removing ) onwards. The regular expressions are therefore slightly simpler.
sub(".*\\(", "", sub("\\).*", "", string))
6) read.table This solution uses no regular expressions at all. It defines sep and comment.char in read.table so that the second column of the result of read.table is the required date or dates.
read.table(text = string, sep = "(", comment.char = ")", as.is = TRUE)$V2
## [1] "7/4/2011"
Note: Note that you don't need the c in defining string
string <- c("Posted 69 months ago (7/4/2011)")
string2 <- "Posted 69 months ago (7/4/2011)"
identical(string, string2)
## [1] TRUE
We can do this with gsub by matching one or more characters that are not a ( ([^(]+) from the start (^) of the string or | the ) at the end ($) of the string and replace it with ""
gsub("[^[^(]+\\(|\\)$", "", string)
#[1] "7/4/2011"
Or using capture groups
sub("^[^(]+\\(([^)]+).*", "\\1", string)
#[1] "7/4/2011"
Or with str_extract, we match one or more characters that are not a ) ([^)]+) that follows the ( ((?<=[(]))
library(stringr)
str_extract(string, "(?<=[(])[^)]+")
#[1] "7/4/2011"

Resources