Extract words enclosed within asterisks in a column in R - r

I have a dataframe, col1 contains text and within the text there are words enclosed by double asterisks. I want to extract all of these words and put them in another column called col2. If there is more than 1 word enclosed with double asterisks, I would like them to be separated by a comma. col2 in the example shows the desired result.
col1<-c("**sometimes** i code in python",
"I like **walks** in the park",
"I **often** do **exercise**")
col2<-c("**sometimes**","**walks**","**often**,**exercise**")
df<-data.frame(col1, col2, stringsAsFactors = FALSE)
Can anyone suggest a solution?

You may use stringr::str_match_all -
df$col3 <- sapply(stringr::str_match_all(df$col1, '(\\*+.*?\\*+)'),
function(x) toString(x[, 2]))
df
# col1 col2 col3
#1 **sometimes** i code in python **sometimes** **sometimes**
#2 I like **walks** in the park **walks** **walks**
#3 I **often** do **exercise** **often**,**exercise** **often**, **exercise**
* has a special meaning in regex. Here we want to match an actual * so we escape it with \\. We extract all the values which come between 1 or more than 1 *.
str_match_all returns a list of matrix, we are interested in the capture group that is between (...) which is the 2nd column hence x[, 2] and finally for more than one value we collapse them in one comma separated string using toString.

You can use str_extract_all:
library(stringr)
library(dplyr)
df %>%
mutate(col2 = str_extract_all(col1, "\\*\\*[^* ]+\\*\\*"))
col1 col2
1 **sometimes** i code in python **sometimes**
2 I like **walks** in the park **walks**
3 I **often** do **exercise** **often**, **exercise**
How the regex works:
\\*\\* matches two asterisks
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
\\*\\* matches two asterisks
If you don't need the asterisks in col2, then this is how you can extract the strings without them:
df %>%
mutate(col2 = str_extract_all(col1, "(?<=\\*\\*)[^* ]+(?=\\*\\*)"))
col1 col2
1 **sometimes** i code in python sometimes
2 I like **walks** in the park walks
3 I **often** do **exercise** often, exercise
How this regex works:
(?<=\\*\\*): positive lookbehind asserting that there must be two asterisks to the left
[^* ]+ matches any character occurring one or more time which is not a literal * and not a whitespace
(?=\\*\\*) positive lookahead asserting that there must be two two asterisks to the right

Related

How to remove data after certain characters

I need to know how to remove all characters from a value after the first D letter and 1st number or 2 second number. I am not sure how to start.
I have a data frame and I have a column of type Character
The column is called " Eircode "
The postal codes go from D01 to D24 ( these are Dublin postal codes )
The values are inputted like so
What you see in red is what needs to be removed.
I need to be able to remove the characters after the last digit.
My dataframe is called "MainSchools"
So if the " Eircode " is D03P820, I need to have it as D03 after my change.
I would preferably like to be able to do this with the Tidyverse package if possible.
You may use sub here:
df <- data.frame(Eircode=c("D15P820", "K78YD27", "D03P820"),
stringsAsFactors=FALSE)
df$Eircode <- sub("^(D(?:0[1-9]|1[0-9]|2[0-4])).*$", "\\1", df$Eircode)
df
Eircode
1 D15
2 K78YD27
3 D03
The regex pattern used above matches and captures Dublin postal codes as follows:
D match D
(?:
0[1-9] followed by 0-9
| OR
1[0-9] 10-19
| OR
2[0-4] 20-24
)
Then, we use \1 as the replacement in sub, leaving behind only the 3 character Dublin postal code.
I like to use the stringr package for such operations.
library(dplyr)
library(sitrngr)
df %>% mutate(Eircode = str_extract_all(Eircode, '^[A-Z][0-9]{2}'))
output with the data from #Tim Biegeleisen:
Eircode
1 D15
2 K78
3 D03

Check which rows with each word in a string are capitalised and space separated

I have a column with string of values as shown below
a=["iam best in the world" "you are awesome" ,"Iam Good"]
and I need to check which rows of each word in string are lower case and separated by space.
I know how to convert those to Upper and space separated but i need to find which rows are lower case & space separated.
I have tried using
grepl("\\b([a-z])\\s([a-z])\\b",aa, perl = TRUE)
We can try using grepl with the pattern \b[a-z]+(?:\\s+[a-z]+)*\b:
matches = a[grepl("\\b[a-z]+(?:\\s+[a-z]+)*\\b", a$some_col), ]
matches
v1 some_col
1 1 iam best in the world
2 2 you are awesome
Data:
a <- data.frame(v1=c(1:3),
some_col=c("iam best in the world", "you are awesome", "Iam Good"))
The regex pattern used matches an all-lowercase word, followed by a space and another all-lowercase word, the latter repeated zero or more times. Note that we place word boundaries around the pattern to ensure that we don't get false flag matches from a word beginning with an uppercase letter.
x <- c("iam best in the word ", "you are awesome", "Iam Good")
Here I did something different, first I separeted by space then I check if is lower case. So, the output is a list for each phrase with only the lower case words split by space.
sapply(strsplit(x, " "), function(x) {
x[grepl("^[a-z]", x)]
})
Another idea is to use stri_trans_totitle from stringi package,
a[!!!stringi::stri_trans_totitle(as.character(a$some_col)) == a$some_col,]
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome
We can convert the column to lower-case and compare with actual value. Using #Tim's data
a[tolower(a$some_col) == a$some_col, ]
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome
If we also need to check for space, we could add another condition with grepl
a[tolower(a$some_col) == a$some_col & grepl("\\s+", a$some_col), ]
We can use filter
library(dplyr)
a %>%
filter(tolower(some_col) == some_col)
# v1 some_col
#1 1 iam best in the world
#2 2 you are awesome

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Regular expression on separate function of Tidyr

I need separate two columns with tidyr.
The column have text like: I am Sam. I mean the text always have only two white spaces, and the text can have all other symbols: [a-z][0-9][!\ºª, etc...].
The problem is I need split it in two columns: Column one I am, and column two: Sam.
I can't find a regular expression two separate with the second blank space.
Could you help me please?
We can use extract from tidyr. We match one or more characters and place it in a capture group ((.*)) followed by one or more space (\\s+) and another capture group that contains only non-white space characters (\\S+) to separate the original column into two columns.
library(tidyr)
extract(df1, Col1, into = c("Col1", "Col2"), "(.*)\\s+(\\S+)")
# Col1 Col2
#1 I am Sam
#2 He is Sam
data
df1 <- data.frame(Col1 = c("I am Sam", "He is Sam"), stringsAsFactors=FALSE)
As an alternative, given:
library(tidyr)
df <- data.frame(txt = "I am Sam")
you can use
separate(, txt, c("a", "b"), sep="(?<=\\s\\S{1,100})\\s")
# a b
# 1 I am Sam
with separate using stringi::stri_split_regex (ICU engine), or
separate(df, txt, c("a", "b"), sep="^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
with the older (?) separate using base:strsplit (Perl engine). See also
strsplit("I am Sam", "^.*?\\s(*SKIP)(*FAIL)|\\s", perl=TRUE)
# [[1]]
# [1] "I am" "Sam"
But it might be a bit "esoterique"...

Resources