extract words from a string into different strings - r

I'm very new with coding, and I have to clean a table with string variables. One of the columns I'm trying to clean includes several variables in itself. So if I take one row from my column it looks like this
string<- ("'casual': True,'classy': False,'divey': False,'hipster': False,'intimate': False,'romantic': False,'touristy': False,'trendy': False,'upscale': False")
I'm trying to extract Boolean values for each of the categories into separate columns.So my outcome should have 9 columns(each for every category) and rows should include True/ False values.
What am I supposed to use in this case?

An option is to use str_extract_all to extract the word (\\w+) that succeeds a a space followed by a :
library(stringr)
as.logical(str_extract_all(string, "(?<=: )\\w+")[[1]])
#[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If we need to parse into a data.frame, it would be better to use fromJSON from jsonlite
library(jsonlite)
lst1 <- fromJSON(paste0("{", gsub("'", "", gsub("\\b(\\w+)\\b",
'"\\1"', string)), "}"))
data.frame(lapply(lst1, as.logical))
# casual classy divey hipster intimate romantic touristy trendy upscale
#1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or in base R
as.logical(regmatches(string, gregexpr("(?<=: )\\w+", string, perl = TRUE))[[1]])

Related

Match multiple patterns in a string in R

I'm looking for a way in R to match multiple patterns in a string. For example:
test <- c("abcdefg", "defabc", "abcghdeft" , "abegrabc", "ghdefab", "dabce rdeft", "dedef abceg")
I want to look for 2 exact patterns "abc" and "def" in the string, and return TRUE if both of them are in the string regardless of position and order. So based on that the result would be:
TRUE TRUE TRUE FALSE FALSE TRUE TRUE
I can't seem to find the AND operator in regex like the OR operator |, I've tried other combinations like abc.*def|def.*abc but they didn't work.
Thank you in advance for your help!
We can use grepl
grepl("abc.*def|def.*abc", test)
#[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE

str_detect, validate phone number with specific pattern

I want to check if the numbers I have in the list matches specific formatting (nnn.nnn.nnnn). I am expecting the code to return a boolean (FALSE, TRUE, FALSE, TRUE, FALSE, FALSE) but the last element returns TRUE when I want it to be FALSE.
library(stringr)
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345')
str_detect(numbers, "[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}")
If I use:
str_detect(numbers, "[:digit:]{4}\\.[:digit:]{3}\\.[:digit:]{4}")
I get (FALSE, FALSE, FALSE, FALSE, FALSE, TRUE), so I know the pattern for the exact matches work but I am not sure why the first block of code returns TRUE for the last element when there are 4 numbers and not 3 before the '.'
It is because that last value has `345.234.2345' at the end and you don't have a requirement that your pattern start and end with the matching values.
Try this pattern:
"^[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}$"
If you wanted to match with a string possibly inside or one that was separate at the end or beginning by a space it might be more general to use:
"(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)"
Testing:
numbers <- c('571-566-6666', '456.456.4566', 'apple', '222.222.2222', '222 333
4444', '2345.234.2345', "interior test 456.456.4566 other",
'456.456.4566 beginning test', "end test 456.456.4566")
str_detect(numbers, "(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)")
#[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
And as Wictor is pointing out you could also use the word boundary operator as long as you double escape it in R patterns.
grepl("\\b[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}\\b", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Caveat: The stringr functions (which if I remember correctly are based on stringi functions) appear to be different than the "ordinary" R regex functions in that they allow using the special character classes without double bracketing.
grepl("(^|[ ])[:digit:]{3}\\.[:digit:]{3}\\.[:digit:]{4}([ ]|$)", numbers)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
grepl("(^|[ ])[[:digit:]]{3}\\.[[:digit:]]{3}\\.[[:digit:]]{4}([ ]|$)", numbers)
[1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
Apparently this is via an implicit setting of "fixed" to TRUE.

R: Checking if mutliple elements of a vector appear in vector of strings

I'm trying to create a function that checks if all elements of a vector appear in a vector of strings. The test code is presented below:
test_values = c("Alice", "Bob")
test_list = c("Alice,Chris,Mark", "Alice,Bob,Chris", "Alice,Mark,Zach", "Alice,Bob,Mark", "Mark,Bob,Zach", "Alice,Chris,Bob", "Mark,Chris,Zach")
I would like the output for this to be FALSE TRUE FALSE TRUE FALSE TRUE FALSE.
I first thought I'd be able to switch the | to & in the command grepl(paste(test_values, collapse='|'), test_list) to get when Alice and Bob are both in the string instead of when either of them appear, but I was unable to get the correct answer.
I also would rather not use the command: grepl(test_values[1], test_list) & grepl(test_values[2], test_list) because the test_values vector will change dynamically (varying from length 0 to 3), so I'm looking for something to take that into account.
We can use Reduce with grepl
Reduce(`&`, lapply(test_values, grepl, test_list))
#[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE

regex to detect string separated by non alphabet characters (or nothing)

I'd like to write a regex to detect the string "el" (stands for "eliminated" and is inside a bunch of poorly formatted score data).
For example
tests <- c("el", "hello", "123el", "el/27")
Here I'm looking for the result TRUE, FALSE, TRUE, TRUE. My sad attempts which don't work for obvious reasons:
library(stringr)
str_detect(tests, "el") # TRUE TRUE TRUE TRUE
str_detect(tests, "[^a-z]el") # FALSE FALSE TRUE FALSE
Use the regex (\\b|[^[:alpha:]])el(\\b|[^[:alpha:]]) along with grepl:
> tests <- c("el", "hello", "123el", "el/27")
> y <- grepl("(\\b|[^[:alpha:]])el(\\b|[^[:alpha:]])", tests)
> y
[1] TRUE FALSE TRUE TRUE
Your condition for whether el appears as an entity is that both sides either have a word boundary (\b) or a non alpha character (represented by the character class [^[:alpha:]] in R).

Partial string matching with grep and regular expressions

I have a vector of three character strings, and I'm trying to write a command that will find which members of the vector have a particular letter as the second character.
As an example, say I have this vector of 3-letter stings...
example = c("AWA","WOO","AZW","WWP")
I can use grepl and glob2rx to find strings with W as the first or last character.
> grepl(glob2rx("W*"),example)
[1] FALSE TRUE FALSE TRUE
> grepl(glob2rx("*W"),example)
[1] FALSE FALSE TRUE FALSE
However, I don't get the right result when I trying using it with glob2rx(*W*)
> grepl(glob2rx("*W*"),example)
[1] TRUE TRUE TRUE TRUE
I am sure my understanding of regular expressions is lacking, however this seems like a pretty straightforward problem and I can't seem to find the solution. I'd really love some assistance!
For future reference, I'd also really like to know if I could extend this to the case where I have longer strings. Say I have strings that are 5 characters long, could I use grepl in such a way to return strings where W is the third character?
I would have thought that this was the regex way:
> grepl("^.W",example)
[1] TRUE FALSE FALSE TRUE
If you wanted a particular position that is prespecified then:
> grepl("^.{1}W",example)
[1] TRUE FALSE FALSE TRUE
This would allow programmatic calculation:
pos= 2
n=pos-1
grepl(paste0("^.{",n,"}W"),example)
[1] TRUE FALSE FALSE TRUE
If you have 3-character strings and need to check the second character, you could just test the appropriate substring instead of using regular expressions:
example = c("AWA","WOO","AZW","WWP")
substr(example, 2, 2) == "W"
# [1] TRUE FALSE FALSE TRUE

Resources