Grepl for 2 words/phrases in proximity in R (dplyr) - r

I'm trying to create a filter for large dataframe. I'm trying to use grepl to search for a series of text within a specific column. I've done this for single words/combinations, but now I want to search for two words in close proximity (ie the word tumo(u)r within 3 words of the word colon).
I've checked my regular expression on https://www.regextester.com/109207 and my search works there, but it doesn't work within R.
The error I get is
Error: '\W' is an unrecognized escape in character string starting ""\btumor|tumour)\W"
Example below - trying to search for tumo(u)r within 3 words of cancer.
Can anyone help?
library(tibble)
example.df <- tibble(number = 1:4, AB = c('tumor of the colon is a very hard disease to cure', 'breast cancer is also known as a neoplasia of the breast', 'tumour of the colon is bad', 'colon cancer is also bad'))
filtered.df <- example.df %>%
filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB, ignore.case=T)

R uses backslashes as escapes and the regex engine does,too. Need to double your backslashes. This is explained in multiple prior questions on StackOverflow as well as in the help page brought up at ?regex. You should try to use the escaped operators in a more simple set of tests before attempting complex operations. And you should pay better attention to the proper placement of parentheses and quotes in the pattern argument.
filtered.df <- example.df %>%
#filter(grepl(("\btumor|tumour)\W|\w+(\w+\W+){0,3}colon\b"), AB,
# errors here ....^.^..............^..^...^..^.............^.^
filter(grepl( "(\\btumor|tumour)\\W|\\w+(\\w+\\W+){0,3}colon\\b", AB,
ignore.case=T) )
> filtered.df
# A tibble: 2 × 2
number AB
<int> <chr>
1 1 tumor of the colon is a very hard disease to cure
2 3 tumour of the colon is bad

Related

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

splitting strings using regex in R

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?
Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.
You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

how to use regexpr to identify patters in icd10 data

I am working with icd10 data, and I wish to create new variables called complication based on the pattern "E1X.9X", using regular expression but I keep getting an error. please help
dm_2$icd9_9code<- (E10.49, E11.51, E13.52, E13.9, E10.9, E11.21, E16.0)
dm_2$DM.complications<- "present"
dm_2$DM.complications[regexpr("^E\\d{2}.9$",dm_2$icd9_code)]<- "None"
# Error in dm_2$DM.complications[regexpr("^E\\d{2}.9", dm_2$icd9_code)] <-
# "None" : only 0's may be mixed with negative subscripts
I want
icd9_9code complications
E10.49 present
E11.51 present
E13.52 present
E13.9 none
E10.9 none
E11.21 present
This problem has already been solved. The 'icd' R package which me and co-authors have been maintaining for five years can do this. In particular, it uses standardized sets of comorbidities, including the diabetes with complications you seek, from AHRQ, Elixhauser original, Charlson, etc..
E.g., for ICD-10 AHRQ, you can see the codes for diabetes with complications here. From icd 4.0, these include ICD-10 codes from the WHO, and all years of ICD-10-CM.
icd::icd10_map_ahrq$DMcx
To use them, first just take your patient data frame and try:
library(icd)
pts <- data.frame(visit_id = c("encounter-1", "encounter-2", "encounter-3",
"encounter-4", "encounter-5", "encounter-6"), icd10 = c("I70401",
"E16", "I70.449", "E13.52", "I70.6", "E11.51"))
comorbid_ahrq(pts)
# and for diabetes with complications only:
comorbid_ahrq(pts)[, "DMcx"]
Or, you can get a data frame instead of a matrix this way:
comorbid_ahrq(pts, return_df = TRUE)
# then you can do:
comorbid_ahrq(pts, return_df = TRUE)$DMcx
If you give an example of the source data and your goal, I can help more.
Seems like there are a few errors in your code, I'll note them in the code below:
You'll want to start with wrapping your ICD codes with quotes: "E13.9"
dm_2 <- data.frame(icd9_9code = c("E10.49", "E11.51", "E13.52", "E13.9", "E10.9", "E11.21", "E16.0"))
Next let's use grepl() to search for the particular ICD pattern. Make sure you're applying it to the proper column, your code above is attempting to use dm_2$icd9_code and not dm_2$icd9_9code:
dm_2$DM.complications <- "present"
dm_2$DM.complications[grepl("^E\\d{2}.9$", dm_2$icd9_9code)] <- "None"
Finally,
dm_2
#> icd9_9code DM.complications
#> 1 E10.49 present
#> 2 E11.51 present
#> 3 E13.52 present
#> 4 E13.9 None
#> 5 E10.9 None
#> 6 E11.21 present
#> 7 E16.0 present
A quick side note -- there is a wonderful ICD package you may find handy as well: https://cran.r-project.org/web/packages/icd/index.html

Extracting text between html tags and labelling it with the tag in R

I am trying to learn how to classify sentences in R.
I have a text file containing sentences in the following format:
<happy>
This did the trick : the boys now have a more distant friendship and David is much happier .
<\happy>
<happy>
When Anna left Inspector Aziz , she was much happier .
<\happy>
I do intent to tag the sentences in the following way:
dataset$text = When Anna left Inspector Aziz , she was much happier
dataset$label = happy
I want to extract the sentence and label them with the emotion. How should I approach this? I know that I should use grouping in regex but I don't know how to do this in R. I am new to it and learning.
rl <- readLines('sentences.txt')
Presently that's badly-formatted XML, as
XML uses forward slashes in closing tags instead of backslashes. In fact, you can't even read that into R as-is, as it will try to parse \h as an escaped character unless you add extra backslashes to escape the backslashes themselves.
XML needs to be enclosed in a single tag. The problem is much easier to remedy (paste on some tags), though.
If, as is not unlikely, your actual data is properly formatted XML, you can use the xml2 or XML packages to parse. I like purrr::map_df to iterate over nodes and coerce the results to a data.frame, but you can do the same thing in base R, if you prefer.
library(xml2)
library(purrr)
'<happy>
This did the trick : the boys now have a more distant friendship and David is much happier .
</happy>
<happy>
When Anna left Inspector Aziz , she was much happier .
</happy>' %>%
paste('<sent>', ., '</sent>') %>% # add enclosing tags
read_xml() %>%
xml_find_all('//text()/parent::*') %>% # select nodes that are parents of text
map_df(~list(text = xml_text(.x, trim = TRUE),
emotion = xml_name(.x)))
## # A tibble: 2 × 2
## text emotion
## <chr> <chr>
## 1 This did the trick : the boys now have a more distant friendship and David is much happier . happy
## 2 When Anna left Inspector Aziz , she was much happier . happy

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Resources