Extracting non-capitalized words using Regex - r

I'm trying to extract non-capitalized words using Regex in R. The data contains several columns (e.g. word, word duration, syllable, syllable duration ...), and in the Word column, there are tons of words that are either capitalized (e.g. EAT), non-capitalized (e.g. see), or in curly brackets (e.g. {VAO}). I want to extract all the words that are not capitalized in the word column. The following is a small example data frame with an expected outcome.
file word
1 sp
2 WHAT
3 ISN'
4 'EM
5 O
6 {PPC}
OUTCOME:
"sp", "{PPC}"
> unique(full_dat$word[!grepl("^[A-Z].*[A-Z]|\\d", full_dat$word) & !grepl(" [[:punct:]] ", full_dat$word)]
This result is the following:
[1] "sp" "{OOV}"
[3] "O" "I"
[5] "A" NA
[7] "{XX}" "'S"
[9] "{LG}" "Y"
[11] "B" "'VE"
[13] "N" "{GAP_ANONYMIZATION_NAME}"
[15] "'EM" "W"
[17] "{GAP_ANONYMIZATION}" "K"
This looks good, since I can easily recognize the non-capitalized words, but there are still some capitalized words in this list.... How can I modify the code, so it shows only lower-case words and curly-bracketed words?

With the library stringr you can just simply do that:
library(stringr)
x <- c("HELLO WORLD", "hello world", "Hello World", "hello World", "HeLLo wOrlD")
str_extract(x, "[A-Z]+")
Which results in all the upper cases found in each word:
[1] "HELLO" NA "H" "W" "H"
You can omit NAs by applying the na.omit function, and you will also obtain in which positions there are NAs, that is, in which positions there are not capitalized words:
na.omit(str_extract(x, "[A-Z]+"))
[1] "HELLO" "H" "W" "H"
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
But you can also see in in which positions there are not capitalized words by doing:
is.na(str_extract(x, "[A-Z]+"))
[1] FALSE TRUE FALSE FALSE FALSE
I hope this is helpful 😀

Related

Extract words that are repeated from one sentence to the next

I have sentences from spoken conversation and would like to identify the words that are repeated fom sentence to sentence; here's some illustartive data (in reproducible format below)
df
# A tibble: 10 x 1
Orthographic
<chr>
1 "like I don't understand sorry like how old's your mom"
2 "eh sixty-one"
3 "yeah (...) yeah yeah like I mean she's not like in the risk age group but still"
4 "yeah"
5 "HH"
6 "I don't know"
7 "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks…
8 "yeah"
9 "she said you should come home probably "
10 "no and like why would you go to the airport where people have corona sit in the plane where peop…
I'm not unsuccessful at extracting the repeated words using a forloop but do also get some strange results: Here's what I've been doing so far:
# initialize pattern and new column `rept` in `df`:
pattern1 <- c()
df$rept <- NA
# for loop:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(df$Orthographic[i-1], " ")), collapse = "|"), ")\\b")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
The results are these; result # 10 is strange/incorrect - it should be character(0). How can the code be improved so that no such strange results are obtained?
df$rept
[[1]]
[1] NA
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "" "" "" "" "" "" "" "" "" "" "you" "" "" "" "" ""
[17] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[33] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[49] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[65] "" "" "" "" "" "" "" "" "" "" "" ""
Reproducible data:
structure(list(Orthographic = c("like I don't understand sorry like how old's your mom",
"eh sixty-one", "yeah (...) yeah yeah like I mean she's not like in the risk age group but still",
"yeah", "HH", "I don't know", "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks ago and they at that time they were already like maybe you should just get on a plane and come home and like you can't just be here and and then last night they were like are you sure you don't wanna come home and I was I don't think I can and my mom said the same thing",
"yeah", "she said you should come home probably ", "no and like why would you go to the airport where people have corona sit in the plane where people have corona to get there where people have corona and then go and take it to your family"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
When you debug such regex issues concerning dynamic patterns with word boundaries, there are a lot of things to keep in mind (so as to understand how to best approach the whole issue).
First, check the patterns you get,
for(i in 2:nrow(df)) {
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+"))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
Here is the list of regexps:
[1] "\\b(like|I|don't|understand|sorry|like|how|old's|your|mom)\\b"
[1] "\\b(eh|sixty-one)\\b"
[1] "\\b(yeah|(...)|yeah|yeah|like|I|mean|she's|not|like|in|the|risk|age|group|but|still)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(HH)\\b"
[1] "\\b(I|don't|know)\\b"
[1] "\\b(yeah|I|talked|to|my|grandparents|last|night|and|last|time|I|talked|to|them|it|was|like|two|weeks|ago|and|they|at|that|time|they|were|already|like|maybe|you|should|just|get|on|a|plane|and|come|home|and|like|you|can't|just|be|here|and|and|then|last|night|they|were|like|are|you|sure|you|don't|wanna|come|home|and|I|was|I|don't|think|I|can|and|my|mom|said|the|same|thing)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(she|said|you|should|come|home|probably|)\\b"
Look at the second pattern: \b(eh|sixty-one)\b. What if the first word was sixty? The \b(sixty|sixty-one)\b regex will never match sixty-one because sixty would have matched first and the other alternative would not even have been considered. You need to always sort the alternatives by length in the descending order to assure you always match the longest alternative first when you use word boundaries and you know there can be alternatives with more than one word in them. Here, you do not need to sort the alternatives because you only have single word alternatives.
See the next pattern containing |(...)| alternative. It matches any three chars other than line break chars and captures them into a group. However, the string contained a (...) substring where the parentheses and dots are literal chars. To match them with a regex, you need to escape all special chars.
Next, you consider "words" to be non-whitespace chunks of chars because you use str_split(df$Orthographic[i-1], " "). This invalidates the approach with \b altogether, you need to use whitespace boundaries, (?<!\S) at the start and (?!\S) at the end instead of \bs. More, since you only split with a single space, you may get empty alternatives if there are two or more consecutive spaces in the input string. You need to use \s+ pattern here to split by one or more whitespaces.
Next, there is a trailing space in the last but one string, and it creates an empty alternative. You need to trimws your input before splitting into tokens/words.
This is what you need to do with the regex solution: add the escape.for.regex function:
## Escape for regex
escape.for.regex <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
and then use it to escape the tokens that you obtain by splitting the trimmed df$Orthographic[i-1] with \s+ regex, appy unique to remove duplicates to make the pattern more efficient and shorter, and add the whitespace boundaries:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unique(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+")))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
See the list of regexps:
[1] "(?<!\\S)(?:like|I|don't|understand|sorry|how|old's|your|mom)(?!\\S)"
[1] "(?<!\\S)(?:eh|sixty-one)(?!\\S)"
[1] "(?<!\\S)(?:yeah|\\(\\.\\.\\.\\)|like|I|mean|she's|not|in|the|risk|age|group|but|still)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:HH)(?!\\S)"
[1] "(?<!\\S)(?:I|don't|know)(?!\\S)"
[1] "(?<!\\S)(?:yeah|I|talked|to|my|grandparents|last|night|and|time|them|it|was|like|two|weeks|ago|they|at|that|were|already|maybe|you|should|just|get|on|a|plane|come|home|can't|be|here|then|are|sure|don't|wanna|think|can|mom|said|the|same|thing)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:she|said|you|should|come|home|probably)(?!\\S)"
Output:
> df$rept
[[1]]
NULL
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "you"
Depending on whether it is sufficient to identify repeated words, or also their repeat frequencies, you might want to modify the function, but here is one approach using the dplyr::lead function:
library(stringr)
library(dplyr)
# general function that identifies intersecting words from multiple strings
getRpt <- function(...){
l <- lapply(list(...), function(x) unlist(unique(
str_split(as.character(x), pattern=boundary(type="word")))))
Reduce(intersect, l)
}
df$rept <- mapply(getRpt, df$Orthographic, lead(df$Orthographic), USE.NAMES=FALSE)

What is the syntax in R for returning the number of words matched in regular expression?

R Package: stringr::words
I want to know the number of words that are exactly three letters long in the stringr::words file after applying the following regular expression:
x <- str_view(words, "^...$", match = TRUE)
While the code was able to extract words that are exactly three letters long, it does not tell me how many words there are. So, I thought the length function will be appropriate to find the number.
length(x)
The code returns 8, which cannot be as it is very clear that x is more than 8.
What is the proper syntax to calculate the number of words after matching with the regular expression, in this case, x?
Also, can anyone explain to me why length(x) returns 8 in the above example?
Thank you in advance.
str_view returns an HTML object which is used for viewing.
x <- str_view(words, "^...$", match = TRUE)
class(x)
#[1] "str_view" "htmlwidget"
The 8 components that you see are
names(x)
#[1] "x" "width" "height" "sizingPolicy" "dependencies"
#[6] "elementId" "preRenderHook" "jsHooks"
Instead of str_view use str_subset :
library(stringr)
x <- str_subset(words, "^...$")
x
# [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad" "bag"
# [14] "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can" "car" "cat"
# [27] "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg" "end" "eye" "far"
# [40] "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy" "hit" "hot" "how" "job"
# [53] "key" "kid" "lad" "law" "lay" "leg" "let" "lie" "lot" "low" "man" "may" "mrs"
# [66] "new" "non" "not" "now" "odd" "off" "old" "one" "out" "own" "pay" "per" "put"
# [79] "red" "rid" "run" "say" "see" "set" "sex" "she" "sir" "sit" "six" "son" "sun"
# [92] "tax" "tea" "ten" "the" "tie" "too" "top" "try" "two" "use" "war" "way" "wee"
#[105] "who" "why" "win" "yes" "yet" "you"
length(x)
#[1] 110
Another option is str_count:
library(stringr)
sum(str_count(x, "^...$"))
[1] 3
Data:
x <- c("abc", "abcd", "ab", "abc", "abcsd", "edf")
I'd suggest using grep with length:
length(grep("^.{3}$", words))
# => [1] 110
With grep, you actually get a subset of the words and length will return the count of the found matches.
stringr::str_view can be used to view HTML rendering of regular expression match, and it does not actually return the list of matches. Beside grep, you may use stringr::str_subset.

How to retain character strings using positional indexing?

What I need to do is very similar to what the function below does
x = c("abcde", "ghij", "klmnopq")
tstrsplit(x, "", fixed=TRUE, keep=c(1,3,5), names=c('first','second','third'))
However, I would like to be able to return strings using ranges of values. For example, I would like to specify that in first I want to have the first two letters for each element.
Thus instead of having:
$first
[1] "a" "g" "k"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
The output should look like
$first
[1] "ab" "gh" "kl"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
Background:
I have a large .txt file of records and a lookup table that tells from which position to which position each attribute goes, and the expected max width from which position. The txt file looks like:
James Brown M 01-01-1970
And then in a separate file I have a lookup table that says:
Field Start width
Name 1 7
FamilyN 9 7
Gender 11 1
Incidentally, I would appreciate any feedback on the best way to import this type of large .txt file. I feel like read.table is inappropriate since it tries to reduce to a dataframe format which is not what these files really are.
Something like this maybe:
x = c("abcde", "ghij", "klmnopq")
library(tidyverse)
list(c(1,3,5), c(2,1,1)) %>%
pmap(~ substr(x, .x, .x + .y - 1) %>% replace(., .=="", NA))
[[1]]
[1] "ab" "gh" "kl"
[[2]]
[1] "c" "i" "m"
[[3]]
[1] "e" NA "o"
I've hardcoded the positions. Per #MrFlick's comment, if you have a large number of strings, you'll need some strategy for deciding on the character positions so that you can automate it, rather than hardcoding it.

split each character in R

I have song.txt file
*****
[1]"The snow glows white on the mountain tonight
Not a footprint to be seen."
[2]"A kingdom of isolation,
and it looks like I'm the Queen"
[3]"The wind is howling like this swirling storm inside
Couldn't keep it in;
Heaven knows I've tried"
*****
[4]"Don't let them in,
don't let them see"
[5]"Be the good girl you always have to be
Conceal, don't feel,
don't let them know"
[6]"Well now they know"
*****
I would like to loop over the lyrics and fill in the elements of each list as
each element in the list contains a character vector, where each element of the vector is a word in the song.
like
[1] "The" "snow" "glows" "white" "on" "the" "mountain" "tonight" "Not" "a" "footprint"
"to" "be" "seen." "A" "kingdom" "of" "isolation," "and" "it" "looks" "like" "I'm" "the"
"Queen" "The" "wind" "is" "howling" "like" "this" "swirling" "storm" "inside"
"Couldn't" "keep" "it" "in" "Heaven" "knows" "I've" "tried"
[2]"Don't" "let" "them" "in,""don't" "let" "them" "see" "Be" "the" "good" "girl" "you"
"always" "have" "to" "be" "Conceal," "don't" "feel," "don't" "let" "them" "know"
"Well" "now" "they" "know"
First I made an empty list with words <- vector("list", 2).
I think that I should first put the text into one long character vector where in relation to the delimiters ***** start and stop. with
star="\\*{5}"
pindex = grep(star, page)
After this what should I do?
It sounds like what you want is strsplit, run (effectively) twice. So, starting from the point of "a single long character string separated by **** and spaces" (which I assume is what you have?):
list_of_vectors <- lapply(strsplit(song, split = "\\*{5}"), function(x) {
#Split each verse by spaces
split_verse <- strsplit(x, split = " ")
#Then return it as a vector
return(unlist(split_verse))
})
The result should be a list of each verse, with each element consisting of a vector of each word in that verse. Iff you're not dealing with a single character string in the read-in object, show us the file and how you're reading it in ;).
To get it into the format you want, maybe give this a shot. Also, please update your post with more information so we can definitively solve your problem. There are a few areas of your posted question that need some clarification. Hope this helps.
## writeLines(text <- "*****
## The snow glows white on the mountain tonight
## Not a footprint to be seen.
## A kingdom of isolation,
## and it looks like I'm the Queen
## The wind is howling like this swirling storm inside
## Couldn't keep it in;
## Heaven knows I've tried
## *****
## Don't let them in,
## don't let them see
## Be the good girl you always have to be Conceal,
## don't feel,
## don't let them know
## Well now they know
## *****", "song.txt")
> read.song <- readLines("song.txt")
> split.song <- unlist(strsplit(read.song, "\\s"))
> star.index <- grep("\\*{5}", split.song)
> word.index <- sapply(2:length(star.index), function(i){
(star.index[i-1]+1):(star.index[i]-1)
})
> lapply(seq(word.index), function(i) split.song[ word.index[[i]] ])
## [[1]]
## [1] "The" "snow" "glows" "white" "on" "the" "mountain"
## [8] "tonight" "Not" "a" "footprint" "to" "be" "seen."
## [15] "A" "kingdom" "of" "isolation," "and" "it" "looks"
## [22] "like" "I'm" "the" "Queen" "The" "wind" "is"
## [29] "howling" "like" "this" "swirling" "storm" "inside" "Couldn't"
## [36] "keep" "it" "in;" "Heaven" "knows" "I've" "tried"
## [[2]]
## [1] "Don't" "let" "them" "in," "don't" "let" "them" "see" "Be"
## [10] "the" "good" "girl" "you" "always" "have" "to" "be" "Conceal,"
## [19] "don't" "feel," "don't" "let" "them" "know" "Well" "now" "they"
## [28] "know"

Importing data into R

So I have a set of data here (note: ignore the first line, the data sets from the second line). There are 311,522 characters in total. I wish to import this into R such that each single character is in one cell, so I end up with a 311,522 by 1 column vector. However, when I copied the data into a text file and then imported that into R, each line is recognized as one single "character" and instead I end up with a column vector where each entry is the entire line rather than a single character.
How can I get around this?
Just use readLines and strsplit. This is pretty straightforward stuff in R:
x <- readLines("Your_Actual_URL_Here")
Check for any junk:
head(x)
# [1] ""
# [2] "<PRE>"
# [3] ">hg19_knownGene_uc003qec.4 range=chr6:133551736-133863257 5'pad=0 3'pad=0 strand=+ repeatMasking=none"
# [4] "AGGGAGAGGAGTATCTTGTCTTGGGGAGGGTGGAGACAGACAACCATTTC"
# [5] "TGTTTTTGTTATATTGAATTGTACATCTTCCTAGGCATAAATACTCTTCA"
# [6] "TGATTTCAGGCCAGGTCCAAATGATACCTCCTACATTCCTTCAGCTGGAA"
tail(x)
# [1] "CTTGCTTTTCACAAAAAGAGATCCAAGAGGAAGAGGTGGAGCAAGCTAGC"
# [2] "AAGAGAGCACCCAAGATGGAAGCTGCAGTCTTTTACCCTAACCTCAGAAG"
# [3] "TGGTGTACCTTTTGCCATATGCCATTTGTCATATAGCTCAAGCATGGTAC"
# [4] "AGTGTGGGAGGGGGCTACATGGGATGTTAATACCAGGATGCAGGGGATCG"
# [5] "CTGGGGCTACTTTGGAGGCTGG"
# [6] "</PRE>"
So, we want from the fourth line to one less than the length of the vector:
y <- unlist(strsplit(x[4:(length(x)-1)], ""), use.names=FALSE)
head(y)
# [1] "A" "G" "G" "G" "A" "G"
tail(y)
# [1] "G" "G" "C" "T" "G" "G"
length(y)
# [1] 311522

Resources