obtaining count of phrases contained between parentheses and containing specific character - r

There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!

I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.

Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO

Related

Add a character to each word within a sentence in R

I have sentences in R (they represent column names of an SQL query), like the following:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
I would need to add a character(s) like "k." in front of every word of the sentence. Notice how sometimes words within the sentence may be separated by a comma and a space, but sometimes just by a comma.
The desired output would be:
new_sentence <- "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
I would prefer to achieve this without using a loop for. I saw this post Add a character to the start of every word but there they work with a vector and I can't figure out how to apply that code to my example.
Thanks
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
gsub(pattern = "(\\w+)", replacement = "k.\\1", sample_sentence)
# [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"
Explanation: in regex \\w+ matches one or more "word" characters, and putting it in () makes each match a "capturing group". We replace each capture group with k.\\1 where \\1 refers to every group captured by the first set of ().
A possible solution:
sample_sentence <- "CITY, AGE,NAME,LAST_NAME, COUNTRY"
paste0("k.", gsub("(,\\s*)", "\\1k\\.", sample_sentence))
#> [1] "k.CITY, k.AGE,k.NAME,k.LAST_NAME, k.COUNTRY"

How to subset words with certain number of vowels in rstudio?

I try to subset a list of words having 5 or more vowel letters using str_subset function in rstudio. However, can't figure it.
Is there any suggestion for this issue?
Since you are evidently using stringr, the function str_count will give you what you are after. Assuming your "list of words" means a character vector of single words, the following should do the trick.
testStrings <- c("Brillig", "slithey", "TOVES",
"Abominable", "EQUATION", "Multiplication", "aaagh")
VowelCount <- str_count(testString, pattern = "[AEIOUaeiou]")
OutputStrings <- testStrings[VowelCount >= 5]
The part in square brackets is a regular expression which matches any capital or lower case vowel in English. Of course other languages have different sets of vowels which you may need to take into account.
If you want to do the same in base R, the following single-liner should do it:
OutputStrings <- grep("([AEIOUaeiou].*){5,}", testStrings, value = TRUE)

Extract first letter in each word in R

I had a data.frame with some categorical variables. Let's suppose sentences is one of these variables:
sentences <- c("Direito à participação e ao controle social",
"Direito a ser ouvido pelo governo e representantes",
"Direito aos serviços públicos",
"Direito de acesso à informação")
For each value, I would like to extract just the first letter of each word, ignoring if the word has 4 letters or less (e, de, à, a, aos, ser, pelo), My goal is create acronym variables. I expect the following result:
[1] "DPCS", "DOGR", "DSP", "DAI
I tried to make a pattern subset using stringr with a regex pattern founded here:
library(stringr)
pattern <- "^(\b[A-Z]\w*\s*)+$"
str_subset(str_to_upper(sentences), pattern)
But I got an error when creating the pattern object:
Error: '\w' is an escape sequence not recognized in the string beginning with ""^(\b[A-Z]\w"
What am I doing wrong?
Thanks in advance for any help.
You can use gsub to delete all the unwanted characters and remain with the ones you want. From the expected output, it seems you are still using characters from words tht are 3 characters long:
gsub('\\b(\\pL)\\pL{2,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DSOPGR" "DASP" "DAI"
But if we were to ignore the words you indicated then it would be:
gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE)
[1] "DPCS" "DOGR" "DSP" "DAI"
#Onyambu's answer is great, though as a regular expression beginner, it does take me a long time to try to understand it so that I can make modifications to suit my own needs.
Here is my understanding to gsub('\\b(\\pL)\\pL{4,}|.','\\U\\1',sentences,perl = TRUE).
Post in the hope of being helpful to others.
Background information:
\\b: boundary of word
\\pL matches any kind of letter from any language
{4,} is an occurrence indicator
{m}: The preceding item is matched exactly m times.
{m,}: The preceding item is matched m or more times, i.e., m+
{m,n}: The preceding item is matched at least m times, but not more than n times.
| is OR logic operator
. represents any one character except newline.
\\U\\1 in the replacement text is to reinsert text captured by the pattern as well as capitalize the texts. Note that parentheses () create a numbered capturing group in the pattern.
With all the background knowledge, the interpretation of the command is
replace words matching \\b(\\pL)\\pL{4,} with the first letter
replace any character not matching the above pattern with "" as nothing is captured for this group
Here are two great places I learned all these backgrounds.
https://www.regular-expressions.info/rlanguage.html
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
You can use this pattern: (?<=^| )\S(?=\pL{4,})
I used a positive lookbehind to make sure the matches are preceded by either a space or the beginning of the line. Then I match one character, only if it is followed by 4 or more letters, hence the positive lookahead.
I suggest you don't use \w for non-English languages, because it won't match any characters with accents. Instead, \pL matches any letter from any language.
Once you have your matches, you can just concatenate them to create your strings (dpcs, dogr, etc...)
Here's a demo

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

Resources