Create a function that tidies up any text in r - r

i am new to r and trying to create a regular expression that abbreviates any word over 5 letters in the input vector (e.g hungry would become hungr.). At the moment i am using gsub, but it is removing the first 5 letters of words over five letters rather than following letters.
current code:
gsub(pattern = "[a-z]{5}", replacement = ".", x = text_converter)

So here is the edited version after you added your explanation in the comment section:
We need the following function:
Your string is "the very reliable dog jumped over the fences"
strsplit splits this string whenever there is a space in the string and gives you a list of those words
[[1]]
[1] "the" "very" "reliable" "dog" "jumped" "over" "the" "fences"
lapply applys the function, gsub("([a-zA-Z]{5}).*","\1", x), to each word in the list from step 2, which reduces the letters in each word to 5 or less.
[[1]]
[1] "the" "very" "relia" "dog" "jumpe" "over" "the" "fence"
finally paste with collapse, concatenates the vector of strings from step 3 to a string and the split_reduce function returns this string as the final output.
[1] "the very relia dog jumpe over the fence"
So here is the function:
split_reduce <- function(text){
text<-strsplit(text, " ")
text_reduced <- lapply(text,function(x)gsub("([a-zA-Z]{5}).*","\\1", x))
paste(unlist(text_reduced), collapse = ' ')
}
split_reduce("the very reliable dog jumped over the fences")
#[1] "the very relia dog jumpe over the fence"

Related

How to find words in a string that have consecutive letters in R language

There is a problem that I do not know how to solve.
You need to write a function that returns all words from a string that contain repeated letters and the maximum number of their repetitions in a word.
Visually, this stage can be viewed with the following example:
"hello good home aboba" after processing should be hello good, and the maximum number of repetitions of a character in a given string = 2.
The code I wrote from tries to find duplicate characters and based on this, extract words from a separate array, but something doesn't work. Help solve the problem.
library(tidyverse)
library(stringr)
text = 'tessst gfvdsvs bbbddsa daxz'
text = strsplit(text, ' ')
text
new = c()
new_2 = c()
for (i in text){
new = str_extract_all(i, '([[:alpha:]])\\1+')
if (new != character(0)){
new_2 = c(new_2, i)
}
}
new
new_2
Output:
Error in if (new != character(0)) { : argument is of length zero
> new
[[1]]
[1] "sss"
[[2]]
character(0)
[[3]]
[1] "bbb" "dd"
[[4]]
character(0)
> new_2
NULL
You can use
new <- unlist(str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*"))
i <- max(nchar( unlist(str_extract_all(new, "(.)\\1+")) ))
With str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*") you will extract all words containing at least two consecutive identical letters, and with max(nchar( unlist(str_extract_all(new, "(.)\\1+")) )) you will get the longest repeated letter chunk.
See the R demo online:
library(stringr)
text <- 'tessst gfvdsvs bbbddsa daxz'
new <- unlist(str_extract_all(text, "\\p{L}*(\\p{L})\\1+\\p{L}*"))
# => [1] "tessst" "bbbddsa"
i <- max(nchar( unlist(str_extract_all(new, "(.)\\1+")) ))
# => [1] 3
See this regex demo. Regex details:
\p{L}* - zero or more letters
(\p{L}) - a letter captured into Group 1
\1+ - one or more repetitions of the captured letter
\p{L}* - zero or more letters
text = "hello good home aboba"
paste0(
grep("(.)\\1{1,}",
unlist(strsplit(text, " ")),
value = TRUE),
collapse = " ")
[1] "hello good"

Why does paste() concatenate list elements in the wrong order?

Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?
paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"
We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

How to iterate through an R list of character vectors to modify each element by keeping all characters up to and including one character past comma

I have an R list of approx. 90 character vectors (representing 90 documents), each containing several author names. As a means to stem (or normalize, what have you) the names, I'd like to drop all characters after the white-space and first character just past the comma in each element. So, for example, "Smith, Joe" would become "Smith, J" (or "Smith J" would fine).
1) I've tried using lapply with str_sub, but I can't seem to specify keeping one character past the comma (each element has different character length). 2) I also tried using lapply to split on the comma and make the last and first names separate elements, then using modify_depth to apply str_sub, but I can't figure out how to specifically use the str_sub only on the second element.
Fake sample to replicate issue.
doc1 = c("King, Stephen", "Martin, George")
doc2 = c("Clancy, Tom", "Patterson, James", "Stine, R.L.")
author = list(doc1,doc2)
What I've tried:
myfun1 = function(x,arg1){str_split(x, ", ")}
author = lapply(author, myfun1)
myfun2 = function(x,arg1){str_sub(x, end = 1L)}
f2 = modify_depth(author, myfun2, .depth = 2)
f2
[[1]]
[[1]][[1]]
[1] "K" "S"
[[1]][[2]]
[1] "M" "G"
Ultimately, I'm hoping after applying a solution, including maybe using unite(), the result will be as follows:
[[1]]
[[1]][[1]]
[1] "King S"
[[1]][[2]]
[1] "Martin G"
lapply( author, function(x) gsub( "(^.*, [A-Z]).*$", "\\1", x))
# [[1]]
# [1] "King, S" "Martin, G"
#
# [[2]]
# [1] "Clancy, T" "Patterson, J" "Stine, R"
What it does:
lapply loops over list of authors
gsub replaces a part of the elements of the vectors, defined by the regex "(^.*, [A-Z]).*$" with the first group (the part between the round brackets).
the regex "(^.*, [A-Z]).*$" puts everything from the start ^.* , until (and including) the first 'comma space, captal' , [A-Z] into a group.

How do I 'efficiently' replace a vector of strings with another (pairwise) in a large text corpus

I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)

Resources