I've been wrapping my head around this for a while, trying plenty of varieties of map, Reduce and such but without success so far.
I am looking for a functional, elegant approach to substitute a sequence of gsub as in
text_example <- c(
"I'm sure dogs are the best",
"I won't, I can't think otherwise",
"We'll be happy to discuss about dogs",
"cant do it today tho"
)
text_example %>%
gsub(pattern = "'ll", replacement = " will") %>%
gsub(pattern = "can'?t", replacement = "can not") %>%
gsub(pattern = "won'?t", replacement = "will not") %>%
gsub(pattern = "n't", replacement = " not") %>%
gsub(pattern = "'m", replacement = " am") %>%
gsub(pattern = "'s", replacement = " is") %>%
gsub(pattern = "dog", replacement = "cat") %>%
Into something of the form
text_example %>%
???(dict$pattern, dict$replacement, gsub())
Where, for sake of a reproducible example, dict can be a data.frame such as
dict <- structure(
list(
pattern = c("'ll", "can'?t", "won'?t", "n't", "'m", "'s", "dog"),
replacement = c(" will", "can not", "will not", " not", " am", " is", "cat")
),
row.names = c(NA, -7L),
class = "data.frame"
)
(and I am aware that the substitutions performed might not be correct linguistically, but that's not the problem now)
Of course, a brutal
for(i in seq(nrow(dict))) {
text_example <- gsub(dict$pattern[i], dict$replacement[i], text_example)
}
would work, and I know that there are dozens of libraries that solve this issue with some specific function. But I want to understand how to deal with recursions and problems like this in a simple, functional way, keeping as close as possible to base R. I love my lambdas!
Thank you in advance for the help.
You can use mapply for a parallel apply-effect:
mapply(dict$pattern, dict$replacement, function(pttrn, rep) gsub(pttrn, rep, text_example))
(You might want to use SIMPLIFY=FALSE)
Maybe the following does what you want.
It is inspired in Functional Programming, the link in your comment.
I don't like the output though, it is a list with as many elements as rows of dataframe dict and only the last one is the one of interess.
new_text <- function(pattern, replacement, text) {
txt <- text
function(pattern, replacement) {
txt <<- gsub(pattern, replacement, txt)
txt
}
}
Replace <- new_text(p, r, text = text_example)
Map(Replace, as.list(dict[[1]]), as.list(dict[[2]]))
Related
So, I am writing a function that, among many other things, is supposed to keep only the first sentence from each paragraph of a text and preserve the paragraph structure (i.e. each sentence is in its own line). Here is the code that I have so far:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
final.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
return(final.text)
}
All of the functions are working as they should, but the one part I can't figure out is how to get the output to print onto separate lines. When I run the function with a vector of text (I was using some text from Moby Dick as a test), this is what I get:
> text_shortener(Moby_Dick)
[1] "Call me Ishma. It is a way I have of drivi. off splee., and regul. circu. This is my subst. for pisto. and ball"
What I want is for the output of this function to look like this:
[1] "Call me Ishma."
[2] "It is a way I have of drivi. off splee., and regul. circu."
[3] "This is my subst. for pisto. and ball"
I am relatively new to R and this giving me a real headache, so any help would be much appreciated! Thank you!
Looking at your output, it seems like splitting on a period followed by a capital letter if what you need.
You could accomplish that with strsplit() and split the string up like so:
strsplit("Call me Ishma. It is drivi. off splee., and regul. circu. This is my subst. for pisto.","\\. (?=[A-Z])", perl=T)
That finds instances where a period is followed by a space and a capital letter and splits the character up there.
Edit: You could add it to the end of your function like so:
text_shortener <- function(input_text) {
lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1])
first.sentences <- unlist(lapply(input_text, function(x)str_split(x, "\\.", simplify = T)[1]))
no.spaces <- gsub(pattern = "(?<=[\\s])\\s*|^\\s+|\\s+$", replacement = "", x = first.sentences, perl = TRUE)
stopwords <- c("the", "really", "truly", "very", "The", "Really", "Truly", "Very")
x <- unlist(strsplit(no.spaces, " "))
no.stopwords <- paste(x[!x %in% stopwords], collapse = " ")
trim.text <- gsub(pattern = "(?<=\\w{5})\\w+", replacement = ".", x = no.stopwords, perl=TRUE)
final.text <- strsplit(trim.text, "\\. (?=[A-Z])", perl=T)
return(final.text)
}
Context
I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
Issue
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
"\\b(a|a's|able|about|above|according|accordingly|across|actually|after|afterwards|again|against|ain't|all|allow|allows|almost|alone|along|already|also|although|always|am|among|amongst|an|and|another|any|anybody|anyhow|anyone|anything|anyway|anyways|anywhere|apart|appear|appreciate|appropriate|are|aren't|around|as|aside|ask|asking|associated|at|available|away|awfully|b|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|believe|below|beside|besides|best|better|between|beyond|both|brief|but|by|c|c'mon|c's|came|can|can't|cannot|cant|cause|causes|certain|certainly|changes|clearly|co|com|come|comes|concerning|consequently|consider|considering|contain|containing|contains|corresponding|could|couldn't|course|currently|d|definitely|described|despite|did|didn't|different|do|does|doesn't|doing|don't|done|down|downwards|during|e|each|edu|eg|eight|either|else|elsewhere|enough|entirely|especially|et|etc|even|ever|every|everybody|everyone|everything|everywhere|ex|exactly|example|except|f|far|few|fifth|first|five|followed|following|follows|for|former|formerly|forth|four|from|further|furthermore|g|get|gets|getting|given|gives|go|goes|going|gone|got|gotten|greetings|h|had|hadn't|happens|hardly|has|hasn't|have|haven't|having|he|he's|hello|help|hence|her|here|here's|hereafter|hereby|herein|hereupon|hers|herself|hi|him|himself|his|hither|hopefully|how|howbeit|however|i|i'd|i'll|i'm|i've|ie|if|ignored|immediate|in|inasmuch|inc|indeed|indicate|indicated|indicates|inner|insofar|instead|into|inward|is|isn't|it|it'd|it'll|it's|its|itself|j|just|k|keep|keeps|kept|know|knows|known|l|last|lately|later|latter|latterly|least|less|lest|let|let's|like|liked|likely|little|look|looking|looks|ltd|m|mainly|many|may|maybe|me|mean|meanwhile|merely|might|more|moreover|most|mostly|much|must|my|myself|n|name|namely|nd|near|nearly|necessary|need|needs|neither|never|nevertheless|new|next|nine|no|nobody|non|none|noone|nor|normally|not|nothing|novel|now|nowhere|o|obviously|of|off|often|oh|ok|okay|old|on|once|one|ones|only|onto|or|other|others|otherwise|ought|our|ours|ourselves|out|outside|over|overall|own|p|particular|particularly|per|perhaps|placed|please|plus|possible|presumably|probably|provides|q|que|quite|qv|r|rather|rd|re|really|reasonably|regarding|regardless|regards|relatively|respectively|right|s|said|same|saw|say|saying|says|second|secondly|see|seeing|seem|seemed|seeming|seems|seen|self|selves|sensible|sent|serious|seriously|seven|several|shall|she|should|shouldn't|since|six|so|some|somebody|somehow|someone|something|sometime|sometimes|somewhat|somewhere|soon|sorry|specified|specify|specifying|still|sub|such|sup|sure|t|t's|take|taken|tell|tends|th|than|thank|thanks|thanx|that|that's|thats|the|their|theirs|them|themselves|then|thence|there|there's|thereafter|thereby|therefore|therein|theres|thereupon|these|they|they'd|they'll|they're|they've|think|third|this|thorough|thoroughly|those|though|three|through|throughout|thru|thus|to|together|too|took|toward|towards|tried|tries|truly|try|trying|twice|two|u|un|under|unfortunately|unless|unlikely|until|unto|up|upon|us|use|used|useful|uses|using|usually|uucp|v|value|various|very|via|viz|vs|w|want|wants|was|wasn't|way|we|we'd|we'll|we're|we've|welcome|well|went|were|weren't|what|what's|whatever|when|whence|whenever|where|where's|whereafter|whereas|whereby|wherein|whereupon|wherever|whether|which|while|whither|who|who's|whoever|whole|whom|whose|why|will|willing|wish|with|within|without|won't|wonder|would|would|wouldn't|x|y|yes|yet|you|you'd|you'll|you're|you've|your|yours|yourself|yourselves|z|zero|i|me|my|myself|we|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|would|should|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|around|as|ask|asked|asking|asks|at|away|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|best|better|between|big|both|but|by|came|can|cannot|case|cases|certain|certainly|clear|clearly|come|could|did|differ|different|differently|do|does|done|down|down|downed|downing|downs|during|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|had|has|have|having|he|her|here|herself|high|high|high|higher|highest|him|himself|his|how|however|i|if|important|in|interest|interested|interesting|interests|into|is|it|its|itself|just|keep|keeps|kind|knew|know|known|knows|large|largely|last|later|latest|least|less|let|lets|like|likely|long|longer|longest|made|make|making|man|many|may|me|member|members|men|might|more|most|mostly|mr|mrs|much|must|my|myself|necessary|need|needed|needing|needs|never|new|new|newer|newest|next|no|nobody|non|noone|not|nothing|now|nowhere|number|numbers|of|off|often|old|older|oldest|on|once|one|only|open|opened|opening|opens|or|order|ordered|ordering|orders|other|others|our|out|over|part|parted|parting|parts|per|perhaps|place|places|point|pointed|pointing|points|possible|present|presented|presenting|presents|problem|problems|put|puts|quite|rather|really|right|right|room|rooms|said|same|saw|say|says|second|seconds|see|seem|seemed|seeming|seems|sees|several|shall|she|should|show|showed|showing|shows|side|sides|since|small|smaller|smallest|some|somebody|someone|something|somewhere|state|states|still|still|such|sure|take|taken|than|that|the|their|them|then|there|therefore|these|they|thing|things|think|thinks|this|those|though|thought|thoughts|three|through|thus|to|today|together|too|took|toward|turn|turned|turning|turns|two|under|until|up|upon|us|use|used|uses|very|want|wanted|wanting|wants|was|way|ways|we|well|wells|went|were|what|when|where|whether|which|while|who|whole|whose|why|will|with|within|without|work|worked|working|works|would|year|years|yet|you|young|younger|youngest|your|yours)\\b"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
with
stringr::str_replace_all(x, sw, " ")
I am trying to clean some text data, and after tokenising and e.g. removing punctuation, I want my transform the token object into a vector/dataframe/corpus.
My current approach is:
library(quanteda)
library(dplyr)
raw <- c("This is text #1.", "And a second document...")
tokens <- raw %>% tokens(remove_punct = T)
docs <- lapply(tokens, toString) %>% gsub(pattern = ",", replacement = "")
Is there a more "quanteda" or at least a simpler way to do this?
This would be how I would do it, and it preserves the docnames as element names in your output vector. (But you can add USE.NAMES = FALSE if you don't want to keep them.)
> sapply(tokens, function(x) paste(as.character(x), collapse = " "))
text1 text2
"This is text #1" "And a second document"
You don't need the library(dplyr) here.
I want to solve two shorten notation in R.
For Ade/i, I should get Ade, Adi
For Do(i)lfal, I should get Dolfal, Doilfal
I have this solution
b='Do(i)lferl'
gsub(pattern = '(\\w+)\\((\\w+)+\\)', replacement='\\1\\i,\\1\\2', x=b)
Can anyone help me to code this
If these values are part of a dataframe, you can do this:
df <- data.frame(
Nickname = c("Ade/i", "Do(i)lfal")
)
df$Nickname_new[1] <- paste0(sub("(?=.*/)(.*)/.*", "\\1", df$Nickname[1], perl = T), ",", paste0(unlist(str_split(df$Nickname[1], "\\w/")), collapse = ""))
df$Nickname_new[2] <- paste0(sub("(.*)(\\(.*\\))(.*)", "\\1\\3", df$Nickname[2]),",", sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", df$Nickname[2]))
which gives you:
df
Nickname Nickname_new
1 Ade/i Ade,Adi
2 Do(i)lfal Dolfal,Doilfal
EDIT:
Just in case the whole thing is not part of a dataframe but an atomic vector, you can do this:
x <- c("Ade/i", "Do(i)lfal")
c(paste0(sub("/.*", "", x[grepl("/", x)]), ", ", sub("./", "", x[grepl("/", x)])),
paste0(sub("(.*)\\((\\w)\\)(.*)", "\\1\\2\\3\\4", x[grepl("\\(", x)]), ", ", sub("\\(\\w\\)", "", x[grepl("\\(", x)])))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal"
If there are values that you don't want to change, then this regex by #Wiktor will work (it works even with any scenario!):
x <- c("Ade/i", "Do(i)lfal", "Peter", "Mary")
gsub('(\\w*)\\((\\w+)\\)(\\w*)', '\\1\\2\\3, \\1\\3', gsub("(\\w*)(\\w)/(\\w)\\b", "\\1\\2, \\1\\3", x))
which gives you:
[1] "Ade, Adi" "Doilfal, Dolfal" "Peter" "Mary"
I have a list of strings which look like that:
categories <- "|Music|Consumer Electronics|Mac|Software|"
However, I only want get the first string. In this case Music(without |). I tried:
sub(categories, pattern = " |", replacement = "")
However, that does not give me the desired result. Any recommendation how to correctly parse my string?
I appreciate your answer!
UPDATE
> dput(head(df))
structure(list(data.founded_at = c("01.06.2012", "26.10.2012",
"01.04.2011", "01.01.2012", "10.10.2011", "01.01.2007"), data.category_list = c("|Entertainment|Politics|Social Media|News|",
"|Publishing|Education|", "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|",
"|Software|", "|Software|", "|Curated Web|")), .Names = c("data.founded_at",
"data.category_list"), row.names = c(NA, 6L), class = "data.frame")
An alternative for this could be scan:
na.omit(scan(text = categories, sep = "|", what = "", na.strings = ""))[1]
# Read 6 items
# [1] "Music"
Find a function that will tokenize a string at a particular character: strsplit would be my guess.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html
Note that the parameter in split is a regexp, so using split="|" will not work (unless you specify fixed=TRUE, as suggested from joran -thanks- in the comments)
strsplit(categories,split="[|]")[[1]][2]
To apply this to the data frame you could do this:
sapply(df$data.category_list, function(x) strsplit(x,split="[|]")[[1]][2])
But this is faster (see the comments):
vapply(strsplit(df$data.category_list, "|", fixed = TRUE), `[`, character(1L), 2)
(thanks to Ananda Mahto)