Counting syllables - r

I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid.
Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count.
so for instance:
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
would yield:
1, 1, 2, 2, 1, 3
Each number corresponding the the number of syllables in the word.

qdap version 1.1.0 does this task:
library(qdap)
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
syllable_sum(x)
## [1] 1 1 2 2 1 3

gsk3 is correct: if you want a correct solution, it is non-trivial.
For example, you have to watch out for strange things like silent e at the end of a word (eg pane), or know when it's not silent, as in finale.
However, if you just want a quick-and-dirty approximation, this will do it:
> nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", tolower( x ))))
[1] 1 1 2 2 1 3
To understand how the parts work, just strip away the function calls from the outside in, starting with nchar and then gsub, etc... ...until the expression makes sense to you.
But my guess is, considering a fight between R's power vs the profusion of exceptions in the English language, you could get a decent answer (maybe 99% right?) parsing through normal text, without a lot of work - heck, the simple parser above may get 90%+ right. With a little more work, you could deal with silent e's if you like.
It all depends on your application - whether this is good enough or you need something more accurate.

Some tools for NLP are available here:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
The task is non-trivial though. More hints (including an algorithm you could implement) here:
Detecting syllables in a word

The koRpus package will help you out immensley, but it's a little difficult to work with.
stopifnot(require(koRpus))
tokens <- tokenize(text, format="obj", lang='en')
flesch.kincaid(tokens)

Related

Filter rows based on dynamic pattern

I have speech data in in a dataframe dfin column Orthographic:
df <- data.frame(
Orthographic = c("this is it at least probably",
"well not probably it's not intuitive",
"sure no it's I mean it's very intuitive",
"I don't mean to be rude but it's anything but you know",
"well okay maybe"),
Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b",
NA))
I want to filter rows based on a dynamic pattern, namely the occurrence of no, never, not as words OR n't before any of the words listed in column Repeat. However, using the pattern \\b(no|never|not)\\b|n't\\b\\s together with the alternation patterns in column Repeat_pattern, I get this error:
df %>%
filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), :
argument 'pattern' has length > 1 and only the first element will be used
I don't know why "only the first element will be used" as the two pattern components seem to connect well:
paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA" "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"
The expected output is this:
2 well not probably it's not intuitive probably \\b(probably)\\b
3 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I,mean|it's)\\b
It looks like a vectorization issue here, you need to use stringr::str_detect here rather than grepl.
Also, you did not group the negative word alternatives well, all of them must reside in a single group and your n't is now obligatory in a string.
Alse, NA values are coerced to text and added to the regex patterns, while it seems you want to discard the items where Repeat_pattern is NA.
You can fix your code by using
df %>%
filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))
Output:
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I|mean|it's)\\b
I also think the last pattern must be \\b(I|mean|it's)\\b, not \\b(I,mean|it's)\\b.
If there can only be whitespace between the "no" words and the word from Repeat column, replace .* with \\s+ in my pattern. I used .*\b to make sure there is a match anywhere to the right of the "no" words.

Correcting mis-typed data in R

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.
The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

Multiple Pattern Search - Pick the most hit Line from document

I am trying to search a list of terms or keywords in list of sentences. Here, I wanted to pick that line from the list of lines (which are review comments from customers) which matches to most of my terms or keywords that appear in.
At present i am doing this,
mydata<-c("i like this product, awesome",
"i could not go with this produt, since s/w is problem",
"Very good s/w. keep up the good work. i really like it")
terms<-c("really, "good", "like", "product")
termco(mydata, 1:3, terms)
and i get
3 word.count really good like product
1 1 5 0 0 1(20.00%) 1(20.00%)
2 2 11 0 0 0 0
3 3 12 1(8.33%) 2(16.67%) 1(8.33%) 0
I also tried few other suggestions HERE. But i could not get the result i wanted. But solution is very nice.
My expectation is that, a particular line or lines should only be displayed which has maximum number of terms or keywords that i am searching.
In this case i expected below line, since i have maximum number of terms or keywords present i.e., "really, "good", and "like"
"Very good s/w. keep up the good work. i really like it"
Thank in Advance!!
Here is a base R solution using apply and grep. The basic idea is to call grep(term, sentence), for every term in a given sentence. Then, we sum the number of hit terms for each sentence. Note carefully that we add word boundary markers around each term. This is to prevent false matches where a term happens to be a substring of another word in a sentence.
sapply(mydata, function(x) {
Reduce("+", sapply(terms, function(y) {
sum(grep(paste0("\\b", y, "\\b"), x))
}))
})
i like this product, awesome
2
i could not go with this product, since s/w is problem
1
Very good s/w. keep up the good work. i really like it
3
Demo
Using stringr's str_count can help as well:
Using str_count to get the counts(4 in total for last record) of the all matches and then use which.max to get the index of the vector(In this case it will return 3, which means third element the vector mydata)
mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
Incase you want an absolute match with boundary conditions, you may use:
mydata[which.max(stringr::str_count(mydata,paste0("\\b",paste0(terms, collapse="\\b|\\b"),"\\b")))]
In your case both will give you same answer, however second will give less number of matches. E.g. when you have words like "keeping" instead of "keep" in a sentence. In this case the later regex will not match as its not absolute however, prior regex will match as there are no boundary conditions set.
Output:
> mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
[1] "Very good s/w. keep up the good work. i really like it"

Alternative to grep function when searching

Relatively simple problem, but I'm thinking around in circles, and not reaching a solution. Essentially, I have x=sample(1:30,20) and I need to find the coordinates of each individual number. If I use grep, then I get where each number that has a 1 (including 10, 19, 21 etc.) is. I have a feeling there's an obscenely simple solution to this, but I can't think of it for some reason.
for example: if x=c(2,3,1,10,12),
then
f(1,x)
[1] 3
and
f(3,x)
[1] 2
Note: I tried using fixed = T, but that did not help.
you want which.
x <- c(2,3,1,10,12)
which(x==10)
[1] 4

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Resources