Counting positive smiles in string using R - r

In src$Review each row is filled with text in Russian. I want to count the number of positive smiles in each row. For example, in "My apricot is orange)) (for sure)" I want to count not just the quantity of outbound brackets (i.e., excluding general brackets in "(for sure)"), but the amount of positive smiling characters ("))" — at least two outbound brackets, number of ":)", ":-)"). So, it works only if at least two outbound brackets are exhibited.
Assume there is a string "I love this girl!)))) (she makes me happy) every day:):) :-)!" Here we count: )))) (4 units), ":)" (2 units), ":-)" (1 unit). After we combine the number of units (i.e., 7). Pay attention that we don't count brackets in "(she makes me happy)".
Now I have following code in my script:
smilecounts <- str_count(src$Review, "[))]")
It counts only the total amount of bracket pairs ("()") (as I understand comparing data set and derivation of this command).
I only need the total amount of ":)", ":-)", "))" (the total number of outbound brackets which display as "))" in rows) to be counted. For example, in ")))))" appear 5 outbound brackets, the condition of at least two outbound brackets together is satisfied, than we count the total amount of brackets in this part of text (i.e., 5 outbound brackets).
Thank you so much for help in advance.

We can use regex lookarounds to extract the ) that follows a ) or : or :=, then use length to get the count.
length(str_extract_all(str1, '(?<=\\)|\\!)\\)')[[1]])
#[1] 4
length(str_extract_all(str1, '(?<=:)\\)')[[1]])
#[1] 2
length(str_extract_all(str1, '(?<=:-)\\)')[[1]])
#[1] 1
Or this can be done using a loop
pat <- c('(?<=\\)|\\!)\\)', '(?<=:)\\)', '(?<=:-)\\)')
sum(sapply(lapply(pat, str_extract_all, string=str1),
function(x) length(unlist(x))))
#[1] 7
data
str1 <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"

One way with regexpr and regmatches:
vec <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
Solution:
#matches the locations of :-) or ))+ or :)
a <- gregexpr(':-)+|))+|:)+', vec)
#extracts those
b <- regmatches(vec, a)[[1]]
b
#[1] "))))" ":)" ":)" ":-)"
#table counts the instances
b
)))) :-) :)
1 1 2
Then I suppose you could count the number of single )s using
nchar(b[1])
[1] 4
Or in a more automated way:
tab <- table(b)
#the following means "if a name of the table consists only of ) then
#count the number of )s"
tab2 <- ifelse(gsub(')','', names(table(b)))=='', nchar(names(table(b))), table(b))
names(tab2) <- names(tab)
> tab2
)))) :-) :)
4 1 2

Related

R: grep multiple strings at once

I have a data frame with 1 variable and 5,000 rows, where each element is a string.
1. "Am open about my feelings."
2. "Take charge."
3. "Talk to a lot of different people at parties."
4. "Make friends easily."
5. "Never at a loss for words."
6. "Don't talk a lot."
7. "Keep in the background."
.....
5000. "Speak softly."
I need to find and output row numbers that correspond to 3 specific elements.
Currently, I use the following:
grep("Take charge." , df[,1])
grep("Make friends easily.", df[,1])
grep("Make friends easily.", df[,1])
And get the following output:
[1] 2 [2] 4 [3] 5000
Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line?
Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this?
2, 4, 5000
What I tried so far.
grep("Take charge." , "Make friends easily.","Make friends easily.",
df[,1]) # this didn't work
I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either
Since these are exact matches use this where phrases is a character vector of the phrases you want to match:
match(phrases, df[, 1])
This also works provided no phrase is a substring of another phrase:
grep(phrases, df[, 1])

Best Match of each element in of a string with many strings

Input Data:
a <- c("coca cola","hot coffee","Running Shoes","Table cloth",
”mobile phones under 5000”,”Amazon kindle”)
b <- c("running shoes","plastic cup","pizza","Let’s go to hill","motor van",
"coffee table","drinking coffee on a rainy day",”Best mobile phones under 10000”,
”kindle e-books”,”Coffee Cup”)
Match each word of each sentence of a vector (here vector a) to all strings in a separate vector(here vector b) word by word and get the best match.
Logic:
All sentence of vector “a” has to be matched with all sentences of vector “b” word by word and a percentage has to be calculated.
There can be only one best match per sentence of vector “a”.
Example 1: “Running Shoes” in vector “a” matched perfectly with “Running Shoes” in vector “b” and the percentage_match is 100% (since both the words matched)
Example 2: the best match of “hot coffee” may be “drinking coffee on a rainy day” or “coffee table” or “Coffee cup” and the percentage_match is 50% (since only “coffee”, matched out of “hot coffee” in all the cases). In such scenario, where there is more than one contender with same max percentage_match, we will choose the best match with the lowest string length i.e “coffee table” and “coffee cup” gets priority over “drinking coffee on a rainy day”. Even after doing this, there is a tie, we are free to choose any thing (i.e either of “Coffee Table” or “Coffee cup”, can be the best match for “hot coffee”.
Code Tried:
as <- strsplit(a, " ")
bs <- strsplit(b, " ")
matchFun <- function(x, y) length(intersect(x, y)) / length(x) * 100
mx <- outer(as, bs, Vectorize(matchFun))
m <- apply(mx, 1, which.max) # the maximum column of each row
z <- unlist(apply(mx, 1, function(x) x[which.max(x)])) # maximum percentage
z[z == 0] <- NA # this gives you the NA if you want it
data.frame(a, Matching_String=b[m], match_perc=z)
Problem faced: Since my actual data is very big (more than 2 million records are to be matched with 1 Mn record), this code takes forever.
Here's one way to do this using stringdistmatrix from package stringdist. Basically, we are calculating the distance between the strings in a and b. Then we keep the smallest distance. There will always be a match, even if the distance is a high number. One thing you could do is establish a minimum distance, or NA otherwise.
library(stringdist)
m <- stringdistmatrix(tolower(a), tolower(b), method = "qgram")
b[apply(m, 1, which.min)]
#[1] "plastic cup" "coffee table" "running shoes"
#[4] "coffee table" "best mobile phones under 10000" "kindle e-books"

Find longest word with even number of characters

So I have a string and I need to find the word which matches two constraints viz, the number of characters in the word should be even and it should be the longest such word.
For ex:
Input: I am a bad coder with good logical skills
Output: skills
Just starting off with R so any help would be great.
you can try the library tokenizers
library(tokenizers)
text <- "I am a bad coder with good logical skills"
names(which.max(sapply(Filter(function(x) nchar(x) %% 2 == 0,
unlist(tokenize_words(text))), nchar)))
#[1] "skills"
Here is my code:
input<-"I am a bad coder with good logical skills"
words<-strsplit(input," ") # Split it to words
countWords<-sapply(words,nchar) # Count the length of words
dt<-data.frame( word=unlist(words), length=unlist(countWords) ) # Make a dataframe
dt<-dt[order(dt$length),] # Sort the dataframe based on length
dt<-dt[ which((dt$length %% 2)==1),] # Get the words with odd length
dt[nrow(dt),] # Get the longest word

How to generate word sequence

1.I want to generate combinations of characters from a given word with each letter being repeated consecutively utmost 2 times and at least 1.The resultant words are of unequal lengths. For example from
"cat"
to
"cat", "catt", "caat", "caatt", "ccat", "ccatt", "ccaat", "ccaatt"
Required function takes a word of length n and generates 2^n words of unequal length. It is almost similar to binary digits with n length gives 2^n combinations. For example a 3 digit binary number gives
000 001 010 011 100 101 110 111
combinations, where 0=t and 1=tt.
2.And also the same function should restrict the resultant sequence maximum upto 2 consecutive repetition of a character even if the given word has repetitions of letters.For example
"catt"
to
"catt" "ccatt" "caatt" "ccaatt"
I tried something like this
pos=expand.grid(l1=c(1,11),l2=c(2,22),l3=c(3,33))
result=chartr('123','cat',paste0(pos[,1],pos[,2],pos[,3]))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"
It gives correct sequence but I am stuck with generalizing it to any given word with different lengths.
Thank you.
Use stdout as per normal...
print("Hello, world!")
x="cat"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
chartr(n,x,do.call(paste0,expand.grid(m)))
1.Just an addition to the answer given by Onyambu to solve the second part of the question i.e. restrict the output to maximum 2 consecutive repetitions of a character given any number of consecutive repetitions of characters in the input word.
x="catt"
l=seq(nchar(x))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
Below line of code removes the words with more than 2 consecutive repetitive characters
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "catt" "ccatt" "caatt" "ccaatt"
2.If you want all the combinations starting from "cat" to "ccaattt" given any number of consecutive repetitions of characters in the input word. Code is
x1="catt"
Below line of code restricts the consecutive repetition of characters to 1.
x2= gsub('([[:alpha:]])\\1+', '\\1', x1)
l=seq(nchar(x2))
n=paste(l,collapse="")
m=split(c(l,paste0(l,l)),rep(l,2))
o <- chartr(n,x,do.call(paste0,expand.grid(m)))
unique(gsub('([[:alpha:]])\\1{2,}', '\\1\\1', o))
#[1] "cat" "ccat" "caat" "ccaat" "catt" "ccatt" "caatt" "ccaatt"

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Resources