Find longest word with even number of characters - r

So I have a string and I need to find the word which matches two constraints viz, the number of characters in the word should be even and it should be the longest such word.
For ex:
Input: I am a bad coder with good logical skills
Output: skills
Just starting off with R so any help would be great.

you can try the library tokenizers
library(tokenizers)
text <- "I am a bad coder with good logical skills"
names(which.max(sapply(Filter(function(x) nchar(x) %% 2 == 0,
unlist(tokenize_words(text))), nchar)))
#[1] "skills"

Here is my code:
input<-"I am a bad coder with good logical skills"
words<-strsplit(input," ") # Split it to words
countWords<-sapply(words,nchar) # Count the length of words
dt<-data.frame( word=unlist(words), length=unlist(countWords) ) # Make a dataframe
dt<-dt[order(dt$length),] # Sort the dataframe based on length
dt<-dt[ which((dt$length %% 2)==1),] # Get the words with odd length
dt[nrow(dt),] # Get the longest word

Related

regex to find the position of the first four concurrent unique values

I've solved 2022 advent of code day 6, but was wondering if there was a regex way to find the first occurance of 4 non-repeating characters:
From the question:
bvwbjplbgvbhsrlpgdmjqwftvncz
bvwbjplbgvbhsrlpgdmjqwftvncz
# discard as repeating letter b
bvwbjplbgvbhsrlpgdmjqwftvncz
# match the 5th character, which signifies the end of the first four character block with no repeating characters
in R I've tried:
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_match("(.*)\1", txt)
But I'm having no luck
You can use
stringr::str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
See the regex demo. Here, (.) captures any char into consequently numbered groups and the (?!...) negative lookaheads make sure each subsequent . does not match the already captured char(s).
See the R demo:
library(stringr)
txt <- "bvwbjplbgvbhsrlpgdmjqwftvncz"
str_extract(txt, "(.)(?!\\1)(.)(?!\\1|\\2)(.)(?!\\1|\\2|\\3)(.)")
## => [1] "vwbj"
Note that the stringr::str_match (as stringr::str_extract) takes the input as the first argument and the regex as the second argument.

Filter rows based on dynamic pattern

I have speech data in in a dataframe dfin column Orthographic:
df <- data.frame(
Orthographic = c("this is it at least probably",
"well not probably it's not intuitive",
"sure no it's I mean it's very intuitive",
"I don't mean to be rude but it's anything but you know",
"well okay maybe"),
Repeat = c(NA, "probably", "it's,intuitive", "I,mean,it's", NA),
Repeat_pattern = c(NA, "\\b(probably)\\b", "\\b(it's|intuitive)\\b", "\\b(I,mean|it's)\\b",
NA))
I want to filter rows based on a dynamic pattern, namely the occurrence of no, never, not as words OR n't before any of the words listed in column Repeat. However, using the pattern \\b(no|never|not)\\b|n't\\b\\s together with the alternation patterns in column Repeat_pattern, I get this error:
df %>%
filter(grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), Orthographic))
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
Warning message:
In grepl(paste0("\\b(no|never|not)\\b|n't\\b\\s", Repeat_pattern), :
argument 'pattern' has length > 1 and only the first element will be used
I don't know why "only the first element will be used" as the two pattern components seem to connect well:
paste0("\\b(no|never|not)\\b|n't\\b\\s", df$Repeat_pattern)
[1] "\\b(no|never|not)\\b|n't\\b\\sNA" "\\b(no|never|not)\\b|n't\\b\\s\\b(probably)\\b"
[3] "\\b(no|never|not)\\b|n't\\b\\s\\b(it's|intuitive)\\b" "\\b(no|never|not)\\b|n't\\b\\s\\b(I,mean|it's)\\b"
[5] "\\b(no|never|not)\\b|n't\\b\\sNA"
The expected output is this:
2 well not probably it's not intuitive probably \\b(probably)\\b
3 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
4 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I,mean|it's)\\b
It looks like a vectorization issue here, you need to use stringr::str_detect here rather than grepl.
Also, you did not group the negative word alternatives well, all of them must reside in a single group and your n't is now obligatory in a string.
Alse, NA values are coerced to text and added to the regex patterns, while it seems you want to discard the items where Repeat_pattern is NA.
You can fix your code by using
df %>%
filter(ifelse(is.na(Repeat_pattern), FALSE, str_detect(Orthographic, paste0("(?:\\bno|\\bnever|\\bnot|n't)\\b.*", Repeat_pattern))))
Output:
Orthographic Repeat Repeat_pattern
1 well not probably it's not intuitive probably \\b(probably)\\b
2 sure no it's I mean it's very intuitive it's,intuitive \\b(it's|intuitive)\\b
3 I don't mean to be rude but it's anything but you know I,mean,it's \\b(I|mean|it's)\\b
I also think the last pattern must be \\b(I|mean|it's)\\b, not \\b(I,mean|it's)\\b.
If there can only be whitespace between the "no" words and the word from Repeat column, replace .* with \\s+ in my pattern. I used .*\b to make sure there is a match anywhere to the right of the "no" words.

R: grep multiple strings at once

I have a data frame with 1 variable and 5,000 rows, where each element is a string.
1. "Am open about my feelings."
2. "Take charge."
3. "Talk to a lot of different people at parties."
4. "Make friends easily."
5. "Never at a loss for words."
6. "Don't talk a lot."
7. "Keep in the background."
.....
5000. "Speak softly."
I need to find and output row numbers that correspond to 3 specific elements.
Currently, I use the following:
grep("Take charge." , df[,1])
grep("Make friends easily.", df[,1])
grep("Make friends easily.", df[,1])
And get the following output:
[1] 2 [2] 4 [3] 5000
Question 1. Is there a way to make syntax more succinct, so I do not have to use grep and df[,1] on every single line?
Questions 2. If so, how to output a single numerical array of the necessary row positions, so the result would look something like this?
2, 4, 5000
What I tried so far.
grep("Take charge." , "Make friends easily.","Make friends easily.",
df[,1]) # this didn't work
I tried to create a vector, called m1, that contains all three elements and then grep(m1, df[,1]) # this didn't work either
Since these are exact matches use this where phrases is a character vector of the phrases you want to match:
match(phrases, df[, 1])
This also works provided no phrase is a substring of another phrase:
grep(phrases, df[, 1])

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Counting positive smiles in string using R

In src$Review each row is filled with text in Russian. I want to count the number of positive smiles in each row. For example, in "My apricot is orange)) (for sure)" I want to count not just the quantity of outbound brackets (i.e., excluding general brackets in "(for sure)"), but the amount of positive smiling characters ("))" — at least two outbound brackets, number of ":)", ":-)"). So, it works only if at least two outbound brackets are exhibited.
Assume there is a string "I love this girl!)))) (she makes me happy) every day:):) :-)!" Here we count: )))) (4 units), ":)" (2 units), ":-)" (1 unit). After we combine the number of units (i.e., 7). Pay attention that we don't count brackets in "(she makes me happy)".
Now I have following code in my script:
smilecounts <- str_count(src$Review, "[))]")
It counts only the total amount of bracket pairs ("()") (as I understand comparing data set and derivation of this command).
I only need the total amount of ":)", ":-)", "))" (the total number of outbound brackets which display as "))" in rows) to be counted. For example, in ")))))" appear 5 outbound brackets, the condition of at least two outbound brackets together is satisfied, than we count the total amount of brackets in this part of text (i.e., 5 outbound brackets).
Thank you so much for help in advance.
We can use regex lookarounds to extract the ) that follows a ) or : or :=, then use length to get the count.
length(str_extract_all(str1, '(?<=\\)|\\!)\\)')[[1]])
#[1] 4
length(str_extract_all(str1, '(?<=:)\\)')[[1]])
#[1] 2
length(str_extract_all(str1, '(?<=:-)\\)')[[1]])
#[1] 1
Or this can be done using a loop
pat <- c('(?<=\\)|\\!)\\)', '(?<=:)\\)', '(?<=:-)\\)')
sum(sapply(lapply(pat, str_extract_all, string=str1),
function(x) length(unlist(x))))
#[1] 7
data
str1 <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
One way with regexpr and regmatches:
vec <- "I love this girl!)))) (she makes me happy) every day:):) :-)!"
Solution:
#matches the locations of :-) or ))+ or :)
a <- gregexpr(':-)+|))+|:)+', vec)
#extracts those
b <- regmatches(vec, a)[[1]]
b
#[1] "))))" ":)" ":)" ":-)"
#table counts the instances
b
)))) :-) :)
1 1 2
Then I suppose you could count the number of single )s using
nchar(b[1])
[1] 4
Or in a more automated way:
tab <- table(b)
#the following means "if a name of the table consists only of ) then
#count the number of )s"
tab2 <- ifelse(gsub(')','', names(table(b)))=='', nchar(names(table(b))), table(b))
names(tab2) <- names(tab)
> tab2
)))) :-) :)
4 1 2

Resources