Identifying specific string along the row to identify the count [duplicate] - r

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.

Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)

Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3

You can also do something like the following:
length(dataset[which(dataset=="corn")])

I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')

You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Related

Extract all numbers from a character string into a SINGLE character string of numbers in the original order [duplicate]

How can I extract digits from a string that can have a structure of xxxx.x or xxxx.x-x and combine them as a number? e.g.
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1",
"1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
The desired (numeric) output would be:
101011, 101021, 101031...
I tried
regexp <- "([[:digit:]]+)"
solution <- str_extract(list, regexp)
However that only extracts the first set of digits; and using something like
regexp <- "([[:digit:]]+\\.[[:digit:]]+\\-[[:digit:]]+)"
returns the first result (data in its initial form) if matched otherwise NA for shorter strings. Thoughts?
Remove all non-digit symbols:
list <- c("1010.1-1", "1010.2-1", "1010.3-1", "1030-1", "1040-1", "1060.1-1", "1060.2-1", "1070-1", "1100.1-1", "1100.2-1")
as.numeric(gsub("\\D+", "", list))
## => [1] 101011 101021 101031 10301 10401 106011 106021 10701 110011 110021
See the R demo online
I have no experience with R but I do know regular expressions. When I look at the pattern you're specifying "([[:digit:]]+)". I assume [[:digit:]] stands for [0-9], so you're capturing one group of digits.
It seems to me you're missing a + to make it capture multiple groups of digits.
I'd think you'd need to use "([[:digit:]]+)+".

How to subset words with certain number of vowels in rstudio?

I try to subset a list of words having 5 or more vowel letters using str_subset function in rstudio. However, can't figure it.
Is there any suggestion for this issue?
Since you are evidently using stringr, the function str_count will give you what you are after. Assuming your "list of words" means a character vector of single words, the following should do the trick.
testStrings <- c("Brillig", "slithey", "TOVES",
"Abominable", "EQUATION", "Multiplication", "aaagh")
VowelCount <- str_count(testString, pattern = "[AEIOUaeiou]")
OutputStrings <- testStrings[VowelCount >= 5]
The part in square brackets is a regular expression which matches any capital or lower case vowel in English. Of course other languages have different sets of vowels which you may need to take into account.
If you want to do the same in base R, the following single-liner should do it:
OutputStrings <- grep("([AEIOUaeiou].*){5,}", testStrings, value = TRUE)

R: How to count gaps at the beginning of a sequence alignment?

I am analyzing an alignment of amino acid sequences using R and need a reproducible way to figure out where the start is for each sequence. My alignment can be read in as a data frame. Here is a sample of 3.
alignment <- data.frame("Strains" = c("Strain.1", "Strain.2", "Strain.3"),
"Sequence" = c("MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSL---------------------",
"MASLIYRQLLTNSYTVNLSDEIQNIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTAEGVVIQGTNNTDRWLATILIEPNVQATNRTYNLFGQQETLLVENTSQTQWKFVDVSKTTSTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNAA-TGY-YSTTSYDTVNMTSSCDFYIIPRSQEGKCTEYINYGLPPIQNTRNVVPVALSAREIVHTRAQVNEDIVVSKTSLWKEMQYNRDITIRFKFDRTIIKAGGLGYKWSEISFKPITYQYTYTRDGEQITAHTTCSVNGVNNFSYNGGSLPTDFAIS--------------",
"-----------------------NIGSAKSQDVTINPGPFAQTGYAPVNWGAGETNDSTTVEPLLDGPYQPTTFNPPTSYWILLAPTVEGVVIQGTNNVDRWLATILIEPNVQATNRTYNLFGQQEILLIENTSQTQWKFVDVSKTTPTGSYTQHGPLFSTPKLYAVMKFSGKIYTYNGTTPNVT-TGY-YSTTNYDTVNMT-----------------------------------------------------"))
Each of the dashes represents a space. What I want to do is read through my data frame and count how many spaces are at the beginning of each sequence. So far I've tried using the str_count function. For example:
alignment$shift <- str_count(alignment$Sequence, "-")
but this fails me when I have gaps downstream in my sequence. Really I'm only interested in the gaps that occur at the beginning of the sequences.
I stumbled across the regex function in a post that almost perfectly matches my problem, (How to count the number of hyphens at the beginning of a string in javascript?) but this is in Java and I'm not sure how to translate this to R.
My questions are:
1) Is it possible to have str_count stop looking for "-" characters once it reaches a non-"-" character?
2) Is there a way to use regex or a similar function in R that outputs the length of a character match at the beginning of a string?
You could do this...
alignment$Sequence <- as.character(alignment$Sequence) #in case they are factors (as above)
alignment$shift <- nchar(alignment$Sequence) - nchar(gsub("^-+", "", alignment$Sequence))
alignment$shift
[1] 0 0 23
It just counts the number of characters removed by telling gsub to delete the start of a string (the ^) followed by any number of spaces (-+). You could use str_replace instead of gsub.
Maybe this might help? It'll return the position index of the start and end of the "---" string only if it begins at the start of the string.
library(stringr)
str_locate_all(string = alignment$Sequence, pattern = "^-{1,}[A-Z]")
[[1]]
start end
[[2]]
start end
[[3]]
start end
[1,] 1 24

How to extract characters after a match from a string in r?

I have a variable in a data frame that contains raw json text. Some observations have a set 14 digit number that I want to extract and some don't. If the observation has the information it is under this format:
{"blur": "10010010010010"
I want to extract the 14 digits after {"blur": " if there is a match for this left-hand side part of the string. I tried str_extract but my regex syntax is not the best, any suggestions here?
If it's fully formed JSON you could use a JSON parser but assuming
it's just fragments as shown in the question or it is fully formed and you prefer to use regular expressions anyways
each input has 0 or 1 occurrences of the digit string
if 0 occurrences then use NA
then try this.
The second argument to strapply is the regular expression. It returns the portion matched to the capture group, i.e. the part of the regular expression within parentheses. The empty=NA argument tells it what to return if no occurrences are found.
library(gsubfn)
s <- c('{"blur": "10010010010010"', 'abc') # test input
strapply(s, '{"blur": "(\\d+)"', empty = NA, simplify = TRUE)
## [1] "10010010010010" NA

regular expression: remove consecutive repeated characters at least 2 times as well as those after it in a string in R

I have a vector with different strings like this:
s <- c("mir123mm8", "qwe98wwww98", "123m3tqppppw23!")
and
> s
[1] "mir123mm8" "qwe98wwww98" "123m3tqppppw23!"
I would like to have the answer like this:
> c("mir123", "qwe98", "123m3tq")
[1] "mir123" "qwe98" "123m3tq"
That means that if a string has at least 2 consecutive repeated characters, then them and after them should be removed.
What is the better way to do it using regular expression in R?
You can use back reference in the pattern to match repeated characters:
sub("(.*?)(.)\\2.*", "\\1", s)
# [1] "mir123" "qwe98" "123m3tq"
The pattern matches when the second captured group which is a single character repeats directly after it. Make the first capture group ungreedy by ? so that whenever the pattern matches, the first captured group is returned.

Resources