I am using regex to search for exact two words in any order. I got the first part of the search but the second part is not working.
REGEXP_SUBSTR('TWO WORDS ARE ONE','(?:^|\W)WORDS(?:$|\W') - one word search
how do I add one more word in the above search?
Related
I have a long text and I counted letters in every single word of it, now I want to show the shortest and the longest words. I used:
words<-strsplit(text," ")
nchar(words[[1]])
w<-factor(nchar(words[[1]]))
table(w)
and I got a table with amount of words of specific length. And now, for example I know that the longest word has 19 letters, but how can I find and show that one word from whole text?
EDIT: and how to show for example every 5-letters word?
Try which.max to find longest word
words[[1]][which.max(nchar(words[[1]]))]
If you want to find all 5-letter words, try below
words[[1]][nchar(words[[1]])==5]
I've been enjoying the powerful function aregexec that allows me to mine strings in a fuzzy way.
For that I can search for a string of nucleotide "ATGGCTTCGTC" within a DNA section with defined allowance of insertion, deletion and substitute.
However, it only show me the first match without finishing the whole string. For example,
If I run
aregexec("a","adfasdfasdfaa")
only the first "a" will show up from the result. I'd like to see all the matches.
I wonder if there are other more powerful functions or a argument to be added to this one.
Thank you very much.
P.S. I explained the fuzzy search poorly. I mean, the match doesn't have to be perfect. Say if I allow an substitution of one character, and search AATTGG in ctagtactaAATGGGatctgct, the capital part will be considered a match. I can similarly allow insertions and deletions of certain characters.
gregexpr will show every time there is the pattern in the string, like in this example.
gregexpr("as","adfasdfasdfaa")
There are many more information if you use ?grep in R, it will explain every aspect of using regex.
this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.
enter image description hereI'm working on creating a word cloud. On creation I see many words having last alphabets missing. For ex., Movie --> movi, become --> becom
I've marked the words in yellow. the last one or two letters are missing
For those who need the answer to this question - We see the last letters in the TDM missing because when we perform stemming on our data, the stem function will look for words that have the same root word. All these words will be then set to their root words. This is the reason we will see "Movie" as "Movi" and so on.
missing letters at the end of the words are the result of preprosessing - stemming. Try to avoid stemming prior to creating DTM or TDM, and create a wordcloud without stemming.
I am attempting to remove all one or two letter words in R with this regular expression:
\\b\\w{1,2}\\b
But I also want to exclude certain two letter words from the removal, e.g. IT.
Is there any way to do this?