I am attempting to remove all one or two letter words in R with this regular expression:
\\b\\w{1,2}\\b
But I also want to exclude certain two letter words from the removal, e.g. IT.
Is there any way to do this?
Related
I am scraping a word document to get the frequency of "content words" only. To this point, I have been able to use Tidyverse and Tidytext packages to remove words that are articles, include punctuation, have a length of one, etc with functions like:
!str_detect(word, pattern = "[[:digit:]]"), # removes any words with numeric digits
!str_detect(word, pattern = "[[:punct:]]"), # removes any remaining punctuations
!str_detect(word, pattern = "(.)\\1{2,}"), # removes any words with 3 or more repeated letters
!str_detect(word, pattern = "\\b(.)\\b") # removes any remaining single letter words
Now, I do not want to remove entire observations any longer--I want to remove only certain characters from existing observations (ex. remove "s" and "ed" endings)
Current Dataframe:
print(df)
WORD N
Happy 7
Apple 8
Coworkers 16
Customers 9
Kicked 11
Turtle 8
Desired Dataframe:
WORD N
Happy 7
Apple 8
Coworker 16
Customer 9
Kick 11
Turtle 8
Your regex may work for simple cases (nouns, verbs) but for more accurate results I recommend a proper stemmer/lemmatizer. I've had good results with spaCy's Lemmatizer.
Here is a R wrapper to spaCy http://spacyr.quanteda.io/
You can use regular expressions like
/\w+((s)|(ed))$/g
The \w+ will match 1 or more alphabetic characters.
The ((s)|(ed))$ looks for an ending of either "s" or "ed". You can extend that list as needed.
The beginning and ending slashes aren't part of the regex, they just mark the beginning and end of the match pattern.
The final g after the last slash is a regex flag that indicates you want to match globally, which in most languages will mean that you don't just stop when you find the first match, it'll find all matches. This may not be appropriate in your case, you'll have to experiment to figure out if it's what you need.
Note that the beginning/ending slashes and the g is a syntax not used in every language, so I'm not sure whether it applies in R. Some languages have regex libraries that make you pass the flags in as separate arguments so read your language's documentation to figure out how that works.
Wrapping things in parentheses automatically creates capturing groups, so you can check the regex match object to see if the 1st capture group (corresponding to the outer parens) has a match, which tells you that this word has an ending you need to replace, then you can perform a regex replace using the first capture group and it'll get rid of any of those endings for you.
I recommend https://regex101.com to test your regular expressions while developing them. Here's a regex & test suite I saved pertaining to your question, if you want to use it: https://regex101.com/r/tBduP6/2
this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.
Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.
If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.
For example, suppose I have this string:
this.is.a.long.string.indeed.yeah.well.you.know
Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:
All substrings must be 12 characters or less
Separators [._] should be at the end, not at the beginning of a substring
The number of substrings must be minimal
If several solutions exist, the one having the most similar substring lengths is to be preferred.
I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.
What I have so far:
Regex: .{1,12}(_|\b|\Z)
Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know
It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.
Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.
You were so close, you just need to present it with another alternative for cases where no separator is found:
.{1,12}(_|\b|\Z)|.{1,12}
Check it out: https://regex101.com/r/XrJuYj/2/
Edit: to ensure the split portion contains a non-separating character, you can use the following:
(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}
See it at: https://regex101.com/r/XrJuYj/3
I have spent a lot of time reading successful answers on how to extract part of a string using substr and substring. Still, I am having trouble applying answers because I cannot differentiate where specific punctuation characters are used to indicate when to start and stop selecting other punctuation characters, if those characters appear more than once within the string.
In a generalised case, I would like to split my string in several places based on re-ocurring characters of "_" and "."
In my individual case, a cell in one column of my dataframe contains a filename as a string, and I would like to use that string to generate strings in 3 subsequent columns on the same row.
To demonstrate, the string might look like: "Name_12. Word_CsvData.txt" and that should be split without referring to numeric character positions into "Name", "12", "Word"
To do this, I am aiming to find out how to extract part of a string:
1) from the beginning of the string to the first instance of an underscore character.
2) from the first underscore to the first full stop.
3) from the space to the second underscore character.
Any help would be much appreciated.
I have separated the last letter of individuals names into a separate column in order to count the number of lowercase letters there are in the column. These could be any letter of the alphabet and I want to only count them if they are a lowercase letter. Any assistance would be appreciated. Thank you.
This would be very easy to do with a VBA custom function - but as this hasn't been mentioned in your tags, you could use a combination of Sumproduct and Exact to essentially do a case sensitive Countif
=SUMPRODUCT(EXACT({"A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"},D1)*1)
I hope that this helps