I want to put spaces between words that have meaning in R.
For example I want to change this sentence :
sentence<-c("haveagoodday!")
to this one :
"have a good day !"
is it possible ?
Related
I'd like to replace the content of a column in a data frame with only a specific word in that column.
The column always looks like this:
Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')
Place(fullName='Iphofen, Deutschland', name='Iphofen', type='city', country='Germany', countryCode='DE')
I'd like to extract the city name (in this case Würzburg or Iphofen) into a new column, or replace the entire row with the name of the town. There are many different towns so having a gsub-command for every city name will be tough.
Is there a way to maybe just use a gsub and tell Rstudio to replace whatever it finds inside the first two ' '?
Might it be possible to tell it, "give me the word after "name=' until the next '?
I'm very new to using R so I'm kind of out of ideas.
Thanks a lot for any help!
I know of the gsub command, but I don't think it will be the most appropriate in this case.
Yes, with a regular expression you can do exactly that:
string <- "Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')"
city <- gsub(".*name='(.*?)'.*", "\\1", string)
The regular expression says "match any characters followed by name=', then capture any characters until the next ' and then match any additional characters". Then you replace all of that with just the captured characters ("\\1").
The parentheses mean "capture this part", and the value becomes "\\1". (You can do multiple captures, with subsequent captures being \\2, \\3, etc.
Note the question mark in (.*?). This means "match as little as possible while still satisfying the rest of the regex". If you don't include the question mark, the regular expression will match "greedily" and you will capture the entire rest of the line instead of just the city since that would also satisfy the regular expression.
More about regular expression (specific to R) can be found here
I have a question similar to this one but instead of having two specific characters to look between, I want to get the text between a space and a specific character. In my example, I have this string:
myString <- "This is my string I scraped from the web. I want to remove all instances of a picture. picture-file.jpg. The text continues here. picture-file2.jpg"
but if I were to do something like this: str_remove_all(myString, " .*jpg) I end up with
[1] "This"
I know that what's happening is R is finding the first instance of a space and removing everything between that space and ".jpg" but I want it to be the first space immediately before ".jpg". My final result I hope for looks like this:
[1] "This is my string I scraped from the web. I want to remove all instances of a picture. the text continues here.
NOTE: I know that a solution may arise which does what I want, but ends up putting two periods next to each other. I do not mind a solution like that because later in my analysis I am removing punctuation.
You can use
str_remove_all(myString, "\\S*\\.jpg")
Or, if you also want to remove optional whitespace before the "word":
str_remove_all(myString, "\\s*\\S*\\.jpg")
Details:
\s* - zero or more whitespaces
\S* - zero or more non-whitespaces
\.jpg - .jpg substring.
To make it case insensitive, add (?i) at the pattern part: "(?i)\\s*\\S*\\.jpg".
If you need to make sure there is no word char after jpg, add a word boundary: "(?i)\\s*\\S*\\.jpg\\b"
I need to extract a specific portion of text that is in a Word (.docx). The document has the following structure:
Question 1:
How many ítems…
two
four
five
ten
Explanation:
There are four ítems in the bag.
Question 2:
How many books…
two
four
five
Explanation:
There are four books in the bag.
With this information I have to create a Dataframe like this one:
I'm able to open the document, extract the text and print the lines starting with , but I'm not able to extract the rest of the string of interest and create the Dataframe.
My code is:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
text=getText('document.docx')
text
strings = re.findall(r" (.+)\n", text)
Any help?
Thanks in advance
I would suggest you expand your regular expression to include all of the information you need. In this case I think you'll need two passes - one to get each question, and a second to parse the possible answers.
Take a look at your source text and break it down into the parts you need. Each item starts with Question n:, then a line for the actual questions, multiple lines for each possible response, followed by Explanation and a line for the explanation. We'll use the grouping operator to extract the parts on interest.
The Question line can be described by the following pattern:
"Question ([0-9]+):\n"
The line that represents the actual question is just text:
"(.+)\n"
The collection of possible responses is a series of lines beginning with a special character (I've replaced it with '*' because I can't tell what character it is from the post), (allowing for possible whitespace)
\*\s*.+\n
but we can get the whole list of them using a combination of grouping including the non-capturing group:
((?:\*\s*.+\n)+)
That causes any number of matching lines to be captured as a single group.
Finally you have "Explanation" possibly preceded by some whitespace, and followed by a line of text:
\s*Explanation:\n(.+)\n
If we put these all together, our regex pattern is
r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
Parsing this:
patt = r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
matches = re.findall(patt, text)
yields:
[('1',
'How many ítems…',
'* two\n* four\n* five\n* ten\n',
'There are four ítems in the bag.'),
('2',
'How many books…',
'* two\n* four\n* five\n',
'There are four books in the bag.')]
Where each entry is a tuple. The 3rd item in each tuple is a text of all of the answers as a group, which you'll need to further break down.
The regex to match your answers (using the character '*') is:
\*\s*(.+)\n
Grouping it to eliminate the character, we can use:
r"(?:\*\s*(.+)\n)"
Finally, using a list comprehension we can replace the string value for the answers with a list:
matches = [ tuple([x[0],x[1],re.findall(r"(?:\*\s*(.+)\n)", x[2]),x[3]) for x in matches]
Yielding the result:
[('1',
'How many ítems…',
['two', 'four', 'five', 'ten'],
'There are four ítems in the bag.'),
('2',
'How many books…',
['two', 'four', 'five'],
'There are four books in the bag.')]
Now you should be prepared to massage that into your dataframe.
I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?
Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.
I am trying to figure out how I can put together a find and replace command with wildcards or figure out a way to find and replace the following example:
I would like to find terms that contain double quotes in front of them with a single quote at the end:
Example:
find "joe' and replace with 'joe'
Basically, I'm trying to find all terms with terms having "in front and at the end.'
Check the [x] Regular expression checkbox in textpad's replace dialog and enter the following values:
Find what:
"([^'"]*)'
Replace with:
'\1'
Explanation:
In a regular expression, square brackets are used to indicate character classes. A character class beginning with a caret will match anything not in the class.
Thus [^'"] will match any character except ' and ". The following * indicates that any number of these characters can follow. The ( and ) mark a group. And the group we're looking for starts with " and ends with '. Finally in the replace string we can refer to any group via \n where n is the nth group. In our case it is the first and only group and that is why we used \1.