Extract text from word and convert into Dataframe - docx

I need to extract a specific portion of text that is in a Word (.docx). The document has the following structure:
Question 1:
How many ítems…
 two
 four
 five
 ten
Explanation:
There are four ítems in the bag.
Question 2:
How many books…
 two
 four
 five
Explanation:
There are four books in the bag.
With this information I have to create a Dataframe like this one:
I'm able to open the document, extract the text and print the lines starting with  , but I'm not able to extract the rest of the string of interest and create the Dataframe.
My code is:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
text=getText('document.docx')
text
strings = re.findall(r" (.+)\n", text)
Any help?
Thanks in advance

I would suggest you expand your regular expression to include all of the information you need. In this case I think you'll need two passes - one to get each question, and a second to parse the possible answers.
Take a look at your source text and break it down into the parts you need. Each item starts with Question n:, then a line for the actual questions, multiple lines for each possible response, followed by Explanation and a line for the explanation. We'll use the grouping operator to extract the parts on interest.
The Question line can be described by the following pattern:
"Question ([0-9]+):\n"
The line that represents the actual question is just text:
"(.+)\n"
The collection of possible responses is a series of lines beginning with a special character (I've replaced it with '*' because I can't tell what character it is from the post), (allowing for possible whitespace)
\*\s*.+\n
but we can get the whole list of them using a combination of grouping including the non-capturing group:
((?:\*\s*.+\n)+)
That causes any number of matching lines to be captured as a single group.
Finally you have "Explanation" possibly preceded by some whitespace, and followed by a line of text:
\s*Explanation:\n(.+)\n
If we put these all together, our regex pattern is
r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
Parsing this:
patt = r"Question\s+([0-9]+):\n(.*)\n((?:\*\s*.+\n)+)\s*Explanation:\n(.+)\n"
matches = re.findall(patt, text)
yields:
[('1',
'How many ítems…',
'* two\n* four\n* five\n* ten\n',
'There are four ítems in the bag.'),
('2',
'How many books…',
'* two\n* four\n* five\n',
'There are four books in the bag.')]
Where each entry is a tuple. The 3rd item in each tuple is a text of all of the answers as a group, which you'll need to further break down.
The regex to match your answers (using the character '*') is:
\*\s*(.+)\n
Grouping it to eliminate the character, we can use:
r"(?:\*\s*(.+)\n)"
Finally, using a list comprehension we can replace the string value for the answers with a list:
matches = [ tuple([x[0],x[1],re.findall(r"(?:\*\s*(.+)\n)", x[2]),x[3]) for x in matches]
Yielding the result:
[('1',
'How many ítems…',
['two', 'four', 'five', 'ten'],
'There are four ítems in the bag.'),
('2',
'How many books…',
['two', 'four', 'five'],
'There are four books in the bag.')]
Now you should be prepared to massage that into your dataframe.

Related

Replacing Content of a column with part of that column's content

I'd like to replace the content of a column in a data frame with only a specific word in that column.
The column always looks like this:
Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')
Place(fullName='Iphofen, Deutschland', name='Iphofen', type='city', country='Germany', countryCode='DE')
I'd like to extract the city name (in this case Würzburg or Iphofen) into a new column, or replace the entire row with the name of the town. There are many different towns so having a gsub-command for every city name will be tough.
Is there a way to maybe just use a gsub and tell Rstudio to replace whatever it finds inside the first two ' '?
Might it be possible to tell it, "give me the word after "name=' until the next '?
I'm very new to using R so I'm kind of out of ideas.
Thanks a lot for any help!
I know of the gsub command, but I don't think it will be the most appropriate in this case.
Yes, with a regular expression you can do exactly that:
string <- "Place(fullName='Würzburg, Germany', name='Würzburg', type='city', country='Germany', countryCode='DE')"
city <- gsub(".*name='(.*?)'.*", "\\1", string)
The regular expression says "match any characters followed by name=', then capture any characters until the next ' and then match any additional characters". Then you replace all of that with just the captured characters ("\\1").
The parentheses mean "capture this part", and the value becomes "\\1". (You can do multiple captures, with subsequent captures being \\2, \\3, etc.
Note the question mark in (.*?). This means "match as little as possible while still satisfying the rest of the regex". If you don't include the question mark, the regular expression will match "greedily" and you will capture the entire rest of the line instead of just the city since that would also satisfy the regular expression.
More about regular expression (specific to R) can be found here

How to prevent code from detecting and pulling patterns within words (Example: I want 'one' detected but not 'one' in the word al'one')?

I have this code that is meant to add highlights to some numbers in a text stored in "lines"
stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})
where nums is the following pattern being deteced
nums<-(Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+\\s?(Hundred|Thousand|Million|Billion|Trillion)?'
The problem I'm having is that the line of code above also leads to numbers embedded in words also being detected. In the following text this happens:
Get <<ten>> eggs. That is what is writ<<ten>>. I am <<one>> and al<<one>>.
when it should be:
Get <<ten>> eggs. That is what is written. I am <<one>> and alone.
I don't want to remove the question mark after the \s because I want to detect both numbers like "One" followed by no space and "One Hundred" which has a space in between.
Does anyone know how to do this?
Surround (Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine)+ with \b.
\b matches word boundaries, so this expression will newer match inside a word.

How do I extract a section number and the text after it?

I have a question.
My text file contains lines such as:
1.1        Description.
This is the description.
1.1.1      Quality Assurance
Random sentence.
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
I'm trying to find out how to get:
1.1        Description
1.1.1      Quality Assurance
1.6.1    Quality Control
Right now, I have:
txt1 <- readLines("text1.txt")
txt2<-grep("^[0-9.]+", txt1, value = TRUE)
file<-write(txt2, "text3.txt")
which results in:
1.1        Description.
1.1.1      Quality Assurance
1.6.1    Quality Control. Quality Control is the responsibility of the contractor.
You are using grep with value=TRUE, which
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
This means, that if your regular expression matches anything in the line, the all line will be returned. You managed to build your regular expression to match numbers in the begining of the line. So all the lines which begin with numbers get selected.
It seems that your goal is not to select the all line, but to select only until there is a line break or a period.
So, you need to adjust the regular expression to be more specific, and you need to extract only the matching portion of the line.
A regular expression that matches what you want can be:
"^([0-9]\\.?)+ .+?(\\.|$)"
It selects numbers with dots, followed by a space, followed by anything, and stops matching things when a . comes or the line ends. I recommend the following website to better understand what the regex does: https://regexr.com/
The next step is extracting from the given lines only the matching portion, and not the all line where the regex has a match. For this we'll use the function regexpr, which tells us where the matches are, and the function regmatches, which helps us extract those matches:
txt1 <- readLines("text.txt")
regmatches(txt1, regexpr("^([0-9]\\.?)+ .+?(\\.|$)", txt1))

Get all hashtags using regular expressions

I'm studying the recent hashtag #BalanceTonPorc in one of my classes. I'm trying to get all the occurrences of this hashtag appearing in tweets, but of course nobody uses the same format.
Some people use #BalanceTonPorc, some #balancetonporc, and son on and so forth.
Using gsub, I've so far done this :
df$hashtags <- gsub(".alance.on.orc", "BalanceTonPorc", df$hashtags)
Which does what I want, and all variations of this hashtag are stored under the same one. But there are A LOT of other variations. Some people used #BalanceTonPorc... or #BalanceTonPorc.
Is there a way to have a RegEx that says I want everything that contains .alance.on.orc with every character possible after the hashtag, except , (because it separates hashtags)? Here is a screenshot to illustrate what I mean.
I'm also having another issue, in my frequency table I have twice #BalanceTonPorc, so I guess R must consider them to be different. Can you spot the difference?
You may use [^,]* to match any char but ,, 0+ occurrences:
gsub(".alance.on.orc[^,]*", "BalanceTonPorc", df$hashtags)
Or, to exactly match balancetonporc,
gsub("balancetonporc[^,]*", "BalanceTonPorc", df$hashtags, ignore.case=TRUE)
See a regex demo and an R online test:
x <- c("#balancetonPorc#%$%#$%^","#balancetonporc#%$%, text")
gsub("balancetonporc[^,]*", "BalanceTonPorc", x, ignore.case=TRUE)
# => [1] "#BalanceTonPorc" "#BalanceTonPorc, text"

Regex for comma delimited list with line breaks

The format I would like to allow in my text boxes are comma delimited lists followed by a line break in between the comma delimited lists. Here is an example of what I want from the user:
1,2,3
1,2,4
1,2,5
1,2,6
So far I have limited the user using this ValidationExpression:
^([1-9][0-9]*[]*[ ]*,[ ]*)*[1-9][0-9]*$
However with that expression, the user is only able to enter one row of comma delimited numbers.
How can proceed to accept multiple rows by accepting line breaks?
It is possible to check if the input has the correct format. I would recommend to use groups and repeat them:
((\d+,)+\d+\n?)+
But to check if the matrix is symmetric you have to use something else then regex.
Check it out here: https://regex101.com/r/GqtOuQ/2/
If you want to be a bit more user friendly it is possible to allow as much horizontal spaces as the user wants to add between the number and comma. This can be done with he regex group \h which allows every whitespace except \n.
The regex code looks now a bit more messy:
((\h*\d+\h*,\h*)+\h*\d+\h*\n?\h*)+
Check this out here: https://regex101.com/r/GqtOuQ/3
Here is the version that should work with .NET:
(([ \t]*\d+[ \t]*,[ \t]*)+[ \t]*\d+[ \t]*\n?[ \t]*)+

Resources