How can I retain the numbering of paragraphs when extracting text from a docx file?
I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:
1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
1.17.1. The Agent will ensure that attendant resources are bla bla bla
1.18. An indicative Authority resource profile is set out in bla bla bla.
etc
The docx_summary() of the officer package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:
The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
The Agent will ensure that attendant resources are bla bla bla
An indicative Authority resource profile is set out in bla bla bla.
I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary() output, the $style_name variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary outputted dataframe.
The output I want is the same docx_summary() dataframe, but with an added numbering column, to look like this:
output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")
> output_df
content_type style_name numbering text
1 paragraph heading 2 1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
Any help would be much appreciated.
Related
hello how i can find a Word and add new line
data <- c("ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla")
i try this but dont know how :X
str_replace(data,"(ran)","\n")
Are you looking for ?
stringr::str_replace(data,"(man)","\\1\n")
which can be written in base R as well :
gsub('(man)', '\\1\n', data)
you don't say what the problem is.
with your data, i run this line
library(stringr)
data_oneword <- str_replace_all(data, "hello", "hello\n")
and get what you seem to be asking for
[1] "Ran hi i man and bla bla" "and want bla bla thxx" "Dan hello\n i want to fly"
[4] "thxx all" "David hello\n i want to fly" "thxx all"
[7] "Yanis hello\n i want to fly" "thxx all" "Ran hello\n i want to fly"
[10] "thxx all" "Yanis hello\n i want to fly" "thxx all"
[13] "David hello\n i want to fly" "thxx all" "Yanis hello\n i want to fly"
[16] "thxx all"
If you want to substitute the word with the new line, then your code works. If you want to keep the word and add a new line afterwards, use Ronak's code. But neither is actually going to print a new line, as that's not how R's vectors work, what you can do to print those "\n" as new lines is use cat function:
cat(stringr::str_replace(data,"(man)","\\1\n"), sep="\n")
Or
cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Obs: the sep="\n" breaks the line after every element in data. The output is:
> cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Ran hi i
and bla bla
and want bla bla thxx
Dan hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
Ran hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
I suggest using "(man )" as your pattern to get rid of that space before and bla bla.
In the adoc file I define a chapter header like:
== [big-number]#2064# Das Spiele-Labor
For HTML that translates to
<span class="big-number">2064</span>
For the epub-Version, converted with asciidoctor-epub, apparently the class is omitted. The code line in the converter.rb:
<h1 class="chapter-title">#{title_upper}#{subtitle ? %[ <small class="subtitle">#{subtitle_formatted_upper}</small>] : nil}</h1>
(/var/lib/gems/1.9.1/gems/asciidoctor-epub3-1.5.0.alpha.7.dev/lib/asciidoctor-epub3/converter.rb)
How can I get the class information over to the chapter-title to format the first number in a special way?
Or is there another way to solve this? (The first number of the chapter title should be large and CSS hasn't got a 'first-word' attribute)
I am working on a little project to learn more about graph analytics. I have dialogs from the TV show Archer to which I added a speaker fields and a speaking tofield. To show the level of interaction, I am using a basic wordcount.
My data looks like this:
TEXT Speaker Speaking_to Wordcount
Bla bla Archer Lana 2
Bla Archer Lana, Cyril 1
Bla bla bla Lana Archer, Cyril 3
I would use the wordcount sums between every combination of speaker to speaking_to to show the strength of the characters interaction between each other.
How would you proceed in Neo4j?
How to you model the cases where I have multiple speaking_to characters? I want all my nodes to be individual characters and not groups.
Thank you,
Model:
two types of nodes - the person and the text
two types of relationships - who speaks and to whom appeals
MERGE (A:Person {name:'Archer'})
MERGE (L:Person {name:'Lana'})
MERGE (C:Person {name:'Cyril'})
MERGE (T1:Text {name: 'Bla bla', wc: 2})
MERGE (T2:Text {name: 'Bla', wc: 1})
MERGE (T3:Text {name: 'Bla bla bla', wc: 3})
MERGE (A)-[:Speaking]->(T1)
MERGE (T1)-[:Speaking_to]->(L)
MERGE (A)-[:Speaking]->(T2)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (L)-[:Speaking]->(T3)
MERGE (T3)-[:Speaking_to]->(A)
MERGE (T3)-[:Speaking_to]->(C)
Strength of directional interaction:
MATCH (A:Person)-[:Speaking]->(S:Text)-[:Speaking_to]->(P:Person)
RETURN A.name, P.name, sum(S.wc) as wordcount
ORDER BY wordcount DESC
I have a list of names looking like this
Noms<- c("André Coin", "XXXAndré Coin", "Gabriel Péri","Léon Blum", "XXXLéon Blum")
I am trying to create a function that finds each time when each of these names is found in a very long text, at the beginning of a line starting with "M" or "Mme".
My text is a vector in which each line is an element.
So at the end, a line like "M. André Coin said bla bla" would be matched; but a line like "He said bla bla bla to M. André Coin" would NOT be matched.
The final requirement is that "André Coin" can be distinguished from "XXXAndré Coin".
The solution I have found for the moment is:
findpattern <- function(name,vect) {
x<-paste0("^.{1,3}((M\\s*)|(Mme\\s*))*\\s*",name)
found<-grepl(x,vect)
return(found)
}
However, when I run findpattern(Noms,txt), it cannot distinguish "André Coin" from "XXX André Coin". Meaning that findpattern("André Coin", "M. XXXAndré Coin")=TRUE".
Can you help me find my mistake in the writing of my regular expression?
You've missed a dot after the M and the 3 first characters must be optionnal (ie. form 0 to 3 char):
"^.{0,3}((M\\.\\s*)|(Mme\\s*))*\\s*"
If the dot after M is optionnal:
"^.{0,3}((M\\.?\\s*)|(Mme\\s*))*\\s*"
I'm trying to clean a text string. For exemple:
[bla bla bla fórmula MELD" width="990" height="718" bla bla bla bla]
I want to remove width="990" height="718" in my whole text. I want to locate every width="xxx" height="xxx" and remove them. In some cases they are inverted height="xxx" width="xxx"
The numbers in width="990" height="718" are not the same every time.