Cleaning a NSString ranges - nsstring

I'm trying to clean a text string. For exemple:
[bla bla bla fórmula MELD" width="990" height="718" bla bla bla bla]
I want to remove width="990" height="718" in my whole text. I want to locate every width="xxx" height="xxx" and remove them. In some cases they are inverted height="xxx" width="xxx"
The numbers in width="990" height="718" are not the same every time.


Find Word and add new line

hello how i can find a Word and add new line
data <- c("ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla")
i try this but dont know how :X
Are you looking for ?
which can be written in base R as well :
gsub('(man)', '\\1\n', data)
you don't say what the problem is.
with your data, i run this line
data_oneword <- str_replace_all(data, "hello", "hello\n")
and get what you seem to be asking for
[1] "Ran hi i man and bla bla" "and want bla bla thxx" "Dan hello\n i want to fly"
[4] "thxx all" "David hello\n i want to fly" "thxx all"
[7] "Yanis hello\n i want to fly" "thxx all" "Ran hello\n i want to fly"
[10] "thxx all" "Yanis hello\n i want to fly" "thxx all"
[13] "David hello\n i want to fly" "thxx all" "Yanis hello\n i want to fly"
[16] "thxx all"
If you want to substitute the word with the new line, then your code works. If you want to keep the word and add a new line afterwards, use Ronak's code. But neither is actually going to print a new line, as that's not how R's vectors work, what you can do to print those "\n" as new lines is use cat function:
cat(stringr::str_replace(data,"(man)","\\1\n"), sep="\n")
cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Obs: the sep="\n" breaks the line after every element in data. The output is:
> cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Ran hi i
and bla bla
and want bla bla thxx
Dan hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
Ran hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
I suggest using "(man )" as your pattern to get rid of that space before and bla bla.

Retaining paragraph numbering in docx using the R officer package

How can I retain the numbering of paragraphs when extracting text from a docx file?
I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:
1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
1.17.1. The Agent will ensure that attendant resources are bla bla bla
1.18. An indicative Authority resource profile is set out in bla bla bla.
The docx_summary() of the officer package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:
The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
The Agent will ensure that attendant resources are bla bla bla
An indicative Authority resource profile is set out in bla bla bla.
I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary() output, the $style_name variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary outputted dataframe.
The output I want is the same docx_summary() dataframe, but with an added numbering column, to look like this:
output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")
> output_df
content_type style_name numbering text
1 paragraph heading 2 1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
Any help would be much appreciated.

Scraping for a rank number using Nokogiri in Ruby

I'm still doing some web scraping practice using this article:
I'd like to get just the rank number of each show and found what I think is the HTML element:
<div class="copy entry manual-ads">
<b class="big">
Chewing Gum
I'm using the following code to grab just the rank number (in this case, "75."):
However, it returns the rank number along with the show title. How can I get just the rank number?
Use regex:

Show levels of interactions between movie characters with Neo4j

I am working on a little project to learn more about graph analytics. I have dialogs from the TV show Archer to which I added a speaker fields and a speaking tofield. To show the level of interaction, I am using a basic wordcount.
My data looks like this:
TEXT Speaker Speaking_to Wordcount
Bla bla Archer Lana 2
Bla Archer Lana, Cyril 1
Bla bla bla Lana Archer, Cyril 3
I would use the wordcount sums between every combination of speaker to speaking_to to show the strength of the characters interaction between each other.
How would you proceed in Neo4j?
How to you model the cases where I have multiple speaking_to characters? I want all my nodes to be individual characters and not groups.
Thank you,
two types of nodes - the person and the text
two types of relationships - who speaks and to whom appeals
MERGE (A:Person {name:'Archer'})
MERGE (L:Person {name:'Lana'})
MERGE (C:Person {name:'Cyril'})
MERGE (T1:Text {name: 'Bla bla', wc: 2})
MERGE (T2:Text {name: 'Bla', wc: 1})
MERGE (T3:Text {name: 'Bla bla bla', wc: 3})
MERGE (A)-[:Speaking]->(T1)
MERGE (T1)-[:Speaking_to]->(L)
MERGE (A)-[:Speaking]->(T2)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (L)-[:Speaking]->(T3)
MERGE (T3)-[:Speaking_to]->(A)
MERGE (T3)-[:Speaking_to]->(C)
Strength of directional interaction:
MATCH (A:Person)-[:Speaking]->(S:Text)-[:Speaking_to]->(P:Person)
RETURN,, sum(S.wc) as wordcount
ORDER BY wordcount DESC

Regular Expression in R: disentangling similar possibilities

I have a list of names looking like this
Noms<- c("André Coin", "XXXAndré Coin", "Gabriel Péri","Léon Blum", "XXXLéon Blum")
I am trying to create a function that finds each time when each of these names is found in a very long text, at the beginning of a line starting with "M" or "Mme".
My text is a vector in which each line is an element.
So at the end, a line like "M. André Coin said bla bla" would be matched; but a line like "He said bla bla bla to M. André Coin" would NOT be matched.
The final requirement is that "André Coin" can be distinguished from "XXXAndré Coin".
The solution I have found for the moment is:
findpattern <- function(name,vect) {
However, when I run findpattern(Noms,txt), it cannot distinguish "André Coin" from "XXX André Coin". Meaning that findpattern("André Coin", "M. XXXAndré Coin")=TRUE".
Can you help me find my mistake in the writing of my regular expression?
You've missed a dot after the M and the 3 first characters must be optionnal (ie. form 0 to 3 char):
If the dot after M is optionnal:
