Find Word and add new line - r

hello how i can find a Word and add new line
data <- c("ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla ran hi i man and bla bla")
i try this but dont know how :X
str_replace(data,"(ran)","\n")

Are you looking for ?
stringr::str_replace(data,"(man)","\\1\n")
which can be written in base R as well :
gsub('(man)', '\\1\n', data)

you don't say what the problem is.
with your data, i run this line
library(stringr)
data_oneword <- str_replace_all(data, "hello", "hello\n")
and get what you seem to be asking for
[1] "Ran hi i man and bla bla" "and want bla bla thxx" "Dan hello\n i want to fly"
[4] "thxx all" "David hello\n i want to fly" "thxx all"
[7] "Yanis hello\n i want to fly" "thxx all" "Ran hello\n i want to fly"
[10] "thxx all" "Yanis hello\n i want to fly" "thxx all"
[13] "David hello\n i want to fly" "thxx all" "Yanis hello\n i want to fly"
[16] "thxx all"

If you want to substitute the word with the new line, then your code works. If you want to keep the word and add a new line afterwards, use Ronak's code. But neither is actually going to print a new line, as that's not how R's vectors work, what you can do to print those "\n" as new lines is use cat function:
cat(stringr::str_replace(data,"(man)","\\1\n"), sep="\n")
Or
cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Obs: the sep="\n" breaks the line after every element in data. The output is:
> cat(stringr::str_replace(data,"(man)","\n"), sep="\n")
Ran hi i
and bla bla
and want bla bla thxx
Dan hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
Ran hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
David hello i want to fly
thxx all
Yanis hello i want to fly
thxx all
I suggest using "(man )" as your pattern to get rid of that space before and bla bla.

Related

Retaining paragraph numbering in docx using the R officer package

How can I retain the numbering of paragraphs when extracting text from a docx file?
I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:
1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
1.17.1. The Agent will ensure that attendant resources are bla bla bla
1.18. An indicative Authority resource profile is set out in bla bla bla.
etc
The docx_summary() of the officer package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:
The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
The Agent will ensure that attendant resources are bla bla bla
An indicative Authority resource profile is set out in bla bla bla.
I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary() output, the $style_name variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary outputted dataframe.
The output I want is the same docx_summary() dataframe, but with an added numbering column, to look like this:
output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")
> output_df
content_type style_name numbering text
1 paragraph heading 2 1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
Any help would be much appreciated.

An alternative to scan() in R (here-document style)

I am looking for a way to read text into a vector such that each line would be a different element, all happening within an R script.
One way that I found was something like:
bla <- scan(text = "line1
line2
line3",
what = character())
Which correctly gives me:
> bla
[1] "line1" "line2" "line3"
However, there are several problems. First, it is indented. I don't have to, but any auto indentation features will just pop it back to be aligned (which I commonly use). Second, this requires escape codes if I would like to use the double quote symbol for example.
Is there a way to do something similar to the Here-Document method (<< EOF), in R scripts?
I am using RStudio as my IDE, running on Windows. Preferably there would be a platform independent way of doing this.
EDIT
Do you need to have the text inside the R script?
Yes.
An example of what I want to do:
R script here
⋮
bla <- <SOMETHING - BEGIN>
line1
line2
line3
<SOMETHING - END>
⋮
more R script here
Where the requirement, again, is that I can type freely without worrying about auto indentation moving the lines forward, and no need to worry about escape codes when typing things like ".
Both problems can be solved with the scan function and two little tricks, I think:
scan(text = '
line1
"line2" uses quotation mark
line3
', what = character(), sep = "\n")
Read 3 items
[1] "line1" "\"line2\" uses quotation mark"
[3] "line3"
When you put the quotation marks in a line of their own, you don't have a problem with auto indentation (tested using RStudio). If you only have double quotation marks in the text, you can use single quotation marks to start and end your character object. If you have single quotation marks in the text, use double quotation marks for character. If you have both, you should probably use search and replace to make them uniform.
I also added sep = "\n", so every line is one element of the resulting character vector.
Since R version 4.0, we have raw strings (See ?Quotes)
bla <- r"(line1
line2
"line3"
'line4'
Here is indentation
Here is a backslash \
)"
#> [1] "line1\nline2\n\"line3\"\n'line4'\n Here is indentation\nHere is a backslash \\\n"
Note though it gives one single string, not separate elements. We can split it back with strsplit:
bla <- strsplit(bla, "\n")[[1]]
#> [1] "line1"
#> [2] "line2"
#> [3] "\"line3\""
#> [4] "'line4'"
#> [5] " Here is indentation"
#> [6] "Here is a backslash \\"
If authoring an Rmarkdown document instead of an R script is an option, we could use the knitr cat engine
---
title: "Untitled"
output: html_document
---
```{cat engine.opts=list(file='foo')}
line1
line2
"line3"
'line4'
```
```{r}
bla <- readLines("foo")
bla
```

Show levels of interactions between movie characters with Neo4j

I am working on a little project to learn more about graph analytics. I have dialogs from the TV show Archer to which I added a speaker fields and a speaking tofield. To show the level of interaction, I am using a basic wordcount.
My data looks like this:
TEXT Speaker Speaking_to Wordcount
Bla bla Archer Lana 2
Bla Archer Lana, Cyril 1
Bla bla bla Lana Archer, Cyril 3
I would use the wordcount sums between every combination of speaker to speaking_to to show the strength of the characters interaction between each other.
How would you proceed in Neo4j?
How to you model the cases where I have multiple speaking_to characters? I want all my nodes to be individual characters and not groups.
Thank you,
Model:
two types of nodes - the person and the text
two types of relationships - who speaks and to whom appeals
MERGE (A:Person {name:'Archer'})
MERGE (L:Person {name:'Lana'})
MERGE (C:Person {name:'Cyril'})
MERGE (T1:Text {name: 'Bla bla', wc: 2})
MERGE (T2:Text {name: 'Bla', wc: 1})
MERGE (T3:Text {name: 'Bla bla bla', wc: 3})
MERGE (A)-[:Speaking]->(T1)
MERGE (T1)-[:Speaking_to]->(L)
MERGE (A)-[:Speaking]->(T2)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (L)-[:Speaking]->(T3)
MERGE (T3)-[:Speaking_to]->(A)
MERGE (T3)-[:Speaking_to]->(C)
Strength of directional interaction:
MATCH (A:Person)-[:Speaking]->(S:Text)-[:Speaking_to]->(P:Person)
RETURN A.name, P.name, sum(S.wc) as wordcount
ORDER BY wordcount DESC

Regular Expression in R: disentangling similar possibilities

I have a list of names looking like this
Noms<- c("André Coin", "XXXAndré Coin", "Gabriel Péri","Léon Blum", "XXXLéon Blum")
I am trying to create a function that finds each time when each of these names is found in a very long text, at the beginning of a line starting with "M" or "Mme".
My text is a vector in which each line is an element.
So at the end, a line like "M. André Coin said bla bla" would be matched; but a line like "He said bla bla bla to M. André Coin" would NOT be matched.
The final requirement is that "André Coin" can be distinguished from "XXXAndré Coin".
The solution I have found for the moment is:
findpattern <- function(name,vect) {
x<-paste0("^.{1,3}((M\\s*)|(Mme\\s*))*\\s*",name)
found<-grepl(x,vect)
return(found)
}
However, when I run findpattern(Noms,txt), it cannot distinguish "André Coin" from "XXX André Coin". Meaning that findpattern("André Coin", "M. XXXAndré Coin")=TRUE".
Can you help me find my mistake in the writing of my regular expression?
You've missed a dot after the M and the 3 first characters must be optionnal (ie. form 0 to 3 char):
"^.{0,3}((M\\.\\s*)|(Mme\\s*))*\\s*"
If the dot after M is optionnal:
"^.{0,3}((M\\.?\\s*)|(Mme\\s*))*\\s*"

Cleaning a NSString ranges

I'm trying to clean a text string. For exemple:
[bla bla bla fórmula MELD" width="990" height="718" bla bla bla bla]
I want to remove width="990" height="718" in my whole text. I want to locate every width="xxx" height="xxx" and remove them. In some cases they are inverted height="xxx" width="xxx"
The numbers in width="990" height="718" are not the same every time.

Resources