I am working on a little project to learn more about graph analytics. I have dialogs from the TV show Archer to which I added a speaker fields and a speaking tofield. To show the level of interaction, I am using a basic wordcount.
My data looks like this:
TEXT Speaker Speaking_to Wordcount
Bla bla Archer Lana 2
Bla Archer Lana, Cyril 1
Bla bla bla Lana Archer, Cyril 3
I would use the wordcount sums between every combination of speaker to speaking_to to show the strength of the characters interaction between each other.
How would you proceed in Neo4j?
How to you model the cases where I have multiple speaking_to characters? I want all my nodes to be individual characters and not groups.
Thank you,
Model:
two types of nodes - the person and the text
two types of relationships - who speaks and to whom appeals
MERGE (A:Person {name:'Archer'})
MERGE (L:Person {name:'Lana'})
MERGE (C:Person {name:'Cyril'})
MERGE (T1:Text {name: 'Bla bla', wc: 2})
MERGE (T2:Text {name: 'Bla', wc: 1})
MERGE (T3:Text {name: 'Bla bla bla', wc: 3})
MERGE (A)-[:Speaking]->(T1)
MERGE (T1)-[:Speaking_to]->(L)
MERGE (A)-[:Speaking]->(T2)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (T2)-[:Speaking_to]->(C)
MERGE (L)-[:Speaking]->(T3)
MERGE (T3)-[:Speaking_to]->(A)
MERGE (T3)-[:Speaking_to]->(C)
Strength of directional interaction:
MATCH (A:Person)-[:Speaking]->(S:Text)-[:Speaking_to]->(P:Person)
RETURN A.name, P.name, sum(S.wc) as wordcount
ORDER BY wordcount DESC
Related
How can I retain the numbering of paragraphs when extracting text from a docx file?
I'm doing some NLP-ML work on a bunch of docx files, and to begin with I need to break up each doc into a dataframe. I'm working with contracts, such that almost every paragraph is numbered, e.g most of the text I'm dealing with looks like this:
1.17. The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
1.17.1. The Agent will ensure that attendant resources are bla bla bla
1.18. An indicative Authority resource profile is set out in bla bla bla.
etc
The docx_summary() of the officer package function lays out the text in a dataframe wonderfully, except that it doesn't retain the paragraph numbering. The result is that I get a dataframe where the text looks like this:
The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
The Agent will ensure that attendant resources are bla bla bla
An indicative Authority resource profile is set out in bla bla bla.
I guessed this is to do with how Word defines numbering as a Style rather than plain text, and I can see in the docx_summary() output, the $style_name variable has Headings 1 through 4 according to the numbering hierarchy in the docx. But I can't figure out how to extract the actual numbering and apply it to each paragraph in the docx_summary outputted dataframe.
The output I want is the same docx_summary() dataframe, but with an added numbering column, to look like this:
output_df <- data.frame(content_type = "paragraph", style_name = "heading 2", numbering = "1.17", text = "The Agent will provide the attendant resources as set out in Annex 3 bla bla bla")
> output_df
content_type style_name numbering text
1 paragraph heading 2 1.17 The Agent will provide the attendant resources as set out in Annex 3 bla bla bla
Any help would be much appreciated.
I am very new to Netlogo and have been using it to do basic network analysis. I have created a social network made up of 5 different turtle breeds. In order to continue my analysis in another program, I need to create an edge list( a two column list of all the connected nodes in the network). So the first column would have the breed and who number (ex. Actor 1) and the second column would list one of Actor 1's contacts
(ex [Actor1, Actor2] [Actor 1, Director5] [Actor1, Producer 1]..........)
The output needs to be a txt or csv file so that I can import it easily into EXCEL.
I've tried:
to make-edgelist
file-open "Test.txt"
ask links [file-show both-ends]
ask turtles[file-show who]
file-close
end
The problem is that 'both-ends' only reports the who number, not the breed. I can get the breed by using ask turtles [file-show who] but this appends the identifcation to the end of the edgelist which means a lot of manipulation to get things in the correct format. Does anyone have any suggetsions about how to build the edge list with the breeds+who numbers? I feel like I'm probably missing something simple, but I am new to Netlogo. Thanks!
The csv extension makes this a one-liner. Assuming you have extensions [ csv ] at the top of your code, you can just do:
csv:to-file "test.csv" [ [ (word breed " " who) ] of both-ends ] of links
If you need column titles, you can add them using fput, e.g.:
csv:to-file "test.csv"
fput ["source" "target"]
[ [ (word breed " " who) ] of both-ends ] of links
Note, however, that both-ends is an agentset that will always be accessed in random order, so "source" and "target" are not very meaningful in that case.
If you have directed links and if the direction is important, you can preserve it with this slightly more complicated version:
csv:to-file "test.csv"
fput ["source" "target"]
[ map [ t -> [ (word breed " " who) ] of t ] (list end1 end2) ] of links
I'm assuming that whatever program you're using for your analysis can organize and rearrange your pairs as needed, so as long as all pairs are recorded your network should build fine regardless of order. Your approach using ask-links and both-ends is a good one, and works with a bit of tweaking (mainly just using the breed primitive to have the turtles include their breed in the output. Here's one example:
to pairs-out
file-open "test.csv"
file-type (word "End_1, " "End_2,\n" )
ask links [
ask both-ends [
file-type (word breed " " who ",")
]
file-type "\n"
]
file-close
end
I am trying to split values in a dataframe column that looks like this:
Apple\Banana
Drink
---
Drink\Cup Cake
Apple
--
Fudge\Grape\Ham
Cup Cake
---
I am trying to match both newline and '\' using regex in strsplit.
currently I am using this:
strsplit(as.character(df$Food), "[\\\\ \n]")
However, it is also matching the space and splitting up "CupCake" to "Cup" and "Cake"
I am trying to figure out the proper regex for this matching.
My aim is to split the multiple values to multiple food columns in the dataframe called Food.1, Food.2, Food.3, etc. Is there standard way to do the split and create new columns in a dataframe? I think strsplit may not be the best way forward.
You have a space in the pattern. Try putting the newline first then you don't need a space:
strsplit(as.character(df$Food), "[\n\\\\]")
I have a list of names looking like this
Noms<- c("André Coin", "XXXAndré Coin", "Gabriel Péri","Léon Blum", "XXXLéon Blum")
I am trying to create a function that finds each time when each of these names is found in a very long text, at the beginning of a line starting with "M" or "Mme".
My text is a vector in which each line is an element.
So at the end, a line like "M. André Coin said bla bla" would be matched; but a line like "He said bla bla bla to M. André Coin" would NOT be matched.
The final requirement is that "André Coin" can be distinguished from "XXXAndré Coin".
The solution I have found for the moment is:
findpattern <- function(name,vect) {
x<-paste0("^.{1,3}((M\\s*)|(Mme\\s*))*\\s*",name)
found<-grepl(x,vect)
return(found)
}
However, when I run findpattern(Noms,txt), it cannot distinguish "André Coin" from "XXX André Coin". Meaning that findpattern("André Coin", "M. XXXAndré Coin")=TRUE".
Can you help me find my mistake in the writing of my regular expression?
You've missed a dot after the M and the 3 first characters must be optionnal (ie. form 0 to 3 char):
"^.{0,3}((M\\.\\s*)|(Mme\\s*))*\\s*"
If the dot after M is optionnal:
"^.{0,3}((M\\.?\\s*)|(Mme\\s*))*\\s*"
I have a file with 3 column where one of the column will consist of an "array" with delimiter as say "," . I will need to link the text inside the array to form something like a linked list. After which, it will be linked to the other 2 column.
For example:
Column 1 (Text): A
Column 2 (Array of text): B1, B2, B3, B4
Column 3 (Text): C
I will need something like A->B1->B2->B3->B4->C to be visualise in Neo4j.
I need help in forming the "LOAD CSV..." query. Appreciate any help offered!
You can use split for extracting each element of the desired array
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file://directory/file.csv' AS line
with SPLIT(line.columnName,',') as arrayColumn
now you can use each data of the arrayColumn like
arrayColumn[0], arrayColumn[1]
then you can create relationships or node
MERGE (v:LabelName {name:arrayColumn[0]})-[:relations]->(v:LabelName {name:arrayColumn[1]})
Hope this helps ...