SyntaxNet Tokenization for Italian and Spanish - syntaxnet

We're trying to use SyntaxNet on English, Italian and Spanish with models pretrained on Universal Dependencies datasets that we found here https://github.com/tensorflow/models/blob/master/syntaxnet/universal.md.
For Italian and Spanish we are encountering some problems at the level of tokenisation for contractions and clitics. Contractions are a combination of a preposition and a determiner, so we want them to be split in the two parts. We noticed that the tokeniser always fails in doing so, which means that the whole analysis of the sentence becomes wrong. The same happens for clitics.
We are launching the models as follows:
MODEL_DIRECTORY=../pretrained/Italian
cat /mnt/test_ita.split | syntaxnet/models/parsey_universal/tokenize.sh \
$MODEL_DIRECTORY > /mnt/test_ita.tokenized
Below, an example of the output we are obtaining now and the one we wish to have.
Italian (SyntaxNet analisys)
1 Sarebbe _ VERB V Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++V 2 cop _ _
2 bello _ ADJ A Gender=Masc|Number=Sing|fPOS=ADJ++A 0 ROOT _ _
3 esserci _ PRON PE fPOS=NOUN++S 2 nsubj _ _
4 . _ PUNCT FS fPOS=PUNCT++FS 2 punct _ _
Italian (desired output)
1 Sarebbe _ VERB V Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++V 2 cop _ _
2 bello _ ADJ A Gender=Masc|Number=Sing|fPOS=ADJ++A 0 ROOT _ _
3 esser _ VERB V VerbForm=Inf|fPOS=VERB++V 2 csubj _ _
4 ci _ PRON PC PronType=Clit|fPOS=PRON++PC 3 advmod _ _
How can we handle this problem? Thanks in advance.

Related

What in this Nearley grammar is causing an infinite loop?

I'm attempting to write a Nearley grammar that will parse a .pbtxt file (a protobuf in textual format). I'm very close but seem to be encountering an infinite loop when tested in the Nearley playground (https://omrelli.ug/nearley-playground/). Can someone more comfortable with Nearley grammars spot the issue more readily?
#builtin "number.ne"
#builtin "string.ne"
#builtin "whitespace.ne"
Start -> Field:+
Field -> _ (ScalarField | MessageField) _ "\n":*
ScalarField -> FieldName _ ":" _ (ScalarValue | ScalarList) _
MessageField -> FieldName _ ":" _ (MessageValue | MessageList) _
FieldName -> [A-Za-z0-9_]:+
MessageValue -> "{" Field:+ "}"
MessageList -> "{" Field (_ "\n":+ Field):* "}"
ScalarValue -> String | Float | Integer
ScalarList -> "{" _ ScalarValue (_ "\n":+ ScalarValue):* "}"
String -> sqstring | dqstring | [A-Za-z0-9_]:+
Float -> decimal
Integer -> int
Here's an example .pbtxt that should pe parseable:
something: WHATEVER
some_list: {
bins: 32
bins: 64
bins: 128
bins: 256
}
things: {
some_path: "Data/thingything"
weight: 2
other_weights: {
positive_dense_bin: 8
low_heat: 1
}
}

Julia RemoteChanel example gives UndefVarError

I tried to run the example from https://docs.julialang.org/en/v1/manual/parallel-computing/#Channels-and-RemoteChannels-1
Just copy-pasted commands to my julia console. I am using version 1.0.0
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.0.0 (2018-08-08)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> nprocs()
5
julia> const results = RemoteChannel(()->Channel{Tuple}(32));
julia> const jobs = RemoteChannel(()->Channel{Int}(32));
julia> #everywhere function do_work(jobs, results) # define work function everywhere
while true
job_id = take!(jobs)
exec_time = rand()
sleep(exec_time) # simulates elapsed time doing actual work
put!(results, (job_id, exec_time, myid()))
end
end
julia> function make_jobs(n)
for i in 1:n
put!(jobs, i)
end
end;
julia> n = 12
12
julia> #async make_jobs(n);
julia> for p in workers() # start tasks on the workers to process requests in parallel
remote_do(do_work, p, jobs, results)
end
julia> #elapsed while n > 0 # print out results
job_id, exec_time, where = take!(results)
println("$job_id finished in $(round(exec_time; digits=2)) seconds on worker $where")
n = n - 1
end
It gave the following error:
4 finished in 0.46 seconds on worker 4
ERROR: UndefVarError: n not defined
Stacktrace:
[1] macro expansion at ./REPL[9]:4 [inlined]
[2] top-level scope at ./util.jl:213 [inlined]
[3] top-level scope at ./none:0
Why it crashed on worker 4? I assumed the while loop should run only on master.
The problem is that line:
n = n - 1
should be
global n = n - 1
This is unrelated to workers but new scope rules in Julia 1.0 (see https://docs.julialang.org/en/latest/manual/variables-and-scoping/#Local-Scope-1).
The example in the manual should be fixed.
Edit: the example is fixed on master https://docs.julialang.org/en/v1.1-dev/manual/parallel-computing/, but was not backported.

How to delete first and last items before the matching pattern or delimiter in R

I have this vector called myvec. I want to delete everything before first delimiter _ and everything after the last delimiter _ (including the delimeter). How do I do this in R to get the result.
myvec <- c("contamination_LPH-001-10_3.txt", "contamination_LPH-001-10_AK1_0.txt",
"contamination_LPH-001-10_AK2_1.txt", "contamination_LPH-001-10_PD_2.txt",
"contamination_LPH-001-10_SCC_4.txt")
Result:
LPH-001-10, LPH-001-10_AK1,LPH-001-10_AK2,LPH-001-10_PD,LPH-001-10_SCC
We can use gsub for this
gsub("^[^_]*_|_[^_]*$", "", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"
From the start (^) of the string, we are matching zero or more characters that are not a _ ([^_]*) followed by a _ or (|) match a _ followed by zero or more charachters that are not a _ ([^_]*) till the end ($) of the string and replace it with "".
Or we can also use capture groups ((...)) and replace with the backreference for the capture groups.
sub("^[^_]*_(.*)_[^_]*$", "\\1", myvec)
#[1] "LPH-001-10" "LPH-001-10_AK1" "LPH-001-10_AK2"
#[4] "LPH-001-10_PD" "LPH-001-10_SCC"

Coordinates to Grid Box Number

Let's say I have some grid that looks like this
_ _ _ _ _ _ _ _ _
| | | |
| 0 | 1 | 2 |
|_ _ _|_ _ _|_ _ _|
| | | |
| 3 | 4 | 5 |
|_ _ _|_ _ _|_ _ _|
| | | |
| 6 | 7 | 8 |
|_ _ _|_ _ _|_ _ _|
How do I find which cell I am in if I only know the coordinates? For example, how do I get 0 from (0,0), or how do I get 7 from (1,2)?
Also, I found this question, which does what I want to do in reverse, but I can't reverse it for my needs because as far as I know there is not a mathematical inverse to modulus.
In this case, given cell index A in the range [0, 9), the row is given by R = floor(A/3) and the column is given by C = A mod 3.
In the general case, where MN cells are arranged into a grid with M rows and N columns (an M x N grid), given a whole number B in [0, MN), the row is found by R = floor(B/N) and the column is found by C = B mod N.
Going the other way, if you are given a grid element (R, C) where R is in [0, M) and C is in [0, N), finding the element in the scheme you show is given by A = RN + C.
cell = x + y*width
Programmers use this often to treat a 1D-array like a 2D-array.
For future programmers
May this be useful:
let wide = 4;
let tall = 3;
let reso = ( wide * tall);
for (let indx=0; indx<reso; indx++)
{
let y = Math.floor(indx/wide);
let x = (indx % wide);
console.log(indx, `x:${x}`, `y:${y}`);
};

Breadth first search, and A* search in a graph?

I understand how to use a breadth first search and A* in a tree structure, but given the following graph, how would it be implemented? In other words, how would the search traverse the graph? S is the start state
Graph Here
It's exactly the same as doing it in a tree. You just need to somehow keep track of which nodes you've already visited so that you don't end up going in circles.
Basically, you treat a graph the same way that you'd treat a tree, except you need to keep track of nodes you've already visited. That's fine for BFS. On top of that, in the case of A*, consider what you'd do when you revisit a node but have found a cheaper route to it.
Paint the graph - Recursively search each node and mark the nodes you visited as dirty. Only recurse when the graph is not dirty.
If memory is not an issue, copy the graph and instead of marking the nodes, remove them from the copy graph.
It's weighted graph. Do you want to find shortest paths or just traverse it?
If you want just traversing, here it is:
1) there is only S in the queue
2) we are adding C and A in the queue, only they are reachable from S directly (with one edge)
3) D, G2 - from C
4) B, E - from A
5) G1 - from D (G2 is already in the queue)
6) there no outgoing edge from G2
7) there's no adjacent nodes of B which aren't already in the queue
So here's the order how nodes where added in the queue: S, C, A, D, G2, B, E, G1
I don't know how helpful you will find this, but here's a complete solution coded in the functional language J (available for free from jsoftware.com).
First, it's probably simplest to work directly from a representation of the graph you show in your picture. I represent this as a (# nodes) x (# nodes) table with a number at (i,j) for the value of the link between node-i and node-j. Also, along the diagonal I've put the number associated with each node itself.
So, I enter this as follows - don't worry too much about the unfamiliar notation, you'll soon see what the result looks like:
grph=: <;.1&>TAB,&.><;._2 ] 0 : 0
A B C D E G1 G2 S
A 2 1 8 2
B 1 1 1 4 2
C 3 1 5
D 1 5 2
E 6 9 7
G1 0
G2 0
S 2 3 5
)
So, I've assigned the variable "grph" as a 9x9 table where the first row and first column are the labels "A"-"E", "G1", "G2", and "S"; I've used tabs to delimit items so this could be cut-and-pasted to or from a spreadsheet as needed.
Now, I'll check the size of my table and display it:
$grph
9 9
grph
+---+--+--+--+--+--+---+---+--+
| | A| B| C| D| E| G1| G2| S|
+---+--+--+--+--+--+---+---+--+
| A | 2| 1| | | 8| | | 2|
+---+--+--+--+--+--+---+---+--+
| B | | 1| 1| 1| | 4 | | 2|
+---+--+--+--+--+--+---+---+--+
| C | | 3| 1| | | | 5 | |
+---+--+--+--+--+--+---+---+--+
| D | | | | 1| | 5 | 2 | |
+---+--+--+--+--+--+---+---+--+
| E | | | | | 6| 9 | 7 | |
+---+--+--+--+--+--+---+---+--+
| G1| | | | | | 0 | | |
+---+--+--+--+--+--+---+---+--+
| G2| | | | | | | 0 | |
+---+--+--+--+--+--+---+---+--+
| S | 2| | 3| | | | | 5|
+---+--+--+--+--+--+---+---+--+
It looks OK and it's easy to compare this to the picture of the graph to check it.
Now I'll drop the first row and column so we're left only with numbers (as boxed literals),
and remove any extraneous tab characters.
grn=. TAB-.~&.>}.}."1 grph
You can see I assign this result to the variable "grn".
Next, I'll replace any empty cells with "_" - which represents infinity - then convert the literals to numeric representation (re-assigning the result to the same name "grn"):
grn=. ".&>(0=#&>grn)}grn,:<'_'
Finally, I'll move the last column and row to the beginning since this is the one for "S" and it's supposed to be first. I'll also display the result to confirm that it looks correct.
]grn=. _1|."1]_1|.grn NB. "S" goes first.
5 2 _ 3 _ _ _ _
2 2 1 _ _ 8 _ _
2 _ 1 1 1 _ 4 _
_ _ 3 1 _ _ _ 5
_ _ _ _ 1 _ 5 2
_ _ _ _ _ 6 9 7
_ _ _ _ _ _ 0 _
_ _ _ _ _ _ _ 0
So, now that I have a simple 8x8 table of numbers representing the graph, it's a simple matter to traverse it.
Here's a simple J function, called "traverseGraph", to read this table, traverse the graph it represents, and return two results: the indexes (0-based origin) of the nodes visited, and the values of the points and edges in the order visited.
traverseGraph=: 3 : 0
pts=. ,_-.~,ix{y [ nxt=. ix=. ,0
while. 0~:#nxt=. ~.ix-.~;([:I._~:])&.><"1 nxt{y do.
ix=. ix,nxt [ pts=. pts,_-.~,nxt{y
end.
ix;pts
)
We start by initializing three variables: the list of indexes "ix" (to zero, since we want to begin in the zeroth row of the table), a variable "nxt" to point to the next group of nodes (initially the same as the starting node), and the list of point values "pts" (starting as the 0th row of our input table, known internally as "y", with all the infinite values removed.)
In the "while." loop, we continue as long as there's more than zero "nxt" values resulting from pulling the current row out of the table and removing any nodes (in "ix") we've already visited. Inside the loop, we accumulate the next set of indexes onto the end of "nxt" and the point values onto "pts". At the end, we return the indexes and point values as our (two-element) result.
We run it like this - it displays the result by default:
traverseGraph grn
+---------------+---------------------------------------------+
|0 1 3 2 5 7 4 6|5 2 3 2 2 1 8 3 1 5 2 1 1 1 4 6 9 7 0 1 5 2 0|
+---------------+---------------------------------------------+
So, the first box contains the indexes starting with "0" and ending with "6". The second boxed item is the vector of point values in the order we accumulated them. I don't know what you do with these, so I just show them.
We can use the indexes to display the node names like this:
0 1 3 2 5 7 4 6{(<"0'SABCDE'),'G1';'G2'
+-+-+-+-+-+--+-+--+
|S|A|C|B|E|G2|D|G1|
+-+-+-+-+-+--+-+--+
I don't know how useful you'll find this but it does outline a simple solution to your problem.

Resources