spaCy Example object format for SpanCategorizer - initialization

I am having an issue with SpanCategorizer that I believe is due to my Example object format and possible its initialization.
Can someone provide a very simple Example object with the correct format? Just an example with two docs and two labels will make it for me.
I am not getting how the prediction and the reference should look like. There is a gold standard mentioned in spacy documentation, but it looks out-of-date because the line reference = parse_gold_doc(my_data) doesn't work. Thanks so much for your help!
Here is the code I am using to annotate the docs:
``` phrase_matches = phrase_matcher(doc)
# Initializing SpanGroups
for label in labels:
doc.spans[label]=[]
# phrase_matches detection and labeling of spans, and generation of SpanGrups for each doc
for match_id, start, end in phrase_matches:
match_label = nlp.vocab.strings[match_id]
span = doc[start:end]
span = Span(doc, start, end, label = match_label)
# Set up of the SpanGroup for each doc, for the different labels
doc.spans[match_label].append(span) ```
However spaCy is not recognizing my labels.

If you want/need to create Example objects directly, the easiest way to do so is to use the function Example.from_dict, which takes a predicted doc and a dict. predicted in this context is a Doc with partial annotations, representing data from previous components. For many use-cases, it can just be a "clean" doc created with nlp.make_doc(text):
from spacy.training import Example
from spacy.lang.en import English
nlp = English()
text = "I like London and Berlin"
span_dict = {"spans": {"my_spans": [(7, 13, "LOC"), (18, 24, "LOC"), (7, 24, "DOUBLE_LOC")]}}
predicted = nlp.make_doc(text)
eg = Example.from_dict(predicted, span_dict)
What this function does, is taking the annotations from the dict and using those to define the gold-standard that is now stored in the Example object eg.
If you print this object (using spaCy >= 3.4.2), you'll see the internal representation of those gold-standard annotations:
{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'O', 'O'], 'spans': {'my_spans': [(7, 13, 'LOC', ''), (18, 24, 'LOC', ''), (7, 24, 'DOUBLE_LOC', '')]}, 'links': {}}, 'token_annotation': {'ORTH': ['I', 'like', 'London', 'and', 'Berlin'], 'SPACY': [True, True, True, True, False], 'TAG': ['', '', '', '', ''], 'LEMMA': ['', '', '', '', ''], 'POS': ['', '', '', '', ''], 'MORPH': ['', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4], 'DEP': ['', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0]}}
PS: the parse_gold_doc function in the docs is just a placeholder/dummy function. We'll clarify that in the docs to avoid confusion!

Related

Import JS Dictionary to Julia

I am very new to Julia Lang (in fact, just trying it instead of Python for some data analysis). However, I am stuck when loading my data.
My data is from a web-application built using ReactJS/ Python, saved in a csv. I get the data into a Julia DataFrame. The cell in this DataFrame that I need to analyse looks like this:
{'isClicked': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'continuation': [100, 100, 100, 100, 100, 0, 100, 100, 100, 0, 100, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
This comes from a JS-dictionary. Is there a way to convert it into a dictionary in Julia? I have tried a JSON3 converter (https://discourse.julialang.org/t/why-does-julia-not-support-json-syntax-to-create-a-dict/42873/20), but it seems not to work because of the single quotation mark. I.e., the error I get is:
ArgumentError: invalid JSON at byte position 2 while parsing type
JSON3.Object: ExpectedOpeningQuoteChar {'isClicked': [True, True,
Any suggestion are highly appreciated!
Thanks!
JSON requires double quotes instead of single quotes. Try
replace(text, "'" => "\"")
before sending it to the JSON parser.
That's not a JSON dictionary, it's a python one. In Python you should do
import json
with open('dict_file.json', 'w') as f:
json.dump(my_py_dict, f)
Then in Julia
import JSON
my_julia_dict = JSON.parsefile("dict_file.json")

numpy array of unexpected dimension

I'm currently switching from Matlab to Python and I have a problem with understanding numpy arrays.
The following code (copied from Numpy documentation) creates a [2x3] array
np.array([[1, 2, 3], [4, 5, 6]], np.int32).
Which behaves as expected.
Now I tried to adapt this to my case and tried
myArray = np.array([\
[-0.000847283, 0.000000000, 0.141182070, 2.750000000],
[ 0.000876414, -0.025855453, 0.270459334, 2.534537894],
[-0.000098373, 0.003388169, -0.021976882, 3,509325279],
[ 0.000077079, -0.004507202, 0.096453685, 2,917172446],
[-0.000049944, 0.003114201, -0.055974372, 3,933359490],
[ 0.000042697, -0.003833862, 0.117727186, 2.485846507],
[-0.000000843, 0.000084733, 0.000169340, 3.661424974],
[ 0.000000676, -0.000074756, 0.005751451, 3.596300338],
[-0.000001860, 0.000229543, -0.006420507, 3.758593109],
[ 0.000006764, -0.000934745, 0.045972458, 2.972698644],
[ 0.000014803, -0.002140505, 0.106260454, 1.967898711],
[-0.000025975, 0.004587858, -0.263799480, 8.752330828],
[ 0.000009098, -0.001725357, 0.114993424, 1.176472749],
[-0.000010418, 0.002080207, -0.132368251, 6.535975709],
[ 0.000032572, -0.006947575, 0.499576502, -8.209401868],
[-0.000039870, 0.009351884, -0.722882956, 22.352084596],
[ 0.000046909, -0.011475011, 0.943268640, -22.078624629],
[-0.000067764, 0.017766572, -1.542265901, 48.344854010],
[ 0.000144148, -0.039449875, 3.607214322,-106.139552662],
[-0.000108830, 0.032648910, -3.242170215, 110.757624352]
])
But not as expected the shape is (20,). I expected the following shape: (20x4).
Question 1: Can anyone tell me why? And how do I create the array correctly?
Question 2: When I add the datatype , dtype=np.float, I get the following
Error:
*TypeError: float() argument must be a string or a number, not 'list'*
but the array isn't intended to be a list.
I found the mistake on my own after trying to np.vstack all vectors.
The resulting error said that the size of the arrays with the row index 2, 3, 4 is not 4 as expected.
Replacing a , (comma) with a dot solved the problem.

Please how do i achieve the following using ramda

I have a random array of numbers 1 to five occurring in ramdom sometimes [1,1,1,1,2,2] etc. I am tasked with finding the value with highest occurrence all the the time regardless. I achieved that in javascript like below using a library called ramda here . After reading the documentation, i went with a solution like below.
// filter out duplication in array that way you can get the uniq represented numbers
const uniqueItems = R.uniq(params);
// use the unique numbers as keys and create a new array of object
const mappedItemsWithRepresentations = map((a) => ({ color: a, rep: params.filter(b => b === a).length }), uniqueItems);
// and then finally, select the item with highest rep and return it key
const maxRepItem = mappedItemsWithRepresentations.reduce((acc, curr) => acc.rep > curr.rep ? acc : curr, []);
return maxRepItem.key; // gives me the correct value i need
However, reading through more in the documentation and going through the example here, i realised there is a way i can combine the logic above and simply with ramda. I tried numerous attempt possible and the closest i could get are below.
const getMaxRep = curry(pipe(uniq, map((a) => ({ color: a, rep: filter(b => b === a).length })), pipe(max(pathEq("rep")), tap(console.log))));
console.log("Max Rep here", getMaxRep(params));
I also tried utilising the reduced feature here, all to no avail. Please how do i arrange achieve that ? Any help will be appreciated.
Ramda has R.countBy to get the number of occurrences. You can convert the resulting object of country to pairs [value, count], and then reduce it to find the pair with the highest count:
const { pipe, countBy, identity, toPairs, reduce, maxBy, last, head } = R
const fn = pipe(
countBy(identity), // count the occurrences
toPairs, // convert to pairs of [value, count]
reduce(maxBy(last), [0, 0]), // reduce to find the maximum occurrence
head, // get the actual value
Number, // convert back to an number
)
const arr = [1,1,1,1,2,2]
const result = fn(arr)
console.log(result)
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.27.0/ramda.js"></script>
A slight variation on this idea that collects values with the same count to an array. This will handle cases in which the frequency of several items is identical:
const { pipe, countBy, identity, toPairs, invert, reduce, maxBy, last, head, map } = R
const fn = pipe(
countBy(identity), // count the occurrences
invert, // combine all values with the same count
toPairs, // convert to pairs of [value, count]
reduce(maxBy(head), [0, 0]), // reduce to find the maximum occurrence
last, // get the actual values
map(Number), // convert back to numbers
)
const arr = [1,1,1,1,2,2,3,3,3,3]
const result = fn(arr)
console.log(result)
<script src="https://cdnjs.cloudflare.com/ajax/libs/ramda/0.27.0/ramda.js"></script>
nice use case, try this:
const maxReduce = reduce(maxBy(last), [0,0])
const getMaxRep = pipe(countBy(identity), toPairs, maxReduce, head)
console.log(getMaxRep([1,1,1,1,2,2]))
countBy is a really nice start, sadly Ramda don't support reduce for object but we can convert to an array of arrays using toPairs function and finish the work.
It's not entirely clear to me what it is you're asking for.
But it might be something like this:
const maxRep = pipe (
countBy (identity),
toPairs,
map (zipObj(['color', 'rep'])),
reduce (maxBy (prop ('rep')), {rep: -Infinity}),
)
const params = [1, 2, 3, 4, 2, 3, 5, 2, 3, 2, 1, 1, 4, 5, 5, 3, 2, 5, 1, 5, 2]
console .log (
maxRep (params)
)
<script src="//cdnjs.cloudflare.com/ajax/libs/ramda/0.27.0/ramda.js"></script>
<script> const {pipe, countBy, identity, toPairs, map, zipObj, reduce, maxBy, prop} = R </script>
We start with a list of values drawn from {1, 2, 3, 4, 5}, occuring in some random, multiply-occuring order.
With countBy(identity) we change the original list into something like
{"1": 4, "2": 6, "3": 4, "4": 2, "5": 5}
with the counts associated with each entry.
toPairs formats that as an array like
[["1", 4], ["2", 6], ["3", 4], ["4", 2], ["5", 5]]
(You could also use Object.entries here.)
Then by calling map (zipObj (['color', 'rep'])), we turn this into
[{"color": "1", "rep": 4}, {"color": "2", "rep": 6}, ...]
Finally, we reduce the result, using maxBy (prop ('rep')), which chooses the one with the maximum rep value. For the initial value to the max call, we create a dummy object, {rep: -Infinity} that will compare less than any in your list.
If you wanted to also keep that final intermediate structure, you could rename that function to makeReps, dropping off the last function in the pipeline, and then making a new maxRep out of it.
Then you could call
const reps = makeResps (params)
const maxVal = maxRep (reps)
and use both.
But all this presupposes that the value with color and rep properties is what you need. If you just need the count then the other solutions already here handle that fine.

In Python how remove extra space elements from two dimensional array?

For eg.
list = [['2', '2', '', ''], ['3', '3', '', ''], ['4', '4', '', '']]
and i want as
newlist = [['2','2'],['3','3'],['4','4']]
is there any list comprehensive compact way to achieve this
like for 1D array we have [x for x in strings if x] is there any thing similar to this.
I think you mean you wish to remove empty elements fromm your 2darray. If that is the case then:
old_list = [['2', '2', '', ''], ['3', '3', '', ''], ['4', '4', '', '']]
new_list = [[instance for instance in sublist if len(instance)>0] for sublist in old_list]
If you wish to remove elements containing only whitespace(spaces etc), then yoy may do something like:
old_list = [['2', '2', '', ''], ['3', '3', '', ''], ['4', '4', '', '']]
new_list = [[instance for instance in sublist if not instance.isspace()] for sublist in old_list]
list = [[x for x in y if x != ''] for y in list]
You can achieve this using filter. Also unrelated but since list is a reserved word it's best not to use it and try to come up with a more meaningful name, I've simply renamed it to original_list since the list() method won't work otherwise.
original_list = [['2', '2', '', ''], ['3', '3', '', ''], ['4', '4', '', '']]
new_list = []
for sub_list in original_list:
new_sub_list = list(filter(None, sub_list))
new_list.append(new_sub_list)
print(new_list)
Or in short
new_list2 = [ list(filter(None, sub_list)) for sub_list in original_list ]
print(new_list2)
use this list comprehension:
list = [[x for x in a if x] for a in list]

minValue and maxValue as Time Range in hAxis in Google Chart

I need to set time range for my hAxis to have minValue of 09:00 and maxValue 17:00 with increment of 1 hour (i.e. 9, 10, 11, 12, 13, 14, ... , 17)
Currently my data is formatted as H:m (for example: 09:35, 10:20)
var formatter3 = new google.visualization.DateFormat({pattern: 'H:m'});
formatter3.format(data,0);
And below are my options:
var options = {
curveType: "function",
title : '',
hAxis:{slantedTextAngle: 90,textStyle:{fontSize:8}},
colors : ['red','#3366CC', '#999999'],
vAxes: {
0: {logScale: false, format:'0.0000'},
1: {logScale: false}
},
hAxis: {
format: 'H:m',
minValue: new Date(null, null, null, 9, 0, 0),
maxValue: new Date(null, null, null, 17, 0, 0),
viewWindow:{min: new Date(null, null, null, 9, 0, 0),
max: new Date(null, null, null, 17, 0, 0)},
series: {
0: {targetAxisIndex:0, type: "line"},
1: {targetAxisIndex:0, type: "line"},
2: {targetAxisIndex:1, type: "bars"}
}
};
However , it is still not working. Please advise. Thanks!
Unfortunately, the minValue, maxValue, and baseline value are ignored for date and time values. I am not sure that this is a recent bug but I just noticed it a week ago. You might try to experiment with the viewWindow min and max, and the gridlines.count option to get the desired result. Or you might be able to convert all your date values to strings, if the values are evenly spaced, in which case axes will use your explicit values.
Another new feature that could work for you is that you can provide an explicit array of tick values, with a ticks: [...] option. In the current release of gviz, the formatting is done using your format option, and that should be enough for your needs. In an upcoming release, you can also specify the formatting of each tick value.
So it might be best to specify the times in your example using timeofday values like so:
hAxis: {
ticks: [[9, 0, 0], [10, 0, 0], [11, 0, 0], [12, 0, 0], ...]
}
I think you could do the same kind of thing with datetime values instead, if that's what your data values are.

Resources