Import JS Dictionary to Julia - dictionary

I am very new to Julia Lang (in fact, just trying it instead of Python for some data analysis). However, I am stuck when loading my data.
My data is from a web-application built using ReactJS/ Python, saved in a csv. I get the data into a Julia DataFrame. The cell in this DataFrame that I need to analyse looks like this:
{'isClicked': [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], 'continuation': [100, 100, 100, 100, 100, 0, 100, 100, 100, 0, 100, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
This comes from a JS-dictionary. Is there a way to convert it into a dictionary in Julia? I have tried a JSON3 converter (https://discourse.julialang.org/t/why-does-julia-not-support-json-syntax-to-create-a-dict/42873/20), but it seems not to work because of the single quotation mark. I.e., the error I get is:
ArgumentError: invalid JSON at byte position 2 while parsing type
JSON3.Object: ExpectedOpeningQuoteChar {'isClicked': [True, True,
Any suggestion are highly appreciated!
Thanks!

JSON requires double quotes instead of single quotes. Try
replace(text, "'" => "\"")
before sending it to the JSON parser.

That's not a JSON dictionary, it's a python one. In Python you should do
import json
with open('dict_file.json', 'w') as f:
json.dump(my_py_dict, f)
Then in Julia
import JSON
my_julia_dict = JSON.parsefile("dict_file.json")

Related

spaCy Example object format for SpanCategorizer

I am having an issue with SpanCategorizer that I believe is due to my Example object format and possible its initialization.
Can someone provide a very simple Example object with the correct format? Just an example with two docs and two labels will make it for me.
I am not getting how the prediction and the reference should look like. There is a gold standard mentioned in spacy documentation, but it looks out-of-date because the line reference = parse_gold_doc(my_data) doesn't work. Thanks so much for your help!
Here is the code I am using to annotate the docs:
``` phrase_matches = phrase_matcher(doc)
# Initializing SpanGroups
for label in labels:
doc.spans[label]=[]
# phrase_matches detection and labeling of spans, and generation of SpanGrups for each doc
for match_id, start, end in phrase_matches:
match_label = nlp.vocab.strings[match_id]
span = doc[start:end]
span = Span(doc, start, end, label = match_label)
# Set up of the SpanGroup for each doc, for the different labels
doc.spans[match_label].append(span) ```
However spaCy is not recognizing my labels.
If you want/need to create Example objects directly, the easiest way to do so is to use the function Example.from_dict, which takes a predicted doc and a dict. predicted in this context is a Doc with partial annotations, representing data from previous components. For many use-cases, it can just be a "clean" doc created with nlp.make_doc(text):
from spacy.training import Example
from spacy.lang.en import English
nlp = English()
text = "I like London and Berlin"
span_dict = {"spans": {"my_spans": [(7, 13, "LOC"), (18, 24, "LOC"), (7, 24, "DOUBLE_LOC")]}}
predicted = nlp.make_doc(text)
eg = Example.from_dict(predicted, span_dict)
What this function does, is taking the annotations from the dict and using those to define the gold-standard that is now stored in the Example object eg.
If you print this object (using spaCy >= 3.4.2), you'll see the internal representation of those gold-standard annotations:
{'doc_annotation': {'cats': {}, 'entities': ['O', 'O', 'O', 'O', 'O'], 'spans': {'my_spans': [(7, 13, 'LOC', ''), (18, 24, 'LOC', ''), (7, 24, 'DOUBLE_LOC', '')]}, 'links': {}}, 'token_annotation': {'ORTH': ['I', 'like', 'London', 'and', 'Berlin'], 'SPACY': [True, True, True, True, False], 'TAG': ['', '', '', '', ''], 'LEMMA': ['', '', '', '', ''], 'POS': ['', '', '', '', ''], 'MORPH': ['', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4], 'DEP': ['', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0]}}
PS: the parse_gold_doc function in the docs is just a placeholder/dummy function. We'll clarify that in the docs to avoid confusion!

"You have to specify either input_ids or inputs_embeds", but I did specify the input_ids

I trained a BERT based encoder decoder model (EncoderDecoderModel) named ed_model with HuggingFace's transformers module.
I used the BertTokenizer named as input_tokenizer
I tokenized the input with:
txt = "Some wonderful sentence to encode"
inputs = input_tokenizer(txt, return_tensors="pt").to(device)
print(inputs)
The output clearly shows that a input_ids is the return dict
{'input_ids': tensor([[ 101, 5660, 7975, 2127, 2053, 2936, 5061, 102]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
But when I try to predict, I get this error:
ed_model.forward(**inputs)
ValueError: You have to specify either input_ids or inputs_embeds
Any ideas ?
Well, apparently this is a known issue, for example: This issue of T5
The problem is that there's probably a renaming procedure in the code, since we use a encoder-decoder architecture we have 2 types of input ids.
The solution is to explicitly specify the type of input id
ed_model.forward(decoder_input_ids=inputs['input_ids'],**inputs)
I wish it was documented somewhere, but now you know :-)

Getting R observations back to NodeJS using "sort"

I am having a weird issue with the r-script (npm module) and the passage of the output to NodeJS.
Using:
needs("arules")
data <- read.transactions(input[[1]], sep = ",")
library(arules)
# default settings result in zero rules learned
groceryrules <- apriori(data, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))
summary(groceryrules)
inspect(groceryrules[1:5])
I get the result fine in nodeJS as:
[ { '2': '=>', lhs: '{potted plants}', rhs: '{whole milk}', support: 0.0069, confidence: 0.4, lift: 1.5655, count: 68, _row: '[1]' }, { '2': '=>', lhs: '{pasta}', rhs: '{whole milk}', support: 0.0061, confidence: 0.4054, lift: 1.5866, count: 60, _row: '[2]' } ...]
However, changing the last line to:
inspect(sort(groceryrules, by = "lift")[1:5])
I get no output. If I set the interval to 1:2, it prints correctly the two top observations (by Lift).
Why can't I view more than 2 items when using sort?
My code in NodeJS:
var R = require("r-script");
var out = R("tests.R");
out = out.data(__dirname+"\\groceries.csv");
out = out.callSync();
console.log(out)
Thanks!
I managed to find the solution.
Using:
out <- capture.output(inspect(sort(groceryrules,by="lift")[1:10]))
out
It correctly puts into a string the inspect output and then passes it to the NodeJS server as:
[' lhs rhs support confidence lift count',
'[1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69',...]
A simple split in each string should address the problem now, to make the data manageable.
EDIT:
Managed to find a better solution that gets the JSON in the correct format straight away, by using:
data = (sort(rules,by="lift"))
as(data, "data.frame")
This way it correctly converts the frame to JSON.

gdi, get width of asian characters

int dx[8];
int fit;
SIZE the_size;
res = GetTextExtentExPointW(dc, L"WWWWWWWW", 8, -1, &fit, &dx[0], &the_size);
This works, dx is filled with numbers 7 14 21 etc. But when I try to do the same for asian characters, like L"薔薇薔薇薔薇薔薇", this function fails. I even created a font for this, it doesn't change anything.
HFONT hFont = CreateFont(14,
0,
0,
0,
FW_DONTCARE,
FALSE, //fdwItalic
FALSE, //fdwUnderline
FALSE, //fdwStrikeOut
SHIFTJIS_CHARSET,
OUT_DEFAULT_PRECIS,
CLIP_DEFAULT_PRECIS,
NONANTIALIASED_QUALITY,
VARIABLE_PITCH,
TEXT("MS PGothic"));
if (hFont == NULL) FUCK();
SelectObject(dc, hFont);
The forth parameter should be the maximum width allowed, not -1. Use a large value instead. Check to make sure GetTextExtentExPointW succeeded.
if(GetTextExtentExPointW(dc, L"薔薇薔薇薔薇薔薇", 8, 1000, &fit, &dx[0], &the_size))
{
...
}
Note that a Unicode code point may require 4 bytes, or 2 wchar_t for each code point.

minValue and maxValue as Time Range in hAxis in Google Chart

I need to set time range for my hAxis to have minValue of 09:00 and maxValue 17:00 with increment of 1 hour (i.e. 9, 10, 11, 12, 13, 14, ... , 17)
Currently my data is formatted as H:m (for example: 09:35, 10:20)
var formatter3 = new google.visualization.DateFormat({pattern: 'H:m'});
formatter3.format(data,0);
And below are my options:
var options = {
curveType: "function",
title : '',
hAxis:{slantedTextAngle: 90,textStyle:{fontSize:8}},
colors : ['red','#3366CC', '#999999'],
vAxes: {
0: {logScale: false, format:'0.0000'},
1: {logScale: false}
},
hAxis: {
format: 'H:m',
minValue: new Date(null, null, null, 9, 0, 0),
maxValue: new Date(null, null, null, 17, 0, 0),
viewWindow:{min: new Date(null, null, null, 9, 0, 0),
max: new Date(null, null, null, 17, 0, 0)},
series: {
0: {targetAxisIndex:0, type: "line"},
1: {targetAxisIndex:0, type: "line"},
2: {targetAxisIndex:1, type: "bars"}
}
};
However , it is still not working. Please advise. Thanks!
Unfortunately, the minValue, maxValue, and baseline value are ignored for date and time values. I am not sure that this is a recent bug but I just noticed it a week ago. You might try to experiment with the viewWindow min and max, and the gridlines.count option to get the desired result. Or you might be able to convert all your date values to strings, if the values are evenly spaced, in which case axes will use your explicit values.
Another new feature that could work for you is that you can provide an explicit array of tick values, with a ticks: [...] option. In the current release of gviz, the formatting is done using your format option, and that should be enough for your needs. In an upcoming release, you can also specify the formatting of each tick value.
So it might be best to specify the times in your example using timeofday values like so:
hAxis: {
ticks: [[9, 0, 0], [10, 0, 0], [11, 0, 0], [12, 0, 0], ...]
}
I think you could do the same kind of thing with datetime values instead, if that's what your data values are.

Resources