How could it be possible that getting key from from nodes results in 404 and other node have this key (returns 200 with data).
AAE is enabled, cluster alive, no errors or handoffs, where to dig?
Cluster consists of 6 nodes, all of them migrated to 2.1.4 recently, one node still at 1.4.12 (that node has the key)
Where to look and repair inconsistency?
upd. values:
r,w=quorum, notfound_ok=false, but i've tried requesting it with true and r=3, same result.
i've found that on the node which have keys present some vnodes do not have AAE exchange at all
riak-admin aae-status
================================== Exchanges ==================================
Index Last (ago) All (ago)
-------------------------------------------------------------------------------
0 -- --
34253944624943037145398863266787883273185918976 3.6 d --
91343852333181432387730302044767688728495783936 4.2 d --
171269723124715185726994316333939416365929594880 3.9 d --
216941649291305901920859467356323260730177486848 -- --
262613575457896618114724618378707105094425378816 -- --
342539446249430371453988632667878832731859189760 4.4 d --
388211372416021087647853783690262677096107081728 3.5 d --
433883298582611803841718934712646521460354973696 3.7 d --
513809169374145557180982949001818249097788784640 -- --
570899077082383952423314387779798054553098649600 -- --
627988984790622347665645826557777860008408514560 -- --
730750818665451459101842416358141509827966271488 -- --
810676689456985212441106430647313237465400082432 -- --
867766597165223607683437869425293042920709947392 -- --
913438523331814323877303020447676887284957839360 -- --
970528431040052719119634459225656692740267704320 3.7 d --
1027618338748291114361965898003636498195577569280 3.8 d --
1141798154164767904846628775559596109106197299200 -- --
1198888061873006300088960214337575914561507164160 -- --
1233142006497949337234359077604363797834693083136 -- --
1267395951122892374379757940871151681107879002112 3.6 d --
1301649895747835411525156804137939564381064921088 3.6 d --
1370157784997721485815954530671515330927436759040 8.6 hr --
1404411729622664522961353393938303214200622678016 -- --
is it possible to force-run aae at a given node?
all node intercommunication is fine:
Report: net_kernel summary ('riak#192.168.135.45')
Node State Type In Out Address
riak#192.168.172.232 up normal 13530445 13587408 192.168.172.232:6000
riak#192.168.202.11 up normal 15055379 15009545 192.168.202.11:6000
riak#192.168.135.180 up normal 15850450 15598452 192.168.135.180:6000
riak#192.168.205.253 up normal 14317197 14327591 192.168.205.253:6000
riak#192.168.157.36 up normal 6291569 5811633 192.168.157.36:6000
riak_maint_15246#192 up hidden 11 16 192.168.135.45:53159
Total 65045051 64334645
This is strange but different versions of Riak have different methods of url encoding:
If you PUT key with name test%40key at Riak 1.x node, that key will be read fine at Riak 1.x nodes in cluster and will return 404 error at 2.x nodes. But it can be found with name test%2540key at 2.x version nodes.
If you put key with name test%40key at 2.x Riak node, this key will be read find at 2.x nodes and will return 404 at 1.x node. It can be found at 1.x nodes with name test#key
Related
I am using ssd_mobilenet_v1_coco.config and
I changed the value of num_classes to 20 after adding 13 things after planning training
python model_main.py --alsologtostderr --model_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config
I kept trying to learn with the command, but I get an error.
To increase num_classes
What should I do ?
Should I grab num_classes=100 from the beginning and start?
I need help.
model {
ssd {
num_classes: 20
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py", line 1326, in restore
err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [126] rhs shape= [84]
[[node save/Assign_56 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
I recently had a similar issue. To solve my problem, I had to do the following:
In the train_config section of pipeline.config, have fine_tune_checkpoint point to the previous model checkpoint. eg: `fine_tune_checkpoint: "./model/model.ckpt"
In the model_main.py command call, make model_dir refer to a different folder from the previous checkpoint:
python research/object_detection/model_main.py \
--model_dir=./model/finetune0 \
--pipeline_config_path=./model/pipeline.config \
--alsologtostderr
My file structure:
+ models
-+ model
--+ checkpoint
--+ model.ckpt.index
--+ model.ckpt.meta
--+ model.ckpt.data-00000-of-00001
--+ pipeline.config
--- finetune0 (will be autogenerated)
-- data (tfrecord dataset)
-- annotations (labels)
...
Context
Looks like when you already have a checkpoint at the model_dir, the script will try to resume the training on the model provided, but the new configuration on pipeline.config won't match the current model (num_class differs).
If you provide this checkpoint in the fine_tune_checkpoint and point model_dir to a new folder, it will build the model from the checkpoint variable, tweak it to match the new config, and then start the training.
Production environment-> 2 app server domains, 2 web server domains, 3 process schedulers [AIX, 1 NT], NON RAC Oracle DB
PT 8.55.15, HCM 9.2 DB 12.2.0.1
We are running a custom AE program in our production environment. This AE that calls delivered EOEN application packages. No customization's have been made in this package. Custom component interface is also involved. This AE has been running in our environment since 5 years ago. No change has been done to the code in near past.
This AE has been behaving very weirdly since some 2-3 weeks ago. It errors out for the first time but goes to success when re-run in second/third run. We are not changing any parameters nor bouncing any services in second/third run. This application package [EOEN_MVC.EOEN_MODEL] makes calls to PeopleCode function - GetNextNumberWithGapsCommit.
We had set trace on this AE and found that the program goes to error every time the "GetNextNumberWithGapsCommit" function is called. These are the lines from the trace:
3905919 23:51:18.263 0.005472 Cur#8.15073404.HCMPRO RC=0
Dur=0.000094 COM Stmt=SELECT DESCR,DESCRLONG FROM PS_EOEN_REGE_LNG
WHERE EOEN_EVENT_NAME=:1 AND LANGUAGE_CD = :2
3905920 23:51:18.264 0.000974 Cur#8.15073404.HCMPRO RC=0
Dur=0.000001 Bind-1 type=2 length=16 value=CreateTriggerESP
3905921 23:51:18.265 0.000966 Cur#8.15073404.HCMPRO RC=0
Dur=0.000000 Bind-2 type=2 length=3 value=ESP
3905922 23:51:18.270 0.004369 258: If
All(&RS_RegEvnt(1).EOEN_REG_EVNT.EOEN_EVENT_NAME.Value) Then
3905923 23:51:18.271 0.001026 Fetch Field:
EOEN_REG_EVNT.EOEN_EVENT_NAME Value=CreateTriggerESP
3905924 23:51:18.272 0.001004 259:
&NextEventID = GetNextNumberWithGapsCommit(EOEN_CONFIG.EOEN_LAST_ID,
2147483647, 1);
3905925 23:51:18.273 0.001065 Cur#7.15073404.HCMPRO RC=0
Dur=0.000031 COM Stmt=UPDATE PS_EOEN_CONFIG SET EOEN_LAST_ID =
EOEN_LAST_ID + 1
3905926 23:51:18.274 0.001453 Cur#2.15073404.HCMPRO RC=0
Dur=0.000042 COM Stmt=SELECT
TO_CHAR(SYSTIMESTAMP,'YYYY-MM-DD-HH24.MI.SS.FF') FROM PSCLOCK
3905927 23:51:18.278 0.003895 Caught Exception: Error
grave de SQL. (2,125) EOEN_MVC.EOEN_MODEL.EOENInterface.OnExecute
Name:RaiseEvent PCPC:14959 Statement:259
Called from:FUNCLIB_HR_ESP.TRGR_FUNCTIONS_ESP.FieldFormula
Name:Create_Triggers_ESP Statement:60
Called from:JOB.REPORTS_TO.SavePostChange Statement:3
Error message in the process monitor is same as above:-
Error -> Error grave de SQL. (2,125)
EOEN_MVC.EOEN_MODEL.EOENInterface.OnExecute Name:RaiseEvent PCPC:14959
Statement:259 Called
from:FUNCLIB_HR_ESP.TRGR_FUNCTIONS_ESP.FieldFormula
Name:Create_Triggers_ESP Statement:60 Called
from:JOB.REPORTS_TO.SavePostChange Statement:3
I have tried keeping the DBFlags parameter as "8" and "0" in both application server and process schedulers.
I also checked the EOEN_MSG_CHNL queue, its associated service operation EOEN_MSG, handlers, routings etc. Everything is active and running. No failed IB messages. Domain status active.
Nothing works. And the error remains.
Any suggestions may be of great help.
I am extracting the ip address of an interface and using that address' 3rd octet as part of the BGP AS number. I need to insert a 0 before the number if the 3rd octet is < 10.
For example, if 3rd octet = 8 then BGP AS = 11108
Here is my current and unfinished applet.
event manager applet replace
event none
action 1.0 cli command "conf t"
action 1.1 cli command "do show ip int brief vlan 1"
action 1.2 regexp " [0-9.]+ " $_cli_result ip match
action 2.0 regexp {([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)} $_cli_result match ip
action 2.1 regexp {([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)} $ip match first second third forth
action 2.2 set vl1 $first.$second.$third.$forth
action 2.3 cli command "router bpg 111$third"
The simplest method here is to use format with the right formatting sequence. (If you've ever used sprintf() in C, you'll understand what the format command does straight off. Except the Tcl command doesn't have any problems with buffer overruns or other tricky bits like that.)
# Rest of your script unchanged; I'm lazy so I'll not repeat it here
set bpg [format "652%02d" $third]
action 2.3 cli command "router bpg $bpg"
The key here is that %02d does formatting (%) of a decimal number (d) in a zero-padded (0) field of width two (2). And there's a literal 652 in front of it (no % there so literal).
You can roll the above into a single line if you want, but I think it is much clearer to write it in two (there's really no good excuse for writing unclear code, as it just makes your life harder later and it doesn't really take much less time to write clearly in the first place):
action 2.3 cli command "router bpg [format 652%02d $third]"
I'm confused about the RISC-V ABI Register Names. For example, Table 18.2 in the "RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0" at page 85 specifies that the stack pointer sp is register x14. However, the instruction
addi sp,zero,0
is compiled to 0x00000113 by riscv64-unknown-elf-as (-m32 does not make a difference). In binary:
000000000000 00000 000 00010 0010011
^imm ^rs1 ^f3 ^rd ^opcode
So here sp seems to be x2. Then I googled a bit and found the RISC-V Linux User's Manual. This document states that sp is x30.
So what is it? Are there different ABIs? Can I set the ABI with a command line option to riscv64-unknown-elf-*? Is there a comprehensive table somewhere?
The stack pointer is now x2.
Here is the current ABI documentation, which has been moved out of the User-Level ISA specification, which now contains that same link.
The ABI was modified to better accommodate the new RISC-V compressed spec, which puts the 8 most-used registers next to each other in x8-x15.
Note: do not trust ANY non riscv.org webpage. Quan Nguyen makes this clear in his introduction that the "RISC-V Linux User's Manual" is for documenting the porting process and that accuracy is NOT guaranteed.
I am running a job on a Sun Grid Engine (now known as Oracle Grid Engine) cluster. To see whether my job is slowing down because the node is overloaded, I tried to check the status of the node:
$ qstat -l hostname=hnode03 -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q#hnode03.rnd.mycorp.com BP 0/0/0 103.41 lx24-amd64
---------------------------------------------------------------------------------
highmem.q#hnode03.rnd.mycorp BP 0/37/40 103.41 lx24-amd64
977530 0.76963 runJob1 userme r 09/13/2013 17:53:26 2
---------------------------------------------------------------------------------
threaded.q#hnode03.rnd.mycor BP 0/24/32 103.41 lx24-amd64
---------------------------------------------------------------------------------
workflow.q#hnode03.rnd.mycor B 0/0/0 103.41 lx24-amd64
and
$ qhost -h hnode03
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
hnode03 lx24-amd64 64 103.4 504.8G 122.9G 16.0G 58.0M
Now, the load_avg is 103.41, while the NCPU is only 64. Is this ever supposed to happen? Are some jobs using CPU than the slots they are assigned?
Update: In response to queries, the configurations are uploaded to http://pastebin.com/hLnJBetS.
Yes, it can.
Slots are not synonymous of cores (NCPU). Slots must be seen as "how many jobs can be scheduled in parallel on a node."
If you only want one job to be ran at once, set the slots count for your machines to one.
For the load factor, even if your job only uses one slot, if you have too many threads or subprocesses, then all the cores will be used and the load factor will definitely go above 1.