slurm:all cpus in a node are allocated by a job which just need a subset of cpus - mpi

I have every node configured as follow in slurm.conf
NodeName=node1 NodeAddr=xxx.xxx.xxx.xxx State=UNKNOWN Procs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128000 TmpDisk=65536
when I run the following command
srun -n 2 sleep 60
I found that all the core in a node would be allocated by this job. If another job want to run on this node, it would be bolcked until the previous job finishes.
scontrol show the job information as following
JobId=51 JobName=sleep
UserId=hadoop(1002) GroupId=hadoop(1002) MCS_label=N/A
Priority=4294901703 Nice=0 Account=hadoop QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:12 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2018-07-16T21:46:56 EligibleTime=2018-07-16T21:46:56
StartTime=2018-07-16T21:46:56 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-07-16T21:46:56
Partition=TOTAL AllocNode:Sid=node1:25124
ReqNodeList=(null) ExcNodeList=(null)
NodeList=xxx.xxx.xxx
BatchHost=xxx.xxx.xxx
NumNodes=1 NumCPUs=32 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=125G,node=1,billing=32
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=125G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=sleep
WorkDir=/home/hadoop
Power=
Use sacct to get the history jobs , I get the following output
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
51 sleep TOTAL hadoop 32 COMPLETED 0:0
51.0 sleep hadoop 2 COMPLETED 0:0
show the partition information:
PartitionName=TOTAL
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=xxxxxxx
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=96 TotalNodes=3 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
It seems something wrong.

It's the problem casued by SelectType. I let it as the default value which I think is select/linear. As mentioned in Select Plugin Design Guide, select/linear is node-centric .
The select/linear and select/cons_res plugins have similar modes of operation. The obvious difference is that data structures in select/linear are node-centric, while those in select/cons_res contain information at a finer resolution (sockets, cores, threads, or CPUs depending upon the SelectTypeParameters configuration parameter).
I change SelectType to select/cons_res and restart the whole cluster, the problem is solved.

Related

Replication factor: 3 larger than available brokers: 1 in #EmbeddedKafka

I want to test kafka - transaction.
kafkaTemplate.executeInTransaction { tx ->
tx.sendDefault("abacaba") // Should I do .get() ??
tx.sendDefault("abacaba")
}
And I get next log when test is starting:
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
2023-01-27 16:18:17.831 INFO 81975 --- [quest-handler-4] kafka.server.ZkAdminManager
: [Admin Manager on Broker 0]: Error processing create topic request
CreatableTopic(name='__transaction_state', numPartitions=50, replicationFactor=3,
assignments=[], configs=[CreateableTopicConfig(name='compression.type',
value='uncompressed'), CreateableTopicConfig(name='cleanup.policy', value='compact'),
CreateableTopicConfig(name='min.insync.replicas', value='2'),
CreateableTopicConfig(name='segment.bytes', value='104857600'),
CreateableTopicConfig(name='unclean.leader.election.enable', value='false')])
I try settings replication factor but it don't work :(
Help me, please.
You didn't say in your question that you deal with an #EmbeddedKafka. See its JavaDocs for more info:
/**
* #return the number of brokers
*/
#AliasFor("value")
int count() default 1;
When you have a enough brokers in the cluster, then you can ask for replication factor on this or that topic, but not more than a number of brokers of course.

Airflow how to connect the previous task to the right next dynamic branch with multiple tasks?

I am facing this situation:
I have generated two dynamic branches. Each branch has multiple chained tasks.
This is what I need Airflow create for me:
taskA1->taskB1 taskC1->taskD1
taskA2->taskB2... taskZ.. taskC2->taskD2
taskA3->taskB3 taskC3->taskD3
and here is my sudocode:
def create_branch1(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_A{0}'.format(task_id))) >>
Operator2(task_id=’task_B{0}’.format(task_id)) ]
return source
def create_branch2(task_ids):
source = []
for task_id in task_ids:
source += [
Operator1(task_id=’task_C{0}'.format(task_id))) >>
Operator2(task_id=’task_D{0}’.format(task_id)) ]
return source
create_branch1 >> dummyOperator(Z) >> create_branch2 >> end
However, what the Airflow generates, looks like this:
taskA1->taskB1 taskD1<-taskC1
taskA2->taskB2...taskZ...taskD2<-taskC2
taskA3->taskB3 taskD3<-taskC3
I mean in the second branch, dummyOperator(Z) will be connected to the last task of the chain (D), instead of connecting to the first task of the chain in the second branch (C).
It seems, no matters what, DummpyOperator(task-Z) will connect to the last task of the chained branches.
Do you have any idea, how to tackle this issue?

MariaDB + MaxScale Replication Error : The slave I/O thread stops because a fatal error is encountered when it tried to SELECT #master_binlog_checksum

I am trying to setup Real-time Data Streaming to Kafka with MaxScale CDC with MariaDB veriosn 10.0.32. After configuring replication, I am getting the status:
"The slave I/O thread stops because a fatal error is encountered when it tried to SELECT #master_binlog_checksum".
Below are all of my configurations:
MariaDB - Configuration
server-id = 1
#report_host = master1
#auto_increment_increment = 2
#auto_increment_offset = 1
log_bin = /var/log/mysql/mariadb-bin
log_bin_index = /var/log/mysql/mariadb-bin.index
binlog_format = row
binlog_row_image = full
# not fab for performance, but safer
#sync_binlog = 1
expire_logs_days = 10
max_binlog_size = 100M
# slaves
#relay_log = /var/log/mysql/relay-bin
#relay_log_index = /var/log/mysql/relay-bin.index
#relay_log_info_file = /var/log/mysql/relay-bin.info
#log_slave_updates
#read_only
MaxScale Configuration
[server1]
type=server
address=192.168.56.102
port=3306
protocol=MariaDBBackend
[Replication]
type=service
router=binlogrouter
version_string=10.0.27-log
user=myuser
passwd=mypwd
server_id=3
#binlogdir=/var/lib/maxscale
#mariadb10-compatibility=1
router_options=binlogdir=/var/lib/maxscale,mariadb10-compatibility=1
#slave_sql_verify_checksum=1
[Replication Listener]
type=listener
service=Replication
protocol=MySQLClient
port=5308
Starting Replication
CHANGE MASTER TO MASTER_HOST='192.168.56.102', MASTER_PORT=5308, MASTER_USER='myuser', MASTER_PASSWORD='mypwd', MASTER_LOG_POS=328, MASTER_LOG_FILE='mariadb-bin.000018';
START SLAVE;
Replication Status
Master_Host: 192.168.56.102
Master_User: myuser
Master_Port: 5308
Connect_Retry: 60
Master_Log_File: mariadb-bin.000018
Read_Master_Log_Pos: 328
Relay_Log_File: mysqld-relay-bin.000002
Relay_Log_Pos: 4
Relay_Master_Log_File: mariadb-bin.000018
**Slave_IO_Running: No**
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 328
Relay_Log_Space: 248
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1593
Last_IO_Error: **The slave I/O thread stops because a fatal error is encountered when it tried to SELECT #master_binlog_checksum. Error:**
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: No
Gtid_IO_Pos:
The binlogrouter performs the following query to set the value of #master_binlog_checksum (real replication slaves perform the same query).
SET #master_binlog_checksum = ##global.binlog_checksum
Checking what the output of it is will probably explain why the replication won't start. Most likely the SET query failed which is why the latter SELECT #master_binlog_checksum query returns unexpected results.
In cases like these, it is recommended to open a bug report on the MariaDB Jira under the MaxScale project. This way the possibility of a bug is ruled out and if it turns out to be a configuration problem, the documentation can be updated to more clearly explain how to configure MaxScale.

Time command equivalent in PowerShell

What is the flow of execution of the time command in detail?
I have a user created function in PowerShell, which will compute the time for execution of the command in the following way.
It will open the new PowerShell window.
It will execute the command.
It will close the PowerShell window.
It will get the the different execution times using the GetProcessTimes function function.
Is the "time command" in Unix also calculated in the same way?
The Measure-Command cmdlet is your friend.
PS> Measure-Command -Expression {dir}
You could also get execution time from the command history (last executed command in this example):
$h = Get-History -Count 1
$h.EndExecutionTime - $h.StartExecutionTime
I've been doing this:
Time {npm --version ; node --version}
With this function, which you can put in your $profile file:
function Time([scriptblock]$scriptblock, $name)
{
<#
.SYNOPSIS
Run the given scriptblock, and say how long it took at the end.
.DESCRIPTION
.PARAMETER scriptBlock
A single computer name or an array of computer names. You mayalso provide IP addresses.
.PARAMETER name
Use this for long scriptBlocks to avoid quoting the entire script block in the final output line
.EXAMPLE
time { ls -recurse}
.EXAMPLE
time { ls -recurse} "All the things"
#>
if (!$stopWatch)
{
$script:stopWatch = new-object System.Diagnostics.StopWatch
}
$stopWatch.Reset()
$stopWatch.Start()
. $scriptblock
$stopWatch.Stop()
if ($name -eq $null) {
$name = "$scriptblock"
}
"Execution time: $($stopWatch.ElapsedMilliseconds) ms for $name"
}
Measure-Command works, but it swallows the stdout of the command being run. (Also see Timing a command's execution in PowerShell)
If you need to measure the time taken by something, you can follow this blog entry.
Basically, it suggest to use the .NET StopWatch class:
$sw = [System.Diagnostics.StopWatch]::startNew()
# The code you measure
$sw.Stop()
Write-Host $sw.Elapsed

Solaris 5.9 issue

Does someone know how can I fix the problem below? I'm not familiar with UNIX/Solaris in deep. I've googled it and I've found some information.
An unexpected exception has been detected in native code outside the VM.
Unexpected Signal : 11 occurred at PC=0xFF2B44E4
Function=strlen+0x80
Library=/usr/lib/libc.so.1
Current Java thread:
at com.tertio.tome.Tome.init0(Native Method)
at com.tertio.tome.TomeConfig.<init>(TomeConfig.java:42)
at com.tertio.tome.Tome.initConfig(Tome.java:124)
at com.tertio.tome.Tome.initConfig(Tome.java:118)
at com.tertio.provident.rmi.server.ServerConfig.<init>(ServerConfig.java:28)
at com.tertio.provident.cli.Cli.connect(Cli.java:38)
at com.tertio.provident.cli.Admin.main(Admin.java:23)
Dynamic libraries:
0x10000 /oracle/product/home0/jre/1.4.2/bin/java
0xff370000 /usr/lib/libthread.so.1
0xff3fa000 /usr/lib/libdl.so.1
0xff280000 /usr/lib/libc.so.1
0xff3a0000 /usr/platform/SUNW,Sun-Fire/lib/libc_psr.so.1
0xfec00000 /oracle/product/home0/jre/1.4.2/lib/sparc/client/libjvm.so
0xff230000 /usr/lib/libCrun.so.1
0xff210000 /usr/lib/libsocket.so.1
0xff100000 /usr/lib/libnsl.so.1
0xff1c0000 /usr/lib/libm.so.1
0xff0e0000 /usr/lib/libsched.so.1
0xff0b0000 /usr/lib/libmp.so.2
0xff070000 /oracle/product/home0/jre/1.4.2/lib/sparc/native_threads/libhpi.so
0xfebd0000 /oracle/product/home0/jre/1.4.2/lib/sparc/libverify.so
0xfeb90000 /oracle/product/home0/jre/1.4.2/lib/sparc/libjava.so
0xff040000 /oracle/product/home0/jre/1.4.2/lib/sparc/libzip.so
0xfe3b0000 /ACS/DEV/users/dvprv02/prov52/lib/libjtome.so
0xfcbb0000 /ACS/DEV/users/dvprv02/prov52/lib/libtome.so
0xfe390000 /ACS/DEV/users/dvprv02/prov52/lib/libtome_ev.so
0xfcad0000 /oracle/product/home0/jre/1.4.2/lib/sparc/libnet.so
Heap at VM Abort:
Heap
def new generation total 2112K, used 476K [0xf1c00000, 0xf1e20000, 0xf2310000)
eden space 2048K, 23% used [0xf1c00000, 0xf1c77230, 0xf1e00000)
from space 64K, 0% used [0xf1e00000, 0xf1e00000, 0xf1e10000)
to space 64K, 0% used [0xf1e10000, 0xf1e10000, 0xf1e20000)
tenured generation total 1408K, used 0K [0xf2310000, 0xf2470000, 0xf5c00000)
the space 1408K, 0% used [0xf2310000, 0xf2310000, 0xf2310200, 0xf2470000)
compacting perm gen total 4096K, used 1223K [0xf5c00000, 0xf6000000, 0xf9c00000)
the space 4096K, 29% used [0xf5c00000, 0xf5d31c80, 0xf5d31e00, 0xf6000000)
Local Time = Wed Jan 23 18:03:25 2013
Elapsed Time = 0
#
# The exception above was detected in native code outside the VM
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_03-b02 mixed mode)
#
# An error report file has been saved as hs_err_pid11057.log.
# Please refer to the file for further information.
#
Abort (core dumped)

Resources