I copied directories with ROBOCOPY, from C: to D: (so disks on the same VM, no network issues). I used options
*.* /V /X /TS /FP /S /E /COPYALL /PURGE /MIR /ZB /NP /R:3 /W:3
Shortly afterwards, I did a comparison with the same options plus /L:
/V /X /TS /FP /L /S /E /COPYALL /PURGE /MIR /ZB /NP /R:3 /W:3
The summary starts by saying that 12 directories FAILED:
Total Copied Skipped Mismatch FAILED Extras
Dirs : (many) 30 0 0 12 0
Files : (more) 958 (more-958) 0 0 0
By Google(R)-brand Web searches, I see that "FAILED" should have lines above with the word "ERROR". But I can find no such lines. If I do a comparison without listing files or directories,
*.* /X /NDL /NFL /L /S /E /COPYALL /PURGE /MIR /ZB /NP /R:3 /W:3
there are no output rows at all other than the header and summary.
Am I missing some error messages in the megalines of verbose output? Does anyone have any idea how to find the problem, if any? I'm thinking of a recursive dir + a script to do my own diff, to at least check names and sizes.
(updated a couple of hours later:)
I've got this as well. Posting in case it helps anyone get closer to an answer.
126 failed Dirs but that doesn't match the number of "ERROR 3" messages about directories not found / not created (108, which after a lot of effort I cranked out of importing the log file into Excel).
So what happened to the other 18 failed dirs?
Turns out there are 18 error messages about retries exceeded for the directories mentioned in the ERROR 3 messages.
I therefore conclude that the "Failed" count in the RC summary includes each ERROR 3 "directory not found" log item - even if it is multiply reporting the same directory on multiple failures - PLUS the error reported when it finally exceeds its allowed retry count. So in my case, I have 18 failed directories, each of which is reported on the first attempt and then each of the 5 retries I allowed plus again when the retries exceeded message is given. That is: (18 problem directories) * (1 try + 5 retries + 1 exceeded message) = 18 * 7 = 126 fails. Now it is up to you whether or not you sulk about the "fails" not being unique, but that seems to be how they get counted.
Hope that helps.
Related
I'm running a DAG that runs once per day. It starts with 9 concurrently running tasks that all do the same thing - each is basically polling S3 to see if that tasks's designated 1 file exists. Each task is the same code in Airflow and is put into the structure in the same way. I have 1 of these tasks, which, on random days, fails to "begin" - it won't enter the running stage. It just sits as queued . When it does this, here's what its log says
*** Log file isn't local.
*** Fetching here: http://:8793/log/my.dag.name./my_airflow_task/2020-03-14T07:00:00
*** Failed to fetch log file from worker.
*** Reading remote logs...
Could not read logs from s3://mybucket/airflow/logs/my.dag.name./my_airflow_task/2020-03-14T07:00:00
Why does this only happen on random days? All similar questions I've seen point to this error happening consistently, and once overcome, no longer continues. To "trick" this task into "running" I manually touch whatever the name of the log file is supposed to be, and then it changes to running.
So the issue appears that it had to do with the system's ownership rules regarding the folder the logs for that particular task wrote to. I used a CI tool to ship the new task_3 when I updated my Airflow's Python code to the production environment, so the task was created that way. When I peaked for log directory ownership, I noticed this for the tasks:
# inside/airflow/log/dir:
drwxrwxr-x 2 root root 4096 Mar 25 14:53 task_3 # is the offending task
drwxrwxr-x 2 airflow airflow 20480 Mar 25 00:00 task_2
drwxrwxr-x 2 airflow airflow 20480 Mar 25 15:54 task_1
So, I think what was going on, was that randomly, Airflow couldn't get the permission to write the log file, thus it wouldn't start the rest of the task. When I applied the appropriate chown command using something like sudo chown -R airflow:airflow task_3 . Ever since I changed this, the issue has disappeared.
I always get the same error report in my RNAs-seq pipeline by snakemake:
MissingOutputException in line 44 of /root/s/r/snakemake/my_rnaseq_data/Snakefile:
Missing files after 5 seconds:
03_align/wt2.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Here is my Snakefile:
SBT=["wt1","wt2","epcr1","epcr2"]
rule all:
input:
expand("02_clean/{nico}_1.paired.fq", nico=SBT),
expand("02_clean/{nico}_2.paired.fq", nico=SBT),
expand("03_align/{nico}.bam", nico=SBT)
rule trim:
input:
"01_raw/{nico}_1.fastq",
"01_raw/{nico}_2.fastq"
output:
"02_clean/{nico}_1.paired.fq.gz",
"02_clean/{nico}_1.unpaired.fq.gz",
"02_clean/{nico}_2.paired.fq.gz",
"02_clean/{nico}_2.unpaired.fq.gz",
shell:
"java -jar /software/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 {input[0]} {input[1]} {output[0]} {output[1]} {output[2]} {output[3]} ILLUMINACLIP:/software/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 &"
rule gzip:
input:
"02_clean/{nico}_1.paired.fq.gz",
"02_clean/{nico}_2.paired.fq.gz"
output:
"02_clean/{nico}_1.paired.fq",
"02_clean/{nico}_2.paired.fq"
run:
shell("gzip -d {input[0]} > {output[0]}")
shell("gzip -d {input[1]} > {output[1]}")
rule map:
input:
"02_clean/{nico}_1.paired.fq",
"02_clean/{nico}_2.paired.fq"
output:
"03_align/{nico}.sam"
log:
"logs/map/{nico}.log"
threads: 40
shell:
"hisat2 -p 20 --dta -x /root/s/r/p/A_th/WT-Al_VS_WT-CK/index/tair10 -1 {input[0]} -2 {input[1]} -S {output} >{log} 2>&1 &"
rule sort2bam:
input:
"03_align/{nico}.sam"
output:
"03_align/{nico}.bam"
threads:30
shell:
"samtools sort -# 20 -m 20G -o {output} {input} &"
everything is fine until I add "rule sort2bam" part.
When I dry-run ,it goes fine. But when I execute it,it report error as the question describe. And Surprisely it run the task where it report it stuck in the background.But it always run the one task.like these:
rule sort2bam:
input: 03_align/epcr1.sam
output: 03_align/epcr1.bam
jobid: 11
wildcards: nico=epcr1
Waiting at most 5 seconds for missing files.
MissingOutputException in line 45 of /root/s/r/snakemake/my_rnaseq_data/Snakefile:
Missing files after 5 seconds:
03_align/epcr1.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
[Sat Apr 27 06:10:22 2019]
rule sort2bam:
input: 03_align/wt1.sam
output: 03_align/wt1.bam
jobid: 9
wildcards: nico=wt1
Waiting at most 5 seconds for missing files.
MissingOutputException in line 45 of /root/s/r/snakemake/my_rnaseq_data/Snakefile:
Missing files after 5 seconds:
03_align/wt1.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
[Sat Apr 27 06:23:13 2019]
rule sort2bam:
input: 03_align/wt2.sam
output: 03_align/wt2.bam
jobid: 6
wildcards: nico=wt2
Waiting at most 5 seconds for missing files.
MissingOutputException in line 44 of /root/s/r/snakemake/my_rnaseq_data/Snakefile:
Missing files after 5 seconds:
03_align/wt2.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I don't know what's wrong with my code? Any ideals? Thanks in advance!
As you figured out, & is the problem. Control operator & makes your command run in the background in a subshell, and this leads snakemake to think that job is complete when in fact it is not. In your case, its usage doesn't appear to be required.
From man bash on usage of & (stolen from this answer):
If a command is terminated by the control operator &, the shell
executes the command in the background in a subshell. The shell does
not wait for the command to finish, and
the return status is 0.
I know how to solve, but I don't know why it works!
Just delete the '&' in
samtools sort -# 20 -m 20G -o {output} {input} &
I am using percona-toolkit for analysing mysql-slow-query (logs). So the command is pretty basic:
pt-query-digest slowquery.log
Now the result(error) is:
18.2s user time, 100ms system time, 35.61M rss, 105.19M vsz
Current date: Thu Jul 7 17:18:43 2016
Hostname: Jammer
Files: slowquery.log
Pipeline process 5 (iteration) caused an error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Will retry pipeline process 4 (iteration) 2 more times.
..
..(same result prints twice)
..
The pipeline caused an error: Pipeline process 5 (iteration) caused an error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Terminating pipeline because process 4 (iteration) caused too many errors.
Now the specifics for the environment, I am using Ubuntu 16.04 , MariaDB 10.1.14, Percona-Toolkit 2.2.16
I found something here bug-report, but it is like a workaround and does not actually solve the error. Even after applying the patch the command result doesn't look satisfying enough.
I am facing same problem on ubuntu 16.04 MySql.
The contents of my slow query log is as follow.
/usr/sbin/mysqld, Version: 5.7.16-0ubuntu0.16.04.1-log ((Ubuntu)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
/usr/sbin/mysqld, Version: 5.7.16-0ubuntu0.16.04.1-log ((Ubuntu)). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
Time: 2016-12-08T05:13:55.140764Z
User#Host: root[root] # localhost [] Id: 20
Query_time: 0.003770 Lock_time: 0.000200 Rows_sent: 1 Rows_examined: 2
SET timestamp=1481174035;
SELECT COUNT(*) FROM INFORMATION_SCHEMA.TRIGGERS;
The error is same:
The pipeline caused an error: Pipeline process 5 (iteration) caused an
error: Redundant argument in sprintf at /usr/bin/pt-query-digest line 2556.
Ubuntu 16.04
MySql Ver 14.14 Distrib 5.7.16
pt-query-digest 2.2.16
The bug appears to be fixed in the current version of the toolkit (2.2.20), and apparently in previous ones, starting from 2.2.17.
This patch seems to do the trick for this particular place in pt-query-digest:
--- percona-toolkit-2.2.16/bin/pt-query-digest 2015-11-06 14:56:23.000000000 -0500
+++ percona-toolkit-2.2.20/bin/pt-query-digest 2016-12-06 17:01:51.000000000 -0500
## -2555,8 +2583,8 ##
}
return sprintf(
$num =~ m/\./ || $n
- ? "%.${p}f%s"
- : '%d',
+ ? '%1$.'.$p.'f%2$s'
+ : '%1$d',
$num, $units[$n]);
}
But as mentioned in the original question and bug report, quite a few tools/functions were affected, the full bugfix consisted of a lot of small changes:
https://github.com/percona/percona-toolkit/pull/73/files
I might be late here. I want to share how I overcame that same error as it might help someone who is searching for an answer. At this time the latest tag of Percona toolkit is 3.0.9
I tried to run pt-query-digest after installing via apt, by downloading deb file as methods provided by Percona documentation, but any of it didn't help. It was this same error.
Pipeline process 5 (iteration) caused an error:
Redundant argument in sprintf at /usr/bin/pt-query-digest line (some line)
1 - So I deleted/removed the installation of percona-toolkit
2 - first, I cleaned/updated perl version
sudo apt-get install perl
3 - then I installed Percona toolkit from source as mentioned in the repository's readme. like this. I used branch 3.0.
git clone git#github.com:percona/percona-toolkit.git
cd percona-toolkit
perl Makefile.PL
make
make test
make install
Thats it. Hope this help to someone.
i found error in this version percona-toolkit-3.0.12-1.el7.x86_64.rpm
and percona-toolkit-3.0.10-1.el7.x86_64.rpm is fine, percona-toolkit is very useful to me
at ./pt-query-digest line 9302.
Terminating pipeline because process 4 (iteration) caused too many errors.
Note that you will see the error message:
"Redundant argument in sprintf at"
if you forget to put a % in front of your format spec (first argument).
I'm using rsync in solaris and couldn't find an exit code if there is no file or folder modification/addition or deletion done on the destination folder. How can I get the status if rsync doesn't have one ?
0 Success
1 Syntax or usage error
2 Protocol incompatibility
3 Errors selecting input/output files, dirs
4 Requested action not supported: an attempt was made to manipulate 64-bit
files on a platform that cannot support them; or an option was specified
that is supported by the client and not by the server.
5 Error starting client-server protocol
6 Daemon unable to append to log-file
10 Error in socket I/O
11 Error in file I/O
12 Error in rsync protocol data stream
13 Errors with program diagnostics
14 Error in IPC code
20 Received SIGUSR1 or SIGINT
21 Some error returned by waitpid()
22 Error allocating core memory buffers
23 Partial transfer due to error
24 Partial transfer due to vanished source files
25 The --max-delete limit stopped deletions
30 Timeout in data send/receive
35 Timeout waiting for daemon connection
Thank you
There is a work around
rsync --log-format=%f ...
Note that rsync outputs files anytime any attribute changes, not only if the content of the file is updated.
There is also a -i option (or --log-format=%i) that itemizes all of the changes. See the rsync man page for details of the output format.
Say my program attempts a read of a byte in a file on a ZFS filesystem. ZFS can locate a copy of the necessary block, but cannot locate any copy with a valid checksum (they're all corrupted, or the only disks present have corrupted copies). What does my program see, in terms of the return value from the read, and the byte it tried to read? And is there a way to influence the behavior (under Solaris, or any other ZFS-implementing OS), that is, force failure, or force success, with potentially corrupt data?
EIO is indeed the only answer with current ZFS implementations.
An open ZFS "bug" asks for some way to read corrupted data:
http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=6186106
I believe this is already doable using the undocumented but open source zdb utility.
Have a look at http://www.cuddletech.com/blog/pivot/entry.php?id=980 for explanations about how to dump a file content using zdb -R option and "r" flag.
Solaris 10:
# Create a test pool
[root#tesalia z]# cd /tmp
[root#tesalia tmp]# mkfile 100M zz
[root#tesalia tmp]# zpool create prueba /tmp/zz
# Fill the pool
[root#tesalia /]# dd if=/dev/zero of=/prueba/dummy_file
dd: writing to `/prueba/dummy_file': No space left on device
129537+0 records in
129536+0 records out
66322432 bytes (66 MB) copied, 1.6093 s, 41.2 MB/s
# Umount the pool
[root#tesalia /]# zpool export prueba
# Corrupt the pool on purpose
[root#tesalia /]# dd if=/dev/urandom of=/tmp/zz seek=100000 count=1 conv=notrunc
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0715209 s, 7.2 kB/s
# Mount the pool again
zpool import -d /tmp prueba
# Try to read the corrupted data
[root#tesalia tmp]# md5sum /prueba/dummy_file
md5sum: /prueba/dummy_file: I/O error
# Read the manual
[root#tesalia tmp]# man -s2 read
[...]
RETURN VALUES
Upon successful completion, read() and readv() return a
non-negative integer indicating the number of bytes actually
read. Otherwise, the functions return -1 and set errno to
indicate the error.
ERRORS
The read(), readv(), and pread() functions will fail if:
[...]
EIO A physical I/O error has occurred, [...]
You must export/import the test pool because, if not, the direct overwrite (pool corruption) will be missed since the file will still be cached in OS memory.
And no, currently ZFS will refuse to give you corrupted data. As it should.
How would returning anything but an EIO error from read() make sense outside a file system specific low level data rescue utility?
The low level data rescue utility would need to use an OS and FS specific API other than open/read/write/close to to access the file. The semantics it would need are fundamentally different from reading normal files, so it would need a specialized API.