How to re-create an equivalent to linux bash statement in Deno with Deno.Run - deno

How to re-create an equivalent to following Linux bash statement in Deno?
docker compose exec container_name -uroot -ppass db_name < ./dbDump.sql
I have tried the following:
const encoder = new TextEncoder
const p = await Deno.run({
cmd: [
'docker',
'compose',
'exec',
'container_name',
'mysql',
'-uroot',
'-ppass',
'db_name',
],
stdout: 'piped',
stderr: 'piped',
stdin: "piped",
})
await p.stdin.write(encoder.encode(await Deno.readTextFile('./dbDump.sql')))
await p.stdin.close()
await p.close()
But for some reason whenever I do it this way I get an error ERROR 1064 (42000) at line 145: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version which does not happen when I perform the exact same command in the bash.
Could someone please explain me how it has to be done properly?

Without a sample input file, it's impossible to be certain of your exact issue.
Given the context though, I suspect that your input file is too large for a single proc.stdin.write() call. Try using the writeAll() function to make sure the full payload goes through:
import { writeAll } from "https://deno.land/std#0.119.0/streams/conversion.ts";
await writeAll(proc.stdin, await Deno.readFile(sqlFilePath));
To show what this fixes, here's a Deno program pipe-to-wc.ts which passes its input to the Linux 'word count' utility (in character-counting mode):
#!/usr/bin/env -S deno run --allow-read=/dev/stdin --allow-run=wc
const proc = await Deno.run({
cmd: ['wc', '-c'],
stdin: 'piped',
});
await proc.stdin.write(await Deno.readFile('/dev/stdin'));
proc.stdin.close();
await proc.status();
If we use this program with a small input, the count lines up:
# use the shebang to make the following commands easier
$ chmod +x pipe-to-wc.ts
$ dd if=/dev/zero bs=1024 count=1 | ./pipe-to-wc.ts
1+0 records in
1+0 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.000116906 s, 8.8 MB/s
1024
But as soon as the input is big, only 65k bytes are going through!
$ dd if=/dev/zero bs=1024 count=100 | ./pipe-to-wc.ts
100+0 records in
100+0 records out
102400 bytes (102 kB, 100 KiB) copied, 0.0424347 s, 2.4 MB/s
65536
To fix this issue, let's replace the write() call with writeAll():
#!/usr/bin/env -S deno run --allow-read=/dev/stdin --allow-run=wc
const proc = await Deno.run({
cmd: ['wc', '-c'],
stdin: 'piped',
});
import { writeAll } from "https://deno.land/std#0.119.0/streams/conversion.ts";
await writeAll(proc.stdin, await Deno.readFile('/dev/stdin'));
proc.stdin.close();
await proc.status();
Now all the bytes are getting passed through on big inputs :D
$ dd if=/dev/zero bs=1024 count=1000 | ./pipe-to-wc.ts
1000+0 records in
1000+0 records out
1024000 bytes (1.0 MB, 1000 KiB) copied, 0.0854184 s, 12.0 MB/s
1024000
Note that this will still fail on huge inputs once they exceed the amount of memory available to your program. The writeAll() solution should be fine up to 100 megabytes or so. After that point you'd probably want to switch to a streaming solution.

First, a couple of notes:
Deno currently doesn't offer a way to create a detached subprocess. (You didn't mention this, but it seems potentially relevant to your scenario given typical docker compose usage) See denoland/deno#5501.
Deno's subprocess API is currently being reworked. See denoland/deno#11016.
Second, here are links to the relevant docs:
docker-compose exec
CLI APIs > Deno.run
Manual > Creating a subprocess (Deno v1.17.0)
Now, here's a commented breakdown of how to create a subprocess (according to the current API) using your scenario as an example:
module.ts:
const dbUser = 'actual_database_username';
const dbPass = 'actual_database_password'
const dbName = 'actual_database_name';
const dockerExecProcCmd = ['mysql', '-u', dbUser, '-p', dbPass, dbName];
const serviceName = 'actual_compose_service_name';
// Build the run command
const cmd = ['docker', 'compose', 'exec', '-T', serviceName, ...dockerExecProcCmd];
/**
* Create the subprocess
*
* For now, leave `stderr` and `stdout` undefined so they'll print
* to your console while you are debugging. Later, you can pipe (capture) them
* and handle them in your program
*/
const p = Deno.run({
cmd,
stdin: 'piped',
// stderr: 'piped',
// stdout: 'piped',
});
/**
* If you use a relative path, this will be relative to `Deno.cwd`
* at the time the subprocess is created
*
* https://doc.deno.land/deno/stable/~/Deno.cwd
*/
const sqlFilePath = './dbDump.sql';
// Write contents of SQL script to stdin
await p.stdin.write(await Deno.readFile(sqlFilePath));
/**
* Close stdin
*
* I don't know how `mysql` handles `stdin`, but if it needs the EOT sent by
* closing and you don't need to write to `stdin` any more, then this is correct
*/
p.stdin.close();
// Wait for the process to finish (either OK or NOK)
const {code} = await p.status();
console.log({'docker-compose exit status code': code});
// Not strictly necessary, but better to be explicit
p.close();

Related

path not being detected by Nextflow

i'm new to nf-core/nextflow and needless to say the documentation does not reflect what might be actually implemented. But i'm defining the basic pipeline below:
nextflow.enable.dsl=2
process RUNBLAST{
input:
val thr
path query
path db
path output
output:
path output
script:
"""
blastn -query ${query} -db ${db} -out ${output} -num_threads ${thr}
"""
}
workflow{
//println "I want to BLAST $params.query to $params.dbDir/$params.dbName using $params.threads CPUs and output it to $params.outdir"
RUNBLAST(params.threads,params.query,params.dbDir, params.output)
}
Then i'm executing the pipeline with
nextflow run main.nf --query test2.fa --dbDir blast/blastDB
Then i get the following error:
N E X T F L O W ~ version 22.10.6
Launching `main.nf` [dreamy_hugle] DSL2 - revision: c388cf8f31
Error executing process > 'RUNBLAST'
Error executing process > 'RUNBLAST'
Caused by:
Not a valid path value: 'test2.fa'
Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run
I know test2.fa exists in the current directory:
(nfcore) MN:nf-core-basicblast jraygozagaray$ ls
CHANGELOG.md conf other.nf
CITATIONS.md docs pyproject.toml
CODE_OF_CONDUCT.md lib subworkflows
LICENSE main.nf test.fa
README.md modules test2.fa
assets modules.json work
bin nextflow.config workflows
blast nextflow_schema.json
I also tried with "file" instead of path but that is deprecated and raises other kind of errors.
It'll be helpful to know how to fix this to get myself started with the pipeline building process.
Shouldn't nextflow copy the file to the execution path?
Thanks
You get the above error because params.query is not actually a path value. It's probably just a simple String or GString. The solution is to instead supply a file object, for example:
workflow {
query = file(params.query)
BLAST( query, ... )
}
Note that a value channel is implicitly created by a process when it is invoked with a simple value, like the above file object. If you need to be able to BLAST multiple query files, you'll instead need a queue channel, which can be created using the fromPath factory method, for example:
params.query = "${baseDir}/data/*.fa"
params.db = "${baseDir}/blastdb/nt"
params.outdir = './results'
db_name = file(params.db).name
db_path = file(params.db).parent
process BLAST {
publishDir(
path: "{params.outdir}/blast",
mode: 'copy',
)
input:
tuple val(query_id), path(query)
path db
output:
tuple val(query_id), path("${query_id}.out")
"""
blastn \\
-num_threads ${task.cpus} \\
-query "${query}" \\
-db "${db}/${db_name}" \\
-out "${query_id}.out"
"""
}
workflow{
Channel
.fromPath( params.query )
.map { file -> tuple(file.baseName, file) }
.set { query_ch }
BLAST( query_ch, db_path )
}
Note that the usual way to specify the number of threads/cpus is using cpus directive, which can be configured using a process selector in your nextflow.config. For example:
process {
withName: BLAST {
cpus = 4
}
}

nix-shell script does nothing when using script

I'm quite new to Nix and I'm trying to create a very simple shell.nix script file.
Unfortunately I need an old package: mariadb-10.4.21. After reading and searching a bit I found out that version 10.4.17 (would've been nice to have the exact version but I couldn't find it) is in channel nixos-20.09, but when I do
$ nix-shell --version
nix-shell (Nix) 2.5.1
$ cat shell.nix
let
pkgs = import <nixpkgs> {};
# git ls-remote https://github.com/nixos/nixpkgs nixos-20.09
pkgs-20_09 = import (builtins.fetchGit {
name = "nixpks-20.09";
url = "https://github.com/nixos/nixpkgs";
ref = "refs/heads/nixos-20.09";
rev = "1c1f5649bb9c1b0d98637c8c365228f57126f361";
}) {};
in
pkgs.stdenv.mkDerivation {
pname = "test";
version = "0.1.0";
buildInputs = [
pkgs-20_09.mariadb
];
}
$ nix-shell
it just waits indefinitely without doing anything. But if I do
$ nix-shell -p mariadb -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/1c1f5649bb9c1b0d98637c8c365228f57126f361.tar.gz
[...]
/nix/store/yias2v8pm9pvfk79m65wdpcby4kiy91l-mariadb-server-10.4.17
[...]
copying path '/nix/store/yias2v8pm9pvfk79m65wdpcby4kiy91l-mariadb-server-10.4.17' from 'https://cache.nixos.org'...
[nix-shell:~/Playground]$ mariadb --version
mariadb Ver 15.1 Distrib 10.4.17-MariaDB, for Linux (x86_64) using readline 5.1
it works perfectly.
What am I doing wrong in the script for it to halt?
EDIT: I got a bit more info by running
$ nix-shell -vvv
[...]
did not find cache entry for '{"name":"nixpks-20.09","rev":"1c1f5649bb9c1b0d98637c8c365228f57126f361","type":"git"}'
did not find cache entry for '{"name":"nixpks-20.09","ref":"refs/heads/nixos-20.09","type":"git","url":"https://github.com/nixos/nixpkgs"}'
locking path '/home/test/.cache/nix/gitv3/17blyky0ja542rww32nj04jys1r9vnkg6gcfbj83drca9a862hwp.lock'
lock acquired on '/home/test/.cache/nix/gitv3/17blyky0ja542rww32nj04jys1r9vnkg6gcfbj83drca9a862hwp.lock.lock'
fetching Git repository 'https://github.com/nixos/nixpkgs'...
Is it me or it seems like it's trying to fetch from two different sources? As far as I understood all three url, rev and ref are needed for git-fetching, but it looks like if it's splitting them.
EDIT2: I've been trying with fetchFromGitHub
pkgs-20_09 = import (pkgs.fetchFromGitHub {
name = "nixpks-20.09";
owner = "nixos";
repo = "nixpkgs";
rev = "1c1f5649bb9c1b0d98637c8c365228f57126f361";
sha256 = "0f2nvdijyxfgl5kwyb4465pppd5vkhqxddx6v40k2s0z9jfhj0xl";
}) {};
and fetchTarball
pkgs-20_09 = import (builtins.fetchTarball "https://github.com/NixOS/nixpkgs/archive/1c1f5649bb9c1b0d98637c8c365228f57126f361.tar.gz") {};
and both work just fine. I'll use fetchFromGitHub from now on but it'd be interesting to now why fetchGit doesn't work.

How to get Rust sqlx sqlite query to work?

main.rs:
#[async_std::main]
async fn main() -> Result<(),sqlx::Error> {
use sqlx::Connect;
let mut conn = sqlx::SqliteConnection::connect("sqlite:///home/ace/hello_world/test.db").await?;
let row = sqlx::query!("SELECT * FROM tbl").fetch_all(&conn).await?;
println!("{}{}",row.0,row.1);
Ok(())
}
Cargo.toml:
[package]
name = "hello_world"
version = "0.1.0"
authors = ["ace"]
edition = "2018"
[dependencies]
async-std = {version = "1", features = ["attributes"]}
sqlx = { version="0.3.5", default-features=false, features=["runtime-async-std","macros","sqlite"] }
bash session:
ace#SLAB:~/hello_world$ sqlite test.db
SQLite version 2.8.17
Enter ".help" for instructions
sqlite> create table tbl ( num integer, chr varchar );
sqlite> insert into tbl values (1,'ok');
sqlite> .quit
ace#SLAB:~/hello_world$ pwd
/home/ace/hello_world
ace#SLAB:~/hello_world$ export DATABASE_URL=sqlite:///home/ace/hello_world/test.db
ace#SLAB:~/hello_world$ cargo run
Compiling hello_world v0.1.0 (/home/ace/hello_world)
error: failed to connect to database: file is not a database
--> src/main.rs:8:12
|
8 | let row = sqlx::query!("SELECT * FROM tbl").fetch_all(&conn).await?;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error: aborting due to previous error
error: could not compile `hello_world`.
To learn more, run the command again with --verbose.
ace#SLAB:~/hello_world$ rustc --version
rustc 1.44.0 (49cae5576 2020-06-01)
ace#SLAB:~/hello_world$ uname -r
5.4.0-33-generic
ace#SLAB:~/hello_world$ cat /etc/os-release | head -2
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
ace#SLAB:~/hello_world$
Also tried using DATABASE_URL "sqlite::memory:" (both in environment variable and in main.rs) with system table "sqlite_master". Got different error:
error[E0277]: the trait bound `&sqlx_core::sqlite::connection::SqliteConnection: sqlx_core::executor::RefExecutor<'_>` is not satisfied
... but it must have gotten partway to success because when I used table name "Xsqlite_master" with memory db, it complained that there was no such table.
Tried "sqlite://home"(etc) and every other number of slashes, zero through 4. Tried several hundred other things. :(
Thank you!
There maybe several more things to try:
Try sqlite3 /home/ace/hello_world/test.db to double verify that the DB does exist. Make sure tbl table is defined there .schema tbl
Try DB path with a single slash i.e. sqlite:/home/ace/hello_world/test.db
Lastly try using the query function not the macro https://docs.rs/sqlx/0.3.5/sqlx/fn.query.html to see if it works.

Send SIGINT to a process by sending ctrl-c to stdin

I'm looking for a way to mimick a terminal for some automated testing: i.e. start a process and then interact with it via sending data to stdin and reading from stdout. E.g. sending some lines of input to stdin including ctrl-c and ctrl-\ which should result in sending signals to the process.
Using std::process::Commannd I'm able to send input to e.g. cat and I'm also seeing its output on stdout, but sending ctrl-c (as I understand that is 3) does not cause SIGINT sent to the shell. E.g. this program should terminate:
use std::process::{Command, Stdio};
use std::io::Write;
fn main() {
let mut child = Command::new("sh")
.arg("-c").arg("-i").arg("cat")
.stdin(Stdio::piped())
.spawn().unwrap();
let mut stdin = child.stdin.take().unwrap();
stdin.write(&[3]).expect("cannot send ctrl-c");
child.wait();
}
I suspect the issue is that sending ctrl-c needs the some tty and via sh -i it's only in "interactive mode".
Do I need to go full fledged and use e.g. termion or ncurses?
Update: I confused shell and terminal in the original question. I cleared this up now. Also I mentioned ssh which should have been sh.
The simplest way is to directly send the SIGINT signal to the child process. This can be done easily using nix's signal::kill function:
// add `nix = "0.15.0"` to your Cargo.toml
use std::process::{Command, Stdio};
use std::io::Write;
fn main() {
// spawn child process
let mut child = Command::new("cat")
.stdin(Stdio::piped())
.spawn().unwrap();
// send "echo\n" to child's stdin
let mut stdin = child.stdin.take().unwrap();
writeln!(stdin, "echo");
// sleep a bit so that child can process the input
std::thread::sleep(std::time::Duration::from_millis(500));
// send SIGINT to the child
nix::sys::signal::kill(
nix::unistd::Pid::from_raw(child.id() as i32),
nix::sys::signal::Signal::SIGINT
).expect("cannot send ctrl-c");
// wait for child to terminate
child.wait().unwrap();
}
You should be able to send all kinds of signals using this method. For more advanced "interactivity" (e.g. child programs like vi that query terminal size) you'd need to create a pseudoterminal like #hansaplast did in his solution.
After a lot of research I figured out it's not too much work to do the pty fork myself. There's pty-rs, but it has bugs and seems unmaintained.
The following code needs pty module of nix which is not yet on crates.io, so Cargo.toml needs this for now:
[dependencies]
nix = {git = "https://github.com/nix-rust/nix.git"}
The following code runs cat in a tty and then writes/reads from it and sends Ctrl-C (3):
extern crate nix;
use std::path::Path;
use nix::pty::{posix_openpt, grantpt, unlockpt, ptsname};
use nix::fcntl::{O_RDWR, open};
use nix::sys::stat;
use nix::unistd::{fork, ForkResult, setsid, dup2};
use nix::libc::{STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO};
use std::os::unix::io::{AsRawFd, FromRawFd};
use std::io::prelude::*;
use std::io::{BufReader, LineWriter};
fn run() -> std::io::Result<()> {
// Open a new PTY master
let master_fd = posix_openpt(O_RDWR)?;
// Allow a slave to be generated for it
grantpt(&master_fd)?;
unlockpt(&master_fd)?;
// Get the name of the slave
let slave_name = ptsname(&master_fd)?;
match fork() {
Ok(ForkResult::Child) => {
setsid()?; // create new session with child as session leader
let slave_fd = open(Path::new(&slave_name), O_RDWR, stat::Mode::empty())?;
// assign stdin, stdout, stderr to the tty, just like a terminal does
dup2(slave_fd, STDIN_FILENO)?;
dup2(slave_fd, STDOUT_FILENO)?;
dup2(slave_fd, STDERR_FILENO)?;
std::process::Command::new("cat").status()?;
}
Ok(ForkResult::Parent { child: _ }) => {
let f = unsafe { std::fs::File::from_raw_fd(master_fd.as_raw_fd()) };
let mut reader = BufReader::new(&f);
let mut writer = LineWriter::new(&f);
writer.write_all(b"hello world\n")?;
let mut s = String::new();
reader.read_line(&mut s)?; // what we just wrote in
reader.read_line(&mut s)?; // what cat wrote out
writer.write(&[3])?; // send ^C
writer.flush()?;
let mut buf = [0; 2]; // needs bytewise read as ^C has no newline
reader.read(&mut buf)?;
s += &String::from_utf8_lossy(&buf).to_string();
println!("{}", s);
println!("cat exit code: {:?}", wait::wait()?); // make sure cat really exited
}
Err(_) => println!("error"),
}
Ok(())
}
fn main() {
run().expect("could not execute command");
}
Output:
hello world
hello world
^C
cat exit code: Signaled(2906, SIGINT, false)
Try adding -t option TWICE to force pseudo-tty allocation. I.e.
klar (16:14) ~>echo foo | ssh user#host.ssh.com tty
not a tty
klar (16:14) ~>echo foo | ssh -t -t user#host.ssh.com tty
/dev/pts/0
When you have a pseudo-tty, I think it should convert that to SIGINT as you wanted to do.
In your simple example, you could also just close stdin after the write, in which case the server should exit. For this particular case it would be more elegant and probably more reliable.
Solution without using a crate
Now that you are spawning a command in Rust, you might as well spawn another to send SIGINT to it. That command is kill.
So, you can do this:
use std::process::{Command, Stdio};
use std::io::{Result, Write};
fn main() -> Result<()> {
let mut child = Command::new("sh")
.arg("-c").arg("-i").arg("cat")
.stdin(Stdio::piped())
.spawn()?;
let mut stdin = child.stdin.take().unwrap();
let mut kill = Command::new("kill")
.arg(child.id().to_string())
.spawn()?;
kill.wait()
}

Removing Airflow task logs

I'm running 5 DAG's which have generated a total of about 6GB of log data in the base_log_folder over a months period. I just added a remote_base_log_folder but it seems it does not exclude logging to the base_log_folder.
Is there anyway to automatically remove old log files, rotate them or force airflow to not log on disk (base_log_folder) only in remote storage?
Please refer https://github.com/teamclairvoyant/airflow-maintenance-dags
This plugin has DAGs that can kill halted tasks and log-cleanups.
You can grab the concepts and can come up with a new DAG that can cleanup as per your requirement.
We remove the Task logs by implementing our own FileTaskHandler, and then pointing to it in the airflow.cfg. So, we overwrite the default LogHandler to keep only N task logs, without scheduling additional DAGs.
We are using Airflow==1.10.1.
[core]
logging_config_class = log_config.LOGGING_CONFIG
log_config.LOGGING_CONFIG
BASE_LOG_FOLDER = conf.get('core', 'BASE_LOG_FOLDER')
FOLDER_TASK_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}'
FILENAME_TEMPLATE = '{{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log'
LOGGING_CONFIG = {
'formatters': {},
'handlers': {
'...': {},
'task': {
'class': 'file_task_handler.FileTaskRotationHandler',
'formatter': 'airflow.job',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'folder_task_template': FOLDER_TASK_TEMPLATE,
'retention': 20
},
'...': {}
},
'loggers': {
'airflow.task': {
'handlers': ['task'],
'level': JOB_LOG_LEVEL,
'propagate': False,
},
'airflow.task_runner': {
'handlers': ['task'],
'level': LOG_LEVEL,
'propagate': True,
},
'...': {}
}
}
file_task_handler.FileTaskRotationHandler
import os
import shutil
from airflow.utils.helpers import parse_template_string
from airflow.utils.log.file_task_handler import FileTaskHandler
class FileTaskRotationHandler(FileTaskHandler):
def __init__(self, base_log_folder, filename_template, folder_task_template, retention):
"""
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string.
:param folder_task_template: template folder task path.
:param retention: Number of folder logs to keep
"""
super(FileTaskRotationHandler, self).__init__(base_log_folder, filename_template)
self.retention = retention
self.folder_task_template, self.folder_task_template_jinja_template = \
parse_template_string(folder_task_template)
#staticmethod
def _get_directories(path='.'):
return next(os.walk(path))[1]
def _render_folder_task_path(self, ti):
if self.folder_task_template_jinja_template:
jinja_context = ti.get_template_context()
return self.folder_task_template_jinja_template.render(**jinja_context)
return self.folder_task_template.format(dag_id=ti.dag_id, task_id=ti.task_id)
def _init_file(self, ti):
relative_path = self._render_folder_task_path(ti)
folder_task_path = os.path.join(self.local_base, relative_path)
subfolders = self._get_directories(folder_task_path)
to_remove = set(subfolders) - set(subfolders[-self.retention:])
for dir_to_remove in to_remove:
full_dir_to_remove = os.path.join(folder_task_path, dir_to_remove)
print('Removing', full_dir_to_remove)
shutil.rmtree(full_dir_to_remove)
return FileTaskHandler._init_file(self, ti)
Airflow maintainers don't think truncating logs is a part of airflow core logic, to see this, and then in this issue, maintainers suggest to change LOG_LEVEL avoid too many log data.
And in this PR, we can learn how to change log level in airflow.cfg.
good luck.
I know it sounds savage, but have you tried pointing base_log_folder to /dev/null? I use Airflow as a part of a container, so I don't care about the files either, as long as the logger pipe to STDOUT as well.
Not sure how well this plays with S3 though.
For your concrete problems, I have some suggestions.
For those, you would always need a specialized logging config as described in this answer: https://stackoverflow.com/a/54195537/2668430
automatically remove old log files and rotate them
I don't have any practical experience with the TimedRotatingFileHandler from the Python standard library yet, but you might give it a try:
https://docs.python.org/3/library/logging.handlers.html#timedrotatingfilehandler
It not only offers to rotate your files based on a time interval, but if you specify the backupCount parameter, it even deletes your old log files:
If backupCount is nonzero, at most backupCount files will be kept, and if more would be created when rollover occurs, the oldest one is deleted. The deletion logic uses the interval to determine which files to delete, so changing the interval may leave old files lying around.
Which sounds pretty much like the best solution for your first problem.
force airflow to not log on disk (base_log_folder), but only in remote storage?
In this case you should specify the logging config in such a way that you do not have any logging handlers that write to a file, i.e. remove all FileHandlers.
Rather, try to find logging handlers that send the output directly to a remote address.
E.g. CMRESHandler which logs directly to ElasticSearch but needs some extra fields in the log calls.
Alternatively, write your own handler class and let it inherit from the Python standard library's HTTPHandler.
A final suggestion would be to combine both the TimedRotatingFileHandler and setup ElasticSearch together with FileBeat, so you would be able to store your logs inside ElasticSearch (i.e. remote), but you wouldn't store a huge amount of logs on your Airflow disk since they will be removed by the backupCount retention policy of your TimedRotatingFileHandler.
Usually apache airflow grab the disk space due to 3 reasons
1. airflow scheduler logs files
2. mysql binaly logs [Major]
3. xcom table records.
To make it clean up on regular basis I have set up a dag which run on daily basis and cleans the binary logs and truncate the xcom table to make the disk space free
You also might need to install [pip install mysql-connector-python].
To clean up scheduler log files I do delete them manually two times in a week to avoid the risk of logs deleted which needs to be required for some reasons.
I clean the logs files by [sudo rm -rd airflow/logs/] command.
Below is my python code for reference
'
"""Example DAG demonstrating the usage of the PythonOperator."""
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
args = {
'owner': 'airflow',
'email_on_failure':True,
'retries': 1,
'email':['Your Email Id'],
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
dag_id='airflow_logs_cleanup',
default_args=args,
schedule_interval='#daily',
start_date=days_ago(0),
catchup=False,
max_active_runs=1,
tags=['airflow_maintenance'],
)
def truncate_table():
import mysql.connector
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
sql_select_query = """TRUNCATE TABLE xcom"""
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("XCOM Table truncated successfully")
def delete_binary_logs():
import mysql.connector
from datetime import datetime
date = datetime.today().strftime('%Y-%m-%d')
connection = mysql.connector.connect(host='localhost',
database='db_name',
user='username',
password='your_password',
auth_plugin='mysql_native_password')
cursor = connection.cursor()
query = 'PURGE BINARY LOGS BEFORE ' + "'" + str(date) + "'"
sql_select_query = query
cursor.execute(sql_select_query)
connection.commit()
connection.close()
print("Binary logs deleted successfully")
t1 = PythonOperator(
task_id='truncate_table',
python_callable=truncate_table, dag=dag
)
t2 = PythonOperator(
task_id='delete_binary_logs',
python_callable=delete_binary_logs, dag=dag
)
t2 << t1
'
I am surprized but it worked for me. Update your config as below:
base_log_folder=""
It is test in minio and in s3.
Our solution looks a lot like Franzi's:
Running on Airflow 2.0.1 (py3.8)
Override default logging configuration
Since we use a helm chart for airflow deployment it was easiest to push an env there, but it can also be done in the airflow.cfg or using ENV in dockerfile.
# Set custom logging configuration to enable log rotation for task logging
AIRFLOW__LOGGING__LOGGING_CONFIG_CLASS: "airflow_plugins.settings.airflow_local_settings.DEFAULT_LOGGING_CONFIG"
Then we added the logging configuration together with the custom log handler to a python module we build and install in the docker image. As described here: https://airflow.apache.org/docs/apache-airflow/stable/modules_management.html
Logging configuration snippet
This is only a copy on the default from the airflow codebase, but then the task logger gets a different handler.
DEFAULT_LOGGING_CONFIG: Dict[str, Any] = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'airflow': {'format': LOG_FORMAT},
'airflow_coloured': {
'format': COLORED_LOG_FORMAT if COLORED_LOG else LOG_FORMAT,
'class': COLORED_FORMATTER_CLASS if COLORED_LOG else 'logging.Formatter',
},
},
'handlers': {
'console': {
'class': 'airflow.utils.log.logging_mixin.RedirectStdHandler',
'formatter': 'airflow_coloured',
'stream': 'sys.stdout',
},
'task': {
'class': 'airflow_plugins.log.rotating_file_task_handler.RotatingFileTaskHandler',
'formatter': 'airflow',
'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER),
'filename_template': FILENAME_TEMPLATE,
'maxBytes': 10485760, # 10MB
'backupCount': 6,
},
...
RotatingFileTaskHandler
And finally the custom handler which is just a merge of the logging.handlers.RotatingFileHandler and the FileTaskHandler.
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""File logging handler for tasks."""
import logging
import os
from pathlib import Path
from typing import TYPE_CHECKING, Optional
import requests
from airflow.configuration import AirflowConfigException, conf
from airflow.utils.helpers import parse_template_string
if TYPE_CHECKING:
from airflow.models import TaskInstance
class RotatingFileTaskHandler(logging.Handler):
"""
FileTaskHandler is a python log handler that handles and reads
task instance logs. It creates and delegates log handling
to `logging.FileHandler` after receiving task instance context.
It reads logs from task instance's host machine.
:param base_log_folder: Base log folder to place logs.
:param filename_template: template filename string
"""
def __init__(self, base_log_folder: str, filename_template: str, maxBytes=0, backupCount=0):
self.max_bytes = maxBytes
self.backup_count = backupCount
super().__init__()
self.handler = None # type: Optional[logging.FileHandler]
self.local_base = base_log_folder
self.filename_template, self.filename_jinja_template = parse_template_string(filename_template)
def set_context(self, ti: "TaskInstance"):
"""
Provide task_instance context to airflow task handler.
:param ti: task instance object
"""
local_loc = self._init_file(ti)
self.handler = logging.handlers.RotatingFileHandler(
filename=local_loc,
mode='a',
maxBytes=self.max_bytes,
backupCount=self.backup_count,
encoding='utf-8',
delay=False,
)
if self.formatter:
self.handler.setFormatter(self.formatter)
self.handler.setLevel(self.level)
def emit(self, record):
if self.handler:
self.handler.emit(record)
def flush(self):
if self.handler:
self.handler.flush()
def close(self):
if self.handler:
self.handler.close()
def _render_filename(self, ti, try_number):
if self.filename_jinja_template:
if hasattr(ti, 'task'):
jinja_context = ti.get_template_context()
jinja_context['try_number'] = try_number
else:
jinja_context = {
'ti': ti,
'ts': ti.execution_date.isoformat(),
'try_number': try_number,
}
return self.filename_jinja_template.render(**jinja_context)
return self.filename_template.format(
dag_id=ti.dag_id,
task_id=ti.task_id,
execution_date=ti.execution_date.isoformat(),
try_number=try_number,
)
def _read_grouped_logs(self):
return False
def _read(self, ti, try_number, metadata=None): # pylint: disable=unused-argument
"""
Template method that contains custom logic of reading
logs given the try_number.
:param ti: task instance record
:param try_number: current try_number to read log from
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: log message as a string and metadata.
"""
# Task instance here might be different from task instance when
# initializing the handler. Thus explicitly getting log location
# is needed to get correct log path.
log_relative_path = self._render_filename(ti, try_number)
location = os.path.join(self.local_base, log_relative_path)
log = ""
if os.path.exists(location):
try:
with open(location) as file:
log += f"*** Reading local file: {location}\n"
log += "".join(file.readlines())
except Exception as e: # pylint: disable=broad-except
log = f"*** Failed to load local log file: {location}\n"
log += "*** {}\n".format(str(e))
elif conf.get('core', 'executor') == 'KubernetesExecutor': # pylint: disable=too-many-nested-blocks
try:
from airflow.kubernetes.kube_client import get_kube_client
kube_client = get_kube_client()
if len(ti.hostname) >= 63:
# Kubernetes takes the pod name and truncates it for the hostname. This truncated hostname
# is returned for the fqdn to comply with the 63 character limit imposed by DNS standards
# on any label of a FQDN.
pod_list = kube_client.list_namespaced_pod(conf.get('kubernetes', 'namespace'))
matches = [
pod.metadata.name
for pod in pod_list.items
if pod.metadata.name.startswith(ti.hostname)
]
if len(matches) == 1:
if len(matches[0]) > len(ti.hostname):
ti.hostname = matches[0]
log += '*** Trying to get logs (last 100 lines) from worker pod {} ***\n\n'.format(
ti.hostname
)
res = kube_client.read_namespaced_pod_log(
name=ti.hostname,
namespace=conf.get('kubernetes', 'namespace'),
container='base',
follow=False,
tail_lines=100,
_preload_content=False,
)
for line in res:
log += line.decode()
except Exception as f: # pylint: disable=broad-except
log += '*** Unable to fetch logs from worker pod {} ***\n{}\n\n'.format(ti.hostname, str(f))
else:
url = os.path.join("http://{ti.hostname}:{worker_log_server_port}/log", log_relative_path).format(
ti=ti, worker_log_server_port=conf.get('celery', 'WORKER_LOG_SERVER_PORT')
)
log += f"*** Log file does not exist: {location}\n"
log += f"*** Fetching from: {url}\n"
try:
timeout = None # No timeout
try:
timeout = conf.getint('webserver', 'log_fetch_timeout_sec')
except (AirflowConfigException, ValueError):
pass
response = requests.get(url, timeout=timeout)
response.encoding = "utf-8"
# Check if the resource was properly fetched
response.raise_for_status()
log += '\n' + response.text
except Exception as e: # pylint: disable=broad-except
log += "*** Failed to fetch log file from worker. {}\n".format(str(e))
return log, {'end_of_log': True}
def read(self, task_instance, try_number=None, metadata=None):
"""
Read logs of given task instance from local machine.
:param task_instance: task instance object
:param try_number: task instance try_number to read logs from. If None
it returns all logs separated by try_number
:param metadata: log metadata,
can be used for steaming log reading and auto-tailing.
:return: a list of listed tuples which order log string by host
"""
# Task instance increments its try number when it starts to run.
# So the log for a particular task try will only show up when
# try number gets incremented in DB, i.e logs produced the time
# after cli run and before try_number + 1 in DB will not be displayed.
if try_number is None:
next_try = task_instance.next_try_number
try_numbers = list(range(1, next_try))
elif try_number < 1:
logs = [
[('default_host', f'Error fetching the logs. Try number {try_number} is invalid.')],
]
return logs, [{'end_of_log': True}]
else:
try_numbers = [try_number]
logs = [''] * len(try_numbers)
metadata_array = [{}] * len(try_numbers)
for i, try_number_element in enumerate(try_numbers):
log, metadata = self._read(task_instance, try_number_element, metadata)
# es_task_handler return logs grouped by host. wrap other handler returning log string
# with default/ empty host so that UI can render the response in the same way
logs[i] = log if self._read_grouped_logs() else [(task_instance.hostname, log)]
metadata_array[i] = metadata
return logs, metadata_array
def _init_file(self, ti):
"""
Create log directory and give it correct permissions.
:param ti: task instance object
:return: relative log path of the given task instance
"""
# To handle log writing when tasks are impersonated, the log files need to
# be writable by the user that runs the Airflow command and the user
# that is impersonated. This is mainly to handle corner cases with the
# SubDagOperator. When the SubDagOperator is run, all of the operators
# run under the impersonated user and create appropriate log files
# as the impersonated user. However, if the user manually runs tasks
# of the SubDagOperator through the UI, then the log files are created
# by the user that runs the Airflow command. For example, the Airflow
# run command may be run by the `airflow_sudoable` user, but the Airflow
# tasks may be run by the `airflow` user. If the log files are not
# writable by both users, then it's possible that re-running a task
# via the UI (or vice versa) results in a permission error as the task
# tries to write to a log file created by the other user.
relative_path = self._render_filename(ti, ti.try_number)
full_path = os.path.join(self.local_base, relative_path)
directory = os.path.dirname(full_path)
# Create the log file and give it group writable permissions
# TODO(aoen): Make log dirs and logs globally readable for now since the SubDag
# operator is not compatible with impersonation (e.g. if a Celery executor is used
# for a SubDag operator and the SubDag operator has a different owner than the
# parent DAG)
Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)
if not os.path.exists(full_path):
open(full_path, "a").close()
# TODO: Investigate using 444 instead of 666.
os.chmod(full_path, 0o666)
return full_path
Maybe a final note; the links in the airflow UI to the logging will now only open the latest logfile, not the older rotated files which are only accessible by means of SSH or any other interface to access the airflow logging path.
I don't think that there is a rotation mechanism but you can store them in S3 or google cloud storage as describe here : https://airflow.incubator.apache.org/configuration.html#logs

Resources