OCaml Unix Error - unix

I have run into an error that I am not sure how to debug. The error is Exception: (Unix.Unix_error "Too many open files" pipe ""). I am not opening any files and only have a single Unix process open. Anybody have some tips on how to debug this?
The function causing the error is:
let rec update_act_odrs ?(sec_to_wait = 0.0) () =
try
(act_odrs := active_orders ())
|> fun _ -> Lwt_io.print "active_orders Updated\n"
with _ ->
Lwt_unix.sleep sec_to_wait
>>= update_act_odrs ~sec_to_wait:(sec_to_wait +. 1.0)
where active_orders () is a function that gets JSON data from a server.

I would suggest to use ltrace, to trace the calls to open or pipe function. Also, it is good idea, just to grep your codebase, for functions that usually opens descriptors, e.g. openfile, socket, pipe, popen.
Also, you should know, that the failing function is not always a root of the evil. The descriptors can be eaten by some other process or function in your process.
You can also look at /proc folder, if you're on linux, to make sure, that your process actually eats so many fds. Maybe, you have another process in your system, that is launched by your application, and that is responsible for the fd leak.
Finally, from the code you've shown, I can conclude, that the source of leak can be only active_orders function. If it downloads json data from serve, it should open a socket connection. The fact that error message, points to the pipe is strange, maybe it is implemented with popen or system functions.

Related

How to prevent "Execution failed:[Errno 32] Broken pipe" in Airflow

I just started using Airflow to coordinate our ETL pipeline.
I encountered the pipe error when I run a dag.
I've seen a general stackoverflow discussion here.
My case is more on the Airflow side. According to the discussion in that post, the possible root cause is:
The broken pipe error usually occurs if your request is blocked or
takes too long and after request-side timeout, it'll close the
connection and then, when the respond-side (server) tries to write to
the socket, it will throw a pipe broken error.
This might be the real cause in my case, I have a pythonoperator that will start another job outside of Airflow, and that job could be very lengthy (i.e. 10+ hours), I wonder if what is the mechanism in place in Airflow that I can leverage to prevent this error.
Can anyone help?
UPDATE1 20190303-1:
Thanks to #y2k-shubham for the SSHOperator, I am able to use it to set up a SSH connection successfully and am able to run some simple commands on the remote site (indeed the default ssh connection has to be set to localhost because the job is on the localhost) and am able to see the correct result of hostname, pwd.
However, when I attempted to run the actual job, I received same error, again, the error is from the jpipeline ob instead of the Airflow dag/task.
UPDATE2: 20190303-2
I had a successful run (airflow test) with no error, and then followed another failed run (scheduler) with same error from pipeline.
While I'd suggest you keep looking for a more graceful way of trying to achieve what you want, I'm putting up example usage as requested
First you've got to create an SSHHook. This can be done in two ways
The conventional way where you supply all requisite settings like host, user, password (if needed) etc from the client code where you are instantiating the hook. Im hereby citing an example from test_ssh_hook.py, but you must thoroughly go through SSHHook as well as its tests to understand all possible usages
ssh_hook = SSHHook(remote_host="remote_host",
port="port",
username="username",
timeout=10,
key_file="fake.file")
The Airflow way where you put all connection details inside a Connection object that can be managed from UI and only pass it's conn_id to instantiate your hook
ssh_hook = SSHHook(ssh_conn_id="my_ssh_conn_id")
Of course, if your'e relying on SSHOperator, then you can directly pass the ssh_conn_id to operator.
ssh_operator = SSHOperator(ssh_conn_id="my_ssh_conn_id")
Now if your'e planning to have a dedicated task for running a command over SSH, you can use SSHOperator. Again I'm citing an example from test_ssh_operator.py, but go through the sources for a better picture.
task = SSHOperator(task_id="test",
command="echo -n airflow",
dag=self.dag,
timeout=10,
ssh_conn_id="ssh_default")
But then you might want to run a command over SSH as a part of your bigger task. In that case, you don't want an SSHOperator, you can still use just the SSHHook. The get_conn() method of SSHHook provides you an instance of paramiko SSHClient. With this you can run a command using exec_command() call
my_command = "echo airflow"
stdin, stdout, stderr = ssh_client.exec_command(
command=my_command,
get_pty=my_command.startswith("sudo"),
timeout=10)
If you look at SSHOperator's execute() method, it is a rather complicated (but robust) piece of code trying to achieve a very simple thing. For my own usage, I had created some snippets that you might want to look at
For using SSHHook independently of SSHOperator, have a look at ssh_utils.py
For an operator that runs multiple commands over SSH (you can achieve the same thing by using bash's && operator), see MultiCmdSSHOperator

Delaying part of an R script inside of a loop

I'm executing a batch file inside an R script. I'd like to run this and another large section of the R script twice using a foreach loop.
foreach (i=1:2, .combine = rbind)%do%{
shell.exec("\\\\network\\path\\to\\batch\\script.ext")
*rest of the R script*
}
One silly problem though is that this batch file generates data and that data is connected to SQL Server localdb inside the loop. I thought at first that the script would execute the batch file, wait for it to finish and then move on. However, (seems obvious in hindsight) the script instead executes the batch file, tries to grab data that hasn't been created yet (because the file isn't finished running) and the executes the batch file again before it finishes the first time.
I've been trying to find away to delay the rest of the script from executing until the batch script has finished executing but have not come up with anything yet. I'd appreciate any insights anyone has.
Use system2 instead of shell.exe. system2 calls are blocking — meaning, the function waits until the external program has finished running. On most systems, this can be used directly to run scripts. On Windows, you may have to invoke rundll32 to execute a script:
cmd = c('rundll32.exe', 'Shell32.dll,ShellExecute', 'NULL', 'open', scriptpath)
system2(paste(shQuote(cmd), collapse = ' '))
Windows users may use shell, which by default has wait=TRUE, which will cause R to wait for its completion. You may choose whether or not to directly "intern" the result.
On unix-like systems, use system, which also defaults to wait=TRUE.
If your batch file simply launches another process and terminates, then it may need to be modified to either wait for completion or return a suitable process or file indicator that can be monitored.

Erlang stopping application doesn't end all processes?

When I stop an Erlang application that I built, the cowboy listener process stays alive, continuing to handle requests. In the gen_server that I wrote I start a server on init. As you can see below:
init([Port]) ->
Dispatch = cowboy_router:compile([
{'_', [
{"/custom/[...]", ?MODULE, []},
% Serve index.html as default file
% Serve entire directory
{"/[...]", cowboy_static, {priv_dir,
app, "www"}}
]}
]),
Name = custom_name,
{ok, Pid} = cowboy:start_http(Name, 100,
[{port, Port}],
[{env, [{dispatch, Dispatch}]}]),
{ok, #state{handler_pid = Pid}}.
This starts the cowboy http server, which uses cowboy_static to server some stuff in the priv/app/ dir and the current module to handle custom stuff (module implements all the cowboy http handle callbacks). It takes the pid returned from the call and assigns it to handler_pid in the state record. This all works. However when I startup the application containing this module (which works) and then I stop it. All processes end (at least the ones in my application). The custom handler (which is implemented in the same module as the gen_server) no longer works. But the cowboy_static handle continues to handle requests. It continues to serve static files until I kill the node. I tried fixing this by adding this to the gen_server:
terminate(_Reason, State) ->
exit(State#state.handler_pid, normal),
cowboy:stop_listener(listener_name()),
ok.
But nothing changes. The cowboy_static handler continues to serve static files.
Questions:
Am I doing anything wrong here?
Is cowboy_static running under the cowboy application? I assume it is.
If so, how do I stop it?
And also, should I be concerned about stopping it? Maybe this isn't that big a deal.
Thanks in advance!
I don't think it is really important, generally you use one node/VM per application (in fact a bunch of erlang application working together, but I haven't a better word). But I think you can stop the server using application:stop(cowboy), application:stop(ranch).
You should fix 3 things:
the symbol in start_http(Name, ...) and stop_listener(Name) should match.
trap exit in service init: process_flag(trap_exit, true)
remove exit call from terminate.

Haskell System.Timeout.timeout crashing when called from certain function

I'm scraping some data from the frontpages of a list of website domains. Some of them are not answering, or are very slow, causing the scraper to halt.
I wanted to solve this by using a timeout. The various HTTP libraries available don't seem to support that, but System.Timeout.timeout seems to do what I need.
Indeed, it seems to work fine when I test the scraping function, but it crashes as soon as I run the enclosing function: (Sorry for bad/ugly code. I'm learning.)
fetchPage domain =
-- Try to read the file from disk.
catch
(System.IO.Strict.readFile $ "page cache/" ++ domain)
(\e -> downloadAndCachePage domain)
downloadAndCachePage domain =
catch
(do
-- Failed, so try to download it.
-- This craches when called by fetchPage, but works fine when called from directly.
maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
let page = fromMaybe "" maybePage
-- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
-- page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
-- Cache it.
writeFile ("page cache/" ++ domain) page
return page)
(\e -> catch
(do
-- Failed, so just fuggeddaboudit.
writeFile ("page cache/" ++ domain) ""
return "")
(\e -> return "")) -- Failed BIG, so just don't give a crap.
downloadAndCachePage works fine with the timeout, when called from the repl, but fetchPage crashes. If I remove the timeout from downloadAndCachePage, fetchPage will work.
Anyone who can explain this, or know an alternative solution?
Your catch handler in fetchPage looks wrong -- it seems you're trying to read a file, and on file not found exception are directly calling into your http function from the exception handler. Don't do this. For complicated reasons, as I recall, code in exception handlers doesn't always behave like normal code -- particularly when it attempts to handle exceptions itself. And indeed, under the covers, timeout uses asynchronous exceptions to kill threads.
In general, you should put as little code as possible in exception handlers, and especially not put code that tries to handle further exceptions (although it is generally fine to reraise a handled exception to "pass it on" [as with bracket]).
That said, even if you're not doing the right thing, a crash (if it is a segfault type crash as opposed to a <<loop>> type crash), even from weird code, is nearly always wrong behavior from GHC, and if you're on GHC 7 then you should consider reporting this.

How can I wrap an executable on UNIX (SunOS) so that it is never run more than once at the same time?

I have an executable (no source) that I need to wrap, to make sure that it is not called more than once at a time. I immediately think of some sort of queue wrapper, but how do I actually make it so that my wrapper is called instead of the executable itself? Is there a better way to do this? The solution needs to be invisible because the users are other applications. Any information/recommendations are appreciated.
Method 1: Put the executable in some location not in the standard path. Create a shell script that checks a sentinel file and, if the sentinel file is absent, executes the program, waits for the ptogram to complete, then deletes the sentinel file. If the sentinel file is present, the script will enter a loop with a short delay (1 second? How long is the standard execution of this program? Take that and half it), check the sentential file again, and so on.
Method 2: Create a separate program that does the same thing as the script, but using a system-level semaphore or lock instead. You could even simply use a read/write lock on a file. The program would do a fork() and exec() on the real program, waiting for child exit before clearing the sentinel.
If the users are other applications, you can just rename the executable (e.g. name -> name.real) and call the wrapper with the original name. To make sure that it's only called once at a time, you can use the pidof command (e.g. pidof name.real) to check if the program is running already (pidof actually gives you the PID of the running process, so that you can use stuff such as kill or whatever to send signals to it).

Resources