Difference between write() and printf() - unix

Recently I am studying operating system..I just wanna know:
What’s the difference between a system call (like write()) and a standard library function (like printf())?

A system call is a call to a function that is not part of the application but is inside the kernel. The kernel is a software layer that provides you some basic functionalities to abstract the hardware to you. Roughly, the kernel is something that turns your hardware into software.
You always ultimately use write() to write anything on a peripheral whatever is the kind of device you write on. write() is designed to only write a sequence of bytes, that's all and nothing more. But as write() is considered too basic (you may want to write an integer in ten basis, or a float number in scientific notation, etc), different libraries are provided to you by different kind of programming environments to ease you.
For example, the C programming langage gives you printf() that lets you write data in many different formats. So, you can understand printf() as a function that convert your data into a formatted sequence of bytes and that calls write() to write those bytes onto the output. But C++ gives you cout; Java System.out.println, etc. Each of these functions ends to a call to write() (at least on POSIX systems).
One thing to know (important) is that such a system call is costly! It is not a simple function call because you need to call something that is outside of your own code and the system must ensure that you are not trying to do nasty things, etc. So it is very common in higher print-like function that some buffering is built-in; such that write is not always called, but your data are kept into some hidden structure and written only when it is really needed or necessary (buffer is full or you really want to see the result of your print).
This is exactly what happens when you manage your money. If many people gives you 5 bucks each, you won't go deposit each to the bank! You keep them on your wallet (this is the print) up to the point it is full or you don't want to keep them anymore. Then you go to the bank and make a big deposit (this is the write). And you know that putting 5 bucks to your wallet is much much faster than going to the bank and make the deposit. The bank is the kernel/OS.

System calls are implemented by the operating system, and run in kernel mode. Library functions are implemented in user mode, just like application code. Library functions might invoke system calls (e.g. printf eventually calls write), but that depends on what the library function is for (math functions usually don't need to use the kernel).

System Call's in OS are used in interacting with the OS. E.g. Write() could be used something into the system or into a program.
While Standard Library functions are program specific, E.g. printf() will print something out but it will only be in GUI/command line and wont effect system.
Sorry couldnt comment, because i need 50 reputation to comment.
EDIT: Barmar has good answer

I am writing a small program. At the moment it just reads each line from stdin and prints it to stdout. I can add a call to write in the loop, and it would add a few characters at the end of each line. But when I use printf instead, then all the extra characters are clustered and appear all at once, instead of appearing on each line.
It seems that using printf causes stderr to be buffered. Adding fflush(stdout); after calling printf fixes the discrepancy in output.

I'd like to mention another point that the stdio buffers are maintained in a process’s user-space memory, while system call write transfers data directly to a kernel buffer. It means that if you fork a process after write and printf calls, flushing may bring about to give output three times subject to line-buffering and block-buffering, two of them belong to printf call since stdio buffers are duplicated in the child by fork.

printf() is one of the APIs or interfaces exposed to user space to call functions from C library.
printf() actually uses write() system call. The write() system call is actually responsible for sending data to the output.

Related

Is there a way to simplify OpenCl kernels usage ?

To use OpenCL kernel the following is needed:
Put the kernel code in a string
call clCreateProgramWithSource
call clBuildProgram
call clCreateKernel
call clSetKernelArg (x number of arguments)
call clEnqueueNDRangeKernel
This need to be done for each kernel. Is there a way to do this repeating less code for each kernel?
There is no way to speed up the process. You need to go step by step as you listed.
But it is important to know why it is needed these steps, to understand how flexible the chain is.
clCreateProgramWithSource: Allows to add different strings from different sources to generate the program. Some string might be static, but some might be downloaded from a server, or loaded from disk. It allows the CL code to be dynamic and updated over time.
clBuildProgram: Builds the program for a given device. Maybe you have 8 devices, so you need to call this multiple times. Each device will produce a different binary code.
clCreateKernel: Creates a kernel. But a kernel is an entry point in a binary. So it is possible you create multiple kernels from a program (for different functions). Also the same kernel might be created multiple times, since it holds the arguments. This is useful for having ready-to-be-launched instances with proper parameters.
clSetKernelArg: Changes the parameters in the instance of the kernel. (it is stored there, so it can used multiple times in the future).
clEnqueueNDRangeKernel: Launches it, configuring the size of the launch and the chain of dependencies with other operations.
So, even if you could have a way to just call "getKernelFromString()", the functionality will be very limited, and not very flexible.
You can have look at wrapper libraries
https://streamhpc.com/knowledge/for-developers/opencl-wrappers/
I suggest you look into SYCL. The building steps are performed offline, saving execution time by skipping the clCreateProgramWithSource. The argument setting is done automatically by the runtime, extracting the information from the user lambda
There is also CLU: https://github.com/Computing-Language-Utility/CLU - see https://www.khronos.org/assets/uploads/developers/library/2012-siggraph-opencl-bof/OpenCL-CLU-and-Intel-SIGGRAPH_Aug12.pdf for more info. It is a very simple tool, but should make life a bit easier.

Writing an entire array to QFile

The QFile::write() documentation says:
Writes at most maxSize bytes of data from data to the device. Returns the number of bytes that were actually written, or -1 if an error occurred.
(Yes that's the entire documentation - unusually poor for Qt.)
This seems to imply that it is common for it not to write all of the data you pass. Does that mean I should call write() in a loop, passing in the remaining unwritten data until it has fully written it?
If so that seems like a pain. Is there some convenience function in Qt that will do it for me?
This seems to imply that it is common for it not to write all of the data you pass
Common, no, but it is possible, which we'll see, below.
Is there some convenience function in Qt that will do it for me?
If you're writing a whole file, a better method would be to use QSaveFile, for "safely writing to files".
As it states in the documentation for QSaveFile:
QSaveFile automatically detects errors while writing, such as the full partition situation, where write() cannot write all the bytes.
So a full partition would be one instance that could cause a QIODevice::write to fail to fully write data to disk. Disk failure, removal of a portable storage device or network drive during a write to those can also cause only partial data to be written.
Yes that's the entire documentation - unusually poor for Qt.
There's not much more to it.
Does that mean I should call write() in a loop
Something has to prevent write from finishing, perhaps the missing bit is that it won't fail to write maxBytes for no reason. I.e. it's not done to spite you, but because write can't proceed.
If you remove the condition that caused it to fail you can call it again. In most cases, there's no point in retrying - when write fails to write what you meant to write, you're done and you handle it as an error.

Why does zumero_sync need to be called multiple times?

According to the documentation for zumero_sync:
If a large amount of information needs to be pulled from the server,
this function may need to be called more than once.
In my Android app that uses Zumero that's no problem; I just keep calling zumero_sync until the return value doesn't start with "0;".
However, now I'm trying to write an admin script that also syncs with my server dbfiles. I'd like to use the sqlite3 shell, and have the script pass the SQL to execute via command line arguments. I need to call zumero_sync in a loop (which SQLite doesn't support) to make sure the db is fully synced. If I had to, I could invoke sqlite3 in a loop (reading its output, looking for "0;"), or even write a C++ app to call the SQLite/Zumero functions natively. But it certainly would be easier if a single zumero_sync was enough.
I guess my real question is: could zumero_sync be changed so it completes the sync before returning? If there are cases where the existing behavior is more useful, maybe there could be a parameter for specifying which mode to use?
I see two basic questions here:
(1) Why does zumero_sync() work the way it does?
(2) Can it work differently?
I'll answer (2) first, since it's easier: Yes, it could work differently. Rather, we could (and probably will, soon, you brought this up) implement an additional function, named something like zumero_sync_complete(), which performs [the guts of] zumero_sync() in a loop and returns after the sync is complete.
We didn't implement zumero_sync_complete() because it doesn't add much value. It's a simple loop, so you can darn well write it yourself. :-)
Er, except in scripting environments which don't support loops. Like the sqlite3 shell.
Answer to (1):
The Zumero sync protocol is designed to give the server the flexibility to return partial results if it wants to do so. And for the sake of reducing load on the server (and increasing its scalability) it often does want to do exactly that.
Given that, one reason to expose this to the client is to increase the client's flexibility as well. As long we're making multiple roundtrips, we might as well give the client an opportunity to do something (like, maybe, update a progress bar) in between them.
Another thing a client might want to do in between loop iterations is handle an error.
Or, in the case of a multithreaded client, it might want to deal with changes that happened on the client while the sync is going on.
Which raises the question of how locking should be managed? Do we hold the sqlite write lock during the entire loop? Or only when absolutely necessary?
Bottom line: A robust app would probably want to implement the loop itself so that it can make its own decisions and retain full control over things.
But, as you observe, the sqlite3 shell doesn't have loops. And it's not an app. And it doesn't have threads. Or progress bars. So it's a use case where a simpler-and-less-powerful form of zumero_sync() would make sense.

Real-time control of Windows Console game

another quick question, I want to make simple console based game, nothing too fancy, just to have some weekend project to get more familiar with C. Basically I want to make tetris, but I end up with one problem:
How to let the game engine go, and in the same time wait for input? Obviously cin or scanf is useless for me.
You're looking for a library such as ncurses.
Many Rogue-like games are written using ncurses or similar.
There's two ways to do it:
The first is to run two threads; one waits for input and updates state accordingly while the other runs the game.
The other (more common in game development) way is to write the game as one big loop that executes many times a second, updating game state, redrawing the screen, and checking for input.
But instead of blocking when you get key input, you check for the presence of pending keypresses, and if nothing has happened, you just continue through your loop. If you have multiple input sources (keyboard, network, etc.) they all get put there in the loop, checking one after another.
Yes, it's called polling. No, it's not efficient. But high-end games are usually all about pulling the maximum performance and framerates out of the computer, not running cool.
For added efficiency, you can optionally block with a timeout -- saying "wait for a keypress, but no longer than 300 milliseconds" so you can continue on with your loop.
select() comes to mind, but there are other ways of waiting or checking for input as well.
You could work out how to change stdin to non-blocking, which would enable you to write something like tetris, but the game might be more directly expressed in an event-driven paradigm. Maybe it's a good excuse to learn windows programming.
Anyway, if you want to go the console route, if you are using the microsoft compiler, then you should have kbhit() available (via conio.h) which can tell you whether a call to fgetc on stdin would block.
Actually should mention that the MinGW gcc compiler 3.4.5 also supports kbhit().

Multitasking using setjmp, longjmp

is there a way to implement multitasking using setjmp and longjmp functions
You can indeed. There are a couple of ways to accomplish it. The difficult part is initially getting the jmpbufs which point to other stacks. Longjmp is only defined for jmpbuf arguments which were created by setjmp, so there's no way to do this without either using assembly or exploiting undefined behavior. User level threads are inherently not portable, so portability isn't a strong argument for not doing it really.
step 1
You need a place to store the contexts of different threads, so make a queue of jmpbuf stuctures for however many threads you want.
Step 2
You need to malloc a stack for each of these threads.
Step 3
You need to get some jmpbuf contexts which have stack pointers in the memory locations you just allocated. You could inspect the jmpbuf structure on your machine, find out where it stores the stack pointer. Call setjmp and then modify its contents so that the stack pointer is in one of your allocated stacks. Stacks usually grow down, so you probably want your stack pointer somewhere near the highest memory location. If you write a basic C program and use a debugger to disassemble it, and then find instructions it executes when you return from a function, you can find out what the offset ought to be. For example, with system V calling conventions on x86, you'll see that it pops %ebp (the frame pointer) and then calls ret which pops the return address off the stack. So on entry into a function, it pushes the return address and frame pointer. Each push moves the stack pointer down by 4 bytes, so you want the stack pointer to start at the high address of the allocated region, -8 bytes (as if you just called a function to get there). We will fill the 8 bytes next.
The other thing you can do is write some very small (one line) inline assembly to manipulate the stack pointer, and then call setjmp. This is actually more portable, because in many systems the pointers in a jmpbuf are mangled for security, so you can't easily modify them.
I haven't tried it, but you might be able to avoid the asm by just deliberately overflowing the stack by declaring a very large array and thus moving the stack pointer.
Step 4
You need exiting threads to return the system to some safe state. If you don't do this, and one of the threads returns, it will take the address right above your allocated stack as a return address and jump to some garbage location and likely segfault. So first you need a safe place to return to. Get this by calling setjmp in the main thread and storing the jmpbuf in a globally accessible location. Define a function which takes no arguments and just calls longjmp with the saved global jmpbuf. Get the address of that function and copy it to your allocated stacks where you left room for the return address. You can leave the frame pointer empty. Now, when a thread returns, it will go to that function which calls longjmp, and jump right back into the main thread where you called setjmp, every time.
Step 5
Right after the main thread's setjmp, you want to have some code that determines which thread to jump to next, pulling the appropriate jmpbuf off the queue and calling longjmp to go there. When there are no threads left in that queue, the program is done.
Step 6
Write a context switch function which calls setjmp and stores the current state back on the queue, and then longjmp on another jmpbuf from the queue.
Conclusion
That's the basics. As long as threads keep calling context switch, the queue keeps getting repopulated, and different threads run. When a thread returns, if there are any left to run, one is chosen by the main thread, and if none are left, the process terminates. With relatively little code you can have a pretty basic cooperative multitasking setup. There are more things you probably want to do, like implement a cleanup function to free the stack of a dead thread, etc. You can also implement preemption using signals, but that is much more difficult because setjmp doesn't save the floating point register state or the flags registers, which are necessary when the program is interrupted asynchronously.
It may be bending the rules a little, but GNU pth does this. It's possible, but you probably shouldn't try it yourself except as an academic proof-of-concept exercise, use the pth implementation if you want to do it seriously and in a remotely portable fashion -- you'll understand why when you read the pth thread creation code.
(Essentially it uses a signal handler to trick the OS into creating a fresh stack, then longjmp's out of there and keeps the stack around. It works, evidently, but it's sketchy as hell.)
In production code, if your OS supports makecontext/swapcontext, use those instead. If it supports CreateFiber/SwitchToFiber, use those instead. And be aware of the disappointing truth that one of the most compelling use of coroutines -- that is, inverting control by yielding out of event handlers called by foreign code -- is unsafe because the calling module has to be reentrant, and you generally can't prove that. This is why fibers still aren't supported in .NET...
This is a form of what is known as userspace context switching.
It's possible but error-prone, especially if you use the default implementation of setjmp and longjmp. One problem with these functions is that in many operating systems they'll only save a subset of 64-bit registers, rather than the entire context. This is often not enough, e.g. when dealing with system libraries (my experience here is with a custom implementation for amd64/windows, which worked pretty stable all things considered).
That said, if you're not trying to work with complex external codebases or event handlers, and you know what you're doing, and (especially) if you write your own version in assembler that saves more of the current context (if you're using 32-bit windows or linux this might not be necessary, if you use some versions of BSD I imagine it almost definitely is), and you debug it paying careful attention to the disassembly output, then you may be able to achieve what you want.
I did something like this for studies.
https://github.com/Kraego/STM32L476_MiniOS/blob/main/Usercode/Concurrency/scheduler.c
The context/thread switching is done by setjmp/longjmp. The difficult part was to get the allocated stack correct (see allocateStack()) this depends on your platform.
This is just a demonstration how this could work, I would never use this in production.
As was already mentioned by Sean Ogden,
longjmp() is not good for multitasking, as
it can only move the stack upward and can't
jump between different stacks. No go with that.
As mentioned by user414736, you can use getcontext/makecontext/swapcontext
functions, but the problem with those is that
they are not fully in user-space. They actually
call the sigprocmask() syscall because they switch
the signal mask as part of the context switching.
This makes swapcontext() much slower than longjmp(),
and you likely don't want the slow co-routines.
To my knowledge there is no POSIX-standard solution to
this problem, so I compiled my own from different
available sources. You can find the context-manipulating
functions extracted from libtask here:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/mcontext
The functions are:
getmcontext(), setmcontext(), makemcontext() and swapmcontext().
They have the similar semantic to the standard functions with similar names,
but they also mimic the setjmp() semantic in that getmcontext()
returns 1 (instead of 0) when jumped to by setmcontext().
On top of that you can use a port of libpcl, the coroutine library:
https://github.com/dosemu2/dosemu2/tree/devel/src/base/lib/libpcl
With this, it is possible to implement the fast cooperative user-space
threading. It works on linux, on i386 and x86_64 arches.

Resources