In microcontrollers architecture : for SRAM memory segments :
as I know that the uninitialized Global variable is allocated in the .bss segment of the SRAM (and the initialized ones are allocated in the .data segment).
so, the Question: When updating the value of a Global variable during the Run time of the program,
Is the Global variable travel from .bss to .data segment ?
No. The location of the variable is set by the compiler. It makes sure all references to the global (and static) variables point to the correct location.
The reason that there are two segments is that they are treated differently by the C runtime, a portion of code that is added by the compiler to your program and that runs before jumping to your code.
The variables that are initialized are in the data section. The C runtime has a function that copies the values from the end of the program code space to the SRAM for these variables. These are most often at the top of RAM.
The variables that are not initialized do not not need such values in the program code. Instead, the C runtime runs a function that initializes all these variables to 0. (Some C runtimes don't even do this. For these runtimes, the variable will have whatever value is already in SRAM when your program begins.) These variables usually occupy the RAM directly after the initialized variables.
The end of RAM is the stack, and the heap starts after the uninitialized globals. Often, there is no mechanism to prevent these from overwriting each other. This is the famous "Stack Overflow".
Related
There are [] operations which are similar to dereferencing in high-level languages. So does that mean that every variable and register name is just a pointer or are pointers a high-level languages idea and have no use in Assembly?
Pointers are a useful concept in asm, but I wouldn't say that symbol names are truly pointers. They're addresses, but they aren't pointers because there's no storage holding them (except metadata, and embedded into the machine code), and thus you can't modify them. A pointer can be incremented.
Variables are a high-level concept that doesn't truly exist in assembly. Asm has labels which you can put at any byte position in any section, including .data, .text, or whatever. Along with directives like dd, you can reserve space for a global variable and attach a symbol to it. (But a variable's value can temporarily be in a register, or for its whole lifetime if it's a local variable. The high-level concept of a variable doesn't have to map to static storage with a label.)
Types like integer vs. pointer also don't really exist in assembly; everything is just bytes that you can load into an integer, XMM, or even x87 FP register. (And you don't really have to think of that as type-punning to integer and back if you use eax to copy the bytes of a float from one memory location to another, you're just loading and storing to copy bytes around.)
But on the other hand, a pointer is a low-enough level concept still be highly relevant in assembly. We have the stack pointer (RSP) which usually holds a valid address, pointing to some memory we're using as stack space. (You can use the RSP register to hold values that aren't valid addresses, though. In that case you're not using it as a pointer. But at any time you could execute a push instruction, or mov eax, [rsp], and cause an exception from the invalid address.)
A pointer is an object that holds the address of another object. (I'm using "object" in C terms here: any byte[s] of storage that you can access, including something like an int. Not objected as in object-oriented programming.) So a pointer is basically a type of integer data, especially in assembly for a flat memory model where there aren't multiple components to it. For a segmented memory model, a seg:off far pointer is a pair of integers.
So any valid address stored anywhere in register or memory can usefully be thought of as a pointer.
But no, a symbol defined by a label is not a pointer. Conceptually, I think it's important to think of it as just a label. A pointer is itself an object (some bytes of storage), and thus can be modified. e.g. increment a pointer. But a symbol is just a way to reference some fixed position.
In C terms, a symbol is like a char symbol[], not a char *symbol = NULL; If you use bare symbol, you get the address. Like mov edi, symbol in NASM syntax. (Or mov edi, OFFSET symbol in GNU .intel_syntax or MASM. See also How to load address of function or label into register for practical considerations like using RIP-relative LEA if 32-bit absolute addresses don't work.)
You can deref any symbol in asm to access the bytes there, whether that's mov eax, [main] to load the first 4 bytes of machine code of that function, or mov eax, [global_arr + rdi*8] to index into an array, or any other x86 addressing mode. (Caveat: 32-bit absolute addresses no longer allowed in x86-64 Linux? for that last example).
But you can't do arr++; that makes no sense. There is no storage anywhere holding that address. It's embedded into the machine code of your program at each site that uses it. It's not a pointer. (Note that C arr[4] = 1 compiles to different asm depending on char *arr; vs. char arr[], but in asm you have to manually load the pointer value from wherever it's stored, and then deref it with some offset.)
If you have a label in asm whose address you want to use in C, but that isn't attached to some bytes of storage, you usually want to declare it as extern const char end_of_data_section[]; or whatever.
So for example you can do size_t data_size = data_end - data_start; and get the size in bytes as a link-time constant, if you arranged for those symbols to be at the end/start of your .data section. With a linker script or with global data_end / data_end: in your NASM source. Probably at the same address as some other symbol, for the start.
Assembly language doesn't have variables, pointers or type checking. It has addresses and registers (and no type checking).
A variable (e.g. in C) is a higher level thing - a way to abstract the location of a piece of data so that you don't have to care if the compiler felt like putting it in a register or in memory. A variable also has a type; which is used to detect some kinds of bugs, and used to allow the compiler to automatically convert data to a different type for you.
A pointer (e.g. in C) is a variable (see above). The main difference between a pointer and a "not pointer" is the type of data it contains - for a pointer the variable typically contains an address, but a pointer is not just an address, it's the address of something with a type. This is important - if it was just an address then the compiler wouldn't know how big it is, couldn't detect some kinds of bugs, and couldn't automatically convert data (e.g. consider what would happen for int foo = *pointer_to_char; if the compiler didn't know what type of data the pointer points to).
I'm studying for a test next week and we're learning about microcontrollers. We just did a sample code with interrupts and it told them temperature in F and C when we pushed a button (interrupted). how can C and F be accessed from both
main and the IRQ() functions?
The most simple way to share a variable between IRQ handlers and the main thread on any bare metal system:
Make sure that the variable type is one
which the CPU can atomically read and write.
Make the variable global and declare it volatile, so that
the generated machine code cannot optimize away accesses to
the shared variable.
To read a value, use something like const atomic_type local_copy = shared_variable; and work with that local copy. An expression like shared_variable * shared_variable might use different values for shared_variable.
Make sure that only one IRQ handler (which must only run once
at the same time), or only the main thread writes to the shared variable.
All other parts of the code are only allowed to read the shared
variable.
If the data you want to communicate between IRQ handler and main thread does not fit inside an atomic type, have fun with complex locking protocols.
one thing I was recently thinking about was how a computer finds his variables. When we run a program, the program will create multiple layers in the stack, one layer for every new scope it opens and put either the variable value or a pointer in case of storage in heap in this scope. When the scope is done, it and all its variables will be destroyed. But how does a computer know where its variables are? And which ones to use if the same variables occur to be present more often.
How I imagine it, the computer searches the scope it is in like an array and if it doesn't find the variable it follows the stack downwards like a linked list and searches the next scope like an array.
That leads to the assumption that a global variable is the slowest to use since it has to traverse all the way back to the last scope. So it has a computational time from a * n (a = average amount of variables per scope, n = amount of scopes). If I now assume that my code is recursive and within the recursive function calls on a global variable (let's say I have defined the variable const PI = 3.1416 and I use it in every recursion), then it would traverse it backwards again for every single call and if my recursive function takes 1000 recursion, then it does that 1000 times.
But on the other hand, while learning about recursion, I have never heard that referring to variables that are not found inside the recursive scope is to be avoided if possible. Therefore I wonder if I am right with my thoughts. Can someone please shed some light on the issue.
You got it the other way around: scopes, frames, heaps don't make variables, variables make scopes, frames, heaps.
Both are a bit of a stretch actually but my point is to avoid focusing on the lifetime of a variable (that's what terms like heap and stack really mean) and instead take a look under the hood.
Memory is a form of storage where each cell is assigned a number, the cell is called word and the number is called address.
The set of addresses is called address space, an address space is usually a range of addresses or a union of ranges of addresses.
The compiler assumes the program data will be loaded at a specific address, say X, and that the there is enough memory after X (i.e. X+1, X+2, X+3, ..., all exists) for all the data.
Variables are then laid out sequentially from X onward, it is the job of the compiler to keep the association between the address X+k and the variable instance.
Note that a variable may be instanced more than one time, calling a function twice or recursion are both examples of that.
In the first case, the two instances can share the same address X+k since they are don't overlap in time (by the time the second instance is alive, the first is over).
In the second case, the two instances overlap in time and two addresses must be used.
So we see that it is the lifetime of a variable that affects how the mapping between the variable name and its address (a.k.a. the allocation of the variable) is done.
Two common strategies are:
A stack
We start from an address X+b and allocates new instances at successive addresses X+b+1, X+b+2, etc.
The current address (e.g. X+b+54) is stored somewhere (it is the stack pointer).
When we want to free a variable we set the stack pointer back (e.g. from X+b+54 to X+b+53).
We can see that it's impossible to free a variable that is not the last allocated.
This allows for a very fast allocation/deallocation and naturally fits the need of a function frame that holds the local variables: when a function is invoked the new variables are allocated, when it ends they are removed.
From what we noted above, we see that if f calls g (i.e. f is the parent of g) then the variables of f cannot be deallocated before those of g.
This again naturally fits the semantics of functions.
The heap
This strategy dynamically allocate a variable instance at an address X+o.
The runtime reserves a block of addresses and manages their status (free, occupied), when asked, it can give a free address and mark it occupied.
This is useful to allocate an object whose size depends on the user input, for example.
The heap (static)
Some variables have the lifespan of the program but their size and number is known a compile time.
In this case, the compiler simply assigns each instance a unique address X+i.
They cannot be deallocated, they are loaded in memory in batch along with the program code and stay there until the program is unloaded.
I left behind some details, like the fact that the stack most often than not grows from bigger to lower addresses (so it can be put at the farthest edge of the memory) and that variables occupy more than one address.
Some programming languages, especially interpreted ones, don't associate addresses to variable instances, instead, they keep a map between the variable name (properly qualified) and the variable value, this way the lifespan of a variable can be controlled in many particular ways (see Closure in Javascript).
Global variables are allocated in the static heap, only one instance is present (only one address).
Each recursive function that uses it always references directly to the sole instance because the unique address is known at compile time.
Local variables in a function are allocated in the stack and each invocation of a function (recursive or not) uses a new set of instances (the addresses don't need to be the same each time, but they could).
Simply put, there is no lookup, variables are allocated so that the code can access them once compiler (either relatively, in the stack, or absolutely, in the heap).
I currently have a kernel that processes a global buffer by reading
into local memory and doing calculations. Now, I would like to use registers
instead of local memory. How do I convert to registers?
Thanks!
Edit: project can be found here:
https://github.com/boxerab/ocldwt
Without seeing some code, it's impossible to give much more guidance than has already been given but I will try to elaborate on the comments.
Any variable declared without __local or __global is private, so if you remove the modifier the memory will only be visible to the single processing element running the work item. This will likely be stored in a register, although that will only happen if there is register space available. The compiler will already be putting some values into registers on your behalf, even if you haven't asked it to do so. You can see evidence of this if, for example, you are running on the NVIDIA platform and pass the -cl-nv-verbose flag when you build your kernels. You will see output like this:
ptxas info : Compiling entry function 'monte' for 'sm_35'
ptxas info : Function properties for monte
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 61 registers, 360 bytes cmem[0], 96 bytes cmem[2]
indicating that 61 registers are in use.
However, as pointed out by #DarkZeros, the decision to move from local memory to private memory is much more about the scope of variables. If your algorithm depends on all of the members of a compute unit having access to the same copy of a variable, then it isn't going to work any more.
In my OpenCL program, I am going to end up with 60+ global memory buffers that each kernel is going to need to be able to access. What's the recommended way to for letting each kernel know the location of each of these buffers?
The buffers themselves are stable throughout the life of the application -- that is, we will allocate the buffers at application start, call multiple kernels, then only deallocate the buffers at application end. Their contents, however, may change as the kernels read/write from them.
In CUDA, the way I did this was to create 60+ program scope global variables in my CUDA code. I would then, on the host, write the address of the device buffers I allocated into these global variables. Then kernels would simply use these global variables to find the buffer it needed to work with.
What would be the best way to do this in OpenCL? It seems that CL's global variables are a bit different than CUDA's, but I can't find a clear answer on if my CUDA method will work, and if so, how to go about transferring the buffer pointers into global variables. If that wont work, what's the best way otherwise?
60 global variables sure is a lot! Are you sure there isn't a way to refactor your algorithm a bit to use smaller data chunks? Remember, each kernel should be a minimum work unit, not something colossal!
However, there is one possible solution. Assuming your 60 arrays are of known size, you could store them all into one big buffer, and then use offsets to access various parts of that large array. Here's a very simple example with three arrays:
A is 100 elements
B is 200 elements
C is 100 elements
big_array = A[0:100] B[0:200] C[0:100]
offsets = [0, 100, 300]
Then, you only need to pass big_array and offsets to your kernel, and you can access each array. For example:
A[50] = big_array[offsets[0] + 50]
B[20] = big_array[offsets[1] + 20]
C[0] = big_array[offsets[2] + 0]
I'm not sure how this would affect caching on your particular device, but my initial guess is "not well." This kind of array access is a little nasty, as well. I'm not sure if it would be valid, but you could start each of your kernels with some code that extracts each offset and adds it to a copy of the original pointer.
On the host side, in order to keep your arrays more accessible, you can use clCreateSubBuffer: http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubBuffer.html which would also allow you to pass references to specific arrays without the offsets array.
I don't think this solution will be any better than passing the 60 kernel arguments, but depending on your OpenCL implementation's clSetKernelArgs, it might be faster. It will certainly reduce the length of your argument list.
You need to do two things. First, each kernel that uses each global memory buffer should declare an argument for each one, something like this:
kernel void awesome_parallel_stuff(global float* buf1, ..., global float* buf60)
so that each utilized buffer for that kernel is listed. And then, on the host side, you need to create each buffer and use clSetKernelArg to attach a given memory buffer to a given kernel argument before calling clEnqueueNDRangeKernel to get the party started.
Note that if the kernels will keep using the same buffer with each kernel execution, you only need to setup the kernel arguments one time. A common mistake I see, which can bleed host-side performance, is to repeatedly call clSetKernelArg in situations where it is completely unnecessary.