How to execute large code in less ram?

How to execute large code in less ram? - microcontroller

I have a doubt that , in all micro controllers the flash memory much more that ram( Example: atmega16 it is 16k, However the RAM is just 1 Kb).
.
So , how exactly that code is executed , does the CPU execute directly from the Flash itself , if yes then whats the use of that small RAM given.

The flash memory is for storing the programs that you want to execute. They change seldom, so flash memory is appropriate.
The RAM is for the memory required during execution of the program: stack (local variables), heap (malloc), etc.

AVRs using a Harvard Architecture that strictly separates Program and Data Memory.
In difference to PC that laods the Programm to RAM first to execute it from RAM, the code is directly executed from Programm Memory and only runtime data is stored in the RAM.

Be aware that setting a variable as const does not necessarily create the variable and put it in flash. Although it may or may not be best off in flash, the compiler does not automatically do this.
For an example check out the following link for avr-gcc.
http://www.nongnu.org/avr-libc/user-manual/pgmspace.html

Related

Which program take care of loading from Flash to RAM and running program in Microcontroller Bare metal?

The program written in C and compiled on some other IDE/computer (or cross-compiling) and then loaded as binary data into the flash memory of the controller.
What am i not understanding in Bare Metal / No RTOS
Which program/code take care of loading from Flash to RAM?
Is the RAM in microcontroller have intelligence/program to understand binary or at time of compile the intelligence is added to the binary file by compiler?

Ideally your program runs in flash not ram. Many mcus you can, it would be an architecture limit primarily if running from ram is not supported. In a pinch you can run your code in ram if you need a trampoline to reprogram the flash as in downloading new firmware in the field (for a chip with only one flash bank that can't run and be erased/modified at the same time), or for performance, but if you need ram for performance then perhaps you need to rethink your design. small sections sure, but if the whole app has to be in ram for reasons other than development, you need to re-think your system design.
You can easily wrap your program with a small copy to ram bit of code, so that the mcu boots up the copy and jump program and then the main application runs in ram. that is your choice. somewhat trivial just a few lines of code. it is chip/architecture dependent on whether you can handle interrupts in that situation or how you need to design it (more than just a copy and jump for example, might need handlers in flash that hop over to ram too).
There is no magic here, the mcu processor is no different than others you need some non-volatile way to get the program in there. Like most others cpus your processor boots from a rom/flash, then as desired it works toward the final application be it an operating system or not. for an mcu the typical approach is to boot right into the application, run the application in flash for read only items (.text and .rodata) and the read-write in ram (.data, .bss) which is handled by knowing how to use your toolchain, which is a critical part of bare-metal success.
CPUs generally don't care about flash, ram, peripherals, they are just addresses, the cpu is very very dumb. You the programmer are smart you lay the tracks down for the cpu to follow, the instructions have to follow the rules and guide the processor. The processor starts in a well known way at a well known address or vector table, from there it is all on you to keep the processor on track by working within the address space where there are resources, flash, ram and peripherals. The processor may have rules on the address space it can fetch/execute from, or not, depends on the implementation. For implementations where the executable address space has both flash and ram then yes you can simply place code in ram and execute it.
Running code in ram on an mcu is the exception not the rule.

Commonly a microcontroller does not load the (single) program into RAM. Instead it is run "in-place" in the (flash or any other non-volatile) memory. The program is built so that the memory at the (fixed) start address contains the startup code of the program.
Having said that you might wonder how (static) variables are initialized with zero and non-zero values. That is done by the startup code linked in when the program is built.
There is no need to add any "intelligence", assuming you mean something like a byte-code interpreter to execute the binary commands. The CPU of the microcontroller executes the machine code directly. And your compiler generates exactly the machine code.

Using PIC18F K40 microcontroller flash memory as storage

I'm using PIC18F67K40 microcontroller in my project.
It has 1kB EEPROM memory and 128kB program memory (flash).
For now I'm using EEPROM to store my settings.
Application is "growing" and I realized that at some point 1kB will be not enough. Some of settings are arrays of pretty big structures.
I realize, that flash memory has 100k 10k write cycles and that I can buy external EEPROM, but I don't want to change anything in hardware and memory in this product will never reach 2k writes for sure.
My quesion is:
How can I switch from EEPROM storage to flash storage?
Do I have to recalculate some CRC after program memory changes?
Do I have to define somewhere in project settings, that I'm using some flash memory for storage?
Is there anything what I have to do in order to use flash memory like this?

100k writes is only the endurance of the data EEPROM not of the flash memory (only 10k writes). You could expand the endurance with a EEPROM emulation.
There is a really nice library from Microchip for EEPROM emulation in flash memory.
Have a look here: EEPROM emulation

I did this for a client a couple of years ago. I can't post the code for NDA and copyright reasons, but the basic trick was to use something called RTSP (Run Time Self Programming). RTSP may be going obsolete now, but whatever replaces it may work in a similar way.
Essentially the flash looks like a series of pages which can be written word at a time, but erased page at a time. What you will need to do is write some code that can unlock and erase a page then write to it. Once you have done this the page can be read as ordinary memory.
You don't need to change the settings. However make sure that the page you use is well clear of the program code.
If you want a CRC (usually a good move) you'll have to calculate it yourself.

accessing file system using cpu device in opencl

I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?
Can you please suggest a source for detailed information ??
thanks in advance.

It can't. Simply because not every OpenCL device has a file system, or a disk respectively.

You can't. OpenCL is trying to unite access to computing power and file system is depending on OS. If you want this feature, there are threads (C++11 thread, pthread,...) or OpenMP should be able to handle this, because it's CPU-only thing.

It doesn't make sense to allow device kernels to access the filesystem, because most of the semantics of filesystem access are essentially incompatible with the massively parallel nature of device kernels.
There are two ways to work around this, considering you're only asking about CPU.
if you intend to use OpenCL as a way to do multithreading on CPU, consider using what OpenCL calls “native kernels”, which are essentially just plain C functions, called within an OpenCL context;
a more general approach that might work on GPU too would be to mmap the files you want to operate on, and pass the resulting pointers to clCreateBuffer with CL_USE_HOST_PTR flags.

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)

Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.

If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.

No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.

Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex