How to measure the amount of memory or RAM consumed by a code on Arduino Mega or Due - arduino

Can anybody tell me how to measure the consumed RAM for a particular code running on Arduino Mega or Due.

There is two kinds of numbers to this question:
Global static usage and current run time.
The static estimated usage can be determined by adding the following line to (if it does not already exist)
.\arduino-1.5.5\hardware\arduino\avr\boards.txt
uno.upload.maximum_ram_size=2048
This then allows the compiler to output the additional 2nd line in the following example in the IDE's result window
Binary sketch size: 25,880 bytes (of a 32,256 byte maximum)
Estimated used SRAM memory: 990 bytes (of a 2048 byte maximum)
To see the amount of memory used at any given point. Including memory space currently in use, that exists while only in functions and members. This includes the HEAP and such. I use the following MemoryFree library at specific points in the code to reveal the high-water. The readme explains how to save unnecessarily/unintentionally used RAM by prints.
Note: That while the original Arduino IDE 1.0.5's boards.txt files does contain these ram_sizes, it does not actually use display usage. Where the original Arduino IDE 1.5.5 does, along with Arduino ERW 1.0.5 does (an non-supported fork).

In my Arduino IDE 2.1.0
I edit the file: /usr/share/arduino/hardware/arduino/boards.txt
but the second line don't appear
After read:
check-ram-memory-usage-arduino-optimization
measuring-free-memory
I tried:
Show vervose output during compilation
and use avr-size /tmp/build4042914391435450796.tmp/XXXXXXX.cpp.elf
then i get my memory used
Best Regards!

int freeRam () {
extern int __heap_start, *__brkval;
int v;
int fr = (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
Serial.print("Free ram: ");
Serial.println(fr);
}

Related

XC8 builds font tables from top ROM

I wrote a barebone progran template in XC8 (1.37) that I use to develop and test new GLCD functions for the 18F family. Programming is done via a PICkit3. Since I need to quicky reprogram several times the code it is really important that programming is faster as much as possible.
Tipically, the code size is around 2K and it takes less than 10 sec to program,
Everiything is fine until I must use a font table, defined as:
const char font8[] = {....
Now, with just $400 bytes added, the compiler place the table at the ROM's end and the programming of 64K memory takes more than 1 minute.
Is there any way to avoid this?
I tried to manually limit the memory range in the MPLABX options, but this is annoying and a little unsafe (sometimes part of code is truncated).
A while back I had to write some code for emissions testing, where I needed to copy data between extreme ends of RAM. To do that I needed to specify the exact memory addresses. You can also use the C extension __at() construct. http://ww1.microchip.com/downloads/en/DeviceDoc/50002053F.pdf#page=27
int scanMode __at(0x200);
const char keys[] __at(123) = { ’r’, ’s’, ’u’, ’d’};
int modify(int x) __at(0x1000) {
return x * 2 + 3;
}

Run arduino sketch from an sd card

Is it possible to put a sketch (.HEX file) to an SD card and run it from there?
My objective is to utilize SD storage instead of flash memory for a program.
If yes, are there any libraries doing exactly this?
All i found was "flashing arduino from sd card", which is not what i need.
UPDATE:
the sketch's loop calling is implemented in the bootloader.
so i assume there is something like this in the bootloader:
while(true)
{
call_sketch_loop();
}
can it be changed to this? :
//signature changed from void loop() to int loop()
while(true)
{
int retval = call_sketch_loop(); //get loop call's return value
if( 0 == retval )
continue; // if 0, iterate the loop as usual
else
{
//copy 1.HEX from sd to flash and reboot
copy_hex_from_sd_to_flash( retval + ".HEX" );
reboot();
}
}
change loop singature to int loop()
put {int}.HEX files to an SD card - 1.HEX , 2.HEX , 3.HEX
the loop() call returns 0
continue with next iteration as usual
the loop() call returns 2
copy file 2.HEX from SD card into program flash memory
reboot device
with this approach, we can run flash-capacity-exceeding programs if we split them up to smaller subprograms.
The technical term you are looking for is "SD card bootloader".
Have you looked into this: "https://github.com/thseiler/embedded/tree/master/avr/2boots"?
As far as I understand, 2boot will first load the hex into the flash and then execute it from there. This is not exactly what you are looking for (you want to load it directly to RAM, right?).
The problem with what you are looking for is that arduino's RAM is really small. And there is liittle advantage in loading directly to RAM. Therfore such library might not exist at all.
I can sugget a poor-mans approach for doing this. First write a sketch that contains a function that have an infinite loop inside it and inside this loop, put the code of your desired "loop". In the setup of the sketch take the pointer to this function and write sufficient ammount of bytes into a binary file on the SD card.
Then upload another sketch wich has an empty buffer. This sketch will load the binary file into it and refernce to it's beginning as a pointer to a function. Viola, you can now execute your "loop".
This is ugly and unless you have very specific and isoteric need for loading directly into RAM, I suggest to try the 2boot library.

Arduino IDE maximum PROGMEM char array length

I am trying to use a single 4KB string in my arduino sketch but this always seems to give a whole bunch of java errors in the console and never compiles. I believe, I am using it correctly:
const char sequence[] PROGMEM = {"0F0FF0 ... 0F0F0FF"};
By trial-and-error I determined that the maximum length I can get to compile successfully is 1104 characters. This doesn't seem to make much sense. Is there some unknown limitation in the compiler or is it an issue with the IDE? I'm using 1.0.5 but I get the same results in 1.6.5 as well. I'd really rather not split the array. Reading online, the size limit should be 32KB, which is far higher, than what I need.
Any help or explanation appreciated, please and thank you.
It's a limitation of the IDE, not the compiler. If you make it a single string still, but use C's string concatenation, it will compile. eg.
const char sequence[] PROGMEM = {
"0F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF0"
"0F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF0"
...
"0F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF00F0F0FF0"
};

Why doesn't OpenCL Nvidia compiler (nvcc) use the registers twice?

I'm doing a small OpenCL benchmark using Nvidia drivers,
my kernel performs 1024 fuse multiply-adds and store the result in an array:
#define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x) { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x) { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x) { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }
__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))
{
float f = (float) i;
FLOPS_MACRO_1024(f)
pf[i] = f;
}
}
But when I look in the PTX generated, I see this:
.entry ocl_Kernel_FLOPS(
.param .u32 ocl_Kernel_FLOPS_param_0,
.param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
.reg .f32 %f<1026>; // 1026 float registers !
.reg .pred %p<3>;
.reg .s32 %r<19>;
ld.param.u32 %r1, [ocl_Kernel_FLOPS_param_0];
// some more code unrelated to the problem
// ...
BB1_1:
and.b32 %r13, %r18, 65535;
cvt.rn.f32.u32 %f1, %r13;
fma.rn.f32 %f2, %f1, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f3, %f2, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f4, %f3, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f5, %f4, 0f3F7D70A4, 0f41200000;
// etc
// ...
If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.
Is it a compiler bug or am I totally missing something ?
I am using nvcc version 6.5 on a Quadro K2000 GPU.
EDIT
Actually I did miss something in the specs:
"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate
a large number of register names. Rather than require explicit declaration of every name,
PTX supports a syntax for creating a set of variables having a common prefix string
appended with integer suffixes. For example, suppose a program uses a large number, say
one hundred, of .b32 variables, named %r0, %r1, ..., %r99"
The PTX file format is intended to describe a virtual machine and instruction set architecture:
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.
So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.
The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.
(This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).

What's the equivalent of rdtsc opcode for PPC?

I have an assembly program that has the following code.
This code compiles fine for a intel processor. But, when I use a PPC (cross)compiler, I get an error that the opcode is not recognized. I am trying to find if there is an equivalent opcode for PPC architecture.
.file "assembly.s"
.text
.globl func64
.type func64,#function
func64:
rdtsc
ret
.size func64,.Lfe1-func64
.globl func
.type func,#function
func:
rdtsc
ret
PowerPC includes a "time base" register which is incremented regularly (although perhaps not at each clock -- it depends on the actual hardware and the operating system). The TB register is a 64-bit value, read as two 32-bit halves with mftb (low half) and mftbu (high half). The four least significant bits of TB are somewhat unreliable (they increment monotonically, but not necessarily with a fixed rate).
Some of the older PowerPC processors do not have the TB register (but the OS might emulate it, probably with questionable accuracy); however, the 603e already has it, so it is a fair bet that most if not all PowerPC systems actually in production have it. There is also an "aternate time base register".
For details, see the Power ISA specification, available from the power.org Web site. At the time of writing that answer, the current version was 2.06B, and the TB register and opcodes were documented at pages 703 to 706.
When you need a 64-bit value on a 32-bit architecture (not sure how it works on 64-bit) and you read the TB register you can run into the problem of the lower half going from 0xffffffff to 0 - granted this doesn't happen often but you can be sure it will happen when it will do the most damage ;)
I recommend you read the upper half first, then the lower and finally the upper again. Compare the two uppers and if they are equal, no problemo. If they differ (the first should be one less than the last) you have to look at the lower to see which upper it should be paired with: if its highest bit is set it should be paired with the first, otherwise with the last.
Apple has three versions of mach_absolute_time() for the different types of code:
32-bit
64-bit kernel, 32-bit app
64-bit kernel, 64-bit app
Inspired by a comment from Peter Cordes and the disassembly of clang's __builtin_readcyclecounter:
mfspr 3, 268
blr
For gcc you can do the following:
unsigned long long rdtsc(){
unsigned long long rval;
__asm__ __volatile__("mfspr %%r3, 268": "=r" (rval));
return rval;
}
Or For clang:
unsigned long long readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
return __builtin_readcyclecounter();
}

Resources