We have an ECU with 8KB of RAM memory which has been divided into segments in the .xcl file (NO_INIT, ECU_ID,INIT etc). In our code, depending on #pragmas, we have the variable going into their respective segments. Now is my assumption true that irrespective of where the variables go, they are always cleared on power ON reset?
There is a crt.asm file which has the startup code. Is there a possibility of defining in this file what area of RAM needs to be cleared and what shouldn't, when the micro is reset?
Related
There is a memory-management process on a server I use that kills processes that go above a certain limit.
There are very large data files that must be loaded in their entirety, but doing so goes above the memory limit.
Is there a way to force R to limit the amount of memory on RAM and use swap-space/a swapfile after it is hit?
First, this is done on a unix-based system. I was asked to simulate writes to a 128MB file, using different methods. First writing to an aligned random location (lseek to a location that is a multiple of the write size) in the file, with changing write sizes (1MB, 256KB,64KB,16KB,4KB), and keep writing until 128MB are written. Then to perform the same operation but with the O_DIRECT flag, and lastly perform it again with unaligned location without the flag.
The results I got were that for the aligned + O_DIRECT writes, decreasing the write size drastically reduces performance, which is understandable because with O_DIRECT it will access the disk directly more with smaller write sizes.
With the unaligned writes, reducing write sizes again reduces performance which makes sense because we have more writes done to different locations in the file, and then when the cache is flushed, during the physical writing, the disk might go back and forth between sectors to write the data. More writes, more going back and forth.
The results that confused me were for the aligned writes without the O_DIRECT flag. This time for smaller write sizes, the performance drastically increased, with 32,678 4KB writes being a lot faster than 128 1MB writes. And I can't figure out what would cause such an increase. I should also add the unix computer i'm working with is a remote one, and I do not know its specs, i.e what hdd it has, whats its sector size and so on. I guess I should also add that the code to write unaligned and aligned is exactly the same, the only change is the location that lseek receives.
One thing I did think about, if the physical sector size is say, 4KB, and the PC has enough RAM, all the write operations can be cached in pages and then flushed. Then if the disk goes from the smallest address to the biggest, it can go sector by sector forward because its aligned, which might mean it can write everything in one go.
I'm trying to get something to work but I run out of ideas so I figured I would ask here.
I have a kernel that has a large global size (usually 5 Million)
Each of the threads can require up to 1Mb of global memory (exact size not known in advance)
So i figured... ok, on my typical target GPU I have 6Gb and I can run 2880 threads in parrallel, more than enough right ?
My idea is to create a big buffer (well actually 2 because of the max buffer size limitation...)
Each thread pointing to a specific global memory area (with the coalescence and stuff, but you get the idea...)
My problem is, How do I know which thread is currenctly being run (in the kernel code) to point to the right memory area ?
I did find the cl_arm_get_core_id extension but this only gives me the workgroup, not the acutal thread being used, plus this does not seem to be available on all GPUs, since it's an extension.
I have the option to have work_group_size = nb_compute_units / nb_cores and have the offset to be arm_get_core_id() * work_group_size + global_id() % work_group_size
But maybe this group size is not optimal, and the portability issue still exists.
I can also enqueue a lot of kernels calls with global size 2880, and there I obviously know where to point to with the global Id.
But won't this lead to a lot of overhead because of the 5Million / 2880 kernel calls ? Plus any work group that finishes before the others will be idle until all workgroups for this call have finished their job.
Any ideas to do this properly are very welcome !
Well, you are storing 1MB per WI for temporal computations (because you are not saving them, otherwise your wouldn't have memory).
Then, why not simply let it spill to global memory? Does the compiler complain? If it does complain, then you need other approaches:
One possibility is to create a queue (just a boolean array), of the memory zones empty for usage by the WorkGroups. And every time a new workgroup is launched it takes an empty slot and sets the boolean to "used" state. You can do this with atomic_cmpxchg() atomic operation.
It may introduce a small overhead to launch each WG, but it would be probably negligible if each WI is needing 1MB of global memory.
Here you have a small example of how to do atomic_cmpxchg() LINK
I was reading this excellent article which gives an introduction to Asynchronous programming here http://krondo.com/blog/?p=1209 and I came across the following line which I find hard to understand.
Since there is no actual parallelism(in asnyc), it appears from our diagrams that an asynchronous program will take just as long to execute as a synchronous one, perhaps longer as the asynchronous program might exhibit poorer locality of reference.
Could someone explain how locality of reference comes into picture here?
Locality of reference, like that Wikipedia article mentions, is the observation that when some data is accessed (on disk, in memory, whatever), other data near that location is often accessed as well. This observation makes sense since developers tend to group similar data together. Since the data are related, they're often processed together. Specifically, this is known as spatial locality.
For a weak example, imagine computing the sum of an array or doing a matrix multiplication. The data representing the array or matrix are typically stored in continguous memory locations, and for this example, once you access one specific location in memory, you will be accessing others close to it as well.
Computer architecture takes locality of reference into account. Operating systems have the notion of "pages" which are (roughly) 4KB chunks of data that can be paged in and out individually (moved between physical memory and disk). When you touch some memory that's not resident (not physically in RAM), the OS will bring the entire page of data off disk and into memory. The reason for this is locality: you're likely to touch other data around what you just touched.
Additionally, CPUs have the concept of caches. For example, a CPU might have an L1 (level 1) cache, which is really just a big block of on-CPU data that the CPU can access faster than RAM. If a value is in the L1 cache, the CPU will use that instead of going out to RAM. Following the principle of the locality of reference, when a CPU access some value in main memory, it will bring that value and all values near it into the L1 cache. This set of values is known as a cache line. Cache lines vary in size, but the point is that when you access the first value of an array, the CPU might have to get it from RAM, but subsequent accesses (close in proximity) will be faster since the CPU brought the whole bundle of values into the L1 cache on the first access.
So, to answer your question: if you imagine a synchronous process computing the sum of a very large array, it will touch memory locations in order one after the other. In this case, your locality is good. In the asynchronous case, however, you might have n threads each taking a slice of the array (of size 1/n) and computing the sub-sum. Each thread is touching a potentially very different location in memory (since the array is large) and since each thread can be switched in and out of execution, the actual pattern of data access from the point of view of the OS or CPU is poor. The L1 cache on a CPU is finite, so if Thread 1 brings in a cache line (due to an access), this might evict the cache line of Thread 2. Then, when Thread 2 goes to access its array value, it has to go to RAM, which will bring in its cache line again and potentially evict the cache line of Thread 1, and so on. Depending on the system resources and usage as a whole, this pattern could happen on the OS/page level as well.
The poorer locality of reference results in poorer cache usage -- each time you do a thread switch, you can expect that most of what's in the cache relates to that previous thread, not the current one, so most reads will get data from main memory instead of the cache.
He's ultimately wrong though, at least for quite a few programs. The reason is pretty simple: even though you gain nothing on CPU-bound code, when you can combine some CPU-bound code with some I/O bound code, you can expect an overall speed improvement. You can, for example, initiate a read or write, then switch to doing computation while the disk is busy, then switch back to the I/O bound thread when the disk finishes its work.
Why can't we move data directly from a memory location into another memory location.
Pardon me if I am asking a dumb question, but I think this is a true situation, at least for the ones I've encountered (8085,8086 n 80386)
I am not really looking for a solution for moving the data (like for eg, using movs n all), but actually the reason for this anomaly.
What about MOVS? It moves a 8/16/32-bit value addressed by esi to the location addressed by edi.
The basic reason is that most instruction sets allow one register operand, and one memory operand, and sticking to this format makes designing the instruction decoder easier. It also makes the execution engine inside the CPU easier, because the instruction can issue typically a memory operation to just one memory location, and at most one register block read or write.
To do a memory-to-memory instruction directly requires two memory locations to be designated. This is awkward given a register/memory instruction format. Given the performance of the machines, there is little justification for modifying the instruction format just for this.
A hack used by more modern CPUs is to provide some type of block-move instruction, in which the source and destination locations are located in registers (for the X86 this is ESI and EDI respectively). Then an instruction can just designate two registers (or in the case of the x86, instructions that simply know which registers). That solves the instruction decoding problem.
The instruction execution problem is a little harder but people have lots of transistors. Organizing a read indirect from one register, and write indirect through another, and increment both is awkward in silicon but that just chews up some transistors.
Now you can have an instruction that moves from memory to memory, just as you asked.
One of the other posters noted for the X86 there are instrucitons (MOVB, MOVW, MOVS, ...) that do exactly this, one memory byte/word/... at a time.
Moving a block of memory would be ideal because the CPU can generate high-bandwith reads and writes. The x86 does this with with a REP (repeat) prefix on MOV- to move a larger block.
But if a single insturction can do this, you have the problem that it might take a long time to execute (how long to move 1Gb? --> millions of clock cycles!) and that ruins the interrupt response rate of the CPU.
The x86 solves this by allowing REP MOV- to be interrupted, with the PC being set back to the beginning of the instruction. By updating the registers during the move appropriately, you can interrupt and restart the REP MOV- instruction having both a fast block move and high interrupt response rates. More transistors down the tube.
The RISC guys figured out that all this complexity for a block move instruction was mostly not worth it. You can code a dumb loop (even the x86):
copy: MOV EAX,[ESI]
ADD ESI,4
MOV [EDI],EAX
ADD EDI,4
DEC ECX
JNE copy
which does the same basic thing as REP MOV- . Pretty much the modern CPUs (x86, others) execute this so fast (superscalar, etc.) that the bus is just as utilized as the custom move instruction, but now you don't need all those wasted transistors (or corresponding heat).
Most CPU varieties don't allow memory-to-memory moves. Normally the CPU can access only one memory location at at time, which means you need a temporary spot to store the value when moving it (a general purpose register, usually). If you think about it, moving directly from one memory location to another would require that the CPU be able to access two different spots in RAM simultaneously - that means two full memory controllers at least, and even then, the chances they'd "play nice" enough to access the same RAM would be pretty bad. The chip designers might have been able to pull some tricks to allow direct copies from one RAM chip to another, but that would be a pretty special-application kind of feature that would just add cost and complexity to solve a very uncommon problem.
You might be able to use some special DMA hardware to make it look to your program like memory is being moved without that temporary storage, at least from the perspective of your CPU.
You have one set of address lines, one set of data lines, and a few control lines between the CPU and RAM. You can't physically move directly from memory to memory without a second set of address lines and a whole bunch of complicated logic inside the RAM. Therefore, we have to store it temporarily in a register.
You could make an instruction that does the load and store together and looks like one instruction to the programmer, but there are other considerations like instruction size, non-duplication of effective address calculation logic, pipelining, etc. that make it desirable to keep it more simple.
Memory-memory machines turn out to be slower in general than load-store machines. This was deduced/figured out/invented by the RISC researchers in 1980ish or so. So the older architectures (VAX/OS360) tend to have memory-memory architectures; newer machines do load-store.
Another interesting variant is stack machines; they seem to always be around as a minority.