I am performing some thermal load tests on the Skylake processor, and am attempting to use RAPL MSRs as an early detection system for oncoming thermal spikes, instead of reading from "sensors" sysfs file.
I have several questions. Consider this as background, when I run sensors, I get the following:
acpitz-virtual-0
Adapter: Virtual device
temp1: +43.0°C (crit = +119.0°C)
pch_skylake-virtual-0
Adapter: Virtual device
temp1: +42.5°C
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +43.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +41.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +41.0°C (high = +100.0°C, crit = +100.0°C)
And when I read the RAPL MSRs, I get the following data points, as clearly described in in Intel's now deprecated page here.
Package energy: 2.493103J
PowerPlane0 (cores): 0.105652J
PowerPlane1 (on-core GPU if avail): 0.106750 J
DRAM: 0.619141J
Now, I am trying to find a relationship between the energy and the temperatures. For example, which one of them is the GPU temperature? Which is DRAM? How do I know these sensor locations?
Are there any MSR based ways to throttle the CPUs from user space? One easy method was to just enable /sys/devices/system/cpu/intel_pstate/no_turbo, but this does not seem to be the right thing to do. Is there any formal means to throttle the CPU/load on the system?
Does RAPL also provide "power" in addition to energy? Can I deduce other details such as battery life left, based on MSR readings? Any other fancy stuff that can be done by reading and deducing from MSRs?
Related
I am using N210 USRP to have a RF spectrum around 2.4GHz range.
I have programmed two TelosB nodes and they are using RadioCoundLed to send and Receive signals
I have set the TelosB nodes at highest power level following the datasheet
I also made them fixed at a channel(26) around 2.48Ghz
I can see the Telosb nodes communication and the LEDS are blinking.
Now I should observe this in USRP RF spectrum. However I am observing nothing in Scope Sink. I have fixed the center freq in the 2.48 Ghz range.
Set the RX gain - 0
Sampling rate is 2M
Is it possible to even to observe it?
I guess I solved the problem. I was using the wrong daughter board. Now I am using the SBX board that can support 2.5Ghz range.
Hi i am working on Tmote sky motes (MSP430 microprocessor) with contiki os. I want to know the number of instruction cycles used when I do a multiplication operation in my programming (software).
Thank you,
Avijit
The msp430 is a 16-bit system, so 32-bit values are not supported directly. A 32-bit operation is typically translated to assembly code as a sequence of 16-bit ops.
The execution times of 8-bit and 16-bit operations can be found in TI application report "The MSP430Hardware Multiplier":
Table 4. CPU Cycles Needed With Different Multiplication Modes
OPERATION: Unsigned Multiply (MPY)
SOFTWARE LOOP: 139...171
HARDWARE MPYer: 8
SPEED INCREASE: 17.4...21.4
OPERATION: Unsigned multiply-and-accumulate (MAC)
SOFTWARE LOOP: 137...169
HARDWARE MPYer: 8
SPEED INCREASE: 17.1...21.1
OPERATION: Signed Multiply (MPYS)
SOFTWARE LOOP: 145...179
HARDWARE MPYer: 8
SPEED INCREASE: 18.1...22.4
OPERATION: Signed multiply-and-accumulate (MAC)
SOFTWARE LOOP: 143...177
HARDWARE MPYer: 17
SPEED INCREASE: 8.4...10.4
The HW multiplier should be active with default compilation settings, but check the generated object file with msp430-objdump to make sure.
You can use naken_asm by Michael Kohn to disaemble an Intel hex or ELF file and it will calculate the cycle counts for each instruction. I've used it in the past and the cycle counter is OK for CPU (such as in your Tmote) but not fully supported in CPUX.
You can invoke it from the command line as simply as:
naken_util -disasm <infile>
where <infile> is the name of your hex or ELF file. The default processor is MSP430, but you'd need the assembly listing from your compiler in order to be able to match up the original code with the disassembled code which includes cycle counts.
Another alternative would be to use MSPDebug's tracer option which can track running software and provide an up-to-date instruction cycle count. However, I've never used it for that purpose so cannot provide an example.
I'm using the Tesla m1060 for GPGPU computation. It has the following specs:
# of Tesla GPUs 1
# of Streaming Processor Cores (XXX per processor) 240
Memory Interface (512-bit per GPU) 512-bit
When I use OpenCL, I can display the following board information:
available platform OpenCL 1.1 CUDA 6.5.14
device Tesla M1060 type:CL_DEVICE_TYPE_GPU
max compute units:30
max work item dimensions:3
max work item sizes (dim:0):512
max work item sizes (dim:1):512
max work item sizes (dim:2):64
global mem size(bytes):4294770688 local mem size:16383
How can I relate the GPU card informations to the OpenCL memory informations ?
For example:
What does "Memory Interace" means ? Is it linked the a Work Item ?
How can I relate the "240 cores" of the GPU to Work Groups/Items ?
How can I map the work-groups to it (what would be the number of Work groups to use) ?
Thanks
EDIT:
After the following answers, there is a thing that is still unclear to me:
The CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE value is 32 for the kernel I use.
However, my device has a CL_DEVICE_MAX_COMPUTE_UNITS value of 30.
In the OpenCL 1.1 Api, it is written (p. 15):
Compute Unit: An OpenCL device has one or more compute units. A work-group executes on a single compute unit
It seems that either something is incoherent here, or that I didn't fully understand the difference between Work-Groups and Compute Units.
As previously stated, when I set the number of Work Groups to 32, the programs fails with the following error:
Entry function uses too much shared data (0x4020 bytes, 0x4000 max).
The value 16 works.
Addendum
Here is my Kernel signature:
// enable double precision (not enabled by default)
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#error "IEEE-754 double precision not supported by OpenCL implementation."
#endif
#define BLOCK_SIZE 16 // --> this is what defines the WG size to me
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1)))
void mmult(__global double * A, __global double * B, __global double * C, const unsigned int q)
{
__local double A_sub[BLOCK_SIZE][BLOCK_SIZE];
__local double B_sub[BLOCK_SIZE][BLOCK_SIZE];
// stuff that does matrix multiplication with __local
}
In the host code part:
#define BLOCK_SIZE 16
...
const size_t local_work_size[2] = {BLOCK_SIZE, BLOCK_SIZE};
...
status = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
The memory interface doesn't mean anything to an opencl application. It is the number of bits the memory controller has for reading/writing to the memory (the ddr5 part in modern gpus). The formula for maximum global memory speed is approximately: pipelineWidth * memoryClockSpeed, but since opencl is meant to be cross-platform, you won't really need to know this value unless you are trying to figure out an upper bound for memory performance. Knowing about the 512-bit interface is somewhat useful when you're dealing with memory coalescing. wiki: Coalescing (computer science)
The max work item sizes have to do with 1) how the hardware schedules computations, and 2) the amount of low-level memory on the device -- eg. private memory and local memory.
The 240 figure doesn't matter to opencl very much either. You can determine that each of the 30 compute units is made up of 8 streaming processor cores for this gpu architecture (because 240/30 = 8). If you query for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, it will very likey be a multiple of 8 for this device. see: clGetKernelWorkGroupInfo
I have answered a similar questions about work group sizing. see here, and here
Ultimately, you need to tune your application and kernels based on your own bench-marking results. I find it worth the time to write many tests with various work group sizes and eventually hard-code the optimal size.
Adding another answer to address your local memory issue.
Entry function uses too much shared data (0x4020 bytes, 0x4000 max)
Since you are allocating A_sub and B_sub, each having 32*32*sizeof(double), you run out of local memory. The device should be allowing you to allocate 16kb, or 0x4000 bytes of local memory without an issue.
0x4020 is 32 bytes or 4 doubles more than what your device allows. There are only two things I can think of that may cause the error: 1) there could be a bug with your device or drivers preventing you from allocating the full 16kb, or 2) you are allocating the memory somewhere else in your kernel.
You will have to use a BLOCK_SIZE value less than 32 to work around this for now.
There's good news though. If you only want to hit a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE as a work group size, BLOCK_SIZE=16 already does this for you. (16*16 = 256 = 32*8). To better take advantage of local memory, try BLOCK_SIZE=24. (576=32*18)
I have so far not found any way to do anything similar to Xilinx' RLOC constraints for Altera FPGAs.
Does anyone know a way to do this?
For example place two FFs in the same or adjacent LABs
So to answer my own question, after some consultation with some Altera manuals and some trial and error, I found that this pretty much does what I want.
module synchronizer (input wire dat_i,
input wire out_clk,
output wire dat_o);
(* altera_attribute = "-name SYNCHRONIZATION_REGISTER_CHAIN_LENGTH 2; -name SYNCHRONIZER_IDENTIFICATION \"FORCED IF ASYNCHRONOUS\"" *)
logic [1:0] out_sync_reg;
always_ff#(posedge out_clk) begin
out_sync_reg <= {out_sync_reg[0],dat_i};
end
assign dat_o = out_sync_reg[1];
endmodule
I tested this by setting global synchronizer detection to off and observed that TimeQuest found and analysed the correct paths for metastability.
This works well even when dat_i is latched by clk_a and out_clk is driven by clk_b and where the two clocks are set as:
set_clock_groups -asynchronous -group {clk_a}
set_clock_groups -asynchronous -group {clk_b}
Thus creating false paths between all connections from registers clocked by clk_a to registers clocked by clk_b
set_max/min_delay wont work since it is ignored (as stated by Altera) if the the two clocks are in different asynchronous clock groups.
Altera do not support RLOC style constraints. Apparently this is something to do with the underlying physical architecture. I believe they over-provision ALMs and fuse out columns during chip test to improve yield, therefore relative locations constraints won't translate as expected to a given physical device.
If you are worried about a synchroniser chain placement, you can enable synchroniser chain detection using SYNCHRONIZATION_REGISTER_CHAIN_LENGTH and SYNCHRONIZER_IDENTIFICATION QSF settings (see also this answer).
If you just want to ensure particular timing properties then use set_max_delay and set_min_delay timing constraints on your path.
I am wondering how to chose optimal local and global work sizes for different devices in OpenCL?
Is it any universal rule for AMD, NVIDIA, INTEL GPUs?
Should I analyze physical build of the devices (number of multiprocessors, number of streaming processors in multiprocessor, etc)?
Does it depends on the algorithm/implementation? Because I saw that some libraries (like ViennaCL) to assess correct values just tests many combination of local/global work sizes and chose best combination.
NVIDIA recommends that your (local)workgroup-size is a multiple of 32 (equal to one warp, which is their atomic unit of execution, meaning that 32 threads/work-items are scheduled atomically together). AMD on the other hand recommends a multiple of 64(equal to one wavefront). Unsure about Intel, but you can find this type of information in their documentation.
So when you are doing some computation and let say you have 2300 work-items (the global size), 2300 is not dividable by 64 nor 32. If you don't specify the local size, OpenCL will choose a bad local size for you. What happens when you don't have a local size which is a multiple of the atomic unit of execution is that you will get idle threads which leads to bad device utilization. Thus, it can be benificial to add some "dummy" threads so that you get a global size which is a multiple of 32/64 and then use a local size of 32/64 (the global size has to be dividable by the local size). For 2300 you can add 4 dummy threads/work-items, because 2304 is dividable by 32. In the actual kernel, you can write something like:
int globalID = get_global_id(0);
if(globalID >= realNumberOfThreads)
globalID = 0;
This will make the four extra threads do the same as thread 0. (it is often faster to do some extra work then to have many idle threads).
Hope that answered your question. GL HF!
If you're essentially making processing using little memory (e.g. to store kernel private state) you can choose the most intuitive global size for your problem and let OpenCL choose the local size for you.
See my answer here : https://stackoverflow.com/a/13762847/145757
If memory management is a central part of your algorithm and will have a great impact on performance you should indeed go a little further and first check the maximum local size (which depends on the local/private memory usage of your kernel) using clGetKernelWorkGroupInfo, which itself will decide of your global size.