I don't understand cache miss count between cachegrind vs. perf tool - intel

I am studying about cache effect using a simple micro-benchmark.
I think that if N is bigger than cache size, then cache have a miss operation every first reading cache line.
In my machine, cache line size=64Byte, so I think totally cache occur N/8 miss operation and cache grind show that.
However, perf tool displays different result. It only occur 34,265 cache miss operations.
I am doubted about hardware prefetch, so turn off this function in BIOS. anyway, result is same.
I really don't know why perf tool's cache miss occur very small operations than "cachegrind".
Could someone give me a reasonable explanation?
1. Here is a simple micro-benchmark program.
#include <stdio.h>
#define N 10000000
double A[N];
int main(){
int i;
double temp=0.0;
for (i=0 ; i<N ; i++){
temp = A[i]*A[i];
}
return 0;
}
2. Following result is cachegrind's output:
==27612== Cachegrind, a cache and branch-prediction profiler
==27612== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==27612== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==27612== Command: ./test
==27612==
--27612-- warning: L3 cache found, using its data for the LL simulation.
==27612==
==27612== I refs: 110,102,998
==27612== I1 misses: 728
==27612== LLi misses: 720
==27612== I1 miss rate: 0.00%
==27612== LLi miss rate: 0.00%
==27612==
==27612== D refs: 70,038,455 (60,026,965 rd + 10,011,490 wr)
==27612== D1 misses: 1,251,802 ( 1,251,288 rd + 514 wr)
==27612== LLd misses: 1,251,624 ( 1,251,137 rd + 487 wr)
==27612== D1 miss rate: 1.7% ( 2.0% + 0.0% )
==27612== LLd miss rate: 1.7% ( 2.0% + 0.0% )
==27612==
==27612== LL refs: 1,252,530 ( 1,252,016 rd + 514 wr)
==27612== LL misses: 1,252,344 ( 1,251,857 rd + 487 wr)
==27612== LL miss rate: 0.6% ( 0.7% + 0.0% )
Generate a report File
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 4-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 8388608 B, 64 B, 16-way associative
Command: ./test
Data file: cache_block
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated: /home/jin/1_dev/99_test/OI/test.s
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
110,102,998 728 720 60,026,965 1,251,288 1,251,137 10,011,490 514 487 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
110,000,011 1 1 60,000,003 1,250,000 1,250,000 10,000,003 0 0 /home/jin/1_dev/99_test/OI/test.s:main
--------------------------------------------------------------------------------
-- User-annotated source: /home/jin/1_dev/99_test/OI/test.s
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
-- line 2 ----------------------------------------
. . . . . . . . . .comm A,80000000,32
. . . . . . . . . .comm B,80000000,32
. . . . . . . . . .text
. . . . . . . . . .globl main
. . . . . . . . . .type main, #function
. . . . . . . . . main:
. . . . . . . . . .LFB0:
. . . . . . . . . .cfi_startproc
1 0 0 0 0 0 1 0 0 pushq %rbp
. . . . . . . . . .cfi_def_cfa_offset 16
. . . . . . . . . .cfi_offset 6, -16
1 0 0 0 0 0 0 0 0 movq %rsp, %rbp
. . . . . . . . . .cfi_def_cfa_register 6
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 1 1 0 0 0 1 0 0 movq %rax, -16(%rbp)
1 0 0 0 0 0 1 0 0 movl $0, -4(%rbp)
1 0 0 0 0 0 0 0 0 jmp .L2
. . . . . . . . . .L3:
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 1,250,000 1,250,000 0 0 0 movsd A(,%rax,8), %xmm1
10,000,000 0 0 10,000,000 0 0 0 0 0 movl -4(%rbp), %eax
10,000,000 0 0 0 0 0 0 0 0 cltq
10,000,000 0 0 10,000,000 0 0 0 0 0 movsd A(,%rax,8), %xmm0
10,000,000 0 0 0 0 0 0 0 0 mulsd %xmm1, %xmm0
10,000,000 0 0 0 0 0 10,000,000 0 0 movsd %xmm0, -16(%rbp)
10,000,000 0 0 10,000,000 0 0 0 0 0 addl $1, -4(%rbp)
. . . . . . . . . .L2:
10,000,001 0 0 10,000,001 0 0 0 0 0 cmpl $9999999, -4(%rbp)
10,000,001 0 0 0 0 0 0 0 0 jle .L3
1 0 0 0 0 0 0 0 0 movl $0, %eax
1 0 0 1 0 0 0 0 0 popq %rbp
. . . . . . . . . .cfi_def_cfa 7, 8
1 0 0 1 0 0 0 0 0 ret
. . . . . . . . . .cfi_endproc
. . . . . . . . . .LFE0:
. . . . . . . . . .size main, .-main
. . . . . . . . . .ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
. . . . . . . . . .section .note.GNU-stack,"",#progbits
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
100 0 0 100 100 100 100 0 0 percentage of events annotated
3. Following result is perf's output:
$ sudo perf stat -r 10 -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches ./test
Performance counter stats for './test' (10 runs):
113,898,951 instructions # 0.00 insns per cycle ( +- 12.73% ) [17.36%]
53,607 cache-references ( +- 12.92% ) [29.23%]
1,483 cache-misses # 2.767 % of all cache refs ( +- 26.66% ) [39.84%]
48,612,823 L1-dcache-loads ( +- 4.58% ) [50.45%]
34,256 L1-dcache-load-misses # 0.07% of all L1-dcache hits ( +- 18.94% ) [54.38%]
14,992,686 L1-dcache-stores ( +- 4.90% ) [52.58%]
1,980 L1-dcache-store-misses ( +- 6.36% ) [61.83%]
1,154 LLC-loads ( +- 61.14% ) [53.22%]
18 LLC-load-misses # 1.60% of all LL-cache hits ( +- 16.26% ) [10.87%]
0 LLC-prefetches [ 0.00%]
0.037949840 seconds time elapsed ( +- 3.57% )
More Experimental result(2014.05.13):
jin#desktop:~/1_dev/99_test/OI$ sudo perf stat -r 10 -e instructions -e r53024e -e r53014e -e L1-dcache-loads -e L1-dcache-load-misses -e r500f0a -e r500109 ./test
Performance counter stats for './test' (10 runs):
116,464,390 instructions # 0.00 insns per cycle ( +- 2.67% ) [67.43%]
5,994 r53024e <-- L1D hardware prefetch misses ( +- 21.74% ) [70.92%]
1,387,214 r53014e <-- L1D hardware prefetch requests ( +- 2.37% ) [75.61%]
61,667,802 L1-dcache-loads ( +- 1.27% ) [78.12%]
26,297 L1-dcache-load-misses # 0.04% of all L1-dcache hits ( +- 48.92% ) [43.24%]
0 r500f0a <-- LLC lines allocated [56.71%]
41,545 r500109 <-- Number of LLC read misses ( +- 6.16% ) [50.08%]
0.037080925 seconds time elapsed
In above result, the number of "L1D hardware prefetch request" seems like D1 miss(1,250,000) on cachegrind.
In my conclusion, if memory access the "stream pattern", then L1D prefetch function is enabled. and I can't check how many byte load from the memory due to LLC miss information.
Is my conclusion correct?
Editor's notes:
(1) According to the output of cachegrind, the OP was most probably using gcc 4.6.3 with no optimizations.
(2) Some of the raw events used in perf stat are only officially supported on Nehalem/Westmere, so I think that's the microarchitecture the OP is using.
(3) The bits set in most signifcant byte (i.e., third byte) in the raw event codes are ignored by perf. (Although not all bits of the third byte are ignored.) So the events effectively are r024e, r014e, r0f0a, and r0109.
(4) The events r0f0a and r0109 are uncore events, but the OP has specified them as core events, which is wrong because perf will measure them as core events.

Bottom line: your assumption regarding prefetches is correct, but your workaround isn't.
First, as Carlo pointed out, this loop would usually get optimized out by any compiler. Since both perf and cachegrind show ~100M instructions do retire, I guess you didn't compile with optimizations, which means the behavior isn't very realistic - for example, your loop variable may be stored in memory instead of in a register, adding pointless memory accesses and skewing cache counters.
Now, the difference between your runs is that cachgrind is just a cache simulator, it doesn't simulate prefetches, so every first access to a line misses as expected. On the other hand, the real CPU does have HW prefetches as you can see, so the first time each line is brought from memory, it's done by a prefetch (thanks to the simple streaming pattern), and not by an actual demand load. This is why perf misses counting these accesses with the normal counters.
You can see that when enabling the prefetch counter, you see roughly the same N/8 prefetches (plus some additional ones from other types of accesses probably).
Disabling the prefetcher would seem the right thing, however most CPUs don't offer too much control over that. You didn't specify what processor type you're using, but if it was Intel for example, you can see here that only the L2 prefetches are controlled by the BIOS, while your output shows L1 prefetches -
https://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
Search the manuals for your CPU type to see which L1 prefetchers exist, and understand how to work around them. Usually a simple stride (larger than a single cache line) should suffice to trick them, but if that doesn't work, you'll need to change your access pattern to be more random. You can randomize some permutation of indices for that.

Related

mpirun hangs after program completion

When I run the following command, I get the expected output but the program does not terminate immediately.
$ mpirun -np 2 echo 1
1
1
The program does not respond to interrupts either. Only after a minute or so I get back to the shell.
Or differently put: the program mpirun -np 2 echo 1; echo 'done' runs successfully but takes forever.
Update:
I ran strace mpirun -np 2 echo 1
The program hangs here:
sysinfo({uptime=5064793, loads=[153856, 184128, 229600], totalram=67362279424, freeram=26006364160, sharedram=8040448, bufferram=1739857920, totalswap=34359734272, freeswap=34358018048, procs=309, totalhigh=0, freehigh=0, mem_unit=1}) = 0
uname({sysname="Linux", nodename="euler", ...}) = 0
ioctl(13, _IOC(0, 0, 0x25, 0)
and then here:
openat(AT_FDCWD, "/tmp/openmpi-sessions-216211#euler_0/42701", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
munmap(0x7f61ed88c000, 2127408) = 0
munmap(0x7f61ee0a1000, 2101720) = 0
close(9) = 0
munmap(0x7f61ede9e000, 2105664) = 0
munmap(0x7f61ed685000, 2122480) = 0
munmap(0x7f61eda95000, 2109856) = 0
munmap(0x7f61ed47c000, 2130304) = 0
munmap(0x7f61ed05b000, 2109896) = 0
munmap(0x7f61ecc9a000, 3934648) = 0
munmap(0x7f61ed25f000, 2212016) = 0
munmap(0x7f61ec8e3000, 3894144) = 0
munmap(0x7f61ec6bd000, 2248968) = 0
munmap(0x7f61ea776000, 28999696) = 0
munmap(0x7f61edc99000, 2110072) = 0
exit_group(0) = ?
Could you help me debug this further?
Apparently, the NVIDIA driver was broken. Updating the driver to 440.64.00 resolved the issue.

One hot encoding / binary columns for each day of the year and select them

I have a R dataset of flight data. I need to add 365 columns to this dataset, one for each Day-of-the-year, with value 1 if the data[i]$FlightDate of the entry corresponds to that Day-of-the-year, 0 otherwise (see this question for why).
Previously I had managed to extract the day of Year from a FlightDate string using lubridate
data$DayOfYear <- yday(ymd(data$FlightDate))
How would I go about generating each 365 columns, and keep only those columns (along with some others) for a future SVD ? I will actually need to repeat the same for the hours in the day (which I will probably split into ranges of 30 or 10 minutes), so an extra 48-120 one-hot columns for a different variable will have to be added later.
Note : my dataset contains about 500k flights per month, (so about 16k flights for a single dayOfTheYear if I just take one year of data), and has 100 variable (columns)
Sample input data row data[1,]:
{
DayOfYear: 10,
FieldGoodForSvd1 : 235
FieldBadForSvd2 : "some string"
...
}
Sample output data row (after generating 365 binary cols and selecting fields compatible with an SVD)
{
DayOfYear1: 0,
...
DayOfYear9: 0,
DayOfYear10: 1, // The flight had taken place on that DayOfYear
DayOfYear11: 0,
...
DayOfYear365: 0,
FieldGoodForSvd1 : 235
}
EDIT
Suppose my input data matrix looks like that
DayOfYear ; FieldGoodForSvd1 ; FieldBadForSvd2
1 ; 275 ; "los angeles"
1 ; 256 ; "san francisco"
5 ; 15 ; "chicago"
The final output should be
FieldGoodForSvd1 ; DayOfYear1 ; DayOfYear2 ; ... ; DayOfYear4 ; DayOfYear5 ; DayOfYear6 ; ... ; DayOfYear365
275 ; 1 ; 0 ; ... ; 0 ; 0 ; 0 ; ... ; 0
256 ; 1 ; 0 ; ... ; 0 ; 0 ; 0 ; ... ; 0
5 ; 0 ; 0 ; ... ; 0 ; 1 ; 0 ; ... ; 0
Here is my final code that does the one hot encoding for DayOfYear and the TimeSlot, and proceeds to the svd
dsan = (d[!is.na(d$FieldGoodForSvd1) & d[!is.na(d$FieldGoodForSvd2),])
# We need factors to perform one hot encoding
dsan$DayOfYear <- as.factor(yday(ymd(dsan$FlightDate)))
dsan$TimeSlot <- as.factor(round(dsan$DepTime/100)) # in my case time slots were like 2055 for 20h55
dSvd= with(dsan,data.frame(
FieldGoodForSvd1,
FieldGoodForSvd2,
# ~ performs one hot encoding (on factors), -1 removes intercept term
model.matrix(~DayOfYear-1,dsan),
model.matrix(~TimeSlot-1,dsan)
))
theSVD = svd(scale(dSvd))

How can I add days and months in the field?

using if statement if field one + 1 day equal field two.
using if statement if field one + 1 month equal field two.
I have this input
09-11-2013 09-12-2013
10-02-2013 10-02-2013
26-10-2013 27-10-2013
12-01-2013 12-02-2013
22-02-2013 23-02-2013
I used this code but it works with years only:
awk '{if ($1+1==$2) print }'
Have a look at the mktime funktin in awk
With it you can convert date to seconds so it be easy to compare.
This prints how many days there are between $1 and $2
awk '{split($1, sd, "-");split($2, ed, "-");print $0,(mktime(ed[3] s ed[2] s ed[1] s 0 s 0 s 0)-mktime(sd[3] s sd[2] s sd[1] s 0 s 0 s 0))/86400}' s=' ' file
09-11-2013 09-12-2013 30
10-02-2013 10-02-2013 0
26-10-2013 27-10-2013 1
12-01-2013 12-02-2013 31
22-02-2013 23-02-2013 1
Her it prints 1 of its one day, and 2 if its one month.
It take in count that February may have 28 or 29 days
awk '
BEGIN {
arr="31,28,31,30,31,30,31,31,30,31,30,31"
split(arr, month, ",")
x=0}
{
split($1, sd, "-")
split($2, ed, "-")
t=(mktime(ed[3] s ed[2] s ed[1] s 0 s 0 s 0)-mktime(sd[3] s sd[2] s sd[1] s 0 s 0 s 0))/86400
month[2]=sd[3]%4==0?29:28
}
t==month[sd[2]+0] {x=2}
t==1 {x=1}
{print $0,x
x=0}
' s=' ' file
09-11-2013 09-12-2013 2
10-02-2013 10-02-2013 0
26-10-2013 27-10-2013 1
12-01-2013 12-02-2013 2
22-02-2013 23-02-2013 1

AIR NativeProcess does the job, but it sends progress data to ProgressEvent.STANDARD_ERROR_DATA

In Flex AIR application, I would like to upload file to ftp-server with NativeProcess API and curl.
Here is the simplified code:
protected function startProcess(event:MouseEvent):void
{
var processInfo:NativeProcessStartupInfo = new NativeProcessStartupInfo();
processInfo.executable = new File('/usr/bin/curl');
var processArgs:Vector.<String> = new Vector.<String>();
processArgs.push("-T");
processArgs.push("/Users/UserName/Desktop/001.mov");
processArgs.push("ftp://domainIp//www/site.com/");
processArgs.push("--user");
processArgs.push("username:password");
processInfo.arguments = processArgs;
var process:NativeProcess = new NativeProcess();
process.addEventListener(ProgressEvent.STANDARD_OUTPUT_DATA, outputDataHandler);
process.addEventListener(ProgressEvent.STANDARD_ERROR_DATA, errorOutputDataHandler);
process.start(processInfo);
}
It does the job well (i.e. target file is uploaded), but it emits ProgressEvent.STANDARD_ERROR_DATA instead of ProgressEvent.STANDARD_OUTPUT_DATA and all progress data goes to process.standardError.
protected function errorOutputDataHandler(event:ProgressEvent):void
{
var process = event.currentTarget as NativeProcess;
trace(process.standardError.readUTFBytes(process.standardError.bytesAvailable));
}
Here is an output of the latter function:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
1 15.8M 0 0 1 200k 0 166k 0:01:37 0:00:01 0:01:36 177k
2 15.8M 0 0 2 381k 0 143k 0:01:53 0:00:02 0:01:51 146k
...
What's wrong with my code? How can I debug it?
Thanks.
What you see is curl's progress meter. Try the -sS option to disable it but keep error messages.

How to understand and change CPU register FLAG in bochsdbg?

I can use 'r' to get the info of CPU register FLAG.
1.Can I understand by this?
eflags 0x00000082: id vip vif ac vm rf nt IOPL=0 of df if tf SF zf af pf cf
0x00000082= 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0
2.How to change the FLAG? By 'set' command?
<bochs:5> set eflags=0x03
:5: syntax error at 'eflags'
Thank you~
If the flag name is in capitals, then the flag is set. E.g. 'SF' means that sign flag is set, while 'sf' means it is not set. Did you mean this, or something else in your question?
The bochs manual says: "Currently only general purpose registers are supported, you may not change: eflags, eip, cs, ss, ds, es, fs, gs" (http://bochs.sourceforge.net/doc/docbook/user/internal-debugger.html#AEN3098).
Regards

Resources