Meaning of mpi and openMP execution times - mpi

I have a cluster made up of 4 nodes, where each node has 2 processors, and each processor has 8 cores, for a total of 16 cores per machine. I have this application made in two versions one in mpi and the other in openMP for image convolution. When I run the mpi program with 1-2-4-8-12 and 16 cores on the cluster I get the following execution times in seconds:
1 core 9.4
2 cores 4.95
4 cores 2.55
8 cores 1.4
12 cores 0.91
16 cores 0.72
instead when I run the openMP code with two threads I get the following execution times:
1 core 4.95
2 cores 2.65
4 cores 1.33
8 cores 0.68
12 cores 0.77
16 cores 0.64
when I always run it with openMP with 4 threads I have the following execution times:
1 core 2.64
2 cores 1.37
4 cores 0.80
8 cores 0.66
12 cores 0.86
16 cores 1.10
what i can't figure out why with 2 threads as the number of processors increases the time decreases, except in the case with 12 cores. Instead in the case with 4 threads up to 8 cores the times decrease, from 8 cores onwards the times start to increase. Is there anyone who can give me an explanation of these execution times? Particularly:
because in the case with 2 threads I get a higher time than with 8 cores and 16 cores
why in the case with 4 threads do the times increase from 8 cores onwards?
Thanks to whoever will answer.

Related

Error when trying to extract values from a raster layer to each shapefile using R

I'm trying to extract the values from a raster layer to a shapefile layer, but the process is extremely time consuming, but 2 hours without me getting any result. In general considering the size of the polygons and the raster this process should not take more than two 2 minutes in cases that I had already done.
the message I get is this:
no loop for break/next, jumping to top level
my code is this:
library(sf)
library(raster)
#shapefile geometry
shp <- sf::st_read("/vsicurl/http://wesleysc352.github.io/seg_s3_r3_m10_fix_estat_amost_val.shp")
#raster geotiff
raster3<-raster::raster("http://wesleysc352.github.io/1final.tif")
#extract values
ext1<-raster::extract(raster3, shp)
OBS Edit: links for reproducible example:
shape https://github.com/wesleysc352/wesleysc352.github.io/raw/master/seg_s3_r3_m10_fix_estat_amost_val.zip
raster
https://github.com/wesleysc352/wesleysc352.github.io/raw/master/1final.tif
Use exact_extract from the exactextractr package. It takes 5 seconds:
> system.time({ext1<-exact_extract(raster3, shp)})
|======================================================================| 100%
user system elapsed
4.805 0.296 5.103
and gives you a list of data frames of pixel numbers and fractional coverage:
> head(ext1[[1]])
value coverage_fraction
1 4 0.25
2 4 0.50
3 4 0.50
4 4 0.25
5 4 0.25
6 4 0.75
> head(ext1[[2]])
value coverage_fraction
1 4 0.25
2 4 0.50
3 4 0.25
4 4 0.50
5 4 1.00
6 4 0.50
[etc]
which is more than extract gives you (a list of pixel values only).
Why is it slow? Don't know, can't really see the point finding out when exactextractr exists. If you want to know, try subsets of your polygons. For example the first ten rows takes:
> system.time({ext10<-extract(raster3, shp[1:10,])})
user system elapsed
2.549 0.029 2.574
2.57 seconds, which extrapolates to:
> nrow(shp)
[1] 2308
> 2.5*2308/10
[1] 577
577 seconds for the lot. If the full set is still going after ten minutes I'd split it in parts and see if maybe there's an odd polygon with bad geometry that's messing it all up. But in that time you could have run exact_extract about 200 times and played a round of golf/gone fishing/engaged in pleasant leisure activity of your choice.

R parLapply taking increasing time to finish

Im running the function parLapply inside a loop and im verifying a strange behaviour. The time per iteration was increasing significantly and it didn't make much sense such an increase.
So i started clocking the functions within the cycle to see which one was taking the most time and i found out that parLapply was taking >95% of the time. So i went inside the parLapply function and clocked it as well to see if the times between inside and outside of the function match. And they did not by quite a large margin. This margin increases over time and the difference can reach seconds which makes quite an impact on the time it takes for the algorithm to complete.
while (condition) {
start.time_1 <- Sys.time()
predictions <- parLapply(cl, array, function(i){
start.time_par <- Sys.time()
#code
end.time <- Sys.time()
time.taken_par<- end.time - start.time_par
print(time.taken_par)
return(value)
})
end.time <- Sys.time()
time.taken <- end.time - start.time_1
print(time.taken)
}
I would be expecting that time.taken would be similar to the sum of all time.taken_par. But it is not. The sum of all time.taken_par is usually 0.026 seconds while time.taken starts by being 4 times that value, which is fine, but then increases to a lot more (>5 seconds).
Can anyone explain what is going on and/or if what im thinking should happen is wrong? Is it a memory issue?
Thanks for the help!
Edit:
The output of parLapply is the following. However in my tests there are 10 lists instead of just 3 as in this example. The size of the each individual list that is returned by parLapply is always the same and in this case is 25.
[1] 11
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
[[3]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14
-0.01878590 -0.03462315 -0.03412670 -0.06016549 -0.02527741 -0.06271799 -0.05429947 -0.02521108 -0.04291305 -0.03145491 -0.08571382 -0.07025075 -0.07704650 0.25301839
15 16 17 18 19 20 21 22 23 24 25
-0.02332236 -0.02521089 -0.01170326 0.41469539 -0.15855689 -0.02548952 -0.02545446 -0.10971302 -0.02521836 -0.09762386 0.02044592
Edit2:
Ok i have found out what the problem was. I have an array that i initialize using vector("list",10000). And in each iteration of the cycle i add a list of lists to this array. This list of lists has size 6656 bytes. So over the 10000 iteration it doesn't even add up to 0.1Gb. However as this array start filling up the performance of the parallelization starts to degrade. I have no idea as to why this is happening as im running the script on a machine with 64Gb of RAM. Is this a known problem?

How to compute an average runtime?

I have a set of programs and for each program, it contains many subprograms, of which, one subprogram has the longest runtime. My goal is to calculate the the average ratio of (longest runtime)/(entire program runtime).
I want to know what is the right way to do so.
> program longest runtime entire runtime ratio
>
> 1 10 secs 50 secs 0.2
>
> 2 5 secs 40 secs 0.125
>
> 3 1 secs 10 secs 0.1
>
> 4 20 secs 80 secs 0.25
>
> 5 15 secs 20 secs 0.75
So I want to see how much percentage the longest runtime takes of the entire runtime.
There are two ways to do so:
1: compute the ratio for each program and then calculate the average of the ratios.
(0.2 + 0.125 + 0.1 + 0.25 + 0.75) / 5 = 1.425 / 5 = 0.285
2: compute the sum of longest runtime and then divided by the sum of entire runtime.
sum_longest = 41 secs
sum_entire = 200 secs
average = 41 / 200 = 0.205
which way is correct?
I'd say that your latter answer (getting .205) is correct, because your first method does not take the weights (i.e. how long it takes each program to run) into account.

What is the difference between a bank conflict and channel conflict on AMD hardware?

I am learning OpenCL programming and running some programs on AMD GPU. I referred the AMD OpenCL Programming guide to read about global memory optimization for GCN Architecture. I am not able to understand the difference between a bank conflict and a channel conflict.
Can someone explain me what is the difference between them?
Thanks in advance.
If two memory access requests are directed to the same controller, the hardware serializes the access. This is called a channel conflict. Which means, each of integrated memory controller circuits can serve to a single task at a time, if you happen to map any two tasks' address to access to same channel, they are served serially.
Similarly, if two memory access requests go to the same memory bank, hardware serializes the access. This is called a bank conflict. If there are multiple memory chips, then you should avoid using a stride of the special width of the hardware.
Example with 4 channels and 2 banks: (not a real world example since banks must be more than or equal to channels)
address 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
channel 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1
bank 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
so you should not read like this:
address 1 3 5 7 9
channel 1 3 1 3 1 // %50 channel conflict
bank 1 1 1 1 1 //%100 bank conflict,serialized on bank level
nor this:
address 1 5 9 13
channel 1 1 1 1 // %100 channel conflict, serialized
bank 1 1 1 1 // %100 bank conflict, serialized
but this could be ok:
address 1 6 11 16
channel 1 2 3 4 // no conflict, %100 channel usage
bank 1 2 1 2 // no conflict, %100 bank usage
because the stride is not a multiple of channel nor bank widths.
Edit: if your algorithms are more of a local-storage optimized, then you should pay attention to local data store channel conflicts. On top of this, some cards can use constant memory as an independent channel source to speed up reading rates.
Edit: You can use multiple wavefronts to hide conflict-based latencies or you can use instruction level parallelism too.
Edit: Number of local data store channels are much faster and more numerous than global channels so optimizing for LDS (local data share) is very important so uniform-gathering on global channels then scattering on local channels shouldn't be as problematic as scattering on global channels and uniform-gathering on local channels.
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_pgfId-472173
For an AMD APU with a decent mainboard, you should be able to select an n-way channel interleaving or n-way bank interleaving for your desire if your software is not alterable.

How to get whether OS is 32 bit or 64 bit by UNIX command?

How will you get to know the bits of operating system? Thanks in advance.
In linux, the answer to such a generic question is just using
uname -m
or even:
getconf LONG_BIT
In C you can use the uname(2) system call.
In windows you can use:
systeminfo | find /I "System type"
or even examine the environment:
set | find "ProgramFiles(x86)"
(or with getenv() in C)
Original question:
How to know the bits of operating system in C or by some other way?
The correct way is to use some system API or command to get the architecture.
Comparing sizeof in C won't give you the system pointer size but the target architecture's pointer size. That's because most architectures/OSes are backward compatible so they can run previous 16 or 32-bit programs without problem. A 32-bit program's pointer is still 32-bit long even on 64 bit OS. And even on 64-bit architectures, some OSes may still use 32-bit pointers such as x32-abi
if you use c, you can get sizeof(void*) or sizeof(long) .if =8 then 64bits else 32bits.It's the same to all arch.
I'm so sorry for my carelessness and mistake.It's only for linux. In Linux Device Driver 3rd,11.1 section: Use of Standard C Types. It says
The program can be used to show that long integers and pointers
feature a different size on 64-bit platforms, as demonstrated by
running the program on different Linux computers:
arch Size: char short int long ptr long-long u8 u16 u32 u64
i386 1 2 4 4 4 8 1 2 4 8
alpha 1 2 4 8 8 8 1 2 4 8
armv4l 1 2 4 4 4 8 1 2 4 8
ia64 1 2 4 8 8 8 1 2 4 8
m68k 1 2 4 4 4 8 1 2 4 8
mips 1 2 4 4 4 8 1 2 4 8
ppc 1 2 4 4 4 8 1 2 4 8
sparc 1 2 4 4 4 8 1 2 4 8
sparc64 1 2 4 4 4 8 1 2 4 8
x86_64 1 2 4 8 8 8 1 2 4 8
And there is some exception.For example:
It's interesting to note that the SPARC 64 architecture runs with a
32-bit user space, so pointers are 32 bits wide there, even though
they are 64 bits wide in kernel space. This can be verified by loading
the kdatasize module (available in the directory misc-modules within
the sample files). The module reports size information at load time
using printk and returns an error (so there's no need to unload it):
#user1437033 I guess Windows isn't compatible with gcc standard .So you maybe get answer from windows' programmers.
#Paul R We should consider it regular code ,right? If you use cross compile tools ,such as arm(it only has 32bits),then you also can't get answer.
Ps:I don't support you use Dev c++ compiler,it's weird in many scenes and isn't standard.Code blocks or vs 2010 may be a good choice.
I hope this can help you.

Resources