Do "if conditions" affect the performance of kernel execution in OpenCL? - opencl

I am running a filter on an image and I perform a vertical pass followed by a horizontal pass. The function for this task is same for both the passes, only the argument values change. I'm calling the function in a loop. For vectorizing the operations in that function I have to write separate function calls for the two passes. The loop is now separate for horizontal and vertical passes. An "if condition" is now added because of this change and I noticed that even though the computations are vectorized, the kernel is taking longer to execute. I have run the code several times and the average time taken with the vectorized code is more than the original code. Is it because of the "if condition" plugged in the code?
Original code
global int* a;
for(int i = 0; i < 4; i++)
filter(a + i, b, c);
Modified code
global int* a;
if(offset == 1)
for(int i = 0; i < 4; i++)
filter_vertical(a + i, b, c);
filter_horizontal(a, b, c);

Did you mean offset == 1 ?
if(offset = 1)
assigns 1 to offset which is an "extra latency" per thread. This is slower than original. But apart from that, "if" changes performance up or down depending on the pattern of a branch "taken" or "not taken" grouped together because some architectures like GPU SIMD, fills bubbles to parallel SIMD pipelines when those are not same branch option with a neighbor pipeline so they are left to other wavefront threads' occupation opportunities, if they can't fill neiter, it will have less performance.
For more performance,
for(int i = 0; i < 4; i++)
filter_vertical(a + i, b, c);
filter_vertical(a , b, c);
filter_vertical(a + 1, b, c);
filter_vertical(a + 2, b, c);
filter_vertical(a + 3, b, c);
needs more instruction cache but, needs less branches, needs less memory usage and less cycles.
If you can group offset == 1 cases together, it would be faster if memory access operations doesn't affect it.


Memoization code for "Longest Common Substring" doesn't work as expected

I was able to think of a recursive solution for the problem "Longest Common Substring" but when I try to memoize it, it doesn't seem to work as I expected it to, and throws a wrong answer.
Here is the recursive code.
int lcs(string X, string Y,int i, int j, int count)
if (i == 0 || j == 0)
return count;
if (X[i - 1] == Y[j - 1])
count = lcs(X,Y,i - 1, j - 1, count + 1);
count = max(count,max(lcs(X,Y,i, j-1, 0),lcs(X,Y,i - 1, j, 0)));
return count;
int longestCommonSubstr(string S1, string S2, int n, int m)
return lcs(S1,S2,n,m,0,dp);
And here is the memoized code.
int lcs(string X, string Y,int i, int j, int count,vector<vector<vector<int>>>& dp)
if (i == 0 || j == 0)
return count;
if(dp[i - 1][j - 1][count] != -1)
return dp[i - 1][j - 1][count];
if (X[i - 1] == Y[j - 1])
count = lcs(X, Y, i - 1, j - 1, count + 1, dp);
count = max(count,max(lcs(X,Y,i, j-1, 0,dp),lcs(X,Y,i - 1, j, 0,dp)));
return dp[i-1][j-1][count]=count;
int longestCommonSubstr(string S1, string S2, int n, int m)
int maxSize=max(n,m);
vector<vector<vector<int>>> dp(n,vector<vector<int>>(m,vector<int>(maxSize,-1)));
return lcs(S1,S2,n,m,0,dp);
I do know that the problem can be solved using a 2D DP vector as well but my objective was to convert my original recursive solution to a memoized solution and not write a solution from scratch. And as I have 3 parameters which are changing, so it should use a 3D DP table.
Can anyone figure out what's wrong or help me out with a 3D DP solution with recursive code same or similar to mine.
An interesting observation, the max function for some reason works from left to right on my Mac system and on Ubuntu running under parallels as well, but the same function works from right to left in Windows machine and in online compilers. I do not know the reason but I would be happy to know about it. I'm running the code in an M1 Mac, I don't know if the ARM compiler is different from x86 Mac compiler or not.
Another thing, the memoized code gives different answers depending upon which recursive call is called first on the line,
count = max(count,max(lcs(X,Y,i, j-1, 0),lcs(X,Y,i - 1, j, 0)));
If I swap the positions of the function call statements then it gives a correct output but for that specific test case and probably similar cases.
This Memo solution gives TLE as well in large test cases, and I do not know why.
I recently started studying DP and this is the only question which I wasn't able to solve by just modifying the original recursive solution. It has been two days and I just can't figure out the proper reasons.
Submission Link:-
Any help in this regard would be great.

Loop for minimum spanning tree does not work

we as 3 friends try to solve minimum spanning tree with coflicts problem using r. In solving this question, we read files in .txt format that contain for ex.
"1 2 5
2 4 6" etc. which indicates from node 1 to 2, there exists an edge with weight 5 and
"1 2 2 4" etc. which indicates there's a conflict relationship between the edges 1-2 and 2-4. To continue, we have to form an nxn conflict matrix in which we will store 0's if there exist no conflict relation between the edges or 1 if there exist a conflict relation. For this purpose, we developed a 3-for loop for(i in 1:dim(edges_read)[1]){
for(i in 1:dim(edges_read)[1]){
for(k in 1:dim(edges_read)[1]){
for(t in 1:dim(conflicts)[1]){
if(all(conflicts[t,] == c(edges_read[i,1], edges_read[i,2],
edges_read[k,1], edges_read[k,2]) )){
conflictmatrix[i,k] <- 1
However, R cannot get us a solution and this for loops take very long times. How can we solve this situation? Thanks for further assistance
As you have discovered, for() loops are not fast in R. There are faster approaches, but it's hard to provide examples without data. Please use something like dput(edges_read) and dput(conflicts) to provide a small example of the data.
As one example, you could implement the for loops in the Rcpp package for speed improvement. Based on the code in your question, you could re-implement the 3-loop code sort of like this:
Rcpp::cppFunction('NumericVector MSTC_nxn_Cpp(NumericMatrix edges_read, NumericMatrix conflicts){
int n = edges_read.nrow(); //output matrix size (adjust to what you need)
int m = conflicts.nrow(); //output matrix size (adjust to what you need)
NumericMatrix conflictmatrix( n , m ); //the output matrix
for(int i=0;i<n;i++){ //your i loop
for(int k=0;k<n;k++){ // your k loop
double te = edges_read( i, 0 ); //same as edges_read[i,1]
double tf = edges_read( i, 1 ); //same as edges_read[i,2]
double tg = edges_read( k, 0 ); //same as edges_read[k,1]
double th = edges_read( k, 1 ); //same as edges_read[k,2]
NumericVector w = NumericVector::create(te,tf,tg,th); //this could probably be more simple
for(int t=0;t<m;t++){ //your t loop
NumericVector v = conflicts( t , _ ); // same as conflicts[t,]
LogicalVector r; //vector for checking if conflicts and edges are the same
for(int p=0; p<4; p++){ //loop to check logic
r[p]=v[p]==w[p]; //True / False stored
int q = r.size();
for (int ii = 0; ii < q; ++ii) { //similar to all() This code could be simplified!
if (!r[ii]) {false;}
else{conflictmatrix[i,k] = 1;}}
return conflictmatrix; //your output
#Then run the function
MSTC_nxn_Cpp(edges_read, conflicts )

How does "runif" function work internally in R?

I am trying to generate a set of uniformly distributed numbers in R. I know that we can use the function "runif" in R to do the same. But I really want to understand the idea behind how this function would have been developed. In the sense how does the code work for the function "runif". So, in a nutshell, I want to create my own function which can do the same task as the "runif"
Ultimately, runif calls a pseudorandom number generator. One of the simpler ones can be found here defined in C within the R code base and should be straightforward to emulate
static unsigned int I1=1234, I2=5678;
void set_seed(unsigned int i1, unsigned int i2)
I1 = i1; I2 = i2;
void get_seed(unsigned int *i1, unsigned int *i2)
*i1 = I1; *i2 = I2;
double unif_rand(void)
I1= 36969*(I1 & 0177777) + (I1>>16);
I2= 18000*(I2 & 0177777) + (I2>>16);
return ((I1 << 16)^(I2 & 0177777)) * 2.328306437080797e-10; /* in [0,1) */
So effectively this takes the initial integer seed values, shuffles them bitwise, then recasts them as double precision floating point numbers via multiplying by a small constant that normalises the doubles into the [0, 1) range.

How to find a pair of numbers in a list given a specific range?

The problem is as such:
given an array of N numbers, find two numbers in the array such that they will have a range(max - min) value of K.
for example:
5 3
25 9 1 6 8
9 6
So far, what i've tried is first sorting the array and then finding two complementary numbers using a nested loop. However, because this is a sort of brute force method, I don't think it is as efficient as other possible ways.
import java.util.*;
public class Main {
public static void main(String[] args) {
Scanner sc = new Scanner(;
int n = sc.nextInt(), k = sc.nextInt();
int[] arr = new int[n];
for(int i = 0; i < n; i++) {
arr[i] = sc.nextInt();
int count = 0;
int a, b;
for(int i = 0; i < n; i++) {
for(int j = i; j < n; j++) {
if(Math.max(arr[i], arr[j]) - Math.min(arr[i], arr[j]) == k) {
a = arr[i];
b = arr[j];
System.out.println(a + " " + b);
Much appreciated if the solution was in code (any language).
Here is code in Python 3 that solves your problem. This should be easy to understand, even if you do not know Python.
This routine uses your idea of sorting the array, but I use two variables left and right (which define two places in the array) where each makes just one pass through the array. So other than the sort, the time efficiency of my code is O(N). The sort makes the entire routine O(N log N). This is better than your code, which is O(N^2).
I never use the inputted value of N, since Python can easily handle the actual size of the array. I add a sentinel value to the end of the array to make the inner short loops simpler and quicker. This involves another pass through the array to calculate the sentinel value, but this adds little to the running time. It is possible to reduce the number of array accesses, at the cost of a few more lines of code--I'll leave that to you. I added input prompts to aid my testing--you can remove those to make my results closer to what you seem to want. My code prints the larger of the two numbers first, then the smaller, which matches your sample output. But you may have wanted the order of the two numbers to match the order in the original, un-sorted array--if that is the case, I'll let you handle that as well (I see multiple ways to do that).
# Get input
N, K = [int(s) for s in input('Input N and K: ').split()]
arr = [int(s) for s in input('Input the array: ').split()]
sentinel = max(arr) + K + 2
left = right = 0
while arr[right] < sentinel:
# Move the right index until the difference is too large
while arr[right] - arr[left] < K:
right += 1
# Move the left index until the difference is too small
while arr[right] - arr[left] > K:
left += 1
# Check if we are done
if arr[right] - arr[left] == K:
print(arr[right], arr[left])

Recursion confusion with local variables

I'm trying to improve my recursion skill(reading a written recursion function) by looking at examples. However, I can easily get the logic of recursions without local variables. In below example, I can't understand how the total variables work. How should I think a recursive function to read and write by using local variables? I'm thinking it like stack go-hit-back. By the way, I wrote the example without variables. I tried to write just countThrees(n / 10); instead of total = total + countThrees(n / 10); but it doesn't work.
with total variable:
int countThrees(int n) {
if (n == 0) { return 0; }
int lastDigit = n % 10;
int total = 0;
total = total + countThrees(n / 10);
if (lastDigit == 3) {
total = total + 1;
return total;
simplified version
int countThrees(int x)
if (x / 10 == 0) return 0;
if (x % 10 == 3)
return 1 + countThrees(x / 10);
return countThrees(x / 10);
In both case, you have to use a stack indeed, but when there are local variables, you need more space in the stack as you need to put every local variables inside. In all cases, the line number from where you jump in a new is also store.
So, in your second algorithme, if x = 13, the stack will store "line 4" in the first step, and "line 4; line 3" in the second one, in the third step you don't add anything to the stack because there is not new recursion call. At the end of this step, you read the stack (it's a First in, Last out stack) to know where you have to go and you remove "line 3" from the stack, and so.
In your first algorithme, the only difference is that you have to add the locale variable in the stack. So, at the end of the second step, it looks like "Total = 0, line 4; Total = 0, line 4".
I hope to be clear enough.
The first condition should read:
if (x == 0) return 0;
Otherwise the single 3 would yield 0.
And in functional style the entire code reduces to:
return x == 0 ? 0
: countThrees(x / 10) + (x % 10 == 3 ? 1 : 0);
On the local variables:
int countThrees(int n) {
if (n == 0) {
return 0;
// Let an alter ego do the other digits:
int total = countThrees(n / 10);
// Do this digit:
int lastDigit = n % 10;
if (lastDigit == 3) {
return total;
The original code was a bit undecided, when or what to do, like adding to total after having it initialized with 0.
By declaring the variable at the first usage, things become more clear.
For instance the absolute laziness: first letting the recursive instances calculate the total of the other digits, and only then doing the last digit oneself.
Using a variable lastDigit with only one usage is not wrong; it explains what is happening: you inspect the last digit.
Preincrement operator ++x; is x += 1; is x = x + 1;.
One could have done it (recursive call and own work) the other way around, so it probably says something about the writer's psychological preferences
The stack usage: yes total before the recursive call is an extra variable on the stack. Irrelevant for numbers. Also a smart compiler could see that total is a result.
On the usage of variables: they can be stateful, and hence are useful for turning recursion into iteration. For that tail recursion is easiest: the recursion happening last.
int countThrees(int n) {
int total = 0;
while (n != 0) {
int digit = n % 10;
if (digit == 3) {
n /= 10; // Divide by 10
return total;
