Today one company interviewer asked me what is this program, is it overloading or is it overriding?
int a(int n1, int n2)
float a(int n1, int n2)
it is overloading because some of the modern language support this methodology
overloading by return type is possible and is done by some modern
languages. The usual objection is that in code like
int func();
string func();
int main() { func(); }
Related
I recently started using Arduino so I still have to adapt and find the differences between C/C++ and the Arduino language.
So I have a question for you.
When I see someone using a C-style string in Arduino (char *str), they always initialize it like this (and never free it) :
char *str = "Hello World";
In pure C, I would have done something like this:
int my_strlen(char const *str)
{
int i = 0;
while (str[i]) {
i++;
}
return (i);
}
char *my_strcpy(char *dest, char const *src)
{
char *it = dest;
while (*src != 0) {
*it = *src;
it++;
src++;
}
return (dest);
}
char *my_strdup(char const *s)
{
char *result = NULL;
int length = my_strlen(s);
result = my_malloc(sizeof(char const) * (length + 1));
if (result == NULL) {
return (NULL);
}
my_strcpy(result, s);
return (result);
}
and then initialize it like this:
char *str = my_strdup("Hello World");
my_free(str);
So here is my question, on C-style Arduino strings, is malloc optional or these people just got it wrong?
Thank you for your answers.
In C++ it's better to use new[]/delete[] and not mix it with malloc/free.
In the Arduino there is also String class, that hides those allocations from you.
However using dynamic memory allocations in such restrained platform has its pitfalls, like heap fragmentation (mainly because String overloads + operator, so everyone is overusing it like: Serial.println(String{"something : "} + a + b + c + d + ......) and then wonders about mysterious crashes.
More about it on Majenko's blog: The Evils of Arduino String class (Majenko has highest reputation on the arduino stackexchange)
Basically with the String class your strdup code would be simple as this:
String str{"Hello World"};
String copyOfStr = str;
I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?
classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);
When writing device-side code...
While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.
To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:
Implementing a critical section in CUDA
A caveat to the above is mentioned by #RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.
... but is it really device side code?
Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).
I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:
template<typename T>
union readablePointer
{
T* ptr;
unsigned long long int address;
};
template<typename T>
struct LinkedList
{
struct Node
{
T value;
readablePointer<Node> previous;
};
Node start;
Node end;
int size;
__host__ __device__ void initialize()
{
size = 0;
start.previous.ptr = nullptr;
end.previous.ptr = &start;
}
__host__ __device__ void push_back(T value)
{
Node* node = nullptr;
malloc(&node, sizeof(Node));
readablePointer<Node> nodePtr;
nodePtr.ptr = node;
nodePtr.ptr->value = value;
#ifdef __CUDA_ARCH__
nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
atomicAdd(&size,1);
#else
nodePtr.ptr->previous.address = end.previous.address;
end.previous.address = nodePtr.address;
size += 1;
#endif
}
__host__ __device__ T pop_back()
{
assert(end.previous.ptr != &start);
readablePointer<Node> lastNodePtr;
lastNodePtr.ptr = nullptr;
#ifdef __CUDA_ARCH__
lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
atomicSub(&size,1);
#else
lastNodePtr.address = end.previous.address;
end.previous.address = end.previous.ptr->previous.address;
size -= 1;
#endif
T toReturn = lastNodePtr.ptr->value;
free(lastNodePtr.ptr);
return toReturn;
}
__host__ __device__ void clear()
{
while(size > 0)
{
pop_back();
}
}
};
I'm doing the following in an OpenCL kernel (simplified example):
__kernel void step(const uint count, __global int *map, __global float *sum)
{
const uint i = get_global_id(0);
if(i < count) {
sum[map[i]] += 12.34;
}
}
Here, sum is some quantity I want to calculate (previously set to zero in another kernel) and map is a mapping from integers i to integers j, such that multiple i's can map to the same j.
(map could be in constant memory rather than global, but it seems the amount of constant memory on my GPU is incredibly limited)
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
Will this work? Is a "+=" implemented in an atomic way, or is there a chance of concurrent operations overwriting each other?
It will not work. When threads access memory written to by other threads, you need to explicitly resort to atomic operations. In this case, atomic_add.
Something like:
__kernel void step(const uint count, __global int *map, __global double *sum)
{
const uint i = get_global_id(0);
if(i < count) {
atomic_add(&sum[map[i]], 1234);
}
}
In OpenCL what will be the consequences and differences between the following struct declarations. And if they are illegal, why?
struct gr_array
{
int ndims;
__global m_integer* dim_size;
__global m_real* data;
};
typedef struct gr_array g_real_array;
struct lr_array
{
int ndims;
__local m_integer* dim_size;
__local m_real* data;
};
typedef struct lr_array l_real_array;
__ kernel temp(...){
__local g_real_array A;
g_real_array B;
__local l_real_array C;
l_real_array D;
}
My question is where will the structures be allocated (and the members)? who can access them? And is this a good practice or not?
EDIT
how about this
struct r_array
{
__local int ndims;
};
typedef struct r_array real_array;
__ kernel temp(...){
__local real_array A;
real_array B;
}
if a work-item modifies ndims in struct B, is the change visible to other work-items in the work-group?
I've rewritten your code as valid CL, or at least CL that will compile. Here:
typedef struct gr_array {
int ndims;
global int* dim_size;
global float* data;
} g_float_array;
typedef struct lr_array {
int ndims;
local int* dim_size;
local float* data;
} l_float_array;
kernel void temp() {
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
}
One by one, here's how this breaks down:
A is in local space. It's a struct that is composed of one int and two pointers. These pointers point to data in global space, but are themselves allocated in local space.
B is in private space; it's an automatic variable. It is composed of an int and two pointers that point to stuff in global memory.
C is in local space. It contains an int and two pointers to stuff in local space.
D, you can probably guess at this point. It's in private space, and contains an int and two pointers that point to stuff in local space.
I cannot say if either is preferable for your problem, since you haven't described what your are trying to accomplish.
EDIT: I realized I didn't address the second part of your question -- who can access the structure fields.
Well, you can access the fields anywhere the variable is in scope. I'm guessing that you were thinking that the fields you had marked as global in g_float_array were in global space (an local space for l_float_array). But they're just pointing to stuff in global (or local) space.
So, you'd use them like this:
kernel void temp(
global float* data, global int* global_size,
local float* data_local, local int* local_size,
int num)
{
local g_float_array A;
g_float_array B;
local l_float_array C;
l_float_array D;
A.ndims = B.ndims = C.ndims = D.ndims = num;
A.data = B.data = data;
A.dim_size = B.dim_size = global_size;
C.data = D.data = data_local;
C.dim_size = D.dim_size = local_size;
}
By the way -- if you're hacking CL on a Mac running Lion, you can compile .cl files using the "offline" CL compiler, which makes experimenting with this kind of stuff a bit easier. It's located here:
/System/Library/Frameworks/OpenCL.framework/Libraries/openclc
There is some sample code here.
It probably won't work, because the current GPU-s have different memory spaces for OpenCL kernels and for the ordinary program. You have to make explicit calls to transmit data between both spaces, and it is often the bottleneck of the program (because the bandwidth of PCI-X graphics card is quite low).
I am preparing myself for the defintion of user-defined literals with a Variadic Template
template<...>
unsigned operator "" _binary();
unsigned thirteen = 1101_binary;
GCC 4.7.0 does not support operator "" yet, but I can simulate this with a simple function until then.
Alas, my recursion is the wrong way around. I can not think of a nice way how I do not shift the rightmost values, but the leftmost:
template<char C> int _bin();
template<> int _bin<'1'>() { return 1; }
template<> int _bin<'0'>() { return 0; }
template<char C, char D, char... ES>
int _bin() {
return _bin<C>() | _bin<D,ES...>() << 1; // <-- WRONG!
}
which of course is not quite right:
int val13 = _bin<'1','1','0','1'>(); // <-- gives 10
because my recursion shifts the rightmost '1's farthest, and not the leftmost ones.
It is probably I tiny thing, but I just can not see it.
Can I correct the line _bin<C>() | _bin<D,ES...>() << 1;?
Or do I have to forward everything and turn it around everything afterwards (not nice)?
Or any other way that I can not see?
Update: I could not fold the recursion the other way around, but I discovered sizeof.... Works, but not perfect. Is there another way?
template<char C, char D, char... ES>
int _bin() {
return _bin<C>() << (sizeof...(ES)+1) | _bin<D,ES...>() ;
}
At any one step of the recursion you already know the rank of the leftmost digit.
template<char C> int _bin();
template<> int _bin<'1'>() { return 1; }
template<> int _bin<'0'>() { return 0; }
template<char C, char D, char... ES>
int _bin() {
return _bin<C>() << (1 + sizeof...(ES)) | _bin<D,ES...>();
}
Parameter packs are relatively inflexible, and you don't usually write algorithms directly in them. Variadic function templates are good for forwarding, but I'd get that packed into a more manageable tuple before trying to manipulate it.
Using a simple binary_string_value metafunction where the 1's place comes first, and a generic tuple_reverse metafunction, the pattern would be
template< char ... digit_pack >
constexpr unsigned long long _bin() {
typedef std::tuple< std::integral_constant< digit_pack - '0' > ... > digit_tuple;
return binary_string_value< typename tuple_reverse< digit_tuple >::type >::value;
}
One possibility would be using an accumulator:
template <char C>
int _binchar();
template<>
int _binchar<'0'>() { return 0; }
template<>
int _binchar<'1'>() { return 1; }
template<char C>
int _bin(int acc=0) {
return (acc*2 + _binchar<C>());
}
template<char C, char D, char... ES>
int _bin(int acc=0) {
return _bin<D, ES...>(acc*2 + _binchar<C>());
}