OK, say I have a boolean array called bits, and an int called cursor
I know I can access individual bits by using bits[cursor], and that I can use bit logic to get larger datatypes from bits, for example:
short result = (bits[cursor] << 3) |
(bits[cursor+1] << 2) |
(bits[cursor+2] << 1) |
bits[cursor+3];
This is going to result in lines and lines of code when reading larger types like int32 and int64 though.
Is it possible to do a cast of some kind and achieve the same result? I'm not concerned about safety at all in this context (these functions will be wrapped into a class that handles that)
Say I wanted to get an uint64_t out of bits, starting at an arbitrary address specified by cursor, when cursor isn't necessarily a multiple of 64; is this possible by a cast? I thought this
uint64_t result = (uint64_t *)(bits + cursor)[0];
Would work, but it doesn't want to compile.
Sorry I know this is a dumb question, I'm quite inexperienced with pointer math. I'm not looking just for a short solution, I'm also looking for a breakdown of the syntax if anyone would be kind enough.
Thanks!
You could try something like this and cast the result to your target data size.
uint64_t bitsToUint64(bool *bits, unsigned int bitCount)
{
uint64_t result = 0;
uint64_t tempBits = 0;
if(bitCount > 0 && bitCount <= 64)
{
for(unsigned int i = 0, j = bitCount - 1; i < bitCount; i++, j--)
{
tempBits = (bits[i])?1:0;
result |= (tempBits << j);
}
}
return result;
}
Related
I'm trying to parse a hex number byte by byte, and concatenate to a string the representation of each byte, in the order they're stored in memory. (for a little test on endianness, but that's not important I guess).
Here is the code (please ignore the glaring unit-test issues with it :D; also, some of the code might look weird since initially the display_bytes method took in a char* not an int8_t*, but I thought using an int8_t might make it more obvious to me, what the issue is)
TEST_CLASS(My001littlebigendian)
{
public:
TEST_METHOD(TestMethod1)
{
int i = 0x12345678;
display_bytes((int8_t*)&i, sizeof(i));
}
void display_bytes(int8_t* b, int length)
{
std::stringstream ss;
for (int i = 0; i < length; ++i)
{
int8_t signedCharRepresentation = *(b + i); //signed char has 1 byte
int8_t signed8ByteInt = (int8_t)signedCharRepresentation; //this is not ok
int32_t signed32ByteInt = (int32_t)signedCharRepresentation; //this is ok. why?
//ss << std::hex << signed8ByteInt; //this is not ok. why?
ss << std::hex << signed32ByteInt; //this is ok
}
std::string stringRepresentation = ss.str();
if (stringRepresentation.compare("78563412") == 0)
{
Assert::IsTrue(true, L"machine is little-endian");
}
else if(stringRepresentation.compare("01234567") == 0)
{
Assert::IsTrue(true, L"machine is big-endian");
}
else
{
Assert::IsTrue(true, L"machine is other-endian");
}
}
};
Now, what I don't understand (as hopefull the comments make clear) is why does this only work when I cast each byte to a 4 byte int, and not an 1 byte int. Since I am working with chunks of 1 byte. Intuitively it would make me think doing it like this should cause some sort of overflow? But it seems not.
I've not dug deeper into why this is the issue yet, since I was hoping to not need to. And maybe if someone with more knowledge in this area can give me nudge in the right direction, or maybe even an outright answer if I'm missing something very obvious. (which I do feel I might be, since I'm not used to working at this low level).
I am trying to run C code in R using Rcpp, but am unsure how to convert a buffer used to hold data from a file. In the third line of code below, I allocate an unsigned char buffer and my problem is that I don't know what Rcpp data type to use. Once the data are read into the buffer, I figured out how to use Rcpp::NumericMatrix to hold the final result, but not the character buffer. I have seen several responses by Dirk Eddelbuettel to similar questions where he suggests replacing all 'malloc' calls with Rcpp initialization commands. I tried using an Rcpp::CharacterVector, but then there is a type mismatch in the loop at the end: the Rcpp::CharacterVector cannot be read as an unsigned long long int. The code runs for some C-compilers, but throws a 'memory corruption' error for others, so I would prefer to do things the way Dirk suggests (use Rcpp data types) so that the code will run regardless of the specific compiler.
FILE *fp = fopen( filename, "r" );
fseek( fp, index_data_offset, SEEK_SET );
unsigned char* buf = (unsigned char *)malloc( 3 * number_of_index_entries * sizeof(unsigned long long int) );
fread( buf, sizeof("unsigned long long int"), (long)(3 * number_of_index_entries), fp );
fclose( fp );
// Convert "buf" into a 3-column matrix.
unsigned long long int l;
Rcpp::NumericMatrix ToC(3, number_of_index_entries);
for (int col=0; col<number_of_index_entries; col++ ) {
l = 0;
int offset = (col*3 + 0)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(0,col) = l;
l = 0;
offset = (col*3 + 1)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(1,col) = l;
l = 0;
offset = (col*3 + 2)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(2,col) = l;
}
return( ToC );
C and C++ can be lovely. If you know what you're doing, you have both a very direct line to the underlying hardware and higher-level abstraction for efficient reasoning.
I would suggest to simplify and reduce the problem. Start with a simple and known case, for example an STL vector of double. Let's call is x. Fill it with 10 or hundred elements, then open a FILE and write a blob from
x.data(), x.size() * sizeof(double)
Close the file. The read it into Rcpp by first allocation a NumericVector v of the same size, then reading the bytes back and then calling memcpy to &(v[0]).
It should be the same vector.
Then you can generalize to different types. Because vectors are guaranteed to be contiguous memory you can this serialization trick directly.
You can do variations on this with character buffers, or void*, or ... None of that matters for as long as you are careful not to mismatch. I.e. don't assing an int payload to a double and so on.
Now, is any this recommended? Hell no, unless you are chasing performance and know well enough what you are doing in which case it is reasonable. Otherwise rely on fantastic existing packages like fst or qs
to do it for you.
I hope this helps with your question. I wasn't entirely what it was you were asking. Maybe you clarify (and possibly shorten / focus) it if not.
A typecast did the trick:
Rcpp::NumericVector NumVecBuf( 3 * number_of_index_entries * sizeof(unsigned long long int) );
unsigned char* buf = (unsigned char*) &(NumVecBuf[0]);
Dirk's statement about "contiguous memory" suggested that this would work, so I went ahead and marked his comment as the answer. Thanks, Dirk! And, thanks for developing and maintaining Rcpp!
I have a question that I found many threads in, but none did explicitly answer my question.
I am trying to have a multidimensional array inside the kernel of the GPU using thrust. Flattening would be difficult, as all the dimensions are non-homogeneous and I go up to 4D. Now I know I cannot have device_vectors of device_vectors, for whichever underlying reason (explanation would be welcome), so I tried going the way over raw-pointers.
My reasoning is, a raw pointer points onto memory on the GPU, why else would I be able to access it from within the kernel. So I should technically be able to have a device_vector, which holds raw pointers, all pointers that should be accessible from within the GPU. This way I constructed the following code:
thrust::device_vector<Vector3r*> d_fluidmodelParticlePositions(nModels);
thrust::device_vector<unsigned int***> d_allFluidNeighborParticles(nModels);
thrust::device_vector<unsigned int**> d_nFluidNeighborsCrossFluids(nModels);
for(unsigned int fluidModelIndex = 0; fluidModelIndex < nModels; fluidModelIndex++)
{
FluidModel *model = sim->getFluidModelFromPointSet(fluidModelIndex);
const unsigned int numParticles = model->numActiveParticles();
thrust::device_vector<Vector3r> d_neighborPositions(model->getPositions().begin(), model->getPositions().end());
d_fluidmodelParticlePositions[fluidModelIndex] = CudaHelper::GetPointer(d_neighborPositions);
thrust::device_vector<unsigned int**> d_fluidNeighborIndexes(nModels);
thrust::device_vector<unsigned int*> d_nNeighborsFluid(nModels);
for(unsigned int pid = 0; pid < nModels; pid++)
{
FluidModel *fm_neighbor = sim->getFluidModelFromPointSet(pid);
thrust::device_vector<unsigned int> d_nNeighbors(numParticles);
thrust::device_vector<unsigned int*> d_neighborIndexesArray(numParticles);
for(unsigned int i = 0; i < numParticles; i++)
{
const unsigned int nNeighbors = sim->numberOfNeighbors(fluidModelIndex, pid, i);
d_nNeighbors[i] = nNeighbors;
thrust::device_vector<unsigned int> d_neighborIndexes(nNeighbors);
for(unsigned int j = 0; j < nNeighbors; j++)
{
d_neighborIndexes[j] = sim->getNeighbor(fluidModelIndex, pid, i, j);
}
d_neighborIndexesArray[i] = CudaHelper::GetPointer(d_neighborIndexes);
}
d_fluidNeighborIndexes[pid] = CudaHelper::GetPointer(d_neighborIndexesArray);
d_nNeighborsFluid[pid] = CudaHelper::GetPointer(d_nNeighbors);
}
d_allFluidNeighborParticles[fluidModelIndex] = CudaHelper::GetPointer(d_fluidNeighborIndexes);
d_nFluidNeighborsCrossFluids[fluidModelIndex] = CudaHelper::GetPointer(d_nNeighborsFluid);
}
Now the compiler won't complain, but accessing for example d_nFluidNeighborsCrossFluids from within the kernel will work, but return wrong values. I access it like this (again, from within a kernel):
d_nFluidNeighborsCrossFluids[iterator1][iterator2][iterator3];
// Note: out of bounds indexing guaranteed to not happen, indexing is definitely right
The question is, why does it return wrong values? The logic behind it should work in my opinion, since my indexing is correct and the pointers should be valid addresses from within the kernel.
Thank you already for your time and have a great day.
EDIT:
Here is a minimal reproducable example. For some reason the values appear right despite of having the same structure as my code, but cuda-memcheck reveals some errors. Uncommenting the two commented lines leads me to my main problem I am trying to solve. What does the cuda-memcheck here tell me?
/* Part of this example has been taken from code of Robert Crovella
in a comment below */
#include <thrust/device_vector.h>
#include <stdio.h>
template<typename T>
static T* GetPointer(thrust::device_vector<T> &vector)
{
return thrust::raw_pointer_cast(vector.data());
}
__global__
void k(unsigned int ***nFluidNeighborsCrossFluids, unsigned int ****allFluidNeighborParticles){
const unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i > 49)
return;
printf("i: %d nNeighbors: %d\n", i, nFluidNeighborsCrossFluids[0][0][i]);
//for(int j = 0; j < nFluidNeighborsCrossFluids[0][0][i]; j++)
// printf("i: %d j: %d neighbors: %d\n", i, j, allFluidNeighborParticles[0][0][i][j]);
}
int main(){
const unsigned int nModels = 2;
const int numParticles = 50;
thrust::device_vector<unsigned int**> d_nFluidNeighborsCrossFluids(nModels);
thrust::device_vector<unsigned int***> d_allFluidNeighborParticles(nModels);
for(unsigned int fluidModelIndex = 0; fluidModelIndex < nModels; fluidModelIndex++)
{
thrust::device_vector<unsigned int*> d_nNeighborsFluid(nModels);
thrust::device_vector<unsigned int**> d_fluidNeighborIndexes(nModels);
for(unsigned int pid = 0; pid < nModels; pid++)
{
thrust::device_vector<unsigned int> d_nNeighbors(numParticles);
thrust::device_vector<unsigned int*> d_neighborIndexesArray(numParticles);
for(unsigned int i = 0; i < numParticles; i++)
{
const unsigned int nNeighbors = i;
d_nNeighbors[i] = nNeighbors;
thrust::device_vector<unsigned int> d_neighborIndexes(nNeighbors);
for(unsigned int j = 0; j < nNeighbors; j++)
{
d_neighborIndexes[j] = i + j;
}
d_neighborIndexesArray[i] = GetPointer(d_neighborIndexes);
}
d_nNeighborsFluid[pid] = GetPointer(d_nNeighbors);
d_fluidNeighborIndexes[pid] = GetPointer(d_neighborIndexesArray);
}
d_nFluidNeighborsCrossFluids[fluidModelIndex] = GetPointer(d_nNeighborsFluid);
d_allFluidNeighborParticles[fluidModelIndex] = GetPointer(d_fluidNeighborIndexes);
}
k<<<256, 256>>>(GetPointer(d_nFluidNeighborsCrossFluids), GetPointer(d_allFluidNeighborParticles));
if (cudaGetLastError() != cudaSuccess)
printf("Sync kernel error: %s\n", cudaGetErrorString(cudaGetLastError()));
cudaDeviceSynchronize();
}
A device_vector is a class definition. That class has various methods and operators associated with it. The thing that allows you to do this:
d_nFluidNeighborsCrossFluids[...]...;
is a square-bracket operator. That operator is a host operator (only). It is not usable in device code. Issues like this give rise to the general statements that "thrust::device_vector is not usable in device code." The device_vector object itself is generally not usable. However the data it contains is usable in device code, if you attempt to access it via a raw pointer.
Here is an example of a thrust device vector that contains an array of pointers to the data contained in other device vectors. That data is usable in device code, as long as you don't attempt to make use of the thrust::device_vector object itself:
$ cat t1509.cu
#include <thrust/device_vector.h>
#include <stdio.h>
template <typename T>
__global__ void k(T **data){
printf("the first element of vector 1 is: %d\n", (int)(data[0][0]));
printf("the first element of vector 2 is: %d\n", (int)(data[1][0]));
printf("the first element of vector 3 is: %d\n", (int)(data[2][0]));
}
int main(){
thrust::device_vector<int> vector_1(1,1);
thrust::device_vector<int> vector_2(1,2);
thrust::device_vector<int> vector_3(1,3);
thrust::device_vector<int *> pointer_vector(3);
pointer_vector[0] = thrust::raw_pointer_cast(vector_1.data());
pointer_vector[1] = thrust::raw_pointer_cast(vector_2.data());
pointer_vector[2] = thrust::raw_pointer_cast(vector_3.data());
k<<<1,1>>>(thrust::raw_pointer_cast(pointer_vector.data()));
cudaDeviceSynchronize();
}
$ nvcc -o t1509 t1509.cu
$ cuda-memcheck ./t1509
========= CUDA-MEMCHECK
the first element of vector 1 is: 1
the first element of vector 2 is: 2
the first element of vector 3 is: 3
========= ERROR SUMMARY: 0 errors
$
EDIT: In the mcve you have now posted, you point out that an ordinary run of the code appears to give correct results, but when you use cuda-memcheck, errors are reported. You have a general design problem that will cause this.
In C++, when an object is defined within a curly-braces region:
{
{
Object A;
// object A is in-scope here
}
// object A is out-of-scope here
}
// object A is out of scope here
k<<<...>>>(anything that points to something in object A); // is illegal
and you exit that region, the object defined within the region is now out of scope. For objects with constructors/destructors, this usually means the destructor of the object will be called when it goes out-of-scope. For a thrust::device_vector (or std::vector) this will deallocate any underlying storage associated with that vector. That does not necessarily "erase" any data, but attempts to use that data are illegal and would be considered UB (undefined behavior) in C++.
When you establish pointers to such data inside an in-scope region, and then go out-of-scope, those pointers no longer point to anything that would be legal to access, so attempts to dereference the pointer would be illegal/UB. Your code is doing this. Yes, it does appear to give the correct answer, because nothing is actually erased on deallocation, but the code design is illegal, and cuda-memcheck will highlight that.
I suppose one fix would be to pull all this stuff out of the inner curly-braces, and put it at main scope, just like the d_nFluidNeighborsCrossFluids device_vector is. But you might also want to rethink your general data organization strategy and flatten your data.
You should really provide a minimal, complete, verifiable/reproducible example; yours is neither minimal, nor complete, nor verifiable.
I will, however, answer your side-question:
I know I cannot have device_vectors of device_vectors, for whichever underlying reason (explanation would be welcome)
While a device_vector regards a bunch of data on the GPU, it's a host-side data structure - otherwise you would not have been able to use it in host-side code. On the host side, what it holds should be something like: The capacity, the size in elements, the device-side pointer to the actual data, and maybe more information. This is similar to how an std::vector variable may refer to data that's on the heap, but if you create the variable locally the fields I mentioned above will exist on the stack.
Now, those fields of the device vector that are located in host memory are not generally accessible from the device-side. In device-side code you would typically use the raw pointer to the device-side data the device_vector manages.
Also, note that if you have a thrust::device_vector<T> v, each use of operator[] means a bunch of separate CUDA calls to copy data to or from the device (unless there's some caching going on under the hoold). So you really want to avoid using square-brackets with this structure.
Finally, remember that pointer-chasing can be a performance killer, especially on a GPU. You might want to consider massaging your data structure somewhat in order to make it amenable to flattening.
I wrote a function to convert a hexa string representation (like x00) of some binary data to the data itself.
How to improve this code?
QByteArray restoreData(const QByteArray &data, const QString prepender = "x")
{
QByteArray restoredData = data;
return QByteArray::fromHex(restoredData.replace(prepender, ""));
}
How to improve this code?
Benchmark before optimizing this. Do not do premature optimization.
Beyond the main point: Why would you like to optimize it?
1) If you are really that concerned about performance where this negligible code from performance point of view matters, you would not use Qt in the first place because Qt is inherently slow compared to a well-optimized framework.
2) If you are not that concerned about performance, then you should keep the readability and maintenance in mind as leading principle, in which case your code is fine.
You have not shown any real world example either why exactly you want to optimize. This feels like an academic question without much pratical use to me. It would be interesting to know more about the motivation.
That being said, several improvement items, which are also optimization, could be done in your code, but then again: it is not done for optimization, but more like logical reasons.
1) Prepender is bad name; it is usually called "prefix" in the English language.
2) You wish to use QChar as opposed to QString for a character.
3) Similarly, for the replacement, you wish to use '' rather than the string'ish "" formula.
4) I would pass classes like that with reference as opposed to value semantics even if it is CoW (implicitly shared).
5) I would not even use an argument here for the prefix since it is always the same, so it does not really fit the definition of variable.
6) It is needless to create an interim variable explicitly.
7) Make the function inline.
Therefore, you would be writing something like this:
QByteArray restoreData(QByteArray data)
{
return QByteArray::fromHex(data.replace('x', ''));
}
Your code has a performance problem because of replace(). Replace itself is not very fast, and creating intermediate QByteArray object slows the code down even more. If you are really concerned about performance, you can copy QByteArray::fromHex implementation from Qt sources and modify it for your needs. Luckily, its implementation is quite self-contained. I only changed / 2 to / 3 and added --i line to skip "x" characters.
QByteArray myFromHex(const QByteArray &hexEncoded)
{
QByteArray res((hexEncoded.size() + 1)/ 3, Qt::Uninitialized);
uchar *result = (uchar *)res.data() + res.size();
bool odd_digit = true;
for (int i = hexEncoded.size() - 1; i >= 0; --i) {
int ch = hexEncoded.at(i);
int tmp;
if (ch >= '0' && ch <= '9')
tmp = ch - '0';
else if (ch >= 'a' && ch <= 'f')
tmp = ch - 'a' + 10;
else if (ch >= 'A' && ch <= 'F')
tmp = ch - 'A' + 10;
else
continue;
if (odd_digit) {
--result;
*result = tmp;
odd_digit = false;
} else {
*result |= tmp << 4;
odd_digit = true;
--i;
}
}
res.remove(0, result - (const uchar *)res.constData());
return res;
}
Test:
qDebug() << QByteArray::fromHex("54455354"); // => "TEST"
qDebug() << myFromHex("x54x45x53x54"); // => "TEST"
This code can behave unexpectedly when hexEncoded is malformed (.e.g. "x54x45x5" will be converted to "TU"). You can fix this somehow if it's a problem.
Could somebody tell me how to convert double precision into network byte ordering.
I tried
uint32_t htonl(uint32_t hostlong);
uint16_t htons(uint16_t hostshort);
uint32_t ntohl(uint32_t netlong);
uint16_t ntohs(uint16_t netshort);
functions and they worked well but none of them does double (float) conversion because these types are different on every architecture. And through the XDR i found double-float precision format representations (http://en.wikipedia.org/wiki/Double_precision) but no byte ordering there.
So, I would much appreciate if somebody helps me out on this (C code would be great!).
NOTE: OS is Linux kernel (2.6.29), ARMv7 CPU architecture.
You could look at IEEE 754 at the interchanging formats of floating points.
But the key should be to define a network order, ex. 1. byte exponent and sign, bytes 2 to n as mantissa in msb order.
Then you can declare your functions
uint64_t htond(double hostdouble);
double ntohd(uint64_t netdouble);
The implementation only depends of your compiler/plattform.
The best should be to use some natural definition,
so you could use at the ARM-platform simple transformations.
EDIT:
From the comment
static void htond (double &x)
{
int *Double_Overlay;
int Holding_Buffer;
Double_Overlay = (int *) &x;
Holding_Buffer = Double_Overlay [0];
Double_Overlay [0] = htonl (Double_Overlay [1]);
Double_Overlay [1] = htonl (Holding_Buffer);
}
This could work, but obviously only if both platforms use the same coding schema for double and if int has the same size of long.
Btw. The way of returning the value is a bit odd.
But you could write a more stable version, like this (pseudo code)
void htond (const double hostDouble, uint8_t result[8])
{
result[0] = signOf(hostDouble);
result[1] = exponentOf(hostDouble);
result[2..7] = mantissaOf(hostDouble);
}
This might be hacky (the char* hack), but it works for me:
double Buffer::get8AsDouble(){
double little_endian = *(double*)this->cursor;
double big_endian;
int x = 0;
char *little_pointer = (char*)&little_endian;
char *big_pointer = (char*)&big_endian;
while( x < 8 ){
big_pointer[x] = little_pointer[7 - x];
++x;
}
return big_endian;
}
For brevity, I've not include the range guards. Though, you should include range guards when working at this level.