I'm working on a personal project to implement a 2-dimensional kd-tree construction algorithm in C++.
I know there are libraries that already do this, but I want to gain experience in C++ programming
(helps in resume if you have personal projects to show)
Input: number of points, and the points themselves (input can be read in from command line)
I want this to run in O(n log n) time, can this be done, if so can someone provide some pseudo-code to help get me started, thanks in advance.
I've been playing around with kd-trees recently. Here's some untested code. kd-tree construction in done in 2 stages; traversing the list of points, O(n), and sorting along each dimension, O(nlogn). So, yes, you can construct kd-trees in O(nlogn) time.
Searching through the tree (say you are looking for nearest neighbors), gets trickier though. I haven't found any easy-to-follow documentation for that.
struct Node
{
int[2] point;
Node* left;
Node* right;
}
Node* createtreeRecursive (int** values /* int[2][len] */, int len, int dim = 2, int depth = 0)
{
// If empty, return
if (value <= 1) return null;
// Get axis to split along
int axis = depth % dim;
int** sorted = sortAlongDim (values, axis);
int mid = len / 2;
Node* curr = new Node ();
curr.point[0] = sorted [0][mid];
curr.point[1] = sorted [1][mid];
int** leftHalf = values;
int** rightHalf = &(values [mid]);
curr.left = createtreeRecursive (leftHalf, mid, dim, depth + 1);
curr.right = createtreeRecursive (rightHalf, len - mid, dim, depth + 1);
return curr;
}
int** sortAlongDim (int** values, int axis)
Related
I am solving word break problem and I have used Dynamic Programming to optimise it, and the solution is working as well. But I am not able to calculate/figure out the time complexity of this approach.
Code:
class Solution {
public:
int util(string &s, int i, unordered_set<string> &dict, vector<int> &DP) {
if(i >= s.size()) {
return 1;
}
if(DP[i] != -1) {
return DP[i];
}
string next = "";
for(int itr = i; itr < s.size(); itr++) { // O(N)
next += s[itr];
if(dict.find(next) != dict.end() and util(s, itr+1, dict, DP)) { // ?
return DP[i] = 1;
}
}
return DP[i] = 0;
}
bool wordBreak(string s, vector<string>& wordDict) {
unordered_set<string> dict(wordDict.begin(), wordDict.end());
vector<int> DP(s.size() + 1, -1);
return util(s, 0, dict, DP);
}
};
Could anyone please help me to understand the time complexity of the above algorithm step-by-step?
Thanks
For each i ∈ [0..n) (n is length of s), your recursive function executes full inner loop exactly once (since the second time the result will be cached in DP).
At this point you might say that the whole algorithm is O(n²), but that is not true, there is a twist.
The inner loop which you labeled O(n) is not actually O(n), but O(n²), because you're searching next (substring of s) in the unordered_dict. Each such search takes O( next.length ) time, and, since length of next ranges from 0..length(s), the dict search is O(n), and the inner loop, consequently is O(n²).
Given all of the above, the whole algorithm is O(n³): O(n) from recursion and multiplied by O(n²) from inner loop.
Or, to be precise, O(n³ + k), where k is the cummulative size of all strings in the wordDict (since you're constructing set from them).
P.S. To reduce complexity by factor of O(n) you can use Trie instead of unordered_set to store words from wordDict. With that your algorithm would be O(n² + k).
I am in trouble passing values between host code and kernel code due to some vector data types. The following code/explanation is just for referencing my problem, my code is much bigger and complicated. With this small example, hopefully, I will be able to explain where I am having a problem. I f anything more needed please let me know.
std::vector<vector<double>> output;
for (int i = 0;i<2; i++)
{
auto& out = output[i];
sum =0;
for (int l =0;l<3;l++)
{
for (int j=0;j<4; j++)
{
if (some condition is true)
{ out[j+l] = 0.;}
sum+= .....some addition...
}
out[j+l] = sum
}
}
Now I want to parallelize this code, from the second loop. This is what I have done in host code:
cl::buffer out = (context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, output.size(), &output, NULL)
Then, I have set the arguments
cl::SetKernelArg(0, out);
Then the loop,
for (int i = 0,i<2, i++)
{
auto& out = output[i];
// sending some more arguments(which are changing accrding to loop) for sum operations
queue.enqueueNDRangeKernel(.......)
queue.enqueuereadbuffer(.....,&out,...)
}
In Kernel Code:
__kernel void sumout(__global double* out, ....)
{
int l = get_global_id(0);
int j = get_global_id(1);
if (some condition is true)
{ out[j+l] = 0.; // Here it goes out of the loop then
return}
sum+= .....some addition...
}
out[j+l] = sum
}
So now, in if condition out[j+l] is getting 0 in the loop. So out value is regularly changing. In normal code, it is a reference pointer to a vector. I am not able to read the values in output from out during my kernel and host code. I want to read the values in output[i] for every out[j+l]. But I am confused due this buffer and vector.
just for more clarification,output is a vector of vector and out is reference vector to output vector. I need to update values in output for every change in out. Since these are vectors, I passed out as cl buffer. I hope it is clear.
Please let me know, if the code is required, I will try to provide as much as I can.
You are sending pointers of vectors to opencl(ofcourse they are contiguous on pointer level) but whole data is not contiguous in memory since each inner vector points to different memory area. Opencl cannot map host pointers to device memory and there is no such command in this api.
You could use vector of arrays(latest version) or pure arrays.
I have a rather simple MPI program where each node does a calculation and in the end I need the sum of of all the calculations. Each node has no need to communicate anything else than the final sum each node has calculated.
Currently this is what I am doing and it is working.
MPI_Init(&argc, &argv); // start up "virtual machine"
MPI_Comm_size(MPI_COMM_WORLD, &p); // get size of VM
MPI_Comm_rank(MPI_COMM_WORLD, &id); // get own rank in VM
int localsum[1] = {0};
int globalsum[1] = {0};
for (i = lower+id; i <= upper; i=i+p) {
localsum[0] = localsum[0] + getResult(i);
}
MPI_Reduce(localsum,globalsum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
if(id==0)
{
printf("globalsum1 = %d \n",globalsum[0]);
}
So each node skips every size-of-vm element in each loop iteration. However here's the problem. At any one time getResult(i) takes less time to compute than getResult(i+1). This means that some nodes will have much bigger work load than others.
Is there anyway to balance this more out, or do something so nodes can steal work from other nodes when they are done?
As Wesley Bland points out in the comments, this is a hard question without knowing more about what getResults() does and how much time extra work we are talking about.
However, One suggestion I have is to pair expensive calls to getResult() with cheaper ones.
example: pair getResult(lower) with getResult(upper) & getResult(lower+1) with getResult(upper-1)
Sample loop (will need some modifications to fix some edge cases):
for (i = id; i <= (upper-lower)/2; i=i+p) {
localsum[0] = localsum[0] + getResult(lower+i) + getResult(upper-i) ;
}
I have a struct:
typedef struct
{
double distance;
int* path;
} tour;
Then I trying to gather results from all processes:
MPI_Gather(&best, sizeof(tour), MPI_BEST, all_best, sizeof(tour)*proc_count, MPI_BEST, 0, MPI_COMM_WORLD);
After gather my root see that all_best containts only 1 normal element and trash in others.
Type of all_best is tour*.
Initialisation of MPI_BEST:
void ACO_Build_best(tour *tour,int city_count, MPI_Datatype *mpi_type /*out*/)
{
int block_lengths[2];
MPI_Aint displacements[2];
MPI_Datatype typelist[2];
MPI_Aint start_address;
MPI_Aint address;
block_lengths[0] = 1;
block_lengths[1] = city_count;
typelist[0] = MPI_DOUBLE;
typelist[1] = MPI_INT;
MPI_Address(&(tour->distance), &displacements[0]);
MPI_Address(&(tour->path), &displacements[1]);
displacements[1] = displacements[1] - displacements[0];
displacements[0] = 0;
MPI_Type_struct(2, block_lengths, displacements, typelist, mpi_type);
MPI_Type_commit(mpi_type);
}
Any ideas are welcome.
Apart from passing incorrect lengths to MPI_Gather, MPI actually does not follow pointers to pointers. With such a structured type you would be sending the value of distance and the value of the path pointer (essentially an address which makes no sense when sent to other processes). If one supposes that distance essentially gives the number of elements in path, then you can kind of achieve your goal with a combination of MPI_Gather and MPI_Gatherv:
First, gather the lengths:
int counts[proc_count];
MPI_Gather(&best->distance, 1, MPI_INT, counts, 1, MPI_INT, 0, MPI_COMM_WORLD);
Now that counts is populated with the correct lengths, you can continue and use MPI_Gatherv to receive all paths:
int disps[proc_count];
disps[0] = 0;
for (int i = 1; i < proc_count; i++)
disps[i] = disps[i-1] + counts[i-1];
// Allocate space for the concatenation of all paths
int *all_paths = malloc((disps[proc_count-1] + counts[proc_count-1])*sizeof(int));
MPI_Gatherv(best->path, best->distance, MPI_INT,
all_paths, counts, disps, MPI_INT, 0, MPI_COMM_WORLD);
Now you have the concatenation of all paths in all_paths. You can examine or extract an individual path by taking counts[i] elements starting at position disps[i] in all_paths. Or you can even build an array of tour structures and make them use the already allocated and populated path storage:
tour *all_best = malloc(proc_count*sizeof(tour));
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = &all_paths[disps[i]];
}
Or you can duplicate the segments instead:
for (int i = 0; i < proc_count; i++)
{
all_best[i].distance = counts[i];
all_best[i].path = malloc(counts[i]*sizeof(int));
memcpy(all_best[i].path, &all_paths[disps[i]], counts[i]*sizeof(int));
}
// all_paths is not needed any more and can be safely free()-ed
Edit: Because I've overlooked the definition of the tour structure, the above code actually works with:
struct
{
int distance;
int *path;
}
where distance holds the number of significant elements in path. This is different from your case, but without some information on how tour.path is being allocated (and sized), it's hard to give a specific solution.
I am still learning about pointers and structs, but I hoping someone might know if it is possible to access individual members sequentially by use of a pointer?
Typedef record_data {
float a;
float b;
float c;
}records,*Sptr;
records lists[5];
Sptr ptr;
Example: assign all members of the 5 lists with a value of float 1.0
// instead of this
(void)testworks(void){
int i;
float j=1.0;
ptr = &lists[i]
ptr->lists[0].a = j;
ptr->lists[0].b = j;
ptr->lists[0].c = j;
ptr->lists[1].a = j;
// ... and so on
ptr->lists[4].c = j;
}
// want to do this
(void)testwannado(void){
int a,i;
float j=1.0;
ptr = &lists[i]
for (a=0;a<5;a++){ // step through typedef structs
for (i=0;i<3;i++){ // step through members
???
}
}
Forgive my errors in this example below, but it represents the closest thing I can think of for want I am trying to accomplish.
int *mptr;
mptr = &(ptr->lists[0].a) // want to assign a pointer to members so all 3 members can be used...
*mptr++ = j; // so I can do something like this.
This wasn't compiled, so any other errors are unintentional.
You generally don't want to do that. Structure members should be accessed individually. You can run into a lot of portability problems by assuming the memory layout of how multiple consecutive structure members are placed in memory. And most (C-like) languages do not give you a way to "introspect" through the members of a structure.