I am experiencing (very) slow page load times that increase proportionately to the number of active users on the system. I have a hunch that this is related to a custom defined thread object:
define stageStoreCache => thread {
parent map
public oncreate() => ..oncreate()
}
This stageStoreCache object simply mimics the behavior of a map whose data available across the entire instance.
Many threads are reading it and very few threads are writing to it. Is this a poorly conceived solution to having a large map of data available across the instance? It's a fairly large map of maps that when exported to map->asstring can exceed 5MB. The objective is to prevent translating data stored as JSON in the database to Lasso types on the fly.
It seems that the large size of the stageStoreCache is not what causes problems. It seems to really be the number of concurrent users on the system.
Thanks for any insight you can offer.
You said that this holds a map of maps and is rather large. If those sub-maps are large, it is possible that the way you are accessing the data is causing the issue. Here's what I mean, if you are doing something like this:
// Potential problem as it copies the sub-map each time
stageStoreCache->find('sub-map')->find('data')
stageStoreCache->find('sub-map')->find('other')
The problem comes in that each time stageStoreCache->find('sub-map') is called it actually has to copy all the map data it finds for "sub-map" out of the thread object and into the thread requesting that data. If those sub-maps are large, this takes time. A better approach would be to do this once and stash it in a local variable:
// Better Approach
local(cache) = stageStoreCache->find('sub-map')
#cache->find('data')
#cache->find('other')
This at least only has to copy the "sub-map" over once. Another approach that might be better (only testing could tell) would be to refactor your code so that each call to stageStoreCache drills down to the data you actually want, and have just that small amount of data copied over.
// Might even be better as it just copies the values you want
stageStoreCache->drill('sub-map', 'data')
stageStoreCache->drill('sub-map', 'other')
Ultimately, I would love for Lasso to improve thread objects so that they never blocked for reads. (I had thought this had been submitted as a feature request, but I'm not finding it on Rhinotrac.) Until that happens, if none of my suggestions help then you may need to investigate using something else to cache this data in such as memcached.
Testing is the only way to tell for sure. But I would go a long way to avoid having a thread object that contains some 5 MB of data.
Take this snippet from the Lasso guide into consideration:
"all parameter values given to a thread object method are copied, as well as any return value of a thread object method"
http://www.lassoguide.com/language/threading.html
Meaning that one of the key features that makes Lasso 9 so fast, the extensive use of reference data, is lost.
Each time you have a call for stageStoreCache all the data it contains will first be copied into the thread that asks for it. That is an awful lot of copying.
I have found that having settings and site wide data contained in smallest possible chunks is convenient and fast. And also, to only actually set it up when it is called for. Unlike the old approach that had a config file that was included on every call, setting up a bunch of variables where the majority maybe never got used on that particular call. Here's a Ke trick that I'm using instead. Consider this:
define mysetting1 => var(__mysetting1) || $__mysetting1 := 'Setting 1 value'
define mysetting2 => var(__mysetting2) || $__mysetting2 := 'Setting 2 value'
define mysetting3 => var(__mysetting3) || $__mysetting3 := 'Setting 3 value'
Have this is a file that is read at startup, either in a LassoApp that's initiated or a file in the startup folder.
These settings can then be called like this:
code blabla
mysetting2
more code blabla
mysetting1
mysetting2
With the beauty that, in this case, there is no wasted processing to initiate mysetting3, since it's not called for. And that mysetting2 is called for several times but is still only initiated once.
This technique can be used for simple things like the above, but also to initiate complex types or methods. Like session management, calling post or get params etc.
Related
I'm trying to work with 2D in vulkan along with 3D. So right now testing out updating a texture for every frame as whatever 2D is going on. I've gotten something of a texture updater working, the problem is that it's very slow and probably not the way it's supposed to be done. Is there any better way of getting this done? The code is based on the https://vulkan-tutorial.com/ code.
https://vulkan-tutorial.com/code/26_depth_buffering.cpp
void UpdateTexture()
{
vkDeviceWaitIdle(device);
vkFreeMemory(device, textureImageMemory, nullptr);
VkBuffer stagingBuffer;
VkDeviceMemory stagingBufferMemory;
createBuffer(imageSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBuffer, stagingBufferMemory);
void* data;
vkMapMemory(device, stagingBufferMemory, 0, imageSize, 0, &data);
memcpy(data, pixel2.data(), static_cast<size_t>(imageSize));
vkUnmapMemory(device, stagingBufferMemory);
createImage(texWidth, texHeight, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, textureImage, textureImageMemory);
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
copyBufferToImage(stagingBuffer, textureImage, static_cast<uint32_t>(texWidth), static_cast<uint32_t>(texHeight));
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
vkDestroyBuffer(device, stagingBuffer, nullptr);
vkFreeMemory(device, stagingBufferMemory, nullptr);
createTextureImageView();
createDescriptorPool();
createDescriptorSets();
createCommandBuffers();
}
This code looks like a direct translation of some OpenGL code, and not particularly good/modern OpenGL code at that.
There's a lot wrong in this code, but most of it boils down to over-synchronization.
First, you should always view any call to vkDeviceWaitIdle as the wrong thing to do. The only exception would be when you are preparing to destroy the VkDevice itself. There is no other reason to do a full CPU/GPU sync like that.
Presumably, this synchronization exists so that you can be sure the GPU is finished using the image before modifying it. This is the wrong thing to do. You should instead employ multiple-buffering. That is, you should have two images that you use. One is currently being used in a rendering process, while the other is being transferred into.
Instead of doing a full device sync, you instead synchronize with the batch you sent two frames ago. That is, if you're wanting to transfer data for use by frame 10, then you must first do a fence-sync operation with the batch you sent in frame 8. Frame 9 is still being processed, but frame 8 is probably done by now. So the synchronization shouldn't hurt too much.
Second, never allocate memory in the middle of an operation like this. Memory gets allocated early in your application, and you leave it allocated until it's time to destroy your application. If you need a staging buffer, then keep it around and reuse it in subsequent frames. Make sure to allocate sufficient storage up-front.
Whatever your createBuffer call is doing, it seems very much like a bad idea. Vulkan is not OpenGL; Vulkan separated memory from buffers/textures that use it for a reason. Creating APIs that hide this separation basically throws all of that away.
Similarly, never unmap memory, unless you're about to destroy that memory object. There's no problem in Vulkan (or OpenGL) with leaving a piece of memory mapped indefinitely. Just map the entire memory's range and leave it mapped. Indeed, you could just pass the mapped pointer directly to your image loader, depending on how the memory get written by the image loading code (if it tries to read data from this pointer, they could be trouble).
Lastly, the commands doing the transfer need to be synchronized with the commands that consume the image. How this happens depends on which queues are being used to do the transfer.
And of course, if you want optimal performance, you may want to check to see if your implementation can read from linear images in your shader. If it can, then you may not need staging at all; you can just write the data directly to the memory in Vulkan's image format, and use it directly.
Employing all of the above is going to add a lot of complexity to your application. But that's how it's supposed to work.
A naive way consists in using the CPU to define the update depending on the time or data and then update the data for the shader, such as a MVP transformation matrix. But this is inefficient with lots of syncing and too low refresh rates, and also overloading the cpu in a loop.
So people recommend using many buffers sometimes mentioning old drivers. If someone can clarify it, that would be nice. I have a naive and probably wrong guess. If they know exactly the frame rate, then they can calculate the time for each frame and dispatch several frames in advance. But it confuses me because the frame rate is dynamic, especially for new screens with the FreeSync functionality that have dynamic refresh rates.
I have thought of a third possibility. One can use the clock directly in the shader. GL_EXT_shader_realtime_clock provides clockRealtimeEXT. It has no defined unit, and will wrap when exceeding the maximum value. But it is said "globally coherent by all invocations on the GPU". During initialization, you can measure its rate using a uniform buffer, and then assume the rate will be constant. And also manage the wrapping.
Then if you can write your shaders as a function of time, for example in a translation, that would be efficient. You just need the initial data. Remember that one must avoid if conditions in shaders.
I've run a plan to create a large set of objects in drake's cache. Now, outside of a plan, I ran lapply over a subset of those objects so I can summarize some of their properties and plan my next steps.
I'm using readd to load each one of these cached objects inside of the function I'm applying over, but they seem to still use up RAM after I'm done with them. That's a problem in my scenario because it's 100 GiB of RAM by the time it's finished. I'm not sure where in the environment I should be looking for them if I need to explicitly remove them.
I understand that drake is doing something similar to memoization with the cache, since if I readd the same object twice, the first one takes time to read from disk, and the second time is instantaneous. But in this scenario, I'd like to treat the cache as a simple data source like any other file, so that an object doesn't take up RAM if it's rm()'d or goes outside of the scope.
Figured it out! It looks like the storr object returned by get_cache or new_cache has a flush_cache method. Calling that, then gc(), returns the memory.
Should flush_cache be documented somewhere in drake, even though it comes from storr?
I also found that if I call readd from multiple processes with mclapply, the objects don't stay in RAM, since they don't get transferred back to the main process.
I understand that sessionless operations are the preferred method of using gremlin. I'm wondering when is the sessioned approach better?
So I might be doing something like...
graph.addVertex("foo").property("name","bar")
graph.traversal().V().has("name","bar").as("f").addV("foo").property("name","baz").as("g").addE("test").from("f").to("g")
I'm doing this type of operation a lot. Often there's also a query (usually involving a coalesce) beforehand to check if a node (g in my example) exists, and create it if not.
So I'm wondering if a session might be better because I could hold a handle to the previous vertices and just attach new nodes to them without the expense of the lookup.
Feel free to tell me why I'm wrong in anything else that I'm doing.. Just trying to make things faster.
First of all, I would avoid use of addVertex() and stick to addV() - see more details here.
As to your question, I think the only time to leverage sessions is if you have some sort of loading operation that requires explicit control over transactions and you're not using a JVM based language. Even then, I might consider other options for dealing with that and just avoid sessions completely. You end up with a less portable solution as there are a number of graph systems which don't even support them directly (e.g. Neptune).
The cost to do a T.id based lookup should be really fast, so saving a vertex between requests in a session really shouldn't vastly improve performance. Even if you keep the vertex between requests you will still need to pass the vertex into your traversal so you still have the lookup anyway - I'm not sure I see the difference in cost there.
// first request
v = g.addV(...).property(...).next()
// second request
g.V(v).addE(....
// third request
g.V(v).addE(....
The above should not be that much faster than:
// first request - returns id=1
g.addV(...).property(...).id().next()
// second request - where "1" is just passed in on the next request as a parameter
g.V(1).addE(....
// third request
g.V(1).addE(....
I have a program that reads 173 (c) data structures from a memory map that need to be converted to Go. The value of the type is stored as a string in those structures. The structures are received 60 times per second.
I'm now using reflection (FieldByName) to get a reference to the go struct field and set the received data. But because there a many fields (173) and they get updated a lot this ads a lot of overhead and that function call is the slowest part of my program (jay go prof!).
What is the best way to make this faster? As far as I can see I have three options:
cache the reflect.Value's in a map and make a function that receives data, use a template struct tied to the cache map, fill that struct and return a copy of that template-struct
go generate all the setters and a giant switch statement for each received field
Just code all the different setters
What would be the "best" option? Is there an option I'm overlooking?
With #1, to be concurrency-safe you'd need a pool of those "template-struct" or at least a mutex protecting it. That adds some overhead and can be tricky to debug.
#3 is a nightmare to maintain.
I would go with #2. The running code will be fast, concurrency-safe and easy to debug.
Once your tool is setup, a change in your struct only requires running a command line to update the setters.
I have the feeling that I should not care about thread safe accessing / writing to an
public static int MyVar = 12;
in ASP .NET.
I read/write to this variable from various user threads. Let's suppose this variable will store the numbers of clicks on a certain button/link.
My theory is that no thread can read/write to this variable at the same time. It's just a simple variable of 4 bytes.
I do care about thread safe, but only for refference objects and List instances or other types that take more cycles to read/update.
I am wrong with my presumption ?
EDIT
I understand this depend of my scenario, but wasn't that the point of the question. The question is: it is right that can be written thread safe code with an (static int) variable without using lock keyword ?
It is my problem to write correct code. The answer seems to be: Yes, if you write correct and simple code, and not to much complicated, you can create thread safe functions without the need of lock keyword.
If one thread simply sets the value and another thread reads the value, then a lock is not necessary; the read and write are atomic. But if multiple threads might be updating it and are also reading it to do the update (e.g., increment), then you definitely do need some kind of synchronization. If only one thread is ever going to update it even for an increment, then I would argue that no synchronization is necessary.
Edit (three years later) It might also be desirable to add the volatile keyword to the declaration to ensure that reads of the value always get the latest value (assuming that matters in the application).
The concept of thread 'safety' is too vague to be meaningful unfortunately. If you're asking whether you can read and write to it from multiple threads without the program crashing during the operation, the answer is almost certainly yes. If you're also asking if the variable is guaranteed to either be the old value or the new value without ever storing any broken intermediate values, the answer for this data type is again almost certainly yes.
But if your question is "will my program work correctly if I access this from multiple threads", then the answer depends entirely on what your program is doing. For example, if you run the following pseudo code in 2 threads repeatedly in most programming languages, eventually you'll hit the assertion.
if MyVar >= 1:
MyVar = MyVar - 1
assert MyVar >= 0
Primitives like int are thread-safe in the sense that reads/writes are atomic. But as with most any type, it's left to you to do proper checking with more complex operations. For example, if (x > 0) x--; would be problematic in a multi-threaded scenario because x might change in between the if condition check and decrement.
A simple read or write on a field of 32 bits or less is always atomic. But you should provide your read/write code to make sure that it is thread safe.
Check out this post: http://msdn.microsoft.com/en-us/magazine/cc163929.aspx
It explains why you need to synchronize access to the integers in this scenario
Try Interlocked.Increment() or Interlocked.Add() and you'll be right. Your code complexity will be the same but you truly won't have to worry. If you're not worried about losing a few clicks in your counter, you can continue as you are.
Reading or writing integers is atomic. However, reading and then writing is not atomic. So, if you have one thread that writes and many that read, you may be able to get away without locks.
However, even though the operations are atomic, there are still potential multi-threading issues. In order for one thread to be guaranteed that another thread can see values it writes, you need a memory barrier. Otherwise, the compiler can optimize the code so that the variable stays in a register (or even optimize the operation away completely), so changes would be invisible from one thread to another.
You can establish a memory barrier explicitly (volatile or Thread.MemoryBarrier), or with the Interlocked class -- or with the lock statement (Monitor).