Should I care about thread safe of static int (4 bytes) variable in ASP .NET - asp.net

I have the feeling that I should not care about thread safe accessing / writing to an
public static int MyVar = 12;
in ASP .NET.
I read/write to this variable from various user threads. Let's suppose this variable will store the numbers of clicks on a certain button/link.
My theory is that no thread can read/write to this variable at the same time. It's just a simple variable of 4 bytes.
I do care about thread safe, but only for refference objects and List instances or other types that take more cycles to read/update.
I am wrong with my presumption ?
EDIT
I understand this depend of my scenario, but wasn't that the point of the question. The question is: it is right that can be written thread safe code with an (static int) variable without using lock keyword ?
It is my problem to write correct code. The answer seems to be: Yes, if you write correct and simple code, and not to much complicated, you can create thread safe functions without the need of lock keyword.

If one thread simply sets the value and another thread reads the value, then a lock is not necessary; the read and write are atomic. But if multiple threads might be updating it and are also reading it to do the update (e.g., increment), then you definitely do need some kind of synchronization. If only one thread is ever going to update it even for an increment, then I would argue that no synchronization is necessary.
Edit (three years later) It might also be desirable to add the volatile keyword to the declaration to ensure that reads of the value always get the latest value (assuming that matters in the application).

The concept of thread 'safety' is too vague to be meaningful unfortunately. If you're asking whether you can read and write to it from multiple threads without the program crashing during the operation, the answer is almost certainly yes. If you're also asking if the variable is guaranteed to either be the old value or the new value without ever storing any broken intermediate values, the answer for this data type is again almost certainly yes.
But if your question is "will my program work correctly if I access this from multiple threads", then the answer depends entirely on what your program is doing. For example, if you run the following pseudo code in 2 threads repeatedly in most programming languages, eventually you'll hit the assertion.
if MyVar >= 1:
MyVar = MyVar - 1
assert MyVar >= 0

Primitives like int are thread-safe in the sense that reads/writes are atomic. But as with most any type, it's left to you to do proper checking with more complex operations. For example, if (x > 0) x--; would be problematic in a multi-threaded scenario because x might change in between the if condition check and decrement.

A simple read or write on a field of 32 bits or less is always atomic. But you should provide your read/write code to make sure that it is thread safe.

Check out this post: http://msdn.microsoft.com/en-us/magazine/cc163929.aspx
It explains why you need to synchronize access to the integers in this scenario

Try Interlocked.Increment() or Interlocked.Add() and you'll be right. Your code complexity will be the same but you truly won't have to worry. If you're not worried about losing a few clicks in your counter, you can continue as you are.

Reading or writing integers is atomic. However, reading and then writing is not atomic. So, if you have one thread that writes and many that read, you may be able to get away without locks.
However, even though the operations are atomic, there are still potential multi-threading issues. In order for one thread to be guaranteed that another thread can see values it writes, you need a memory barrier. Otherwise, the compiler can optimize the code so that the variable stays in a register (or even optimize the operation away completely), so changes would be invisible from one thread to another.
You can establish a memory barrier explicitly (volatile or Thread.MemoryBarrier), or with the Interlocked class -- or with the lock statement (Monitor).

Related

Rust Global.dealloc vs ptr::drop_in_place vs ManuallyDrop

I'm relatively new to Rust. I was working on some lock-free algorithms, and started playing around with manually managing memory, something similar to C++ new/delete. I noticed a couple different ways that do this throughout the standard library components, but I want to really understand the differences and use cases of each. Here's what it seems like to me:
ManuallyDrop<Box<T>> will prevent Box's destructor from running. I can save a raw pointer to the ManuallyDrop element, and have the actual element go out of scope (what would normally be dropped in Rust) without being dropped. I can later call ManuallyDrop::drop(&mut *ptr) to drop this value manually.
I can also dereference the ManuallyDrop<Box<T>> element, save a raw pointer to just the Box<T>, and later call std::ptr::drop_in_place(box_ptr). This is supposed to destroy the Boxitself and drop the heap-allocated T.
Looking at the ManuallyDrop::drop implementation, it looks those are literally doing the exact same thing. Since ManuallyDrop is zero cost and just stores a value in it's struct, is there any difference in the above two approaches?
I can also call std::alloc::Global.dealloc(...), which looks like it will deallocate the memory block without calling drop. So if I call this on a pointer to Box<T>, it'll deallocate the heap pointer, but won't call drop, so T will still be lying around on the heap. I could call it on a pointer to T itself, which will remove T.
From exploring the standard library, it looks like Global.dealloc gets called in the raw_vec implementation to actually remove the heap-allocated array that Vec points to. This makes sense, since it's literally trying to remove a block of memory.
Rc has a drop implementation that looks roughly like this:
// destroy the contained object
ptr::drop_in_place(self.ptr.as_mut());
// remove the implicit "strong weak" pointer now that we've
// destroyed the contents.
self.dec_weak();
if self.weak() == 0 {
Global.dealloc(self.ptr.cast(), Layout::for_value(self.ptr.as_ref()));
}
I don't really understand why it needs both the dealloc and the drop_in_place. What does the dealloc add that the drop_in_place doesn't do?
Also, if I just save a raw pointer to a heap-allocated value by doing something like Box::new(5).into_raw(), does my pointer now control that memory allocation. As in, will it remain alive until I explicitly call ptr::drop_in_place()?
Finally, when I was playing with all this, I ran into a strange issue. After running ManuallyDrop::drop or ptr::drop_in_place on my raw pointer, I then tried running println! on the pointer's dereferenced value. Sometimes I get a scary heap error and my test fails, which is what I would expect. Other times, it just prints the same value, as if no drops happened. I also tried running ManuallyDrop::drop multiple times on the exact same value, and same thing. Sometimes a heap error, sometimes totally fine, and the same value prints out.
What is happening here?
If you come from C++, you can think of drop_in_place as calling the destructor manually, and dealloc as calling old C free.
They serve different purposes:
drop_in_place just calls Drop::drop, that releases the resources held by your type.
dealloc frees the memory pointed to by a pointer, previously allocated with alloc.
You seem to think that drop_in_place also frees the memory, but that is not the case. I think your confusion arises because Box<T> contains a dynamically allocated object, so its Box::drop implementation does release the memory used by that object, after calling its drop_in_place, of course.
That is what you see in the Rc implementation, first it calls the drop_in_place (destructor) of the inner object, then it releases the memory.
About what happens if you call drop_in_place several times in a row... well, the function is unsafe for a reason: you most likely get Uundefined Behavior. From the docs:
...if T is not Copy, using the pointed-to value after calling drop_in_place can cause undefined behavior.
Note the can cause. I think it is perfectly possible to write a type that allows calling drop several times, but it doesn't sound like such a good idea.

How are you supposed to update a texture per frame in Vulkan?

I'm trying to work with 2D in vulkan along with 3D. So right now testing out updating a texture for every frame as whatever 2D is going on. I've gotten something of a texture updater working, the problem is that it's very slow and probably not the way it's supposed to be done. Is there any better way of getting this done? The code is based on the https://vulkan-tutorial.com/ code.
https://vulkan-tutorial.com/code/26_depth_buffering.cpp
void UpdateTexture()
{
vkDeviceWaitIdle(device);
vkFreeMemory(device, textureImageMemory, nullptr);
VkBuffer stagingBuffer;
VkDeviceMemory stagingBufferMemory;
createBuffer(imageSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBuffer, stagingBufferMemory);
void* data;
vkMapMemory(device, stagingBufferMemory, 0, imageSize, 0, &data);
memcpy(data, pixel2.data(), static_cast<size_t>(imageSize));
vkUnmapMemory(device, stagingBufferMemory);
createImage(texWidth, texHeight, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, textureImage, textureImageMemory);
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
copyBufferToImage(stagingBuffer, textureImage, static_cast<uint32_t>(texWidth), static_cast<uint32_t>(texHeight));
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
vkDestroyBuffer(device, stagingBuffer, nullptr);
vkFreeMemory(device, stagingBufferMemory, nullptr);
createTextureImageView();
createDescriptorPool();
createDescriptorSets();
createCommandBuffers();
}
This code looks like a direct translation of some OpenGL code, and not particularly good/modern OpenGL code at that.
There's a lot wrong in this code, but most of it boils down to over-synchronization.
First, you should always view any call to vkDeviceWaitIdle as the wrong thing to do. The only exception would be when you are preparing to destroy the VkDevice itself. There is no other reason to do a full CPU/GPU sync like that.
Presumably, this synchronization exists so that you can be sure the GPU is finished using the image before modifying it. This is the wrong thing to do. You should instead employ multiple-buffering. That is, you should have two images that you use. One is currently being used in a rendering process, while the other is being transferred into.
Instead of doing a full device sync, you instead synchronize with the batch you sent two frames ago. That is, if you're wanting to transfer data for use by frame 10, then you must first do a fence-sync operation with the batch you sent in frame 8. Frame 9 is still being processed, but frame 8 is probably done by now. So the synchronization shouldn't hurt too much.
Second, never allocate memory in the middle of an operation like this. Memory gets allocated early in your application, and you leave it allocated until it's time to destroy your application. If you need a staging buffer, then keep it around and reuse it in subsequent frames. Make sure to allocate sufficient storage up-front.
Whatever your createBuffer call is doing, it seems very much like a bad idea. Vulkan is not OpenGL; Vulkan separated memory from buffers/textures that use it for a reason. Creating APIs that hide this separation basically throws all of that away.
Similarly, never unmap memory, unless you're about to destroy that memory object. There's no problem in Vulkan (or OpenGL) with leaving a piece of memory mapped indefinitely. Just map the entire memory's range and leave it mapped. Indeed, you could just pass the mapped pointer directly to your image loader, depending on how the memory get written by the image loading code (if it tries to read data from this pointer, they could be trouble).
Lastly, the commands doing the transfer need to be synchronized with the commands that consume the image. How this happens depends on which queues are being used to do the transfer.
And of course, if you want optimal performance, you may want to check to see if your implementation can read from linear images in your shader. If it can, then you may not need staging at all; you can just write the data directly to the memory in Vulkan's image format, and use it directly.
Employing all of the above is going to add a lot of complexity to your application. But that's how it's supposed to work.
A naive way consists in using the CPU to define the update depending on the time or data and then update the data for the shader, such as a MVP transformation matrix. But this is inefficient with lots of syncing and too low refresh rates, and also overloading the cpu in a loop.
So people recommend using many buffers sometimes mentioning old drivers. If someone can clarify it, that would be nice. I have a naive and probably wrong guess. If they know exactly the frame rate, then they can calculate the time for each frame and dispatch several frames in advance. But it confuses me because the frame rate is dynamic, especially for new screens with the FreeSync functionality that have dynamic refresh rates.
I have thought of a third possibility. One can use the clock directly in the shader. GL_EXT_shader_realtime_clock provides clockRealtimeEXT. It has no defined unit, and will wrap when exceeding the maximum value. But it is said "globally coherent by all invocations on the GPU". During initialization, you can measure its rate using a uniform buffer, and then assume the rate will be constant. And also manage the wrapping.
Then if you can write your shaders as a function of time, for example in a translation, that would be efficient. You just need the initial data. Remember that one must avoid if conditions in shaders.

Concurrent Read Access to Thread Object that Emulates Map

I am experiencing (very) slow page load times that increase proportionately to the number of active users on the system. I have a hunch that this is related to a custom defined thread object:
define stageStoreCache => thread {
parent map
public oncreate() => ..oncreate()
}
This stageStoreCache object simply mimics the behavior of a map whose data available across the entire instance.
Many threads are reading it and very few threads are writing to it. Is this a poorly conceived solution to having a large map of data available across the instance? It's a fairly large map of maps that when exported to map->asstring can exceed 5MB. The objective is to prevent translating data stored as JSON in the database to Lasso types on the fly.
It seems that the large size of the stageStoreCache is not what causes problems. It seems to really be the number of concurrent users on the system.
Thanks for any insight you can offer.
You said that this holds a map of maps and is rather large. If those sub-maps are large, it is possible that the way you are accessing the data is causing the issue. Here's what I mean, if you are doing something like this:
// Potential problem as it copies the sub-map each time
stageStoreCache->find('sub-map')->find('data')
stageStoreCache->find('sub-map')->find('other')
The problem comes in that each time stageStoreCache->find('sub-map') is called it actually has to copy all the map data it finds for "sub-map" out of the thread object and into the thread requesting that data. If those sub-maps are large, this takes time. A better approach would be to do this once and stash it in a local variable:
// Better Approach
local(cache) = stageStoreCache->find('sub-map')
#cache->find('data')
#cache->find('other')
This at least only has to copy the "sub-map" over once. Another approach that might be better (only testing could tell) would be to refactor your code so that each call to stageStoreCache drills down to the data you actually want, and have just that small amount of data copied over.
// Might even be better as it just copies the values you want
stageStoreCache->drill('sub-map', 'data')
stageStoreCache->drill('sub-map', 'other')
Ultimately, I would love for Lasso to improve thread objects so that they never blocked for reads. (I had thought this had been submitted as a feature request, but I'm not finding it on Rhinotrac.) Until that happens, if none of my suggestions help then you may need to investigate using something else to cache this data in such as memcached.
Testing is the only way to tell for sure. But I would go a long way to avoid having a thread object that contains some 5 MB of data.
Take this snippet from the Lasso guide into consideration:
"all parameter values given to a thread object method are copied, as well as any return value of a thread object method"
http://www.lassoguide.com/language/threading.html
Meaning that one of the key features that makes Lasso 9 so fast, the extensive use of reference data, is lost.
Each time you have a call for stageStoreCache all the data it contains will first be copied into the thread that asks for it. That is an awful lot of copying.
I have found that having settings and site wide data contained in smallest possible chunks is convenient and fast. And also, to only actually set it up when it is called for. Unlike the old approach that had a config file that was included on every call, setting up a bunch of variables where the majority maybe never got used on that particular call. Here's a Ke trick that I'm using instead. Consider this:
define mysetting1 => var(__mysetting1) || $__mysetting1 := 'Setting 1 value'
define mysetting2 => var(__mysetting2) || $__mysetting2 := 'Setting 2 value'
define mysetting3 => var(__mysetting3) || $__mysetting3 := 'Setting 3 value'
Have this is a file that is read at startup, either in a LassoApp that's initiated or a file in the startup folder.
These settings can then be called like this:
code blabla
mysetting2
more code blabla
mysetting1
mysetting2
With the beauty that, in this case, there is no wasted processing to initiate mysetting3, since it's not called for. And that mysetting2 is called for several times but is still only initiated once.
This technique can be used for simple things like the above, but also to initiate complex types or methods. Like session management, calling post or get params etc.

High rate of Gen 1 garbage collections

I am profiling an application(using VS 2010) that is behaving badly in production. Once of the recommendations given by VS 2010 is:
Relatively high rate of Gen 1 garbage collections is occurring. If, by
design, most of your program's data structures are allocated and
persisted for a long time, this is not ordinarily a problem. However,
if this behavior is unintended, your app may be pinning objects. If
you are not certain, you can gather .NET memory allocation data and
object lifetime information to understand the pattern of memory
allocation your application uses.
Searching on google gives the following link=> http://msdn.microsoft.com/en-us/library/ee815714.aspx. Are there some obvious things that I can do to reduce this issue?I seem to be lost here.
Double-click the message in the Errors List window to navigate to the
Marks View of the profiling data. Find the .NET CLR Memory# of Gen 0
Collections and .NET CLR Memory# of Gen 1 Collections columns.
Determine if there are specific phases of program execution where
garbage collection is occurring more frequently. Compare these values
to the % Time in GC column to see if the pattern of managed memory
allocations is causing excessive memory management overhead.
To understand the application’s pattern of managed memory usage,
profile it again running a.NET Memory allocation profile and request
Object Lifetime measurements.
For information about how to improve garbage collection performance,
see Garbage Collector Basics and Performance Hints on the Microsoft
Web site. For information about the overhead of automatic garbage
collection, see Large Object Heap Uncovered.
The relevant line there is:
To understand the application’s pattern of managed memory usage, profile it again running a.NET Memory allocation profile and request Object Lifetime measurements.
You need to understand how many objects are being allocated by your application and when, and how long they are alive for. You're probably allocating hundreds (or thousands!) of tiny objects inside a loop somewhere without really thinking about the consequences of reclaiming that memory when the references fall out of scope.
http://msdn.microsoft.com/en-us/library/ms973837.aspx states:
Now that we have a basic model for how things are working, let's
consider some things that could go wrong that would make it slow. That
will give us a good idea what sorts of things we should try to avoid
to get the best performance out of the collector.
Too Many Allocations
This is really the most basic thing that can go wrong. Allocating new
memory with the garbage collector is really quite fast. As you can see
in Figure 2 above is all that needs to happen typically is for the
allocation pointer to get moved to create space for your new object on
the "allocated" side—it doesn't get much faster than that. However,
sooner or later a garbage collect has to happen and, all things being
equal, it's better for that to happen later than sooner. So you want
to make sure when you're creating new objects that it's really
necessary and appropriate to do so, even though creating just one is
fast.
This may sound like obvious advice, but actually it's remarkably easy
to forget that one little line of code you write could trigger a lot
of allocations. For example, suppose you're writing a comparison
function of some kind, and suppose that your objects have a keywords
field and that you want your comparison to be case insensitive on the
keywords in the order given. Now in this case you can't just compare
the entire keywords string, because the first keyword might be very
short. It would be tempting to use String.Split to break the keyword
string into pieces and then compare each piece in order using the
normal case-insensitive compare. Sounds great right?
Well, as it turns out doing it like that isn't such a good idea. You
see, String.Split is going to create an array of strings, which means
one new string object for every keyword originally in your keywords
string plus one more object for the array. Yikes! If we're doing this
in the context of a sort, that's a lot of comparisons and your
two-line comparison function is now creating a very large number of
temporary objects. Suddenly the garbage collector is going to be
working very hard on your behalf, and even with the cleverest
collection scheme there is just a lot of trash to clean up. Better to
write a comparison function that doesn't require the allocations at
all.

Barriers in OpenCL

In OpenCL, my understanding is that you can use the barrier() function to synchronize threads in a work group. I do (generally) understand what they are for and when to use them. I'm also aware that all threads in a work group must hit the barrier, otherwise there are problems. However, every time I've tried to use barriers so far, it seems to result in either my video driver crashing, or an error message about accessing invalid memory of some sort. I've seen this on 2 different video cards so far (1 ATI, 1 NVIDIA).
So, my questions are:
Any idea why this would happen?
What is the difference between barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE)? I read the documentation, but it wasn't clear to me.
Is there general rule about when to use barrier(CLK_LOCAL_MEM_FENCE) vs. barrier(CLK_GLOBAL_MEM_FENCE)?
Is there ever a time that calling barrier() with the wrong parameter type could cause an error?
As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel.
Now to answer your question, the specification was not clear to me either, but it seems to me that section 6.11.9 contains the answer:
CLK_LOCAL_MEM_FENCE – The barrier function will either flush any
variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence
to ensure correct ordering of memory operations to global memory.
This can be useful when work-items, for example, write to buffer or
image memory objects and then want to read the updated data.
So, to my understanding, you should use CLK_LOCAL_MEM_FENCE when writing and reading to the __local memory space, and CLK_GLOBAL_MEM_FENCE when writing and readin to the __global memory space.
I have not tested whether this is any slower, but most of the time, when I need a barrier and I have a doubt about which memory space is impacted, I simply use a combination of the two, ie:
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
This way you should not have any memory reading\writing ordering problem (as long as you are sure that every thread in the group goes through the barrier, but you are aware of that).
Hope it helps.
Reviving an old-ish thread here. I have had a little bit of trouble with barrier() myself.
Regarding your crash problem, one potential cause could be if your barrier is inside a condition. I read that when you use barrier, ALL work items in the group must be able to reach that instruction, or it will hang your kernel - usually resulting in a crash.
if(someCondition){
//do stuff
barrier(CLK_LOCAL_MEM_FENCE);
//more stuff
}else{
//other stuff
}
My understanding is that if one or more work items satisfies someCondition, ALL work items must satisfy that condition, or there will be some that will skip the barrier. Barriers wait until ALL work items reach that point. To fix the above code, I need to restructure it a bit:
if(someCondition){
//do stuff
}
barrier(CLK_LOCAL_MEM_FENCE);
if(someCondition){
//more stuff
}else{
//other stuff
}
Now all work items will reach the barrier.
I don't know to what extent this applies to loops; if a work item breaks from a for loop, does it hit barriers? I am unsure.
UPDATE: I have successfully crashed a few ocl programs with a barrier in a for-loop. Make sure all work items exit the for loop at the same time - or better yet, put the barrier outside the loop.
(source: Heterogeneous Computing with OpenCL Chapter 5, p90-91)

Resources