Data corruption when uploading a texture (Direct3D11) - intel

I have encountered the following issue in my Direct3D11-based application:
Sometimes (and on some machines) I get a texture which is seemingly corrupted, a couple of lines are just black. Something like this (showing one texture):
Sometimes it is as bad as this, sometimes it is only one or two lines. I made sure that it is not a rendering issue; all observations indicate that the texture-data is not correct.
So far this has been observed only under the following conditions:
we are creating an immutable texture (CPUAccessFlags=None, Usage=Default, BindFlags=ShaderResource)
using a laptop with "Intel HD Graphics 520" and "AMD Radeon R7 M360" with Windows 7 (with "AMD Enduro"-technology)
only with texture-formats "R16G16B16A16_UNorm", "R8G8B8A8_UNorm", "R32_Float", not with "R8_UNorm" and "R16_UNorm" (for the rest it is unknown to me whether they work or not)
and it seems to occur only intermittently (seems to be more likely if the GPU is quite busy)
I tried putting together a small sample which reproduces the issue: https://github.com/ptahmose/ImmutableTextureUploadTest
In this code, I am up-loading an immutable texture, then copy it to a staging texture, and then compare the data to what was up-loaded.
Now, with this repro-project I get the following:
If this repro-application is running alone on the machine, all is fine.
If I run another application which uses D3D11 on the machine, we get errors with this test-application.
I.e. this test-application reports errors if we run two instances of it (or, if I run the actual D3D11-based application).
Did somebody run across something similar? Can this issue be reproduced? Am I missing something? ...and of course: what are my options to solve/work around this issue?
Thanks for reading!

Related

Is there a way to treat a cl_image as a cl_mem?

I've been working on speeding up some image processing code written in OpenCL, and I've found that for my kernel, buffers (cl_mem) are significantly faster than images (cl_image).
So I want to process my images as cl_mem, but unfortunately I'm stuck with an API that only spits out cl_images. I'm using a OS X API clCreateImageFromIOSurface2DAPPLE that creates an image for me.
Is there any way to take a cl_image and treat it as a cl_mem? When I've tried to do that I get an error when running my kernel.
I've tried copying the image to a buffer using clEnqueueCopyImageToBuffer but that's also too slow. Any ideas? Thanks in advance
PS: I believe my kernel operating on a buffer is much faster because I can do a vload4 and load 4 pixels at a time, vs read_imagei which does just one.
You cannot treat an OpenCL image as memory. The memory layout of an image is private to the implementation and should be considered unknown.
If your code creates the image, however, you could create a buffer and then use cl_khr_image2d_from_buffer. Otherwise write a kernel that copies the data from image to buffer and see if it is faster than clEnqueueCopyImageToBuffer (unlikely).

Is filebacked.big.matrix in the bigmemory packagage memory neutral?

I have been using filebacked.big.matrix to store a very large matrix (~1 million x 20 thousand). I am working on a cluster with very high memory, but not quite that much. I have previously used the ff package which worked great and kept the memory usage consistent despite the matrix size, but it died when I surpassed 10^32 items in the matrix (R community really needs to fix that problem). the filebacked.big.matrix initially seemed to work very well and generally runs without problems, but when I check on the memory usage it is sometimes spiking into the 100s of GBs. I am careful to only read/write to the matrix a relatively few rows at a time, so I think there should not be much in memory at any given time.
Does it do some sort of automatic memory caching or something that is driving the memory usage up? If so can this caching be disabled or limited? The high memory usage is causing some nasty side effects on the cluster so I need a way to do this that is memory neutral. I have checked the filebacked.big.matrix help page, but can't find any pertinent information there.
Thanks!
UPDATE:
I am also using bigmemoryExtras.
I was wrong earlier, the problem is happening when I loop through the entire matrix reading it into a different, smaller file.backed matrix like this:
tmpGeno=fileBackedMatrix(rowIndex-1,numMarkers,'double',tmpDir)
front=1
back=40000
large matrix must be copied in chunks to avoid integer.max errors
while(front < rowIndex-1){
if(back>rowIndex-1) back=rowIndex-1
tmpGeno[front:back,1:numMarkers]=genotypeMatrix[front:back,1:numMarkers,drop=F]
front=front+40000
back=back+40000
}
The physical memory usage is initially very low (with virtual memory very high). But while running this loop, and even after it has finished it seems to just keep most of the matrix in physical memory. I need it to only keep the one small chunk of the matrix in memory at a time.
UPDATE 2:
It is a bit confusing to me: the cluster metrics and top command say that it is using tons of memory (~80GB), but the gc() command says that memory usage never went over 2GB. The free command says that all the memory is used, but in the -/+ buffers/cache line is says only 7GB are being used total.

How to run Klaus Dormann's 6502 test suite on real hardware with separate ROM and RAM

I would like to run the full 6502 test suite by Klaus Dormann to test my Kansas Lava 6502 implementation. However, the code uses self-modification (see all uses of range_adr), which, while trivial to implement in an emulator, doesn't bode well for a hardware implementation: the program image needs to be stored on ROM, so the write-backs are going to be blackholed by whatever routes writes based on addressing ROM or RAM-backed parts.
The same problem, of course, applies both to synthesizing it to a real FPGA and to running it in a simulator (either the low-level VHDL one or the high-level Kansas Lava one).
Is there a way to run the test suite without doing a lengthy (in terms of cycles) dance of suspending the CPU, copying the program from some unaddressable ROM into an all-RAM memory byte-by-byte, and then initializing the CPU and letting it run? I'd prefer not doing this because simulating these extra cycles at startup will slow down running the test considerably.
Knee-jerk observations:
Despite coming as a 64kb image, the test is actually just 14,093 bytes of actual content, from $0000 up to $370d, then a padded fill of $ffs up to the three vectors in $fffa–$ffff. So you'd need to copy at most 14,099 bytes rather than the prima facie 65,536.
Having set up that very test suite in an emulator I wrote yesterday (no, really) the full range of touched addresses — using [x, y] to denote the closed range, i.e. to include both x and y, is:
[000a, 0012], [0100, 0101], [01f9, 01ff] (i.e. the stack and zero pages);
0200;
[0203, 0207];
04a8;
2cbb;
2cdc;
2eb1;
2ed2;
30a7;
30c8;
33f2;
3409;
353b; and
3552.
From the .lst version of the program, that means all you need to move are the variables with labels:
test_case;
ada2;
sba2;
range_adr;
... and either move or remove the routines that:
test AND immediate from 2cac down to 2cec;
test EOR immediate from 2ea2 to 2ee2;
test ORA immediate from 3098 to 30d8;
test decimal ADC/SBC immediate from 33e7 down to 3414 (specifically to include chkdadi and chksbi);
test binary ADC/SBC immediate from 3531 down to 355d.
All the immediate tests self modify the operand. If you're happy leaving that one addressing mode untested then it shouldn't be too troublesome.
So, I guess, edit those tests out of the original file, and you can safely relocate range_adr to the middle of the stack page if my simulation is accurate.

Give CPU more power to plot in Octave

I made this function in Octave which plots fractals. Now, it takes a long time to plot all the points I've calculated. I've made my function as efficient as possible, the only way I think I can make it plot faster is by having my CPU completely focus itself on the function or telling it somehow it should focus on my plot.
Is there a way I can do this or is this really the limit?
To determine how much CPU is being consumed for your plot, run your plot, and in a separate window (assuming your on Linux/Unix), run the top command. (for windows, launch the task master and switch to the 'Processes' tab, click on CPU header to sort by CPU).
(The rollover description for Octave on the tag on your question says that Octave is a scripting language. I would expect it's calling gnuplot to create the plots. Look for this as the highest CPU consumer).
You should see that your Octave/gnuplot cmd is near the top of the list, and for top there is a column labeled %CPU (or similar). This will show you how much CPU that process is consuming.
I would expect to see that process is consuming 95% or more CPU. If you see that is a significantly lower number, then you need to check the processes below that, are they consuming the remaining CPU (some sort of Virus scan (on a PC), or DB or Server?)? If a competing program is the problem, then you'll have to decide if you can wait til it/they are finished, OR that you can kill them and restart later. (For lunix, use kill -15 pid or kill -11 pid. Only use kill -9 pid as a last resort. Search here for articles on correct order for trying to kill -$n)
If there are no competing processes AND it octave/gnuplot is using less than 95%, then you'll have to find alternate tools to see what is holding up the process. (This is unlikely, it's possible some part of your overall plotting process is either Disk I/O or Network I/O bound).
So, it depends on the timescale you're currently experiencing versus the time you "want" to experience.
Does your system have multiple CPUs? Then you'll need to study the octave/gnuplot documentation to see if it supports a switch to indicate "use $n available CPUs for processing". (Or find a plotting program that does support using $n multiple CPUs).
Realistically, if your process now takes 10 mins, and you can, by eliminating competing processes, go from 60% to 90%, that is a %50 increase in CPU, but will only reduce it to 5 mins (not certain, maybe less, math is not my strong point ;-)). Being able to divide the task over 5-10-?? CPUs will be the most certain path to faster turn-around times.
So, to go further with this, you'll need to edit your question with some data points. How long is your plot taking? How big is the file it's processing. Is there something especially math intensive for the plotting you're doing? Could a pre-processed data file speed up the calcs? Also, if the results of top don't show gnuplot running at 99% CPU, then edit your posting to show the top output that will help us understand your problem. (Paste in your top output, select it with your mouse, and then use the formatting tool {} at the top of the input box to keep the formatting and avoid having the output wrap in your posting).
IHTH.
P.S. Note the # of followers for each of the tags you've assigned to your question by rolling over. You might get more useful "eyes" on your question by including a tag for the OS you're using, and a tag related to performance measurement/testing (Go to the tags tab and type in various terms to see how many followers you're getting. One bit of S.O. etiquette is to only specify 1 programming language (if appropriate) and that may apply to OS's too.)

R error allocMatrix

HI all,
I was trying to load a certain amount of Affymetrix CEL files, with the standard BioConductor command (R 2.8.1 on 64 bit linux, 72 GB of RAM)
abatch<-ReadAffy()
But I keep getting this message:
Error in read.affybatch(filenames = l$filenames, phenoData = l$phenoData, :
allocMatrix: too many elements specified
What's the general meaning of this allocMatrix error? Is there some way to increase its maximum size?
Thank you
The problem is that all the core functions use INTs instead of LONGs for generating R objects. For example, your error message comes from array.c in /src/main
if ((double)nr * (double)nc > INT_MAX)
error(_("too many elements specified"));
where nr and nc are integers generated before, standing for the number of rows and columns of your matrix:
nr = asInteger(snr);
nc = asInteger(snc);
So, to cut it short, everything in the source code should be changed to LONG, possibly not only in array.c but in most core functions, and that would require some rewriting. Sorry for not being more helpful, but i guess this is the only solution. Alternatively, you may wait for R 3.x next year, and hopefully they will implement this...
If you're trying to work on huge affymetrix datasets, you might have better luck using packages from aroma.affymetrix.
Also, bioconductor is a (particularly) fast moving project and you'll typically be asked to upgrade to the latest version of R in order to get any continued "support" (help on the BioC mailing list). I see that Thrawn also mentions having a similar problem with R 2.10, but you still might think about upgrading anyway.
I bumped into this thread by chance. No, the aroma.* framework is not limited by the allocMatrix() limitation of ints and longs, because it does not address data using the regular address space alone - instead it subsets also via the file system. It never hold and never loads the complete data set into memory at any time. Basically the file system sets the limit, not the RAM nor the address space of you OS.
/Henrik
(author of aroma.*)

Resources