Sudden degradation of network file-transfer rate - networking

I've been using jcifs-1.3.17 for over a year now, transferring thousands of files from place A to place B, without problems. The file transfers have always taken from 15-30ms. Yesterday (7/13/16) something changed and the transfers are now taking about 90seconds - not milliseconds, but seconds. I can identify a 15-minute window in which the change happened. Operations insists that nothing in the network or on the servers changed. My code code didn't change, nor the character/size of the files.
Has anyone else experienced something like this? Any ideas on what I might look at to determine root cause?

Related

Copying 100 GB with continues change of file between datacenters with R-sync is good idea?

I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.

is this the result of a partial image transfer?

I have code that generates thumbnails from JPEGs. It pulls an image from S3 and then generates the thumbs.
One in about every 3000 files ends up looking like this. It happens in batches. The high res looks like this and they're all resized down to low res. It does not fail on resize. I can go to my S3 bucket and see that the original file is indeed intact.
I had this code written in Ruby and ported it all over to clojure hoping it would just fix my issue but it's still happening.
What would result in a JPEG that looks like this?
I'm using standard image copying code like so
(with-open [in (clojure.java.io/input-stream uri)
out (clojure.java.io/output-stream file)]
(clojure.java.io/copy in out))
Would there be any way to detect the transfer didn't go well in clojure? Imagemagick? Any other command line tool?
My guess is it is one of 2 possible issues (you know your code, so you can probably rule one out quickly):
You are running out of memory. If the whole batch of processing is happening at once, the first few are probably not being released until the whole process is completed.
You are running out of time. You may be reaching your maximum execution time for the script.
Implementing some logging as the batches are processed could tell you when the issue happens and what the overall state is at that moment.

Time synchronization of views generated by different instances of the game engine

I'm using open source Torque 3d game engine for the avia simulator project.
I need to generate single image from the several IG (image generator) PCs.Each IG displays has its own view camera with certain angle offset and get the info about the current position from the server via LAN.
I've already setup multi IG system.
Network connection is robust (less than <1 ms)
Frame rate is good as well - about 70 FPS on each IG.
However while moving the whole picture looks broken because some IG are updating their views faster than others.
I'm looking for the solution that will make the IG update simultaneously. Maybe some kind of precise time synchronization algorithms that make different PC connected via LAN act as one.
I had a much simpler problem, but my approach might help you.
You've got to run clocks on all your machines with, say, a 15 millisecond tick. Each image needs to be generated correctly for a specific tick and marked with its tick ID time. The display machine can check its own clock, determine the specific tick number (time) for which it should display, grab the images for that specific time, and display them.
(To have the right mindset to think about this, imagine your network is really bad and think about one IG delivering 1000 images ahead of the current display tick while another is 5 ticks behind. Write for this sort of system and the results will look really good on the one you have.)
Ideally you want your display running a bit behind the IGs so you always have a full set of images for the current tick. I had a client-server setup and slowed the display (client) timer down if it came close to missing updates and sped it up if it was getting too far behind. You have to synchronize all your IG machines, so it might be better to have the master clock on the display and have it send messages to speed up any IG machine that's getting behind. (You may not have the variable network delays I had, but it's best to plan for them.)
The key is that each image must be made at a particular time, that the display include only images for the time being displayed, and that the composite images appear right when they should (every 15 milliseconds, on the millisecond). Also, do not depend on your network or even your machines to do anything in a timely manner. Use feedback to keep everything synched.
Addition On Feedback:
Say the last image for the frame at time T arrives 5ms after time T by the display computer's time (real time). If you display the frame for time T at T plus 10 ms, no one will notice the lag and you'll have plenty of time to assemble the images. Using a constant (10 ms) delay might work for you, especially if you make it big enough. It may be the way to go if you always run with the exact same network.
But you are depending on all your IG machines being precisely synchronized for real time, taking no more than a certain amount of time to produce their image, and delivering their image to the display machine all in predictable lengths of time.
What I'd suggest is have your display machine determine the delay based on the time stamps on the images it receives. It would want to increase the delay if it isn't getting the images it needs in time, and decrease it if all the IG's are running several images ahead of what the display needs. (You might want to ignore the occasional really late image. You have to decide which is more annoying: images that are out-of-date, a display that is running noticably behind time, or a display that noticably speeds up and slows down.)
In my original answer I was suggesting some kind of feedback from the display to keep the IG machines running on time, but that may be overkill: your computer's clocks are probably good enough for that.
Very generally, when any two processes have to coordinate over time, it's best if they talk to each other to stay in step (feedback) rather than each stick to a carefully timed schedule.

Can you sacrifice performance to get concurrency in Sqlite on a NFS?

I need to write a client/server app stored on a network file system. I am quite aware that this is a no-no, but was wondering if I could sacrifice performance (Hermes: "And this time I mean really slash.") to prevent data corruption.
I'm thinking something along the lines of:
Create a separate file in the system everytime a write is called (I'm willing do it for every connection if necessary)
Store the file name as the current millisecond timestamp
Check to see if the file with that time or earlier exists
If the same one exists wait a random time between 0 to 10 ms, and try again.
While file is the earliest timestamp, do work, delete file lock, otherwise wait 10ms and try again.
If a file persists for more than a minute, log as an error, stop until it is determined that the data is not corrupted by a person.
The problem I see is trying to maintain the previous state if something locks up. Or choosing to ignore it, if the state change was actually successful.
Is there a better way of doing this, that doesn't involve not doing it this way? Or has anyone written one of these with a lot less problems than the Sqlite FAQ warns about? Will these mitigations even factor in to preventing data corruption?
A couple of notes:
This must exist on an NSF, the why is not important because it is not my decision to make (it doesn't look like I was clear enough on that point).
The number of readers/writers on the system will be between 5 and 10 all reading and writing at the same time, but rarely on the same record.
There will only be clients and a shared memory space, there is no way to put a server on there, or use a server based RDMS, if there was, obviously I would do it in a New York minute.
The amount of data will initially start off at about 70 MB (plain text, uncompressed), it will grown continuous from there at a reasonable, but not tremendous rate.
I will accept an answer of "No, you can't gain reasonably guaranteed concurrency on an NFS by sacrificing performance" if it contains a detailed and reasonable explanation of why.
Yes, there is a better way. Don't use NFS to do this.
If you are willing to create a new file every time something changes, I expect that you have a small amount of data and/or very infrequent changes. If the data is small, why use SQLite at all? Why not just have files with node names and timestamps?
I think it would help if you described the real problem you are trying to solve a bit more. For example if you have many readers and one writer, there are other approaches.
What do you mean by "concurrency"? Do you actually mean "multiple readers/multiple writers", or can you get by with "multiple readers/one writer with limited latency"?

Performance monitor shows 4294967293 sessions active

I have an ASP.Net 3.5 website running in IIS 6 on Windows Server 2003 R2. It is a relatively small internal application that probably serves less than ten users at any given time. The server has 4 Gig of memory and shows that 3+ Gig is available while the site is active.
Just minutes after restarting the web application Performance monitor shows that there is a whopping 4,294,967,293 sessions active! I am fairly certain that this number is incorrect; at the time this reading there were only 100 requests to the website.
Has anyone else experienced this kind odd behavior from perf mon? Any ideas on how to get an accurate reading?
UPDATE: After running for about an hour the number of active sessions has dropped by 4. So it does seem to be responding to sessions timing out.
Could be an overflow, but my money's on an underflow. I think that the program started with 0 people, someone logged off, and then the number of sessions went negative.
Well, 2^32 = 4,294,967,296, so sounds like there's some kind of overflow occurring. Can't say exactly why.
We have the same problem. It looks like MS has a Hotfix available: http://support.microsoft.com/kb/969722
Update 9/10/2009: Our IT department contacted MS for the Hotfix. It fixed our issue. We are running .NET 2.0 if it matters any.
I am also showing a high number, currently 4,294,967,268.
Every time I abandon a Session, the Sessions abandoned count goes up by 1, and the Sessions Active count decreases by 1. Currently my abandoned session count = 16, so this number probably started at 4,294,967,84.
Is there a fix for this?
My counters were working fine, but one morning I logged in remotely to the production server, and the counter was on this huge number (which is as somebody mentioned very close to 2^32 indicating an underflow). But the only difference from the day before when everything worked was the fact that during the night, windows had installed updates.
So for some reason these updates caused this pretty annoying error.
Observing the counter a little more, I found out that whenever the application is restarted - after some time with no traffic, the counter starts correctly at zero. When users start logging on, it increments fine. When they start logging off again, it still decrements fine until it reaches what is supposed to be zero. At that point it goes bananas...
Sigh...
If you have to use your existing statistics, I opened the log file in Excel and used a formula to bring a more accurate value. I cannot guarantee its accuracy, but the results did look okay:
If B2 is the (aspnet_wp)\Sessions Active value , and the formula sits in C2
/* This one is quicker as it doesn't have to do the extra calculations */
=IF(B2>1073741824,4294967296-B2,B2)
Or
/* This one is clearer what is going on */
=IF(B2>power(2,30),(4*power(2,30))-B2,B2)
P.S. (I feel your pain - I have to explain why they have 4.2 billion sessions opening whereas a second earlier they had 0!)

Resources