I have an existing piece of code that scans through a directory recursively, and then decodes and loads information from supported filetypes. This works fine.
The problem is that this process can take a long time to perform, so I'm trying to find a way to do progress reporting. (% complete and estimated time remaining)
There is a conceptual problem though:
I don't know in advance how many subfolders and files there are. This could be huge.
I don't know in advance what the structure of the folders is.
Scanning through the entire folder recursively can potentially take very long.
My initial thought would be to go over all the folders and count how many files there are, and then do progress reporting based on this number. This poses problems for large file-structures (with millions of files and folders).
Alternatively, I though of counting progress as I go. Adding files to my total counted files as I'm going over the directories. But this means that progress can go both up and down, as new files/folders are discovered. My progress would be meaningless as a single large enough folder could reduce my progress significantly.
Is there an alternative solution to this conceptual problem? Perhaps some form of a hybrid solution?
I'm using Java, should that matter. (Though I don't know how it would)
Stochastic Estimation
You can also do a depth-first search and estimate the remaining work, updating the estimate as you progress. Simply average the fan-out ratio at each level, and apply those statistics to every unknown folder. This is hardly perfect, and you might want to be sensitive to some power-law distribution of real-world organizations (i.e. the log of the number of files in a folder will tend to be normally (Gaussian) distributed).
However, this is still subject to fluctuations. If the first leaf contains 1,000 target files (and is under a large fan-out) and the rest of the tree has only 0-5 per leaf, your initial estimate will be quite high. If the order is reversed, you'll have an artificially low estimate, until the final 0.1% takes half the actual running time.
Full Count
Does it take you so long to count the entries in a large file system? This would be compared to the load-decode stages. Is it worth that overhead to give a more accurate progress report? Remember, you don't have to interpret every file name. Just count the quantity of target files and then recur on every directory's nodeID. Deal with those node IDs (integers) rather than the overhead of string names, if you can.
Related
I have a huge problem. I have a concurrent programming class and can't come up with a solution. I have to develop my own formula by which I will calculate the order in which clients upload a file to the cloud disk. The files are sorted from smallest to largest on startup.
Graph.
The graph shows the change in the ratio based on the weight of the file and the amount of time the client has spent in the queue. The main assumption is that on the basis of the calculated factor, I should determine the order of customers. The values I took into account are:
first file weight,
all files weight,
waiting time in the queue,
number of files,
clients in line.
Maybe some of you have encountered a similar math problem and know the answer or a similar problem that would help me with this.
I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.
I'm using a 3rd party application that uses BerkeleyDB for its local datastore (called BMC Discovery). Over time, its BDB files fragment and become ridiculously large, and BMC Software scripted a compact utility that basically uses db_dump piped into db_load with a new file name, and then replaces the original file with the rebuilt file.
The time it can take for large files is insanely long, and can take hours, while some others for the same size take half that time. It seems to really depend on the level of fragmentation in the file and/or type of data in it (I assume?).
The utility provided uses a crude method to guestimate the duration based on the total size of the datastore (which is composed of multiple BDB files). Ex. if larger than 1G say "will take a few hours" and if larger than 100G say "will take many hours". This doesn't help at all.
I'm wondering if there would be a better, more accurate way, using the commands provided with BerkeleyDB.6.0 (on Red Hat), to estimate the duration of a db_dump/db_load operation for a specific BDB file ?
Note: Even though this question mentions a specific 3rd party application, it's just to put you in context. The question is generic to BerkelyDB.
db_dump/db_load are the usual (portable) way to defragment.
Newer BDB (like last 4-5 years, certainly db-6.x) has a db_hotbackup(8) command that might be faster by avoiding hex conversions.
(solutions below would require custom coding)
There is also a DB->compact(3) call that "optionally returns unused Btree, Hash or Recno database pages to the underlying filesystem.". This will likely lead to a sparse file which will appear ridiculously large (with "ls -l") but actually only uses the blocks necessary to store the data.
Finally, there is db_upgrade(8) / db_verify(8), both of which might be customized with DB->set_feedback(3) to do a callback (i.e. a progress bar) for long operations.
Before anything, I would check configuration using db_tuner(8) and db_stat(8), and think a bit about tuning parameters in DB_CONFIG.
Is it possible to estimate the heat generated by an individual process in runtime.
Temperature readings of the processor is easily accessible but what I need is process specific information.
Is it possible to map information such as cpu utilization, io, running time, memory usage etc to get some kind of an estimate?
I'm gonna say no. Because the overall temperature of your system components isn't a simple mathematical equation with everything that's moving and switching either.
Heat generated by and inside a computer is dependent on many external factors like hardware setup, ambient temperature of the room, possibly the age of the components, is there dust on them or in the fans, was the cooling paste correctly applied on the CPU or elsewhere, where heat sinks are present, how is heat being dissipated etc.etc.. In short, again, no.
Additionally, your computer runs a LOT of processes at any given time apart from the ones that you control (and "control" is a relative term). Even if it is possible to access certain sensory data for individual components (like you can see to some extent in the BIOS) then interpolating one single process' generated temperature in regard to the total is, well, impossible.
At the lowest levels (gate networks, control signalling etc.), an external individual no longer has any means to observe or measure what's going on but there as well, things are in a changing state, a variable amount of electricity is being used and thus a variable amount of heat generated.
Pertaining to your second question: that's basically what your task manager does. There are countless examples and articles on the internet on how to get that done in a plethora of programming languages.
That is, unless some of the actually smart people in this merry little community of keytappers and screengazers say that it IS actually possible, at which point I will be thoroughly amazed...
EDIT: Monitoring the processes is a first step in what you're looking for. take a look at How to detect a process start & end using c# in windows? and be sure to follow up on duplicates like the one mentioned by Hans.
You could take a look at PowerTOP or some other tool that monitors power usage. I am not sure how accurate it is across different systems but a power estimation should provide at least some relative information as the heat generated assuming the processes you are comparing are running in similar manners on hardware. In reality there are just too many factors to predict power, much less heat, effectively but you may be able to get an idea of the usage.
I am trying to use Flex Profiler to improve the application performance (loading time, etc). I have seen the profiler results for the current desgn. I want to compare these results with a new design for the same set of data. Is there some direct way to do it? I don't know any way to save the current profiling results in history and compare it later with the results of a new design.
Otherwise I have to do it manually, write the two results in a notepad and then compare it.
Thanks in advance.
Your stated goal is to improve aspects of the application performance (loading time, etc.) I have similar issues in other languages (C#, C++, C, etc.) I suggest that you focus not so much on the timing measurements that the Flex profiler gives you, but rather use it to extract a small number of samples of the call stack while it is being slow. Don't deal in summaries, but rather examine those stack samples closely. This may bend your mind a little bit, because it will not give you particularly precise time measurements. What it will tell you is which lines of code you need to focus on to get your speedup, and it will give you a very rough idea of how much speedup you can expect. To get the exact amount of speedup, you can time it afterward. (I just use a stopwatch. If I'm getting the load time down from 2 minutes to 10 seconds, timing it is not a high-tech problem.)
(If you are wondering how/why this works, it works because the reason for the program being slower than it's going to be is that it's requesting work to be done, mostly by method calls, that you are going to avoid executing so much. For the amount of time being spent in those method calls, they are sitting exposed on the stack, where you can easily see them. For example, if there is a line of code that is costing you 60% of the time, and you take 5 stack samples, it will appear on 3 samples, plus or minus 1, roughly, regardless of whether it is executed once or a million times. So any such line that shows up on multiple stacks is a possible target for optimization, and targets for optimization will appear on multiple stack samples if you take enough.
The hard part about this is learning not to be distracted by all the profiling results that are irrelevant. Milliseconds, average or total, for methods, are irrelevant. Invocation counts are irrelevant. "Self time" is irrelevant. The call graph is irrelevant. Some packages worry about recursion - it's irrelevant. CPU-bound vs. I/O bound - irrelevant. What is relevant is the fraction of stack samples that individual lines of code appear on.)
ADDED: If you do this, you'll notice a "magnification effect". Suppose you have two independent performance problems, A and B, where A costs 50% and B costs 25%. If you fix A, total time drops by 50%, so now B takes 50% of the remaining time and is easier to find. On the other hand, if you happen to fix B first, time drops by 25%, so A is magnified to 67%. Any problem you fix makes the others appear bigger, so you can keep going until you just can't squeeze it any more.