One to one correspending to files - in unix - log files - unix

I am writing a Log Unifier program. That is, I have a system that produces logs:
my.log, my.log.1, my.log.2, my.log.3...
I want on each iteration to store the number of lines I've read from a certain file, so that on the next iteration - I can continue reading on from that place.
The problem is that when the files are full, they roll:
The last log is deleted
...
my.log.2 becomes my.log.3
my.log.1 becomes my.log.2
my.log becomes my.log.1
and a new my.log is created
I can ofcourse keep track of them, using inodes - which are almost a one-to-one correspondence to files.
I say "almost", because I fear of the following scenario:
Between two of my iterations - some files are deleted (let's say the logging is very fast), and are then new files are created and some have inodes of files just deleted. The problem is now - that I will mistake these files as old files - and start reading from line 500 (for example) instead of 0.
So I am hoping to find a way to solve this- here are a few directions I thought about - that may help you help me:
Either another 1-to-1 correspondence other than inodes.
An ability to mark a file. I thought about using chmod +x to mark the file as an
existing file, and for new files that don't have these permissions - I will know they are new - but if somebody were to change the permissions manually, that would confuse my program. So if you have any other way to mark.
I thought about creating soft links to a file that are deleted when the file is deleted. That would allow me to know which files got deleted.
Any way to get the "creation date"
Any idea that comes to mind - maybe using timestamps, atime, ctime, mtime in some clever way - all will be good, as long as they will allow me to know which files are new, or any idea creating a one-to-one correspondence to files.
Thank you

I can think of a few alternatives:
Use POSIX extended attributes to store metadata about each log file that your program can use for its operation.
It should be a safe assumption that the contents of old log files are not modified after being archived, i.e. after my.log becomes my.log.1. You could generate a hash for each file (e.g. SHA-256) to uniquely identify it.
All decent log formats embed a timestamp in each entry. You could use the timestamp of the first entry - or even the whole entry itself - in the file for identification purposes. Log files are usually rolled on a periodic basis, which would ensure a different starting timestamp for each file.

Related

DataFactory copies files multiple times when using wildcard

Hi all complete ADF newbie here - I have a strange issue with DataFactory and surprisingly cant see that anyone else has experienced this same issue.
To summarize:
I have setup a basic copy activity from blob to an Azure SQL database with no transformation steps
I have setup a trigger based on wildcard name. I.e. any files loaded to blob that start with IDT* will be copied to the database
I have loaded a few files to a specific location in Azure Blob
The trigger is activated
As soon as it looks like it all works, a quick assessment of the record count shows that the same files have been imported X number of times
I have analysed what is happening, basically when I load my files to blob, they don't technically arrive at the exact same time. So when file 1 hits the blob, the wildcard search is triggered and it finds 1 file. Then when the 2nd file hits the blob some milliseconds later, the wildcard search is triggered again and this time it processes 2 files (first and second).
The the problem keeps compounding based on the number of files loaded.
I have tried multiple things to get this fixed to no avail, because fundamentally it is behaving "correctly".
I have tried:
Deleting the file once it has processed but again due to the millisecond issue the file is technically still there and can still be processed
I have added a loop to process 1 file at a time then deleting the file before the next is loaded based on file name in the blob but hasn't worked (and cant explain why)
I have limited ADF to only 1 concurrent connection, this reduces the number of times it has duplicated but unfortunately still duplicates it
Tried putting a wait timer at the start of the copy activity, but this causes a resource locking issue. I get an error saying that multiple waits are causing the process to fail
Tried a combination of 1,2 and 3 and i end up with an entirely different issue in that it is trying to find file X, but now no longer exists because it was deleted as part of step 2 above
I am really struggling with something that seems extremely basic. So i am sure it is me overlooking something extremely fundamental as noone else seems to have this issue with ADF.

The best way to store big number of files with adding quick search

In my Windows 8/RT app I use SQLite DataBase (sqlite-net) witch store in Isolated Storage. In DataBase I have a lot of data, including files(images, pdf's and other) links. I get those links from web server. When I got link, I want to download file and store it locally.
My question is: what is the best way to store big number of files (100+)? One important think: I need to organize quickly find the desired file.
I have three ideas:
Create another DataBase only for files (I can't modify existing)
Create folder in IS and store here directly.
Create list of files and store it in IS.
Which would be better/faster? Or somebody have another great solution?
100 files isn't such a big number as you can easily store up to 100k files (or folders) in a single (NTFS) directory.
If you receive the files from a webserver then the question is whether the source makes sure there are no duplicate filenames. If this can't be assured, I'd recommend having a database table mapping from original filename and metadata to its hash (SHA256 or similar) and store the file with a filename corresponding to its hash.
Then, when using the file, you can pass pass it to the user using the original filename using the StorageFile API.
Going beyond 100k files, you could create a subfolder structure from the first two letters of the hash.
Either way, storing the file metadata in a database and the files in a directory has been the most useful approach for us in the past.
100 files with average size of 1MB is only 100MB.
Most people say that storing binary files in database is wrong and suggest storing files separately and only keep file names in database, but I think it is fine provided you know what you are doing and why.
Big advantage of storing files in database is that you keep files together with their properties logically in one place. Also, you can simply copy one file and this would backup everything.
Database also affords you transaction support. You may have some problems reading and writing BLOBs into database, but it is not very difficult.

Efficency for reading file names of a directory ASP.NET

How efficient is reading the names of files in a directory in ASP.NET?
Background: I want to update pictures on a webserver automatically and deploy them in advance. E.g. until the 1. April I want to pick 'image1.png'. After the 1. April 'image2.png'. To achieve this I have to map every image name to a date which indicates if this image has to be picked or not.
In order to avoid mapping between file name and date in a seperate file or database the idea is to put a date in the file name. Iterating the directory and parsing the dates make me find my file.
E.g.:
image_2013-01-01.png
image_2013-04-31.png
The second one will be picked from May to eternity if no image with a later date will be dropped.
So I wonder how this solution impacts the speed of a website assuming <20 files.
If you are using something like Directory.GetFiles, that is one call to the OS.
This will access the disk to get the listing.
For less that 20 files this will be very quick. However since this data is unlikely to change very often, consider caching the name of your image.
You could store it in the application context to share it among all users of your site.

In cvs diff I want to see author name and date of files having difference

How can I get the author name and date of the different files, when I do a cvs diff.
M using wincvs on windows 2.1.1.1(build 1). I can also do it on unix server through command line.
thx
You can't, unless this information appears in the source (which it shouldn't, unless you want a conflict at every merge). You need to use other cvs commands for this; cvs log comes to mind, but it's been a while. You could always write a script or batch file which displays the changes' log entiries and diffs, though.

In reverse-history-search in zsh, is there a maximum # of items it will go back?

I use control-r interactive reverse history search in zsh all day long. Just now I tried to find something I haven't used in quite some time, and it wasn't there. But when I grepped my .zsh_history file, it was there.
Is there a maximum # of history entries reverse search will search? Can I configure this?
You can configure (in your .zshrc) how long the history per session and the stored one is:
HISTSIZE=5000 # session history size
SAVEHIST=1000 # saved history
HISTFILE=~/.zshistory # history file
That is what is searched by zsh, I am not aware of a limit that reverse-history-search uses.
You might want to check on the zsh-param(1) man page.
In-memory history may not contain entries that are in the history file (if the file size limit is larger than the in-memory limit) and the history file may not contain entries that are in memory (until they're written - either explicitly or when the shell exits - subject to size limits set by environment variables). There is no other limit on reverse-history-search.
This is true in Bash and I believe the Z shell is similar.

Resources