Batch renaming / moving / hashing of files - encryption

I have a highly structured hierarchical directory containing multiple files that need to be moved into a flat structure and renamed at the same time. The original path and name must be logged along with the new path and name and eventually loaded into a database. Finally, each renamed file must get a unique, unguessable (IE: encrypted or hashed) file name. When the renamed file is moved into the new directory structure, I also want to limit the # of files in each directory, so each directory would be created with a sequential number for its name and then the files would be loaded into it until a maximum number of files was reached (eg: 255) before rolling into a new directory with the next sequential number for its name.
Is there a tool / software that does this? I did some initial research and nothing came up with the following criteria:
batch rename & copy into alternative (flatter) structure
hash / encrypt filename and ensure uniqueness
sequentially name folders and limit file count
log each file's original name and path, and new (encrypted) name and path

I have several Bash scripts I have used in the past to migrate hand-made file repositories to hashed repositories to be accessed and managed from a web application (mostly PHP apps). In these repositories filenames are hashed (to avoid collisions with files with the same content/name) and files are distributed evenly (in a deterministic fashion or randomly) to keep files-per-dir count low for performance reasons. The following is one fully-working example:
#!/bin/bash
MAXFILESPERDIR=500
TARGETROOTDIR="./newrepository"
RANDOMDISTRIBUTION=1
if [ -d "$1" ]; then
LOGFILE=$(basename $0).$(date +"_%Y%m%d_%H%M").${$}.log
SQLFILE=$(basename $0).$(date +"_%Y%m%d_%H%M").${$}.sql
SOURCEDIR="$1"
TOTALSOURCEFILES=$(find "$1" -type f | wc -l)
let "TOTALTARGETDIRS=$TOTALSOURCEFILES / $MAXFILESPERDIR"
PADLENTARGETDIRS=${#TOTALTARGETDIRS}
PADLENTARGETFILE=${#TOTALSOURCEFILES}
echo "We will create $TOTALTARGETDIRS directories to hold $MAXFILESPERDIR files per directory."
if [ "$RANDOMDISTRIBUTION" == "1" ] ; then
echo "We will rename and distribute each file randomly."
else
echo "We will rename and distribute each file uniformly."
fi
echo "Do you want to continue?"
select choice in yes no ; do
if [ "$choice" == "yes" ] ; then
COUNTER=1
find "$1" -type f | while read SOURCEFILE ; do {
CHECKSUMFILE=$(sha1sum "$SOURCEFILE" | cut -d " " -f 1)
CHECKSUMNAME=$(echo "$SOURCEFILE" | sha1sum | cut -d " " -f 1)
DETERMINISTICNONCE=$(printf "%0${PADLENTARGETFILE}d\n" $COUNTER)
if [ "$RANDOMDISTRIBUTION" == "1" ] ; then
PROBABILISTICNONCE=$(let "XX=$RANDOM % $TOTALTARGETDIRS + 1" ; printf "%0${PADLENTARGETDIRS}d\n" $XX;)
else
PROBABILISTICNONCE=$(let "XX=$COUNTER % $TOTALTARGETDIRS + 1" ; printf "%0${PADLENTARGETDIRS}d\n" $XX;)
fi
FILEDATE=$(stat -c %z "$SOURCEFILE" | cut -d "." -f 1)
FILESIZE=$(stat -c %s "$SOURCEFILE")
echo "Source file $SOURCEFILE" >> $LOGFILE
echo "Target file $TARGETROOTDIR/$PROBABILISTICNONCE/$PROBABILISTICNONCE$CHECKSUMFILE$DETERMINISTICNONCE" >> $LOGFILE
echo "INSERT INTO files (Filename, Location, Checksum, CDate, Size) VALUES ('$PROBABILISTICNONCE$CHECKSUMFILE$DETERMINISTICNONCE', '$PROBABILISTICNONCE', '$CHECKSUMFILE', '$FILEDATE', $FILESIZE);" >> $SQLFILE
mkdir -p $TARGETROOTDIR/$PROBABILISTICNONCE
cp -v "$SOURCEFILE" $TARGETROOTDIR/$PROBABILISTICNONCE/$PROBABILISTICNONCE$CHECKSUMFILE$DETERMINISTICNONCE
let "COUNTER+=1"
} ; done
echo "Done."
echo
break
fi
if [ "$choice" == "no" ] ; then
echo
echo "Operation cancelled"
echo
break
fi
done
else
echo
echo "Missing source directory"
echo
fi
Just run it from the root of your new repository. You can configure it modifying the first variables: MAXFILESPERDIR defines how many files to store per-directory, TARGETROOTDIR is the name of the first-level directory to create the first level directory (it uses only two levels, the first one is really a single root), and RANDOMDISTRIBUTION defines if the files will be distributed randomly (it may look uneven, specially for small runs) or deterministically (just counting).
How it works (FYI, just in case this is not what you are looking for but maybe you can get some ideas):
Count the source files.
Calculate how many target directories will create.
Ask for confirmation.
For each file:
Calculate the SHA1 hash for the file content.
Create a deterministic nonce.
Create a probabilistic nonce (if RANDOMDISTRIBUTION is 1, otherwise just a counter).
Get the size and modification date.
Combine the values of the random value with the hash and the counter to get the new file name (the path will be the random value).
Log the source and target full paths.
Create and log a SQL insert query.
Create the target directory (if it does not exist).
Copy the file. (You can move it if you want but I'm playing safe).
Finish
If you set RANDOMDISTRIBUTION to 1 and run the script several times, you'll get duplicates of your source files, as each file will get different target filename/path each time you run it. If RANDOMDISTRIBUTION is set to something else, everytime you run the script the files will be renamed the same way (for the same file set, if you add or remove files, they will get different names/paths).
The objective of using a random value + hash + counter is to be sure we can handle duplicates (won't collide thanks to the counter) while still distributing the files randomly (for long enough runs, this will distribute the files evenly).
Also, the preffix of the generated file name is the name of the directory too, so that if you have the file name and the directory name length, you can calculate the directory name (just in case you don't store that in your database table).
Finally, this is a one-time migration script, it was not really written to be executed regularly over the same set of files.

Related

Loop through folders inside the zip file using unix shell script

My zip file has my folders inside. After unzipping my zip file, I want to iterate a loop for available folders inside the zip.
Inside loop condition is like below:
If my folder has index file (This is a file contains some data), then only I want to run some process (I know what this process is..). Otherwise we can ignore that folder.
Then loop will continue with other folder if there are anything
Thanks advance..
something like this?
(note: I assume $destdir will only contain the zipfile and its extraction!)
zipfile="/path/to/the/zipfile.zip"
destdir="/path/to/where/you/want/to/unzip"
indexfile="index.txt" #name of the index files
mkdir -p "$destdir" 2>/dev/null #make "sure" it exists.. but ignore errors in case it already exists
cd "$destdir" || { echo "Can not go into destdir=$destdir" ; Exit 1 ; }
#at that point, we are inside $destdir : we can start to work:
unzip "$zipfile"
for i in ./*/ ; do # you could change ./*/ to ./*/*/ if the zip contains a master directory too
cd "$i" && { #the && is important: you want to be sure you could enter that subdir!
if [ -e ./"$indexfile" ]; then
dosomething # you can define the function dosomething and use it here..
# or just place commands here
fi
cd - #we can safely this works, as we started there...
}
done
note: I iterate on ./*/ instead of */ as the dirname could contain a leding -, and therefore make cd -something not work (it would say it can't recognise some options!) ! this goes away with ./, cd ./-something will work !

Powershell Script to rename TXT file based on certain criteria

I realise it is likely the each of the requirements listed below is available individually within the forum here but I am struggling to bring it all together (if at all possible!).
Hoping someone has the patience and time to point me in right direction to make this happen.
What I need to do is the following:
Scan a directory (and all sub-directories) for a particular filename
NOTE: Whilst there are many files within the sub-directories with the filename in question, we only wish to target those in a sub-directory with a suffix of JERRY
ie. In the below example the files indicated by the arrow would be targeted
ONE\NEW1-JERRY\FILENAME.TXT <----
ONE\NEW1-TOM\FILENAME.TXT
ONE\NEW1-SYLVESTER\FILENAME.TXT
TWO\NEW2-JERRY\FILENAME.TXT <----
TWO\NEW2-TOM\FILENAME.TXT
TWO\NEW2-SYLVESTER\FILENAME.TXT
THREE\NEW3-JERRY\FILENAME.TXT <----
THREE\NEW3-TOM\FILENAME.TXT
THREE\NEW3-SYLVESTER\FILENAME.TXT
FOUR\NEW4-JERRY\FILENAME.TXT <----
FOUR\NEW4-TOM\FILENAME.TXT
FOUR\NEW4-SYLVESTER\FILENAME.TXT
When file is found matching the filename and is within the sub-directory listed above take a copy of the file (to remain in same directory) & rename based on the following criteria:
a) Created date/time
b) Certain content within the file
The content in the file is always located on ROW 8 and it is the first 9 characters
Original filename: FILENAME.txt
Finished Product: FILENAME-20121129#1300-123456789.txt
Thanks in advance!
you should tell us what have you tried so far and what are your mains problems...
try this :
(remove the -whatif flag to actually copy files)
#list dir & subdirs
ls c:\ -recurse |
Foreach {
#find subdirs named JERRY
if($_.PSIsContainer -and $_.name -match "JERRY"){
ls $_.fullname -filter 'FILENAME*' |
Foreach{
$fcontent=get-Content $_.FullName
$content=$fcontent[7].Substring(0,9)
$newName=Get-Date -UFormat "%Y%m%d"
$destination="$($_.Directory)\$newName#$content.txt"
write-verbose $destination
Copy-Item $_.FullName -Destination $destination -WhatIf
}
}
}

DOS extract directory from find command

I am writing a dos script and I want to get the path of a file that is known so that the path can be used within the script to change to that directory to use the file specified and ouput log files to the same directory.
The script adds some directories to the path and changes to the required directory to execute a command using input files in the same directory. The command generates a number of files that are saved to the same directory.
Here is what i have so far
#ECHO OFF
:: Check argument count
set argC=0
for %%x in (%*) do Set /A argC+=1
IF %argC% LSS 3 (echo WARNING must include the parent Directory, the filename and the timestep for the simulation)
:: Assign meaningfull names to the input arguments
set parentDirectory=%1
set filename=%2
set scenarioTimestep=%3
:: Check validaty of the input arguments
:: TODO: implement check for directory, filename exists, and possibly limits to the timestep
IF "%parentDirectory%"=="" (
set parentDirectory=P:Parent\Directory
)
IF "%filename%"=="" (
set filename=ship2.xmf
)
IF "%scenarioTimestep%"=="" (
set scenarioTimestep=0.1
)
echo parent Directory: %parentDirectory%
echo filename: %filename%
echo timestep: %scenarioTimestep%
set MSTNFYURI=file:mst.log
set MSTNFYLEVEL=debug
set MSTNFYFLUSH=1
set XSFNFYURI=file:xsf.log
set XSFNFYLEVEL=debug
set XSFNFYFLUSH=1
set parentNFYURI=file:parent.log
set parentNFYLEVEL=debug
set parentNFYFLUSH=1
:: Add the parent directories to the path
set PATH=%parentDirectory%\bin\;%parentDirectory%\bin\ext\;%parentDirectory%\bin\gtkmm\;%parentDirectory%\bin\osg\;%PATH%
:: Change to the target directoy
set tagetDirectory=%parentDirectory%\examples\testing_inputs
cd %tagetDirectory%
echo command will be: ft -c %filename% -T %scenarioTimestep%
::ft -c %filename% -T %scenarioTimestep%
#ECHO ON
What i want to be able to do is instead of using the hard coded directory path examples\testing_inputs for targetDirectoy, i want to be able to search for the filename supplied and change directory to that path.
I know i can get the information displayed using
"dir filename.ext /s"
DOS ouptut
Volume in drive C is OS
Volume Serial Number is XXXX-XXXX
Directory of C:\Users\Me\parent\examples\testing_input
15/11/2012 02:51 PM <size> filename
...
...
How do i extract the directory form this info to be used within the script? Also if there is more than one file of the same name, how can i select the path based on the timestamp of the file?
for /f %%F in ('dir /B /S /A:-D filename.ext ') do set file_path=%%F
pushd %file_path%\..
dir_path=%CD%
popd
echo %file_path%
echo %dir_path%
is this what you looking for?
EDIT: Check dbenham's comment.

Unix cp command destination = . (dot)?

What does . (dot) mean as the destination of the cp command?
For example:
cp ~dir1/dir2/dir3/executableFile.x .
When this executes it copies the file successfully with the correct file name, but I'm wondering is this what a destination of '.' will always do or is there another purpose?
Within the reference material I've seen, dots are used in front of files to indicate 'hidden', but in that has no relation to the command above.
dot represents the current directory
while dotdot is the parent directory.
As EvilTeach's answer says, . is the current directory, and .. is the parent directory.
There are basically two ways to use the cp command:
cp file1 file2
will copy file1 to file2, creating file2 if it doesn't exist or (depending on permissions) possibly clobbering it if it does.
The other way is:
cp file1 file2 ... dir
where dir is an existing directory. With this form, you can specify one or more files, and they'll all be copied into the specified directory dir with their existing names.
(This can be a pitfall sometimes; cp foo bar behaves very differently depending on whether there's an existing directory named bar.)
As you mention, files (including directories) whose names start with . are hidden. What this means is that (a) the ls command won't list them (unless you use the -a or -A option), and (b) a shell wildcard such as * or *.txt will omit them. (GUI directory managers such as Nautilus may also omit them, depending on your settings.)
This applies to the current directory . and the parent directory ... ls won't include the . and .. entries in its output; ls -a will.

Why did my use of the read command not do what I expected?

I did some havoc on my computer, when I played with the commands suggested by vezult [1]. I expected the one-liner to ask file-names to be removed. However, it immediately removed my files in a folder:
> find ./ -type f | while read x; do rm "$x"; done
I expected it to wait for my typing of stdin:s [2]. I cannot understand its action. How does the read command work, and where do you use it?
What happened there is that read reads from stdin. When you put it at the end of a pipe, it read from that pipe.
So your find becomes
file1
file2
and so on; read reads that and replaces x successively with file1 then file2, and so your loop becomes
rm "file1"
rm "file2"
and sure enough, that rm's every file starting at the current directory ".".
A couple hints.
You didn't need the "/".
It's better and safer to say
find . -type f
because should you happen to type ". /" (ie, dot SPACE slash) find will start at the current directory and then go look starting at the root directory. That trick, given the right privileges, would delete every file in the computer. "." is already the name of a directory; you don't need to add the slash.
The find or rm commands will do this
It sounds like what you wanted to do was go through all the files in all the directories starting at the current directory ".", and have it ASK if you want to delete it. You could do that with
find . -type f -exec rm -i {} \;
or
find . -type f -ok rm {} \;
and not need a loop at all. You can also do
rm -r -i *
and get nearly the same effect, except that it will try to delete directories too. If the directory is empty, that'll even work.
Another thought
Come to think of it, unless you have a LOT of files, you could also do
rm -i `find . -type f`
Now the find in backquotes will become a bunch of file names on the command line, and the '-i' interactive flag on rm will ask the yes or no question.
Charlie Martin gives you a good dissection and explanation of what went wrong with your specific example, but doesn't address the general question of:
When should you use the read command?
The answer to that is - when you want to read successive lines from some file (quite possibly the standard output of some previous sequence of commands in a pipeline), possibly splitting the lines into several separate variables. The splitting is done using the current value of '$IFS', which normally means on blanks and tabs (newlines don't count in this context; they separate lines). If there are multiple variables in the read command, then the first word goes into the first variable, the second into the second, ..., and the residue of the line into the last variable. If there's only one variable, the whole line goes into that variable.
There are many uses. This is one of the simpler scripts I have that uses the split option:
#!/bin/ksh
#
# #(#)$Id: mkdbs.sh,v 1.4 2008/10/12 02:41:42 jleffler Exp $
#
# Create basic set of databases
MKDUAL=$HOME/bin/mkdual.sql
ELEMENTS=$HOME/src/sqltools/SQL/elements.sql
cat <<! |
mode_ansi with log mode ansi
logged with buffered log
unlogged
stores with buffered log
!
while read dbs logging
do
if [ "$dbs" = "unlogged" ]
then bw=""; cw=""
else bw="-ebegin"; cw="-ecommit"
fi
sqlcmd -xe "create database $dbs $logging" \
$bw -e "grant resource to public" -f $MKDUAL -f $ELEMENTS $cw
done
The cat command with a here-document has its output sent to a pipe, so the output goes into the while read dbs logging loop. The first word goes into $dbs and is the name of the (Informix) database I want to create. The remainder of the line is placed into $logging. The body of the loop deals with unlogged databases (where begin and commit do not work), then run a program sqlcmd (completely separate from the Microsoft new-comer of the same name; it's been around since about 1990) to create a database and populate it with some standard tables and data - a simulation of the Oracle 'dual' table, and a set of tables related to the 'table of elements'.
Other scripts that use the read command are bigger (by far), but generally read lines containing one or more file names and some other attributes of relevance, and then apply an appropriate transform to the files using the attributes.
Osiris JL: file * | grep 'sh.*script' | sed 's/:.*//' | xargs wgrep read
esqlcver:read version letter
jlss: while read directory
jlss: read x || exit
jlss: read x || exit
jlss: while read file type link owner group perms
jlss: read x || exit
jlss: while read file type link owner group perms
kb: while read size name
mkbod: while read directory
mkbod:while read dist comp
mkdbs:while read dbs logging
mkmsd:while read msdfile master
mknmd:while read gfile sfile version notes
publictimestamp:while read name type title
publictimestamp:while read name type title
Osiris JL:
'Osiris JL: ' is my command line prompt; I ran this in my 'bin' directory. 'wgrep' is a variant of grep that only matches entire words (to avoid words like 'already'). This gives some indication of how I've used it.
The 'read x || exit' lines are for an interactive script that reads a response from standard input, but exits if the command gets EOF (for example, if standard input comes from /dev/null).

Resources