Efficient way of getting listing of files in large filesystem

Efficient way of getting listing of files in large filesystem - unix

What is the most efficient way to get a "ls"-like output of the most recently created files in a very large unix file system (100 thousand files +)?
Have tried ls -a and some other varients.

You can also use less to search and scroll it easily.
ls -la | less

If I'm understanding your question correctly try
ls -a | tail
More information here

If the files are in a single directory, then you can use:
ls -lt | less
the -t option to ls will sort the files by modification time and less will let you scroll through them
If the want recent files across an entire file system --- i.e., in different directories, then you can use the find command:
find dir -mtime 1 -print | xargs ls -ld
Substitute the directory where you want to start the search for "dir". The find command will print the names of all of the files that have been modified in the last day (-mtime 1 means modified in the last one day) and the xargs command will take that list of files and feed it to ls, giving you the ls-like output you want

Related

Unix Find directories containing more than 2 files

I have a directory containing over a thousand subdirectories. I only want to 'ls' all the directories that contain more than 2 files. I don't need the directories that contain less than 2 files. This is in C-shell, not bash. Anyone know of a good command for this?
I tried this command but it's not giving the desired output. I simply want the full list of directories with more than 2 files. A reason it isn't working is because it will go into sub dirs in those dirs to find if they have more than 2 files. I don't want a recursive search. Just a list of first level directories in the main directory they are in.
$ find . -type f -printf '%h\n' | sort | uniq -c | awk '$1 > 2'

My mistake, I was thinking bash rather than csh. Although I don't have a csh to test with, I think this is the csh syntax for the same thing:
foreach d (*)
if (d "$d" && `ls -1 "$d"|wc -l` > 2) echo "$d"
end
I've added a guard so that non-directories aren't unnecessarily processed, and I've included double-quotes in case there are any "funny" file or directory names (containing spaces e.g.).
One possible problem (I don't know what your exact task is): any immediate subdirectories will also count as files.
Sorry, I was working in bash here:
for d in *;do if [ $(ls -1 $d|wc -l) -gt 2 ];then echo $d;fi;done
For a faster solution, you could try "cheating" by deconstructing the format of the directories themselves if you're on pure Unix. They're just files themselves with contents that can be analyzed in that case. Needless to say that is NOT PORTABLE, to e.g. any bash running on Windows, so not recommended.

Script Issues with find -> tar/gzip

I am currently working on a script, to store/backup our old files, so that we have more space on our server. This script will be used as a cronjob to backup the stuff every week. My script currently looks like this:
#!/bin/bash
currentDate=$(date '+%Y%m%d%T' | sed -e 's/://g')
find /Directory1/ -type f -mtime +90 | xargs tar cvf - | gzip > /Directory2/Backup$currentDate.tar.gz
find /Directory1/ -type f -mtime +90 -exec rm {} \;
The script is at first saving the current Date + Timestamp(without ":") as a variable. Afterwards it searches for files older than 90 days, tars them and finally makes a gzip out of them, which has the name "Backup$currentDate.tar.gz".
Then it's supposed to find the files again and remove them.
I do however have some issues here:
Directory1 consists of multiple Directories. It does find the files and creates the gz file, but while some files are zipped properly(for instance /DirName1/DirName2/DirName3/File), others appear directly in the "root" Dir. What could be the issue here?
Is there a way to tell the Script, to only create the gz file, if files are found? Because currently, we get gz files, even if there was nothing found, leading to empty directories.
Can I somehow use the find output later on(store variable?), so that the remove at the end really only targets those files found in the step before? Because if the third step would take, let's say a hour and the last step gets executed after it's finished, it could potentially remove files, that weren't older than 90 days before, but are now, so they are never backed up, but then deleted(highly unlikly, but not impossible).
If there's anything else you need to know, feel free to ask ^^
Best regards

I've "rephrased" your original code a bit. I don't have an AIX machine to test anything, so DO NOT cut and paste this. Using this code, you should be able to address your issues. To wit:
It make a record of what files it intends to operate on ($BFILES).
This record can be used to check for empty tar files.
This record can be used to see why your find is producing "funny" output. It wouldn't surprise me to find that xargs hit a space character.
This record can be used to delete exactly the files archived.
As a child, I had a serious accident with xargs and have avoided it ever since. Maybe there is a safe version out there.
#!/bin/bash
# I don't have an AIX machine to test this, so exit immediately until
# someone can proof this code.
exit 1
currentDate=$(date '+%Y%m%d%T' | sed -e 's/://g')
BFILES=/tmp/Backup$currentDate.files
find /Directory1 -type f -mtime +90 -print > $BFILES
# Here is the time to proofread the file list, $BFILES
# The AIX page I read lists the '-L' option to take filenames from an
# input file. I've found xargs to be sketchy unless you are very
# careful about quoting.
#tar -c -v -L $BFILES -f - | gzip -9 > /Directory2/Backup$currentDate.tar.gz
# I've found xargs to be sketchy unless you are very careful about
# quoting. I would rather loop over the input file one well quoted
# line at a time rather than use the faster, less safe xargs. But
# here it is.
#xargs rm < $BFILES

terminal command to act on filenames that don't contain text

I have a directory full of files with names such as:
file_name_is_001
file_name_001
file_name_is_002
file_name_002
file_name_is_003
file_name_003
I want to copy only the files that don't contain 'is'. I'm not sure how to do this. I have tried to search for it, but can't seem to google the right phrase to find the results.

Details depend on operating system, shell, etc.
For a unix system a quite verbose but easy to understand approach could look like this (please mind that I didn't test it):
mkdir some_temporary_directory
mv *_is_* some_temporary_directory
cp * where_ever_you_want_to_copy_it
mv some_temporary_directory/* .
rmdir some_temporary_directory

You can do this using bash. First, here's a command to get you a list of files that don't contain the text _is_:
ls | grep -v "_is_"
This takes the output of ls and matches all values with DO NOT contain _is_ using grep -v.
In order to then copy these files, we need to turn the lines output by grep into arguments of cp. We can do this using xargs:
ls | grep -v "_is_" | xargs -J % cp % new_folder
From the xargs man page, it is a tool to "build and execute command lines from standard input".

Unix [Homework]: Get a list of /home/user/ directories in /etc/passwd

I'm very new to Unix, and currently taking a class learning the basics of the system and its commands.
I'm looking for a single command line to list off all of the user home directories in alphabetical order from the /etc/passwd directory. This applies only to the home directories, and not the contents within them. There should be no duplicate entries. I've tried many permutations of commands such as the following:
sort -d | find /etc/passwd /home/* -type -d | uniq | less
I've tried using -path, -name, removing -type, using -prune, and changing the search pattern to things like /home/*/$, but haven't gotten good results once. At best I can get a list of my own directory (complete with every directory inside it, which is bad), and the directories of the other students on the server (without the contained directories, which is good). I just can't get it to display the /home/user directories and nothing else for my own account.
Many thanks in advance.

/etc/passwd is a file. the home directory is usually at field/column 6, where ":" is the delimiter. When you are dealing with file structure that has distinct characters as delimiters, you should use a tool that can break your data down into smaller chunks for easier manipulation using fields and field delimiters. awk/cut etc, even using the shell with IFS variable set can do the job. eg
awk -F":" '{print $6}' /etc/passwd | sort
cut -d":" -f6 /etc/passwd |sort
using the shell to read the file
while IFS=":" read -r a b c d e home_dir g
do
echo $home_dir
done < /etc/passwd | sort

I think the tools you want are grep, tr and awk. Grep will give you lines from the file that actually contain home directories. tr will let you break up the delimiter into spaces, which makes each line easier to parse.
Awk is just one program that would help you display the results that you want.
Good luck :)
Another hint, try ls --color=auto /etc, passwd isn't the kind of file that you think it is. Directories show up in blue.

In Unix, find is a command for finding files under one or more directories. I think you are looking for a command for finding lines within a file that match a pattern? Look into the command grep.

sed 's|\(.[^:]*\):\(.[^:]*\):\(.*\):\(.[^:]*\):\(.[^:]*\)|\4|' /etc/passwd|sort

I think all this processing could be avoided. There is a utility to list directory contents.
ls -1 /home
If you'd like the order of the sorting reversed
ls -1r /home
Granted, this list out the name of just that directory name and doesn't include the '/home/', but that can be added back easily enough if desired with something like this
ls -1 /home | (while read line; do echo "/home/"$line; done)

I used something like :
ls -l -d $(cut -d':' -f6 /etc/passwd) 2>/dev/null | sort -u
The only thing I didn't do is to sort alphabetically, didn't figured that yet

Diff files present in two different directories

I have two directories with the same list of files. I need to compare all the files present in both the directories using the diff command. Is there a simple command line option to do it, or do I have to write a shell script to get the file listing and then iterate through them?

You can use the diff command for that:
diff -bur folder1/ folder2/
This will output a recursive diff that ignore spaces, with a unified context:
b flag means ignoring whitespace
u flag means a unified context (3 lines before and after)
r flag means recursive

If you are only interested to see the files that differ, you may use:
diff -qr dir_one dir_two | sort
Option "q" will only show the files that differ but not the content that differ, and "sort" will arrange the output alphabetically.

Diff has an option -r which is meant to do just that.
diff -r dir1 dir2

diff can not only compare two files, it can, by using the -r option, walk entire directory trees, recursively checking differences between subdirectories and files that occur at comparable points in each tree.
$ man diff
...
-r --recursive
Recursively compare any subdirectories found.
...
Another nice option is the über-diff-tool diffoscope:
$ diffoscope a b
It can also emit diffs as JSON, html, markdown, ...

If you specifically don't want to compare contents of files and only check which one are not present in both of the directories, you can compare lists of files, generated by another command.
diff <(find DIR1 -printf '%P\n' | sort) <(find DIR2 -printf '%P\n' | sort) | grep '^[<>]'
-printf '%P\n' tells find to not prefix output paths with the root directory.
I've also added sort to make sure the order of files will be the same in both calls of find.
The grep at the end removes information about identical input lines.

If it's GNU diff then you should just be able to point it at the two directories and use the -r option.
Otherwise, try using
for i in $(\ls -d ./dir1/*); do diff ${i} dir2; done
N.B. As pointed out by Dennis in the comments section, you don't actually need to do the command substitution on the ls. I've been doing this for so long that I'm pretty much doing this on autopilot and substituting the command I need to get my list of files for comparison.
Also I forgot to add that I do '\ls' to temporarily disable my alias of ls to GNU ls so that I lose the colour formatting info from the listing returned by GNU ls.

When working with git/svn or multiple git/svn instances on disk this has been one of the most useful things for me over the past 5-10 years, that somebody might find useful:
diff -burN /path/to/directory1 /path/to/directory2 | grep +++
or:
git diff /path/to/directory1 | grep +++
It gives you a snapshot of the different files that were touched without having to "less" or "more" the output. Then you just diff on the individual files.

In practice the question often arises together with some constraints. In that case following solution template may come in handy.
cd dir1
find . \( -name '*.txt' -o -iname '*.md' \) | xargs -i diff -u '{}' 'dir2/{}'

Here is a script to show differences between files in two folders. It works recursively. Change dir1 and dir2.
(search() { for i in $1/*; do [ -f "$i" ] && (diff "$1/${i##*/}" "$2/${i##*/}" || echo "files: $1/${i##*/} $2/${i##*/}"); [ -d "$i" ] && search "$1/${i##*/}" "$2/${i##*/}"; done }; search "dir1" "dir2" )

Try this:
diff -rq /path/to/folder1 /path/to/folder2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex