(grep) Regex to match non-ASCII characters? FROM WINDOWS - unix

I'm developing a pre-commit hook to avoid committing files with non-ascii chars, it works as well from unix system, using the below REGEX:
grep -P -n '[\x80-\xFF]' /tmp/app.txt
Now the issue that is giving me a lot of pain is that when i commit from windows, the result of the grep change, giving me a lot of char more than non ascii chars...
Does someone know how to fix this? I really try a lot of different things..

strings -n 1 filename will show the normal characters, but what if you only want to see the kind of file? file filename will show the kind of file but I am afraid it won't work for you.
You might try something like
cat /tmp/app.txt | tr -d "[:print:]\r\n" | wc -c
or avoiding the cat
tr -d "[:print:]\r\n" < /tmp/app.txt | wc -c

Related

suggest good example to explain unix input redirection

Can somebody explain how unix input redirection is used with a good example?
to find a text in a file I can use this approach
grep "text" file.txt
or
grep "text" < file.txt
Both of them give the same output. I'm unable to explain somebody on how unix input redirection can actually be helpful and used?
grep supports both stdin (through redirection) and parameter, so here you have no benefit.
With tr you can delete '\r' characters from a windows file :
cat windowsfile | tr -d '\r' > newunixfile
The use of the command cat can be avoiding (one less program):
tr -d '\r' < windowsfile > newunixfile

Unix query --- finding the numbers

On a Linux/Unix system, you have a directory full of files. You want to identify the files that have phone numbers in them, where you can assume a phone number looks like xxx-xxx-xxxx. Which standard Unix tools could you use to solve your problem?
I can use egrep '[0-9]+-[0-9]+-[0-9]+' *
Is there any way I can use find command to solve this or any other commands to solve this ?
what's wrong with ls | egrep '[0-9]{3}-[0-9]{3}-[0-9]{4}'? you don't need find to serach a flat directory unless you want some complex postprocessing.
if you want to check the contents of files, use egrep '[0-9]{3}-[0-9]{3}-[0-9]{4}' *.
ls -1 | perl -lne 'print if(/^[\d]{3}-\d{3}-\d{4}$/)'
Tested Below:
> touch 903-745-39913
> touch 903-745-3991
> ls -1 | perl -lne 'print if(/^[\d]{3}-\d{3}-\d{4}$/)'
903-745-3991
>
Your question is not quite clear.If you mean phone numbers are present inside the file.
Then,
ls| xargs perl -lne 'print if(/\b[\d]{3}-\d{3}-\d{4}\b/)'
will server the purpose.
Use Sed
sed -n '/^[0-9]\{3\}-[0-9]\{3\}-[0-9]\{4\}/p' phonelist.txt

How do I perform a recursive directory search for strings within files in a UNIX TRU64 environment?

Unfortunately, due to the limitations of our Unix Tru64 environment, I am unable to use the GREP -r switch to perform my search for strings within files across multiple directories and sub directories.
Ideally, I would like to pass two parameters. The first will be the directory I want my search is to start on. The second is a file containing a list of all the strings to be searched. This list will consist of various directory path names and will include special characters:
ie:
/aaa/bbb/ccc
/eee/dddd/ggggggg/
etc..
The purpose of this exercise is to identify all shell scripts that may have specific hard coded path names identified in my list.
There was one example I found during my investigations that perhaps comes close, but I am not sure how to customize this to accept a file of string arguments:
eg: find etb -exec grep test {} \;
where 'etb' is the directory and 'test', a hard coded string to be searched.
This should do it:
find dir -type f -exec grep -F -f strings.txt {} \;
dir is the directory from which searching will commence
strings.txt is the file of strings to match, one per line
-F means treat search strings as literal rather than regular expressions
-f strings.txt means use the strings in strings.txt for matching
You can add -l to the grep switches if you just want filenames that match.
Footnote:
Some people prefer a solution involving xargs, e.g.
find dir -type f -print0 | xargs -0 grep -F -f strings.txt
which is perhaps a little more robust/efficient in some cases.
By reading, I assume we can not use the gnu coreutil, and egrep is not available.
I assume (for some reason) the system is broken, and escapes do not work as expected.
Under normal situations, grep -rf patternfile.txt /some/dir/ is the way to go.
a file containing a list of all the strings to be searched
Assumptions : gnu coreutil not available. grep -r does not work. handling of special character is broken.
Now, you have working awk ? no ?. It makes life so much easier. But lets be on the safe side.
Assume : working sed ,one of od OR hexdump OR xxd (from vim package) is available.
Lets call this patternfile.txt
1. Convert list into a regexp that grep likes
Example patternfile.txt contains
/foo/
/bar/doe/
/root/
(example does not print special char, but it's there.) we must turn it into something like
(/foo/|/bar/doe/|/root/)
Assuming echo -en command is not broken, and xxd , or od, or hexdump is available,
Using hexdump
cat patternfile.txt |hexdump -ve '1/1 "%02x \n"' |tr -d '\n'
Using od
cat patternfile.txt |od -A none -t x1|tr -d '\n'
and pipe it into (common for both hexdump and od)
|sed 's:[ ]*0a[ ]*$::g'|sed 's: 0a:\\|:g' |sed 's:^[ ]*::g'|sed 's:^: :g' |sed 's: :\\x:g'
then pipe result into
|sed 's:^:\\(:g' |sed 's:$:\\):g'
and you have a regexp pattern that is escaped.
2. Feed the escaped pattern into broken regexp
Assuming the bare minimum shell escape is available,
we use grep "$(echo -en "ESCAPED_PATTERN" )" to do our job.
3. To sum it up
Building a escaped regexp pattern (using hexdump as example )
grep "$(echo -en "$( cat patternfile.txt |hexdump -ve '1/1 "%02x \n"' |tr -d '\n' |sed 's:[ ]*0a[ ]*$::g'|sed 's: 0a:\\|:g' |sed 's:^[ ]*::g'|sed 's:^: :g' |sed 's: :\\x:g'|sed 's:^:\\(:g' |sed 's:$:\\):g')")"
will escape all characters and enclose it with (|) brackets so a regexp OR match will be performed.
4. Recrusive directory lookup
Under normal situations, even when grep -r is broken, find /dir/ -exec grep {} \; should work.
Some may prefer xargs instaed (unless you happen to have buggy xargs).
We prefer find /somedir/ -type f -print0 |xargs -0 grep -f 'patternfile.txt' approach, but since
this is not available (for whatever valid reason),
we need to exec grep for each file,and this is normaly the wrong way.
But lets do it.
Assume : find -type f works.
Assume : xargs is broken OR not available.
First, if you have a buggy pipe, it might not handle large number of files.
So we avoid xargs in such systems (i know, i know, just lets pretend it is broken ).
find /whatever/dir/to/start/looking/ -type f > list-of-all-file-to-search-for.txt
IF your shell handles large size lists nicely,
for file in cat list-of-all-file-to-search-for.txt ; do grep REGEXP_PATTERN "$file" ;
done ; is a nice way to get by. Unfortunetly, some systems do not like that,
and in that case, you may require
cat list-of-all-file-to-search-for.txt | split --help -a 4 -d -l 2000 file-smaller-chunk.part.
to turn it into smaller chunks. Now this is for a seriously broken system.
then a for file in file-smaller-chunk.part.* ; do for single_line in cat "$file" ; do grep REGEXP_PATTERN "$single_line" ; done ; done ;
should work.
A
cat filelist.txt |while read file ; do grep REGEXP_PATTERN $file ; done ;
may be used as workaround on some systems.
What if my shell doe not handle quotes ?
You may have to escape the file list beforehand.
It can be done much nicer in awk, perl, whatever, but since we restrict our selves to
sed, lets do it.
We assume 0x27, the ' code will actually work.
cat list-of-all-file-to-search-for.txt |sed 's#['\'']#'\''\\'\'\''#g'|sed 's:^:'\'':g'|sed 's:$:'\'':g'
The only time I had to use this was when feeding output into bash again.
What if my shell does not handle that ?
xargs fails , grep -r fails , shell's for loop fails.
Do we have other things ? YES.
Escape all input suitable for your shell, and make a script.
But you know what, I got board, and writing automated scripts for csh just seems
wrong. So I am going to stop here.
Take home note
Use the tool for the right job. Writing a interpreter on bc is perfectly
capable, but it is just plain wrong. Install coreutils, perl, a better grep
what ever. makes life a better thing.

Unix [Homework]: Get a list of /home/user/ directories in /etc/passwd

I'm very new to Unix, and currently taking a class learning the basics of the system and its commands.
I'm looking for a single command line to list off all of the user home directories in alphabetical order from the /etc/passwd directory. This applies only to the home directories, and not the contents within them. There should be no duplicate entries. I've tried many permutations of commands such as the following:
sort -d | find /etc/passwd /home/* -type -d | uniq | less
I've tried using -path, -name, removing -type, using -prune, and changing the search pattern to things like /home/*/$, but haven't gotten good results once. At best I can get a list of my own directory (complete with every directory inside it, which is bad), and the directories of the other students on the server (without the contained directories, which is good). I just can't get it to display the /home/user directories and nothing else for my own account.
Many thanks in advance.
/etc/passwd is a file. the home directory is usually at field/column 6, where ":" is the delimiter. When you are dealing with file structure that has distinct characters as delimiters, you should use a tool that can break your data down into smaller chunks for easier manipulation using fields and field delimiters. awk/cut etc, even using the shell with IFS variable set can do the job. eg
awk -F":" '{print $6}' /etc/passwd | sort
cut -d":" -f6 /etc/passwd |sort
using the shell to read the file
while IFS=":" read -r a b c d e home_dir g
do
echo $home_dir
done < /etc/passwd | sort
I think the tools you want are grep, tr and awk. Grep will give you lines from the file that actually contain home directories. tr will let you break up the delimiter into spaces, which makes each line easier to parse.
Awk is just one program that would help you display the results that you want.
Good luck :)
Another hint, try ls --color=auto /etc, passwd isn't the kind of file that you think it is. Directories show up in blue.
In Unix, find is a command for finding files under one or more directories. I think you are looking for a command for finding lines within a file that match a pattern? Look into the command grep.
sed 's|\(.[^:]*\):\(.[^:]*\):\(.*\):\(.[^:]*\):\(.[^:]*\)|\4|' /etc/passwd|sort
I think all this processing could be avoided. There is a utility to list directory contents.
ls -1 /home
If you'd like the order of the sorting reversed
ls -1r /home
Granted, this list out the name of just that directory name and doesn't include the '/home/', but that can be added back easily enough if desired with something like this
ls -1 /home | (while read line; do echo "/home/"$line; done)
I used something like :
ls -l -d $(cut -d':' -f6 /etc/passwd) 2>/dev/null | sort -u
The only thing I didn't do is to sort alphabetically, didn't figured that yet

What's the best way to convert Windows/DOS files to Unix in batch?

Basically we need to change the end of line characters for a group of files.
Is there a way to accomplish this with a batch file? Is there a freeware utility?
dos2unix
It could be done with somewhat shorter command.
find ./ -type f | xargs -I {} dos2unix {}
You should be able to use tr in combination with xargs to do this.
On the Unix side at least, this should be the simplest way. However, I tried doing it that way once on a Windows box over a decade ago, but discovered that the Windows version of tr was translating my terminators right back to Windows format for me. :-( However, I think in the interveneing decade the tools have gotten smarter.
Combine find with dos2unix/fromdos to convert a directory of files (excluding binary files).
Just add this to your .bashrc:
DOS2UNIX=$(which fromdos || which dos2unix) \
|| echo "*** Please install fromdos or dos2unix"
function finddos2unix {
# Usage: finddos2unix Directory
find $1 -type f -exec file {} \; | grep " text" | cut -d ':' -f1 | xargs $DOS2UNIX
}
First, DOS2UNIX finds whether you have the utility installed, and picks one to use
Find makes a list of all files, then file appends the ": ASCII text" after each text file.
Finally, grep picks the text files, Cut removes all text after ':', and xargs makes this one big command line for DOS2UNIX.

Resources