Check if a file contains certain ASCII characters - unix

I need a unix command to verify the file has ASCII printable characters only (between ASCII Hex 20 and 7E inclusive).
I got below command to check if file contains non-ASCII characters, but cannot figure out my above question.
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi

nice to have:
- Stop loading results. Sometimes one is enough
To find 20 to 7E characters in a file you can use:
grep -P "[\x20-\x7E]" file
Note the usage of -P to perform Perl regular expressions.
But in this case you want to check if the file just contains these kind of characters. So the best thing to do is to check if there is any of them that are not within this range, that is check [^range]:
grep -P "[^\x20-\x7E]" file
All together, I would say:
grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"

This can be done in unix using the POSIX grep options:
if LC_ALL=C grep -q '[^ -~]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
where the characters in [ ... ] are ^ (caret), space, - (ASCII minus sign), ~ (tilde).
You could also specify ASCII tab. The standard refers to these as collating elements. It seems that both \x (hexadecimal) or \0 (octal) are shown in the standard description of bracket expressions (see 7.4.1). So you could use \x09 or \011 for the literal tab.
According to the description, by default -e accepts a basic regular expression (BRE). If you added a -E, you could have an extended regular expression (but that is not needed).

Related

Remove latin1 chars from a file

My file is utf8 but contains several latin1 chars namely other foreign languages. My aim is to get rid of these chars using a Unix command. Earlier when i tried to achieve this by removing all the non-ASCII chars, the below command went ahead and removed all the accented chars as well. I wanted to retain the accented chars on the same hand i wanted to remove only the non-english(mandarain, japanese, korean, thai, arabic) terms from the file.
grep --color='auto' -P -n "[\x80-\xFF]" file.txt -> this command helped me remove non-ASCII chars but it also removes the accented chars(í, æ, Ö etc)...is it possible to get
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO|洁|ID12|doorbell|geo#xyx.comd
1011|ICE|LAND|邵|DUY|DUY|123|EOP|dataset1|geo#xyx.com
53101|炜|GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|陈
สัอ |JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
เมื่รกเริ่ม|JOH|LIU|ABC|DUYសា|DUY|57T2P|EOP|unknown|geo#xyx.com
👼|👼🏻| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM
Output:
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO||ID12|doorbell|geo#xyx.comd
1011|ICE|LAND||DUY|DUY|57T2P|EOP|dataset1|geo#xyx.com
53101||GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM
You can use the Unicode Properties to detect characters that belong to the Latin and Basic Latin, which are the ones you seem to want preserved. Perl supports them in regular expressions:
perl -CSD -pe 's/[^\p{Basic Latin}\p{Latin}]//g' file.txt
(but it doesn't change 123 to 57T2P)
-CSD turns on UTF-8 decoding/encoding of input and output
-p reads the input line by line and prints each line after processing
s/PATTERN/REPLACEMENT/g is a global replacement, it replaces all occurrences of PATTERN by the replacement, in this case the replacement is empty
[...] introduces a character class, ^ at the beginning negates it, i.e. we want to match anything that's not Latin or Basic Latin.
If you really have UTF-8 and want to keep only the extended ascii characters (aka, usually, latin1), iconv may work for you.
iconv -c -f UTF8 -t LATIN1 input_file > output_file
-c Silently discard characters that cannot be converted instead
of terminating when encountering such characters.
Here is the most non-elegant solution to your problem:
$ sed -e 's/[^,./#|[:space:]0-9[=a=][=b=][=c=][=d=][=e=][=f=][=g=][=h=][=i=][=j=][=k=][=l=][=m=][=n=][=o=][=p=][=q=][=r=][=s=][=t=][=u=][=v=][=w=][=x=][=y=][=z=][=A=][=B=][=C=][=D=][=E=][=F=][=G=][=H=][=I=][=J=][=K=][=L=][=M=][=N=][=O=][=P=][=Q=][=R=][=S=][=T=][=U=][=V=][=W=][=X=][=Y=][=Z=]]//g' file.txt
To my big surprise I could not use [:punct:] because some of the symbols are actually defined as punctuation.

zsh function throwing "Bad math expression: illegal character" error

I want to create a simple function which prints the size of a file with the appropriate label that is accessed via curl. This is what I have included in my .zshrc config:
function curl-size {
BYTELENGTH=$(curl -sI $1 | grep -i Content-Length | awk '{print $2}')
if (($BYTELENGTH>1000000000));then
VALUE=$(echo "scale=3;$BYTELENGTH/1000000000" | bc -l)
LABEL="gb"
elif (($BYTELENGTH>1000000));then
VALUE=$(echo "scale=3;$BYTELENGTH/1000000" | bc -l)
LABEL="mb"
elif (($BYTELENGTH>1000));then
VALUE=$(echo "scale=3;$BYTELENGTH/1000" | bc -l)
LABEL="kb"
else
VALUE=$BYTELENGTH
LABEL="bytes"
fi
echo $(echo "$VALUE" | grep -o '.*[1-9]') $LABEL
}
trying to use curl-size https://i.imgur.com/A8eQsll.jpg in the terminal returns
curl-size:2: bad math expression: illegal character: ^M
curl-size:5: bad math expression: illegal character: ^M
curl-size:8: bad math expression: illegal character: ^M
curl-size:12: bad math expression: illegal character: ^M
^M is the character otherwise known as a carriage return -- which is to say, an instruction for the cursor to go back to the beginning of the current line. On DOS-derived platforms, lines of a text file are separated by a <CR><LF> sequence (whereas on UNIX-family platforms, lines of a text file are terminated by <LF> alone; note that this means that on UNIX, a text file is expected to have a <LF> at the very end for that last line to be valid, whereas on Windows, a trailing <CR><LF> results in an empty line at the end of the file).
If the web server you're connecting to with curl is returning content with DOS newlines, those carriage returns will be considered content rather than code. A somewhat inefficient but workable fix might look like:
BYTELENGTH=$(curl -sI "$1" | tr -d '\r' | awk '/Content-Length/ {print $2}')
Note that using all-caps names for your own variables is a bad idea when writing scripts for POSIX-compliant shells -- which by standard-mandated convention reserve lowercase names for application use and use exclusively the all-caps namespace for variables that modify their behavior -- but zsh is not POSIX-compliant and does not follow this convention, so this guideline does not apply there.

Check if file contains some text (not regex) in Unix

I want to check if a multiline text matches an input. grep comes close, but I couldn't find a way to make it interpret pattern as plain text, not regex.
How can I do this, using only Unix utilities?
Use grep -F:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by
POSIX.)
EDIT: Initially I didn't understand the question well enough. If the pattern itself contains newlines, use -z option:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.
I've tested it, multiline patterns worked.
From man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
If the input string you are trying to match does not contain a blank line (eg, it does not have two consecutive newlines), you can do:
awk 'index( $0, "needle\nwith no consecutive newlines" ) { m=1 }
END{ exit !m }' RS= input-file && echo matched
If you need to find a string with consecutive newlines, set RS to some string that is not in the file. (Note that the results of awk are unspecified if you set RS to more than one character, but most awk will allow it to be a string.) If you are willing to make the sought string a regex, and if your awk supports setting RS to more than one character, you could do:
awk 'END{ exit NR == 1 }' RS='sought regex' input-file && echo matched

grep for special characters in Unix

I have a log file (application.log) which might contain the following string of normal & special characters on multiple lines:
*^%Q&$*&^#$&*!^#$*&^&^*&^&
I want to search for the line number(s) which contains this special character string.
grep '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The above command doesn't return any results.
What would be the correct syntax to get the line numbers?
Tell grep to treat your input as fixed string using -F option.
grep -F '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
Option -n is required to get the line number,
grep -Fn '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The one that worked for me is:
grep -e '->'
The -e means that the next argument is the pattern, and won't be interpreted as an argument.
From: http://www.linuxquestions.org/questions/programming-9/how-to-grep-for-string-769460/
A related note
To grep for carriage return, namely the \r character, or 0x0d, we can do this:
grep -F $'\r' application.log
Alternatively, use printf, or echo, for POSIX compatibility
grep -F "$(printf '\r')" application.log
And we can use hexdump, or less to see the result:
$ printf "a\rb" | grep -F $'\r' | hexdump -c
0000000 a \r b \n
Regarding the use of $'\r' and other supported characters, see Bash Manual > ANSI-C Quoting:
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard
grep -n "\*\^\%\Q\&\$\&\^\#\$\&\!\^\#\$\&\^\&\^\&\^\&" test.log
1:*^%Q&$&^#$&!^#$&^&^&^&
8:*^%Q&$&^#$&!^#$&^&^&^&
14:*^%Q&$&^#$&!^#$&^&^&^&
You could try removing any alphanumeric characters and space. And then use -n will give you the line number. Try following:
grep -vn "^[a-zA-Z0-9 ]*$" application.log
Try vi with the -b option, this will show special end of line characters
(I typically use it to see windows line endings in a txt file on a unix OS)
But if you want a scripted solution obviously vi wont work so you can try the -f or -e options with grep and pipe the result into sed or awk.
From grep man page:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified
by POSIX.)

Checking for DOS files on UNIX

I want to check whether any DOS files exist in any specific directory.
Is there any way to distinguish DOS files from UNIX apart from the ^M chars ?
I tried using file, but it gives the same output for both.
$ file test_file
test_file: ascii text
And after conversion:
$ unix2dos test_file test_file
$ file test_file.txt
test_file.txt: ascii text
The CRLF (\r\n, ^M) line endings chars are the only difference between Unix and DOS/Windows ASCII files, so no, there's no other way.
What you might try if you have to fromdos command is to compare its output with the original file:
file=test_file
fromdos < $file | cmp $file -
This fails (non-zero $?) if fromdos stripped any \r away.
dos2unix might be used in a similar way, but I don't know its exact syntax.
If you actually put Windows newlines in, you'll see the following output from file:
test_file.txt: ASCII text, with CRLF line terminators

Resources