Remove latin1 chars from a file - unix

My file is utf8 but contains several latin1 chars namely other foreign languages. My aim is to get rid of these chars using a Unix command. Earlier when i tried to achieve this by removing all the non-ASCII chars, the below command went ahead and removed all the accented chars as well. I wanted to retain the accented chars on the same hand i wanted to remove only the non-english(mandarain, japanese, korean, thai, arabic) terms from the file.
grep --color='auto' -P -n "[\x80-\xFF]" file.txt -> this command helped me remove non-ASCII chars but it also removes the accented chars(í, æ, Ö etc)...is it possible to get
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO|洁|ID12|doorbell|geo#xyx.comd
1011|ICE|LAND|邵|DUY|DUY|123|EOP|dataset1|geo#xyx.com
53101|炜|GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|陈
สัอ |JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
เมื่รกเริ่ม|JOH|LIU|ABC|DUYសា|DUY|57T2P|EOP|unknown|geo#xyx.com
👼|👼🏻| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM
Output:
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO||ID12|doorbell|geo#xyx.comd
1011|ICE|LAND||DUY|DUY|57T2P|EOP|dataset1|geo#xyx.com
53101||GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM

You can use the Unicode Properties to detect characters that belong to the Latin and Basic Latin, which are the ones you seem to want preserved. Perl supports them in regular expressions:
perl -CSD -pe 's/[^\p{Basic Latin}\p{Latin}]//g' file.txt
(but it doesn't change 123 to 57T2P)
-CSD turns on UTF-8 decoding/encoding of input and output
-p reads the input line by line and prints each line after processing
s/PATTERN/REPLACEMENT/g is a global replacement, it replaces all occurrences of PATTERN by the replacement, in this case the replacement is empty
[...] introduces a character class, ^ at the beginning negates it, i.e. we want to match anything that's not Latin or Basic Latin.

If you really have UTF-8 and want to keep only the extended ascii characters (aka, usually, latin1), iconv may work for you.
iconv -c -f UTF8 -t LATIN1 input_file > output_file
-c Silently discard characters that cannot be converted instead
of terminating when encountering such characters.

Here is the most non-elegant solution to your problem:
$ sed -e 's/[^,./#|[:space:]0-9[=a=][=b=][=c=][=d=][=e=][=f=][=g=][=h=][=i=][=j=][=k=][=l=][=m=][=n=][=o=][=p=][=q=][=r=][=s=][=t=][=u=][=v=][=w=][=x=][=y=][=z=][=A=][=B=][=C=][=D=][=E=][=F=][=G=][=H=][=I=][=J=][=K=][=L=][=M=][=N=][=O=][=P=][=Q=][=R=][=S=][=T=][=U=][=V=][=W=][=X=][=Y=][=Z=]]//g' file.txt
To my big surprise I could not use [:punct:] because some of the symbols are actually defined as punctuation.

Related

Check if a file contains certain ASCII characters

I need a unix command to verify the file has ASCII printable characters only (between ASCII Hex 20 and 7E inclusive).
I got below command to check if file contains non-ASCII characters, but cannot figure out my above question.
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
nice to have:
- Stop loading results. Sometimes one is enough
To find 20 to 7E characters in a file you can use:
grep -P "[\x20-\x7E]" file
Note the usage of -P to perform Perl regular expressions.
But in this case you want to check if the file just contains these kind of characters. So the best thing to do is to check if there is any of them that are not within this range, that is check [^range]:
grep -P "[^\x20-\x7E]" file
All together, I would say:
grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"
This can be done in unix using the POSIX grep options:
if LC_ALL=C grep -q '[^ -~]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
where the characters in [ ... ] are ^ (caret), space, - (ASCII minus sign), ~ (tilde).
You could also specify ASCII tab. The standard refers to these as collating elements. It seems that both \x (hexadecimal) or \0 (octal) are shown in the standard description of bracket expressions (see 7.4.1). So you could use \x09 or \011 for the literal tab.
According to the description, by default -e accepts a basic regular expression (BRE). If you added a -E, you could have an extended regular expression (but that is not needed).

How to get character string out from file in UNIX

I have file on unix which has line with special characters also pure character string. Special character could be any like .,$%&*()-#. sample below
sample input
\302\275B\303\236Gcl\302\275t eRkAtMbn;
Jacob
Chinese
39:00
Language
53.00
output:
Jacob
Chinese
Language
I want to get only pure character string lines out of this file. I have a way to read each line and compare each character with alphabets but if file is big then it will consume lot of time.
Any better approach or suggestions?
Your best bet is the grep utility.
grep -i '^[a-z]\+$' file.txt
Specifically, we're doing a case-insensitive search (-i) for lines that contain only the characters [a-z], and only those characters from start (^) to finish ($).

grep for special characters in Unix

I have a log file (application.log) which might contain the following string of normal & special characters on multiple lines:
*^%Q&$*&^#$&*!^#$*&^&^*&^&
I want to search for the line number(s) which contains this special character string.
grep '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The above command doesn't return any results.
What would be the correct syntax to get the line numbers?
Tell grep to treat your input as fixed string using -F option.
grep -F '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
Option -n is required to get the line number,
grep -Fn '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The one that worked for me is:
grep -e '->'
The -e means that the next argument is the pattern, and won't be interpreted as an argument.
From: http://www.linuxquestions.org/questions/programming-9/how-to-grep-for-string-769460/
A related note
To grep for carriage return, namely the \r character, or 0x0d, we can do this:
grep -F $'\r' application.log
Alternatively, use printf, or echo, for POSIX compatibility
grep -F "$(printf '\r')" application.log
And we can use hexdump, or less to see the result:
$ printf "a\rb" | grep -F $'\r' | hexdump -c
0000000 a \r b \n
Regarding the use of $'\r' and other supported characters, see Bash Manual > ANSI-C Quoting:
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard
grep -n "\*\^\%\Q\&\$\&\^\#\$\&\!\^\#\$\&\^\&\^\&\^\&" test.log
1:*^%Q&$&^#$&!^#$&^&^&^&
8:*^%Q&$&^#$&!^#$&^&^&^&
14:*^%Q&$&^#$&!^#$&^&^&^&
You could try removing any alphanumeric characters and space. And then use -n will give you the line number. Try following:
grep -vn "^[a-zA-Z0-9 ]*$" application.log
Try vi with the -b option, this will show special end of line characters
(I typically use it to see windows line endings in a txt file on a unix OS)
But if you want a scripted solution obviously vi wont work so you can try the -f or -e options with grep and pipe the result into sed or awk.
From grep man page:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified
by POSIX.)

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

grep -w with only space as delimiter

grep -w uses punctuations and whitespaces as delimiters.
How can I set grep to only use whitespaces as a delimiter for a word?
If you want to match just spaces: grep -w foo is the same as grep " foo ". If you also want to match line endings or tabs you can start doing things like: grep '\(^\| \)foo\($\| \)', but you're probably better off with perl -ne 'print if /\sfoo\s/'
You cannot change the way grep -w works. However, you can replace punctuations with, say, X character using tr or sed and then use grep -w, that will do the trick.
The --word-regexp flag is useful, but limited. The grep man page says:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
If you want to use custom field separators, awk may be a better fit for you. Or you could just write an extended regular expression with egrep or grep --extended-regexp that gives you more control over your search pattern.
Use tr to replace spaces with new lines. Then grep your string. The contiguous string I needed was being split up with grep -w because it has colons in it. Furthermore, I only knew the first part, and the second part was the unknown data I needed to pull. Therefore, the following helped me.
echo "$your_content" | tr ' ' '\n' | grep 'string'

Resources