How to get character string out from file in UNIX - unix

I have file on unix which has line with special characters also pure character string. Special character could be any like .,$%&*()-#. sample below
sample input
\302\275B\303\236Gcl\302\275t eRkAtMbn;
Jacob
Chinese
39:00
Language
53.00
output:
Jacob
Chinese
Language
I want to get only pure character string lines out of this file. I have a way to read each line and compare each character with alphabets but if file is big then it will consume lot of time.
Any better approach or suggestions?

Your best bet is the grep utility.
grep -i '^[a-z]\+$' file.txt
Specifically, we're doing a case-insensitive search (-i) for lines that contain only the characters [a-z], and only those characters from start (^) to finish ($).

Related

Remove latin1 chars from a file

My file is utf8 but contains several latin1 chars namely other foreign languages. My aim is to get rid of these chars using a Unix command. Earlier when i tried to achieve this by removing all the non-ASCII chars, the below command went ahead and removed all the accented chars as well. I wanted to retain the accented chars on the same hand i wanted to remove only the non-english(mandarain, japanese, korean, thai, arabic) terms from the file.
grep --color='auto' -P -n "[\x80-\xFF]" file.txt -> this command helped me remove non-ASCII chars but it also removes the accented chars(í, æ, Ö etc)...is it possible to get
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO|洁|ID12|doorbell|geo#xyx.comd
1011|ICE|LAND|邵|DUY|DUY|123|EOP|dataset1|geo#xyx.com
53101|炜|GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|陈
สัอ |JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
เมื่รกเริ่ม|JOH|LIU|ABC|DUYសា|DUY|57T2P|EOP|unknown|geo#xyx.com
👼|👼🏻| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM
Output:
888|Jobin|Matt|NORMALSQ|YUOZ|IOP|OPO||ID12|doorbell|geo#xyx.comd
1011|ICE|LAND||DUY|DUY|57T2P|EOP|dataset1|geo#xyx.com
53101||GUTTI|RR|Hi|London|UK|WLU|GB|dataset1|
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|JOH|LIU|ABC|DUY|DUY|57T2P|EOP|unknown|geo#xyx.com
|| RAVI|OLE|Hi|London|UK|NA|GB|unknown| WELSH#WELSH.COM
Rogério|Davies|Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Balázs| Roque| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Johny|Peniç| Hi|USA|US|WLU|US|unknown| USA#WELSH.COM
Mike|Mane| Hi | USA |US|WLU|US|unknown| USA#WELSH.COM
You can use the Unicode Properties to detect characters that belong to the Latin and Basic Latin, which are the ones you seem to want preserved. Perl supports them in regular expressions:
perl -CSD -pe 's/[^\p{Basic Latin}\p{Latin}]//g' file.txt
(but it doesn't change 123 to 57T2P)
-CSD turns on UTF-8 decoding/encoding of input and output
-p reads the input line by line and prints each line after processing
s/PATTERN/REPLACEMENT/g is a global replacement, it replaces all occurrences of PATTERN by the replacement, in this case the replacement is empty
[...] introduces a character class, ^ at the beginning negates it, i.e. we want to match anything that's not Latin or Basic Latin.
If you really have UTF-8 and want to keep only the extended ascii characters (aka, usually, latin1), iconv may work for you.
iconv -c -f UTF8 -t LATIN1 input_file > output_file
-c Silently discard characters that cannot be converted instead
of terminating when encountering such characters.
Here is the most non-elegant solution to your problem:
$ sed -e 's/[^,./#|[:space:]0-9[=a=][=b=][=c=][=d=][=e=][=f=][=g=][=h=][=i=][=j=][=k=][=l=][=m=][=n=][=o=][=p=][=q=][=r=][=s=][=t=][=u=][=v=][=w=][=x=][=y=][=z=][=A=][=B=][=C=][=D=][=E=][=F=][=G=][=H=][=I=][=J=][=K=][=L=][=M=][=N=][=O=][=P=][=Q=][=R=][=S=][=T=][=U=][=V=][=W=][=X=][=Y=][=Z=]]//g' file.txt
To my big surprise I could not use [:punct:] because some of the symbols are actually defined as punctuation.

Line entry/count difference between sed and nl on unix vs. mac

I have a simple and annoying problem, and I apologize for not posting an example. The files are big and I haven't been able to recreate the exact issue using smaller files:
These are tab-delimited files (some entries contain " ; or a single space character). On UNIX, when I access a unique word via: nl file | sed -n '/word/p' I see that my word is on exactly the same line in all my files.
Now I copy the files to my mac. I run the same command on the same exact files, but the line numbers are all different! The total number of lines via wc -l is still identical to the numbers I get in unix, but when I do nl file | tail -n1 I see a different number. Yet, when I enter the number returned from my unix nl, and access the same line via sed '12345p' file I get the correct entry!?
My question: I must have something in some of my lines that is interpreted as linebreaks on my mac but not in unix, and only by nl not sed. Can anyone help me figure out what it is? I already know it's not on every line. I found this issue persists when I load the data into R, and I'm stumped. Thank you!
"Phantom newlines" can be hidden in text in the form of a multi-byte UTF-8 character called an "overlong sequence".
UTF-8 normally represents ASCII characters as themselves: UTF-8 bytes in the range 0 to 127 are just those character values. However, overlong sequences can be used to (incorrectly) encode ASCII characters using multiple UTF-8 bytes (which are in the range 0x80-0xFF). A properly written UTF-8 decoder must detect overlong sequences and somehow flag them as invalid bytes. A naively written UTF-8 decoder will simply extract the implied character.
Thus, it's possible that your data is being treated as UTF-8, and contains some bytes which look like an overlong sequence for a newline, and that this is fooling some of the software you are working with. A two-byte overlong sequence for a newline would look like C0 8A, and a three-byte overlong sequence would be E0 80 8A.
It is hard to come up with an alternative hypothesis not involving character encodings.

Reading in text file with unmatched quotes

I have a large (>1GB) CSV file I'm trying to read into a data frame in R.
The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails".
What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote " to a non-conflicting character like pipe |. My heuristic for finding quoting characters would be double-quotes next to a comma:
gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt
This question is related, but the solution (argument in read.csv of quote="") is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.
Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):
Search for
(?<!^|,)"(?!,|$)
and replace all the matches with "".
R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:
$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;
This would be a more foolproof variant of Tim's solution, in case non-boundary commas exist inside the cell:
(?<!,\s+)"(?!\s+,$)
I'm not sure if it would have any bugs though.

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

How to handle non-printable ASCII character parameters?

I'm working on a project where we are dealing with importing/exporting data from database tables using ksh scripts and Perl scripts. We have an existing process to export data from a table to a file and it is then imported into another system.
Here's the catch - the export process dumps out pipe delimited files while the system that is doing the import expects files delimited by the ASCII group separator character which is decimal 29, hexidecimal 1d, or octal 35. It shows up in vi as ^] Right now, I'm converting the delimiter via a Perl script. What I'd like to do is tell our export process to just use the delimiter we are expecting. Something like:
export_table.ksh -d '\035'
The problem is I can't figure out how to pass this character to the export script.
I've tried all kinds of combinations of single quotes, double quotes, backslashes, and the octal and hex version of this character.
I'm on Solaris 10 using ksh and/or Perl.
have you tried:
$'\x29'
actually try this for ]:
echo $'\x5d'
and for ^
echo $'\x5e'
so you just need to do:
export_table.ksh -d $'\x5e\x5d'
In bash(1), one can prefix a character with ^v to enter that character verbatim. Perhaps ksh(1) does the same?

Resources