special character removal 'sed' - unix

I'm facing an issue where I'm getting some special characters in my file at the beggining; a snap of the same below:
^#<9b>200931350515,test1,910,420032400825443
^#<9a>200931350515,test1,910,420032400825443
^#<9d>200931746996,test2,910,420031390086807
I'm using the following command to remove anything other than numbers in first column:
sed 's/^[^0-9]*//g' file.dat
No success on that. The file is created btw during a fastexport from Teradata, the process adds some special characters by itself during extract.
Any idea on the command?

If you want to remove any non-ASCII characters anywhere in a line, you can use tr.
tr -d '\000\200-\377' <file >file.new

Using perl
perl -lne 'print /\d+,.*/g'
200931350515,test1,910,420032400825443
200931350515,test1,910,420032400825443
200931746996,test2,910,420031390086807
matches only digits up to the first comma and then everything else.

sed is to big gun for such a small problem,
use cut to remove the beginning of each line:
cut -b 2- file.dat
Where 2- is the range of bytes you want to retain, I'm not sure how many such strange characters you have there, so I would experiment with 1-, 2-, 3-, 4-, 5-, etc.

It looks like the number of characters that should be removed is constant across all line. To remove a fixed number of characters from the beginning of each line, you could simply do
$ sed 's/^.....//' input >output
Adjust the number of dots to fit your need.

Related

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.
This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Unix redact data

I want to mask only the 2nd column of the data.
Input:
First_name,second_name,phone_number
ram,prakash,96174535
hari,pallavi,98888234
anurag,aakash,82783784
Output Expected:
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
The sed program will do this just fine:
sed '2,$s/,[^,]*,/,*****,/'
The 2,$ only operates on lines 2 through to the end of the file (to leave the header line alone) and the substitute command s/,[^,]*,/,*****,/ will replace anything between the first and second comma with the mask *****.
Note that I've specifically used a fixed number of asterisks in the replacement string. Whether you're hiding passwords or anonymising data (as seems to be the case here), you don't want to leak any information, including the size of the names being replaced.
If you really want to use the same number of characters as in the original data, and you also want to cater for the possibility of replacing multiple fields, you can use something like:
awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
This will also leave the first line untouched but will anonymise columns two and four (albeit with the information leakage previously mentioned):
echo 'First_name,second_name,phone_number,other
ram,prakash,96174535,abc
hari,pallavi,98888234,def
anurag,aakash,82783784,g
bob,santamaria,124,xyzzy' | awk -F, 'BEGIN{OFS=FS}NR==1{print;next}{gsub(/./,"*",$2);gsub(/./,"*",$4);print}'
First_name,second_name,phone_number,other
ram,*******,96174535,***
hari,*******,98888234,***
anurag,******,82783784,*
bob,**********,124,*****
Doing multiple columns with full anonymising would entail using $2="*****" rather than the gsub (for both columns of course).
Another in awk. Using gsub to replace every char in $2 with an *:
$ awk 'BEGIN{FS=OFS=","}NR>1{gsub(/./,"*",$2)}1' file
First_name,second_name,phone_number
ram,*******,96174535
hari,*******,98888234
anurag,******,82783784
try following too once and let me know if this helps you.
awk -F"," 'NR>1{$2="*******"} 1' OFS=, Input_file

How to remove special characters from a csv file in unix

i am facing hard time in removing the special characters from the csv file .
My process like this in my output table i have some data like this
Col1
BC,BS/APP
Like this i have another 10 columns where there is a chance of getting the special characters when i tried with patindex i'm able to remove only first special character and for removing the other characters i need to use while loop which is taking hard time to do that .
So i tried to remove the special characters after bcping the data to the csv file below is the bcp command i am using
bcp_with_error_check tempdb..STT_IM166_WEB_MWE out temp.dat -SSVR -UUSR -PPWD -c -b1000 -t'","'
sed -e 's/,"0/,="0/g;s/,"1/,="1/g;s/,"2/,="2/g;s/,"3/,="3/g;s/,"4/,="4/g;s/,"5/,="5/g;s/,"6/,="6/g;s/,"7/,="7/g;s/,"8/,="8/g;s/,"9/,="9/g'temp.dat > temp1.dat
sed -e 's/$/"/g' temp1.dat > temp2.dat
sed -e 's/^/="/g' temp3.dat >>Filename.csv
My problem is since it is CSV file if i remove comma (,) considering as special character it is disturbing the file layout .
i can replace comma alone in data base but i am not getting the command to exclude the comma alone and remove other charachters . Please help me out i am in very need of this command
I'm not clear what you're really after, but at the very least you can shrink your first sed command by a factor of 10:
sed -e 's/,"\([0-9]\)/,="\1/g' temp.dat > temp1.dat
The pattern looks for comma, double quote and a digit (and remembers what the digit is); it is replaced by comma, equals, double quote and the remembered digit.
Unless you have a reason for the different temporary files, you can collapse the three sed commands into one with:
sed -e 's/,"\([0-9]\)/,="\1/g' -e 's/$/"/g' -e 's/^/="/g' temp.dat >>Filename.csv
And if bcp_with_error_check will write to standard output if you omit the out temp.dat arguments, then you don't need any temporary files (which is generally a good idea). Note that if two people innocently ran this command at the same time in the same directory, they'd be trampling over each other's temporary files (or running into problems because they couldn't). With no temporary files, you've only got the final file name, Filename.csv to worry about.
However, that does not address your main question — it just improves your scripting.

Regex to remove junk from a .txt file in Unix

I am new to Unix.
I am using a sed command to remove junk from a .txt file in Unix.
This is the command that i used--
sed -e 's/[^ -~]//g' final.txt > file1_now
but here i am facing a problem the junks are getting removed, but in case my data contains a '-' that is also removed. I dont want that.
Appreciate your help.
Thanks,
Binayak
Try doing this :
sed -e 's/[^ ~-]//g' final.txt > file1_now
The - character must be the latest (or the first) in your character class, because the meaning is different in other cases : it means a range like in [a-z]
The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc].
http://en.wikipedia.org/wiki/Regular_expression

Resources