Regex to remove junk from a .txt file in Unix - unix

I am new to Unix.
I am using a sed command to remove junk from a .txt file in Unix.
This is the command that i used--
sed -e 's/[^ -~]//g' final.txt > file1_now
but here i am facing a problem the junks are getting removed, but in case my data contains a '-' that is also removed. I dont want that.
Appreciate your help.
Thanks,
Binayak

Try doing this :
sed -e 's/[^ ~-]//g' final.txt > file1_now
The - character must be the latest (or the first) in your character class, because the meaning is different in other cases : it means a range like in [a-z]
The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc].
http://en.wikipedia.org/wiki/Regular_expression

Related

Unix: multi and single character delimiter in cut or awk commands

This is the string I have:
my_file1.txt-myfile2.txt_my_file3.txt
I want to remove all the characters after the first "_" that follows the first ".txt".
From the above example, I want the output to be my_file1.txt-myfile2.txt. I have to search for first occurrence of ".txt" and continue parsing until I find the underscore character, and remove everything from there on.
Is it possible to do it in sed/awk/cut etc commands?
You can't do this job with cut but you can with sed and awk:
$ sed 's/\.txt/\n/g; s/\([^\n]*\n[^_]*\)_.*/\1/; s/\n/.txt/g' file
my_file1.txt-myfile2.txt
$ awk 'match($0,/\.txt[^_]*_/){print substr($0,1,RSTART+RLENGTH-2)}' file
my_file1.txt-myfile2.txt
Could you please try following, written based on your shown samples.
awk '{sub(/\.txt_.*/,".txt")} 1' Input_file
Simply substituting everything from .txt_ to till last of line with .txt and printing the line here

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.
This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

special character removal 'sed'

I'm facing an issue where I'm getting some special characters in my file at the beggining; a snap of the same below:
^#<9b>200931350515,test1,910,420032400825443
^#<9a>200931350515,test1,910,420032400825443
^#<9d>200931746996,test2,910,420031390086807
I'm using the following command to remove anything other than numbers in first column:
sed 's/^[^0-9]*//g' file.dat
No success on that. The file is created btw during a fastexport from Teradata, the process adds some special characters by itself during extract.
Any idea on the command?
If you want to remove any non-ASCII characters anywhere in a line, you can use tr.
tr -d '\000\200-\377' <file >file.new
Using perl
perl -lne 'print /\d+,.*/g'
200931350515,test1,910,420032400825443
200931350515,test1,910,420032400825443
200931746996,test2,910,420031390086807
matches only digits up to the first comma and then everything else.
sed is to big gun for such a small problem,
use cut to remove the beginning of each line:
cut -b 2- file.dat
Where 2- is the range of bytes you want to retain, I'm not sure how many such strange characters you have there, so I would experiment with 1-, 2-, 3-, 4-, 5-, etc.
It looks like the number of characters that should be removed is constant across all line. To remove a fixed number of characters from the beginning of each line, you could simply do
$ sed 's/^.....//' input >output
Adjust the number of dots to fit your need.

Delete line containing a specific string starting with dollar sign using unix sed

I am very new to Unix.
I have a parameter file Parameter.prm containing following lines.
$$ErrorTable1=ErrorTable1
$$Filename1_New=FileNew.txt
$$Filename1_Old=FileOld.txt
$$ErrorTable2=ErrorTable2
$$Filename2_New=FileNew.txt
$$Filename2_Old=FileOld.txt
$$ErrorTable3=ErrorTable3
$$Filename3_New=FileNew.txt
$$Filename3_Old=FileOld.txt
I want get the output as
$$ErrorTable1=ErrorTable1
$$ErrorTable2=ErrorTable2
$$ErrorTable3=ErrorTable3
Basically, I need to delete line starting with $$Filename.
Since $ is a keyword, I am not able to interpret it as a string. How can I accomplish this using sed?
With sed:
$ sed '/$$Filename/d' infile
$$ErrorTable1=ErrorTable1
$$ErrorTable2=ErrorTable2
$$ErrorTable3=ErrorTable3
The /$$Filename/ part is the address, i.e., for all lines matching this, the command following it will be executed. The command is d, which deletes the line. Lines that don't match are just printed as is.
Extracting information from a textfile based on pattern search is a job for grep:
grep ErrorTable file
or even
grep -F '$$ErrorTable' file
-F tells grep to treat the search term as a fixed string instead of a regular expression.
Just to answer your question, if a regular expression needs to search for characters which have a special meaning in the regex language, you need to escape them:
grep '\$\$ErrorTable' file

How to remove special characters from a csv file in unix

i am facing hard time in removing the special characters from the csv file .
My process like this in my output table i have some data like this
Col1
BC,BS/APP
Like this i have another 10 columns where there is a chance of getting the special characters when i tried with patindex i'm able to remove only first special character and for removing the other characters i need to use while loop which is taking hard time to do that .
So i tried to remove the special characters after bcping the data to the csv file below is the bcp command i am using
bcp_with_error_check tempdb..STT_IM166_WEB_MWE out temp.dat -SSVR -UUSR -PPWD -c -b1000 -t'","'
sed -e 's/,"0/,="0/g;s/,"1/,="1/g;s/,"2/,="2/g;s/,"3/,="3/g;s/,"4/,="4/g;s/,"5/,="5/g;s/,"6/,="6/g;s/,"7/,="7/g;s/,"8/,="8/g;s/,"9/,="9/g'temp.dat > temp1.dat
sed -e 's/$/"/g' temp1.dat > temp2.dat
sed -e 's/^/="/g' temp3.dat >>Filename.csv
My problem is since it is CSV file if i remove comma (,) considering as special character it is disturbing the file layout .
i can replace comma alone in data base but i am not getting the command to exclude the comma alone and remove other charachters . Please help me out i am in very need of this command
I'm not clear what you're really after, but at the very least you can shrink your first sed command by a factor of 10:
sed -e 's/,"\([0-9]\)/,="\1/g' temp.dat > temp1.dat
The pattern looks for comma, double quote and a digit (and remembers what the digit is); it is replaced by comma, equals, double quote and the remembered digit.
Unless you have a reason for the different temporary files, you can collapse the three sed commands into one with:
sed -e 's/,"\([0-9]\)/,="\1/g' -e 's/$/"/g' -e 's/^/="/g' temp.dat >>Filename.csv
And if bcp_with_error_check will write to standard output if you omit the out temp.dat arguments, then you don't need any temporary files (which is generally a good idea). Note that if two people innocently ran this command at the same time in the same directory, they'd be trampling over each other's temporary files (or running into problems because they couldn't). With no temporary files, you've only got the final file name, Filename.csv to worry about.
However, that does not address your main question — it just improves your scripting.

Resources