How to replace string by an escape character plus string in unix - unix

How can I convert a one line like below:
794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|...and so on
to a linefeed-delimted,
Expected Output:
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
and so on.
BTW I'n not a unix expert and just want steps or simple commands to resolve. Appreciate your help.

I assume you have your string in a file with name "x", then you can do this.
I use the character ":" to represent the carriage return that 'sed' adds to your string. Choose something else if ":" occurs in your string. Then "tr" changes ":" to carriage return. The output is as you desire except that there is an extra carriage return at the beginning.
cat x | sed 's/794170/:794170/g' | tr ':' "\n"

You can use the fold command:
$ fold -w32 file
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

I don't think you can do it with a simple command. There are several options for creating scripts that can split lines more or less arbitrarily. Any Unix will have the awk utility available. On most systems you will also find Python and Perl. My guess is that a Perl or Python script is the easiest way to split lines like the one you gave.
This would be one way to do it in Python
inline = "794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|"
splits = ['794170' + s for s in inline.split('794170')]
for s in splits[1:]:
print s
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

Related

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.
This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

Field spearator to used if they are not escaped using awk

i have once question, suppose i am using "=" as fiels seperator, in this case if my string contain for example
abc=def\=jkl
so if i use = as fields seperator, it will split into 3 as
abc def\ jkl
but as i have escaped 2nd "=" , my output should be as
abc def\=jkl
Can anyone please provide me any suggestion , if i can achieve this.
Thanks in advance
I find it simplest to just convert the offending string to some other string or character that doesn't appear in your input records (I tend to use RS if it's not a regexp* since that cannot appear within a record, or the awk builtin SUBSEP otherwise since if that appears in your input you have other problems) and then process as normal other than converting back within each field when necessary, e.g.:
$ cat file
abc=def\=jkl
$ awk -F= '{
gsub(/\\=/,RS)
for (i=1; i<=NF; i++) {
gsub(RS,"\\=",$i)
print i":"$i
}
}' file
1:abc
2:def\=jkl
* The issue with using RS if it is an RE (i.e. multiple characters) is that the gsub(RS...) within the loop could match a string that didn't get resolved to a record separator initially, e.g.
$ echo "aa" | gawk -v RS='a$' '{gsub(RS,"foo",$1); print "$1=<"$1">"}'
$1=<afoo>
When the RS is a single character, e.g. the default newline, that cannot happen so it's safe to use.
If it is like the example in your question, it could be done.
awk doesn't support look-around regex. So it would be a bit difficult to get what you want by setting FS.
If I were you, I would do some preprocessing, to make the data easier to be handled by awk. Or you could read the line, and using other functions by awk, e.g. gensub() to remove those = s you don't want to have in result, and split... But I guess you want to achieve the goal by playing field separator, so I just don't give those solutions.
However it could be done by FPAT variable.
awk -vFPAT='\\w*(\\\\=)?\\w*' '...' file
this will work for your example. I am not sure if it will work for your real data.
let's make an example, to split this string: "abc=def\=jkl=foo\=bar=baz"
kent$ echo "abc=def\=jkl=foo\=bar=baz"|awk -vFPAT='\\w*(\\\\=)?\\w*' '{for(i=1;i<=NF;i++)print $i}'
abc
def\=jkl
foo\=bar
baz
I think you want that result, don't you?
my awk version:
kent$ awk --version|head -1
GNU Awk 4.0.2

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

Unix ksh loop and put the result into variable

I have a simple thing to do, but I'm novice in UNIX.
So, I have a file and on each line I have an ID.
I need to go through the file and put all ID's into one variable.
I've tried something like in Java but does not work.
for variable in `cat myFile.txt`
do
param=`echo "${param} ${variable}"`
done
It does not seems to add all values into param.
Thanks.
I'd use:
param=$(<myFile.txt)
The parameter has white space (actually newlines) between the names. When used without quotes, the shell will expand those to spaces, as in:
cat $param
If used with quotes, the file names will remain on separate lines, as in:
echo "$param"
Note that the Korn shell special-cases the '$(<file)' notation and does not fork and execute any command.
Also note that your original idea can be made to work more simply:
param=
for variable in `cat myFile.txt`
do
param="${param} ${variable}"
done
This introduces a blank at the front of the parameter; it seldom matters. Interestingly, you can avoid the blank at the front by having one at the end, using param="${param}${variable} ". This also works without messing things up, though it looks as though it jams things together. Also, the '${var}' notation is not necessary, though it does no harm either.
And, finally for now, it is better to replace the back-tick command with '$(cat myFile.txt)'. The difference becomes crucial when you need to nest commands:
perllib=$(dirname $(dirname $(which perl)))/lib
vs
perllib=`dirname \`dirname \\\`which perl\\\`\``/lib
I know which I prefer to type (and read)!
Try this:
param=`cat myFile.txt | tr '\n' ' '`
The tr command translates all occurrences of \n (new line) to spaces. Then we assign the result to the param variable.
Lovely.
param="$(< myFile.txt)"
or
while read line
do
param="$param$line"$'\n'
done < myFile.txt
awk
var=$(awk '1' ORS=" " file)
ksh
while read -r line
do
t="$t $line"
done < file
echo $t

Resources