how to take substring in ksh - unix

I have a file named "output.txt" having data in format:
400949703|2000025967912|20130614010652|20130614131543
355949737|2144050263|20120407100407|20120407101307
355499738|2144500262|20110911010901|20110911135601
I am executing an awk command as shown below:
awk -F"|" '{num1="`echo $3| cut -c1-8`"; print $num1}' output.txt
My expected output is :
20130614
20120407
20110911
But I am getting output as what is actually the input.
400949703|2000025967912|20130614010652|20130614131543
355949737|2144050263|20120407100407|20120407101307
355499738|2144500262|20110911010901|20110911135601
Not able to find out the reason. My task is to compare the 1st 8 characters in 3rd and 4th column. But stucked at this part only.
Experts, kindly help me to get the way, where I am missing.

What about using cut twice?
$ cut -d'|' -f4 file | cut -c-8
20130614
20120407
20110911
Firstly to get the 4th field based on | delimiter.
Secondly to get the first 8 characters (note that cut -c-8 is the same as your cut -c1-8)

You're mixing bash with awk, one tool is just enough:
awk -F\| 'a=substr($3, 1, 8){if(a==substr($4, 1, 8)){print a}}' output.txt
Get substrings of columns 3 and 4 , compare it and print if its ok.

Related

Linux - Get Substring from 1st occurence of character

FILE1.TXT
0020220101
or
01 20220101
Need to extra date part from file where text starts from 2
Options tried:
t_FILE_DT1='awk -F"2" '{PRINT $NF}' FILE1.TXT'
t_FILE_DT2='cut -d'2' -f2- FILE1.TXT'
echo "$t_FILE_DT1"
echo "$t_FILE_DT2"
1st output : 0101
2nd output : 0220101
Expected Output: 20220101
Im new to linux scripting. Could some one help guide where Im going wrong?
Use grep like so:
echo "0020220101\n01 20220101" | grep -P -o '\d{8}\b'
20220101
20220101
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions
Using any awk:
$ awk '{print substr($0,length()-7)}' file
20220101
20220101
The above was run on this input file:
$ cat file
0020220101
01 20220101
Regarding PRINT $NF in your question - PRINT != print. Get out of the habit of using all-caps unless you're writing Cobol. See correct-bash-and-shell-script-variable-capitalization for some reasons.
The 2 in your scripts is telling awka and cut to use the character 2 as the field separator so each will carve up the input into substrings everywhere a 2 occurs.
The 's in your question are single quotes used to make strings literal, you were intending to use backticks, `cmd`, but those are deprecated in favor of $(cmd) anyway.
I would instead of looking for "after" the 2 .. (not having to worry about whether there is a space involved as well) )
Think instead about extracting the last 8 characters, which you know for fact is your date ..
input="/path/to/txt/file/FILE1.TXT"
while IFS= read -r line
do
# read in the last 8 characters of $line .. You KNOW this is the date ..
# No need to worry about exact matching at that point, or spaces ..
myDate=${line: -8}
echo "$myDate"
done < "$input"
About the cut and awk commands that you tried:
Using awk -F"2" '{PRINT $NF}' file will set the field separator to 2, and $NF is the last field, so printing the value of the last field is 0101
Using cut -d'2' -f2- file uses a delimiter of 2 as well, and then print all fields starting at the second field, which is 0220101
If you want to match the 2 followed by 7 digits until the end of the string:
awk '
match ($0, /2[0-9]{7}$/) {
print substr($0, RSTART, RLENGTH)
}
' file
Output
20220101
The accepted answer shows how to extract the first eight digits, but that's not what you asked.
grep -o '2.*' file
will extract from the first occurrence of 2, and
grep -o '2[0-9]*' file
will extract all the digits after every occurrence of 2. If you specifically want eight digits, try
grep -Eo '2[0-9]{7}'
maybe also with a -w option if you want to only accept a match between two word boundaries. If you specifically want only digits after the first occurrence of 2, maybe try
sed -n 's/[^2]*\(2[0-9]*\).*/\1/p' file

unix command to print every 2nd line of duplicate

I have a text file that has 110132 lines and looks like this,
b3694658:heccc 238622
b3769025:heccc 238622
b3694659:heccc 238623
b3769026:heccc 238623
b3694660:heccc 238624
b3769027:heccc 238624
b3694661:heccc 238625
b3769028:heccc 238625
Notice that every 2nd line has a duplicate entry at heccc etc., i want an output that only has the 2nd occurrence of the duplicate, so it would look like this,
b3769025:heccc 238622
b3769026:heccc 238623
b3769027:heccc 238624
b3769028:heccc 238625
Thanks for your help!
It appears that you are just looking to output unique values. If that is so, just do this:
cat textfile | sort | uniq
uniq -f1 file.txt
should do in this case.
see how -f , -s options work with the uniq command?

Customizing print output after getting a column using 'cut' command

I'm trying to print the first column of output in a "customized" way, after executing a program that prints out a table. I know how to get the first column from the output, but I want to print each row between single quotes. So, right now I have the commands that can get me the first column:
./genTable | cut -f2 | xargs -0
What can I add to this command so that it prints the values between quotes. For example, the output right now looks like
apple
cider
vinegar
I want it to look like
'apple'
'cider'
'vinegar'
I'd use Perl. ./genTable | perl -nwla -e 'print \'$F[1]\''
I'd use awk ;-) , i.e.
./genTable | awk -v singleQ="'" '{print singleQ $1 singleQ}'
And of course you if you want super-minimalist, change all references from singleQ to Q ;-)
output
'apple'
'cider'
'vinegar'
IHTH

unix cut to extract column from text file and save rest of the contents to a new file

I can do the following using unix cut :
cut -f 1 myfile.out
Output:
6DKK463WXXK
VKFQ9PYP9CG
Since its printing out the column that I want to extract. How do I create the a new file without this column? In other words, I want to remove this column now and keep the rest of the content.
Depending on your version of Unix, you may use the negate option to select the fields not listed.
cut -f 2 --complement myfile.input > myfile.output
That will place all the columns from the input file into the output file, except for column 2.
You use the -d argument to specify a delimiter other than tab, which is the default.
Note from experience: Be careful with the > especially when using similar names for input and output so that you don't accidentally overwrite your input file (using tab completion, this is easy to do).
Example:
% echo one two three | cut -d ' ' -f 2 --complement
> one three
Gdang! S.O. must be swamped right now.
This is very easy in awk
echo "1 2 3 4 5" | awk -F" " '{sub(/[^ ]+ /,""); print}'
output
2 3 4 5
Deletes everything upto the first space character.
The the remaining record is printed.
IHTH

Median Calculation in Unix

I need to calculate median value for the below input file. It is working fine for odd occurrences but not for even occurrences. Below is the input file and the script used. Could you please check what is wrong with this command and correct the same.
Input file:
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
Desired Output:
AR,3.31
Unix command:
cat test.txt | sort -t"," -k2n,2 | awk '{arr[NR]=$1} END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'
Don't forget that your input file has an additional line, containing the header. You need to take an additional step in your awk script to skip the first line.
Also, due to the fact you're using the default field separator, $1 will contain the whole line, so your code arr[NR/2]+arr[NR/2+1])/2 is never going to work. I would suggest that you changed it so that awk splits the input on a comma, then use the second field $2.
sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
I also removed your useless use of cat. Most tools, including sort and awk, are capable of reading in files directly, so you don't need to use cat with them.
Testing it out:
$ cat file
col1,col2
AR,2.52
AR,3.57
AR,1.29
AR,6.66
AR,3.05
AR,5.52
$ sort -t, -k2n,2 file | awk -F, 'NR>1{a[++i]=$2}END{if(i%2==1)print a[(i+1)/2];else print (a[i/2]+a[i/2+1])/2}'
3.31
It shouldn't be too difficult to modify the script slightly to change the output to whatever you want.

Resources