Unix uniq -c to csv format - unix

I'm trying to get a uniq -c output like the following:
100 a
99 b
45 c
etc...
To write to a file in csv format using only unix utils, with the ordering reversed:
a,100
b,99
c,45
etc...
I figure I can pipe the output into tr " " "," to get the csv format, but how do I get the order of the columns to be reversed, only using unix utils?

cat test.txt | awk '{print $2,$1}'

armnotstrong's is probably the simplest but you can also use sed
sed -r 's/(.+) (.+)/\2,\1/'

For these one liners in unix, i like to use perl.
uniq -c yourfile | perl -lne 's/^(\s*)(\d*)(\s*)(.*)/$4,$2/g; print $_'

Difference is how the "," gets included in the output. If done as per #armnotstrong, the result will just be a bunch of lines seperated by spaces.
bash
=> cat text
100 a
99 b
98 c
=> cat text | awk '{print $2","$1}'
a,100
b,99
c,98

The awk solutions don't work correctly if your text data contains spaces.
Based on #cbreezier's answer, I came up with this solution:
sed -r 's/ +([0-9]+) (.+)/\2,\1/'

Related

How to delete duplicate lines in file in unix?

I can delete duplicate lines in files using below commands:
1) sort -u and uniq commands. is that possible using sed or awk ?
There's a "famous" awk idiom:
awk '!seen[$0]++' file
It has to keep the unique lines in memory, but it preserves the file order.
sort and uniq these only need to remove duplicates
cat filename | sort | uniq >> filename2
if its file consist of number use sort -n
After sorting we can use this sed command
sed -E '$!N; /^(.*)\n\1$/!P; D' filename
If the file is unsorted then you can use with combination of the command.
sort filename | sed -E '$!N; /^\(.*\)\n\1$/!P; D'

Grep part of large file without spliting it

How can I grep a certain part of a large file from lines 1000 to 2000, up to line 1000 or from line 1000 for example?
I don't want to split the file in smaller files.
you could use sed to pre-process. EDIT: adding a q per Kent's suggestion
sed -n '1000,2000{p;2000q}' file.txt | grep 'abc'
for line 1000 through end of file
sed -n '1000,$p' file.txt | grep 'abc'
As a minor improvement over the sed solution by #ravoori, refactor the grep into the sed:
sed '1000,$/pattern/!d;2000q' file.txt
If you have the pattern in a variable, use double quotes;
sed '1000,$/'"$pattern"'/!d;2000q' file.txt
Or equivalently in Awk:
awk 'NR==2000{exit(0)}NR>=1000 && /pattern/' file.txt
or with a variable
awk -v pat="$pattern" 'NR==2000{exit(0)}NR>=1000 && $0~pat' file.txt
I'd suggest
head -2000 FILE.TXT | tail -1000 | grep XXX
as the neatest solution because head does not have to read the huge file, just the first few N thousand lines. It essentially achieves what q does in the sed solution.

How can I delete the second word of every line of top(1) output?

I have a formatted list of processes (top output) and I'd like to remove unnecessary information. How can I remove for example the second word+whitespace of each line.
Example:
1 a hello
2 b hi
3 c ahoi
Id like to delete a b and c.
You can use cut command.
cut -d' ' -f2 --complement file
--complement does the inverse. i.e. with -f2 second field was choosen. And with --complement if prints all fields except the second. This is useful when you have variable number of fields.
GNU's cut has the option --complement. In case, --complement is not available then, the following does the same:
cut -d' ' -f1,3- file
Meaning: print first field and then print from 3rd to the end i.e. Excludes second field and prints the rest.
Edit:
If you prefer awk you can do: awk {$2=""; print $0}' file
This sets the second to empty and prints the whole line (one-by-one).
Using sed to substitute the second column:
sed -r 's/(\w+\s+)\w+\s+(.*)/\1\2/' file
1 hello
2 hi
3 ahoi
Explanation:
(\w+\s+) # Capture the first word and trailing whitespace
\w+\s+ # Match the second word and trailing whitespace
(.*) # Capture everything else on the line
\1\2 # Replace with the captured groups
Notes: Use the -i option to save the results back to the file, -r is for extended regular expressions, check the man as it could be -E depending on implementation.
Or use awk to only print the specified columns:
$ awk '{print $1, $3}' file
1 hello
2 hi
3 ahoi
Both solutions have there merits, the awk solution is nice for a small fixed number of columns but you need to use a temp file to store the changes awk '{print $1, $3}' file > tmp; mv tmp file where as the sed solution is more flexible as columns aren't an issue and the -i option does the edit in place.
One way using sed:
sed 's/ [^ ]*//' file
Results:
1 hello
2 hi
3 ahoi
Using Bash:
$ while read f1 f2 f3
> do
> echo $f1 $f3
> done < file
1 hello
2 hi
3 ahoi
This might work for you (GNU sed):
sed -r 's/\S+\s+//2' file

Unix uniq, sort & cut command remove duplicate lines

If we have the following result:
Operating System,50
Operating System,40
Operating System,30
Operating System,23
Data Structure,87
Data Structure,21
Data Structure,17
Data Structure,8
Data Structure,3
Crypo,33
Crypo,31
C++,65
C Language,39
C Language,19
C Language,4
Java 1.6,16
Java 1.6,11
Java 1.6,10
Java 1.6,2
I only want to compare the first field (book name), and remove duplicate lines except the first line of each book, which records the largest number. So the result is as below:
Operating System,50
Data Structure,87
Crypo,33
C++, 65
C Language,39
Java 1.6,16
Can anyone help me out that how could I do using uniq, sort & cut command? May be using tr, head or tail?
Most elegant in this case would seem
rev input | uniq -f1 | rev
If your input is sorted, you can use GNU awk like this:
awk -F, '!array[$1]++' file.txt
Results:
Operating System,50
Data Structure,87
Crypo,33
C++,65
C Language,39
Java 1.6,16
If your input is unsorted, you can use GNU awk like this:
awk -F, 'FNR==NR { if ($2 > array[$1]) array[$1]=$2; next } !dup[$1]++ { if ($1 in array) print $1 FS array[$1] }' file.txt{,}
Results:
Operating System,50
Data Structure,87
Crypo,33
C++,65
C Language,39
Java 1.6,16
awk -F, '{if(P!=$1)print;p=$1}' your_file
This could be done in different ways, but I've tried to restrict myself to the tools you suggested:
cut -d, -f1 file | uniq | xargs -I{} grep -m 1 "{}" file
Alternatively, if you are sure that the words in the first column do not have more than 2 characters which are the same, you can simply use: uniq -w3 file. This tells uniq to compare no more than the first three characters.

get only the value...not what your grepping for

pdftk file.pdf dump_data output | grep NumberOfPages:
gives me:
NumberOfPages: 5
I don't want it to output NumberOfPages. I want to get in this case just 5. Is there a flag I can say in grep to get just that? I did a man grep and nothing seemed to do the trick.
I think grep doesn't know about how to parse strings in different formats. But other utilities like awk will help you:
pdftk file.pdf dump_data output | grep NumberOfPages: | awk '{print $2}'
pdftk file.pdf dump_data output | grep NumberOfPages: | sed 's\NumberOfPages:\\'
Yes, in GNU Grep you can use the -o operator to get "only" the matching portion of your expression. So something like;
pdftk file.pdf dump_data output | grep -o ' .*'
Could work for you. As other answers have pointed out, if you want only the number you'd be better off using something in addition to grep.
For example:
$ echo 'NumberOfPages: 5' | grep -o ' .*'
5
Notice the space before the 5 being included.

Resources