I have a test folder that contains some text:
1 Line one. $
2 Line $
3 This is a new line $
4 This is the same line$
5 $
How would I go about using a PIPED grep command to only displays lines 1, 2 and 5?
My attempt at this was grep -n '^ ' test | grep -n ' $' test
However I get lines 1, 2, 3, 5
Your second grep innvocation in the pipe reads the file test and ignores the pipe. The first grep writes to the pipe, but its output is discarded. So, your command effectively is identical to
grep -n ' $' test
Just remove the test argument from the second grep invocation.
You are probably better off removing the -n option there as well, because it will count the lines it reads from the pipe then, which is probably not what you want.
Related
I have a huge file with many column. I want to count the number of occurences of each values in 1 column. Therefore, I use
cut -f 2 "file" | sort | uniq -c
. I got the result as I want. However, when I read this file to R, It shows that I have only 1 column but the data is like the example below
Example:
123 Chelsea
65 Liverpool
77 Manchester city
2 Brentford
The thing I want is two columns, one for the counts the other for the names. However, I got one only. Can anyone help me to split the column into 2 or a better method to extract from the big file?
Thanks in advance !!!!
Not a beautiful solution, but try this.
Pipe the output of the previous command into this while loop:
"your program" | while read count city
do
printf "%20s\t%s" $count $city
done
If you want to simply count the unique instances in each column, your best bet would be the cut command with the custom delimiter. For instance, it would be the whitespace delimiter.
In this case you have to consider that you have subsequent spaces after the first one e.g. Manchester city.
So, in order to count the unique occurrences of the first column:
cut -d ' ' -f1 <your_file> | uniq | wc -l
where -d sets the delimiter to whitespace ' ', and -f1 gives you the first column; uniq keeps the unique instances and wc -l counts the number of occurrences.
Similarly, to count the unique occurrences of the second column:
cut -d ' ' -f2- <your_file> | uniq | wc -l
where all parameters/commands are the same except for -f2- which allows you to get the from the second column to the last (see cut man page -f<from>-<to>).
EDIT
Based on the update of your question, here is a proposition on how to get what you want in r:
You can use cut with pipe:
df = read.csv(pipe("cut -f1,2- -d ' ' <your_csv_file>"))
And this should return a dataframe with the data separated as you want.
I want to get a count of matches with grep.
grep -c works fine with one pattern.
But when I pipe (|) a second pattern, it returns nothing.
Here is an example:
File test with contents:
A C
A A
A B
A B C
I run the command:
>grep A test
A C
A A
A B
A B C
I run the command:
>grep -c A test
4
I run the command:
>grep A test | grep B
A B
A B C
I run the command:
grep -c A test | grep B
returns nothing
How can I get a count of 2 for the second example?
thank you
grep -c A test outputs 4. The result of searching for rows that match B in a text consisting of 4 is unsurprisingly empty.
Instead, you want to count at the last step:
grep A test | grep -c B
Here, the first command filters the rows down to only those that have A, then the second command filters those rows into those that contain B, finally counting them.
The role of the pipe command is to use the result of the previous command as the input of the next command, you should use grep A test | grep -c B
I need a command to look for a pattern in specific columns in a fixed length file and output the entire line to a different file.
Example"
File1
2345abcdef450022677
1234sdfght350022677
3456abcdef350022677
I need extract the lines if column 5 to 10 = abcdef and column 15 to 16 = 22.
I want the output file to have the following data
2345abcdef450022677
3456abcdef350022677
I can use the cut command with grep to find the pattern but not sure how to output the entire line
You can use sed:
sed -r '/^.{4}abcdef.{4}22/!d' inputfile > outputfile
This finds lines starting with any four characters ^.{4}, followed by abcdef, followed by any 4 characters .{4}, followed by 22, then it deletes lines that don't match !d.
I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.
I am looking for the simplest unix tool to do this!
Timeserie here and here.
Simple example
Input
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
20-Apr 00 00:00 7 || 21-Apr-00 00:00 3
Output
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Let's consider these sample input files:
$ cat file1
10-Apr-00 00:00 0
20-Apr-00 00:00 7
$ cat file2
10-Apr-00 00:00 7
21-Apr-00 00:00 3
To merge together those lines with the same date:
$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Explanation
NR==FNR{a[$1]=$0;next;}
NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0
If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.
Alternative Output
If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:
$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00 00:00 7
Or, still simpler:
$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00 00:00 7
There is the relatively unknown unix command join. It can join sorted files on a key column.
To use it in your context, we follow this strategy (left.txt and right.txt are your files):
add line numbers (to put everything in the original sequence in the last step)
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
sort both files on the date column
sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt
join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
sort boths results on the linenumber column and output the other columns
sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt
Tools used: cut, join, nl, sort.
As requested by #Masi, I tried to work out a solution using sed.
My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.
sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2
With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.
My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.
sed 's/^[^ \t]\+//' file1 > marked1
sed 's/^[^ \t]\+//' file2 > marked2
sort marked1 marked2 > sorted
sed '$d;N;/^\([^ \t]\+\)#1.*\n\1#2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)#2/\1/' filtered > out2
Explanation:
In the first command, s/^[^ \t]\+// appends #1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker #2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the #2 marker from the output.
The third sed command in detail:
$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^\([^ \t]\+\)#1.*\n\1#2/ matches two lines originating from different files but with the same date
{ starts a command group
s/\(.*\)\n\(.*\)/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space
The bad news is, even the second approach is slower than the awk approach made by #John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making #John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.
Given a sorted file like so:
AAA 1 2 3
AAA 2 3 4
AAA 3 4 2
BBB 1 1 1
BBB 1 2 1
and a desired output of
AAA 1 2 3
BBB 1 1 1
what's the best way to achieve this with sed?
Basically, if the col starts with the same field as the previous line, how do I delete it? The rest of the data must be kept on the output.
I imagine there must be some way to do this either using the hold buffer, branching, or the test command.
another way with awk:
awk '!($1 in a){print;a[$1]}' file
This could be done with AWK:
$ gawk '{if (last != $1) print; last = $1}' in.txt
AAA 1 2 3
BBB 1 1 1
Maybe there's a simpler way with sed, but:
sed ':a;N;/\([[:alnum:]]*[[:space:]]\).*\n\1/{s/\n.*//;ta};P;D'
This produces the output
AAA 1 2 3
BBB 1 1 1
which differs from that in the question, but matches the description:
if the col starts with the same field as the previous line, how do I delete it?
This might work for you (GNU sed):
sed -r ':a;$!N;s/^((\S+\s).*)\n\2.*/\1/;ta;P;D' file
or maybe just:
sort -uk1,1 file
One way using GNU awk:
awk '!array[$1]++' file.txt
Results:
AAA 1 2 3
BBB 1 1 1
Using sed:
#!/bin/sed -nf
P
: loop
s/\s.*//
N
/\([^\n][^\n]*\)\n\1/ b loop
D
Firstly, we must pass the -n flag to sed so it will only print what we tell it to.
We start off by printing the line with the "P" command, because the first line will always be printed and we will force sed to only execute this line when we want it to.
Now we will do a loop. We define a loop with a starting label through the ":" command (in this case we name the label as "loop"), and when necessary we jump back to this label with a "b" command (or a "t" test command). This loop is quite simple:
Remove everything but the first field (replace the first space character and everything that follows it with nothing)
Append the next line (a newline character will be included)
Check if the new line starts with the field we isolated. We do this by using a capture. A capture is defined as a "sub-match" whose matched input will be stored into a special "variable", named numerically following the order of captures present. We specify captures using parenthesis escaped with backslased (starts with \( and ends with \)). In this case we match all characters that aren't a newline character (ie. [^\n]) up to the end of the line. We do this by matching at least one of non-newline characters followed by an arbitrary sequence of them. This prevents matching an empty string before a newline. After the capture, we match a newline character followed by the result of the capture, by using the special variable \1, which contains the input matched by that first capture. If this succeeds, we have a line that repeats the first field, so we jump back to the start of the loop with the "b" branch command.
When we exit the loop, we have found a line that has a different first field, so we must prepare the input line and jump back to the beginning of the script. This can be done with the "D" delete-first-line-and-restart-script command.
This can be shortened into a single line (notice that we have renamed the "loop" label into "a"):
sed -e 'P;:a;s/\s.*//;N;/\([^\n][^\n]*\)\n\1/ba;D'