How to split a column into two in Bash shell

How to split a column into two in Bash shell - r

I have a huge file with many column. I want to count the number of occurences of each values in 1 column. Therefore, I use
cut -f 2 "file" | sort | uniq -c
. I got the result as I want. However, when I read this file to R, It shows that I have only 1 column but the data is like the example below
Example:
123 Chelsea
65 Liverpool
77 Manchester city
2 Brentford
The thing I want is two columns, one for the counts the other for the names. However, I got one only. Can anyone help me to split the column into 2 or a better method to extract from the big file?
Thanks in advance !!!!

Not a beautiful solution, but try this.
Pipe the output of the previous command into this while loop:
"your program" | while read count city
do
printf "%20s\t%s" $count $city
done

If you want to simply count the unique instances in each column, your best bet would be the cut command with the custom delimiter. For instance, it would be the whitespace delimiter.
In this case you have to consider that you have subsequent spaces after the first one e.g. Manchester city.
So, in order to count the unique occurrences of the first column:
cut -d ' ' -f1 <your_file> | uniq | wc -l
where -d sets the delimiter to whitespace ' ', and -f1 gives you the first column; uniq keeps the unique instances and wc -l counts the number of occurrences.
Similarly, to count the unique occurrences of the second column:
cut -d ' ' -f2- <your_file> | uniq | wc -l
where all parameters/commands are the same except for -f2- which allows you to get the from the second column to the last (see cut man page -f<from>-<to>).
EDIT
Based on the update of your question, here is a proposition on how to get what you want in r:
You can use cut with pipe:
df = read.csv(pipe("cut -f1,2- -d ' ' <your_csv_file>"))
And this should return a dataframe with the data separated as you want.

Related

reducing stream data to single result without putting them all into memory

I can reduce produced lines like:
seq 5 | jq --slurp ' reduce .[] as $i (0;.+($i|tonumber))'
to get
15
but this put whole input into memory, I don't want that. Following:
seq 5 | jq ' reduce . as $i (0;.+($i|tonumber))'
produces incorrect output
1
2
3
4
5
similar happens when foreach is used.
What is correct syntax?

Use inputs instead of the context ., along with the --null-input (or -n) option so the context wouldn't eat up the first item:
seq 5 | jq -n 'reduce inputs as $i (0;.+($i|tonumber))'
15
Demo
Explanation: If you just use . as context, your filter will be executed once for each input (five times in this case, hence five outputs, each summing up just one value). Providing the -n option sets the input to null, so the filter is executed only once, while input and inputs sequentially read in another/all new items from the input stream.

awk output into specified number of columns

EDIT: I can get all my desired values from the command:
awk '/Value/{print $4}' *.log > ofile.csv
This makes a .csv file with a single column with hundreds of values. I would like to separate these values into a specified number of columns, i.e. instead of having values 1-1000 in a single column, I could specify that I want 100 columns and then my .csv file would have the first column be 1-10, 2nd column be 11-20... 100th column be 991-1000.
Previously, I was using the pr command to do this, but it doesn't work when the number of columns I want is too high (>36 in my experience).
awk '/Value/{print $4}' *.log | pr -112s',' > ofile.csv
the pr command gives the following message:
pr: page width too narrow
Is there an alternative to this command that I can use, that won't restrict the amount of comma delimiters in a row of data?

If your values are always the same length, you can use column:
$ seq 100 200 | column --columns 400 | column --table --output-separator ","
100,103,106,109,112,115,118,121,124,127,130,133,136,139,142,145,148,151,154,157,160,163,166,169,172,175,178,181,184,187,190,193,196,199
101,104,107,110,113,116,119,122,125,128,131,134,137,140,143,146,149,152,155,158,161,164,167,170,173,176,179,182,185,188,191,194,197,200
102,105,108,111,114,117,120,123,126,129,132,135,138,141,144,147,150,153,156,159,162,165,168,171,174,177,180,183,186,189,192,195,198,
The --columns control the number of columns, but in my example, 400 is the number of characters, not the number of columns.
If your values are not the same character length, you will find spaces inserted where the values have a different width.

Parsing a file with multiple line and replace the each first occurrence with Pipe

Currently we have receiving a file, which is more like a name value pair. Each of the pair data is separated by pipe delimiter and Name and value pair is separated by space. I want to replace the space with Pipe inside the pipe delimited values.
I replace pipe to double code and tried using the below Perl Command line to add then replacing the space with the Pipe value. But it is add Pipe to each occurrence of the space.
perl -pe' s{("[^"]+")}{($x=$1)=~tr/ /|/;$x}ge'
Sample data:
|id 12345|code_value TTYE|Code_text Sample Data|Comments3 |
|id 23456|code_value2 UHYZ|Code_text3 Second Line Text|Comments M D Test|
|id 45677|code_value4 TEST DAT|Code_text Third line|Comments2 A D T Come|
|id 78904|code_value |Code_text2 Done WIth Sample data|Comments |
Expected Result:
|id|12345|code_value|TTYE|Code_text|Sample Data|Comments3 |
|id|23456|code_value2|UHYZ|Code_text3|Second Line Text|Comments|M D Test|
|id|45677|code_value4|TEST DAT|Code_text|Third line|Comments2|A D T Come|
|id|78904|code_value |Code_text2|Done WIth Sample data|Comments |

This sed script creates the output as shown in the question.
sed 's/\(|[^ ][^ ]*\) \([^|]\)/\1|\2/g' inputfile
From your expected output I assume the first space after a pipe should not be replaced with a pipe if it is followed by a pipe as in |code_value | or |Comments3 |.
Explanation:
\(|[^ ][^ ]*\) - 1st capturing group that contains a character which is not a space followed by 0 or more of the same
- followed by a space
\([^|]\) - 2nd capturing group that contains a character which is not a pipe
\1|\2 - replaced by group 1 followed by pipe and group 2
/g - replace all occurrences (global)
Using the two grouped patterns before and after the space makes sure the script does not replace a space followed immediately by a pipe.
Edit: Depending on your sed you cpould replace the double [^ ] in the first group \(|[^ ][^ ]*\) with \(|[^ ]+*\) or other variants.

Count occurence of unique value in 2nd field using awk

I'm using this syntax to count occurrence of unique values in 2nd field of the file. Can somebody explain how does this work. How is Unix calculating this count ? Is it reading each line or whole file as one.. how is it assigning count and incrementing it?
Command:
awk -F: '{a[$2]++} END {for ( i in a) { print i,a[i]}}' inputfile

It's not Unix calculating but awk; awk is not Unix or shell, it's a language. Presented awk program calculates how many times each unique value in the second field ($2. separated by :) occurs and outputs the values and related counts.
awk -F: ' # set the field separator to ":"
{
# awk reads in records or lines in a loop
a[$2]++ # here it hashes each value to a and counts each occurrance
}
END { # after all records have been processed
for ( i in a) { # hash a is looped thru in no particular order
print i,a[i] # and value-count pairs are outputed
}
}' inputfile
If you want to learn more about awk, please read following quote (* see below) by #EdMorton: The best source of all awk information is the book Effective Awk Programming, 4th Edition, by Arnold Robbins. If you have any other book, throw it away, and if you're trying to learn from a web site - don't as most of them are full of complete nonsense. Just get the book.
*) Now go read the book.
Edit How a[$2]++ works:
Sample data and a[$2]'s value:
1 val1 # a[$2]++ causes: a["val1"] = 1
2 val2 # a[$2]++ causes: a["val2"] = 1
3 val1 # a[$2]++ causes: a["val1"] = 2
4 val1 # a[$2]++ causes: a["val1"] = 3

Unix: find all lines having timestamps in both time series?

I have time-series data where I would like to find all lines matching each another but values can be different (match until the first tab)! You can see the vimdiff below where I would like to get rid of days that occur only on the other time series.
I am looking for the simplest unix tool to do this!
Timeserie here and here.
Simple example
Input
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
20-Apr 00 00:00 7 || 21-Apr-00 00:00 3
Output
Left file Right File
------------------------ ------------------------
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7

Let's consider these sample input files:
$ cat file1
10-Apr-00 00:00 0
20-Apr-00 00:00 7
$ cat file2
10-Apr-00 00:00 7
21-Apr-00 00:00 3
To merge together those lines with the same date:
$ awk 'NR==FNR{a[$1]=$0;next;} {if ($1 in a) print a[$1]"\t||\t"$0;}' file1 file2
10-Apr-00 00:00 0 || 10-Apr-00 00:00 7
Explanation
NR==FNR{a[$1]=$0;next;}
NR is the number of lines read so far and FNR is the number of lines read so far from the current file. So, when NR==FNR, we are still reading the first file. If so, save this whole line, $0, in array a under the key of the first field, $1, which is the date. Then, skip the rest of the commands and jump to the next line.
if ($1 in a) print a[$1]"\t||\t"$0
If we get here, then we are reading the second file, file2. If the first field on this line, $1 is a date that we already saw in file1, in other words, if $1 in a, then print this line out together with the corresponding line from file1. The two lines are separated by tab-||-tab.
Alternative Output
If you just want to select lines from file2 whose dates are also in file1, then the code can be simplified:
$ awk 'NR==FNR{a[$1]++;next;} {if ($1 in a) print;}' file1 file2
10-Apr-00 00:00 7
Or, still simpler:
$ awk 'NR==FNR{a[$1]++;next;} ($1 in a)' file1 file2
10-Apr-00 00:00 7

There is the relatively unknown unix command join. It can join sorted files on a key column.
To use it in your context, we follow this strategy (left.txt and right.txt are your files):
add line numbers (to put everything in the original sequence in the last step)
nl left.txt > left_with_lns.txt
nl right.txt > right_with_lns.txt
sort both files on the date column
sort left_with_lns.txt -k 2 > sl.txt
sort right_with_lns.txt -k 2 > sr.txt
join the files using the date column (all times are 0:00) (this would merge all columns of both files with correponding key, but we provide a output template to write the columns from the first file somewhere and the columns from the second file somewhere else (but only those line with a matching key will end in the result fl.txt and fr.txt)
join -j 2 -t $'\t' -o 1.1 1.2 1.3 1.4 sl.txt sr.txt > fl.txt
join -j 2 -t $'\t' -o 2.1 2.2 2.3 2.4 sl.txt sr.txt > fr.txt
sort boths results on the linenumber column and output the other columns
sort -n fl |cut -f 2- > left_filtered.txt
sort -n fr.txt | cut -f 2- > right_filtered.txt
Tools used: cut, join, nl, sort.

As requested by #Masi, I tried to work out a solution using sed.
My first attempt uses two passes; the first transforms file1 into a sed script that is used in the second pass to filter file2.
sed 's/\([^ \t]*\).*/\/^\1\t\/p;t/' file1 > sed1
sed -nf sed1 file2 > out2
With big input files, this is s-l-o-w; for each line from file2, sed has to process an amount of patterns that equals the number of lines in file1. I haven't done any profiling, but I wouldn't be surprised if the time complexity is quadratic.
My second attempt merges and sorts the two files, then scans through all lines in search of pairs. This runs in linear time and consequently is a lot faster. Please note that this solution will ruin the original order of the file; alphabetical sorting doesn't work too well with this date notation. Supplying files with a different date format (y-m-d) would be the easiest way to fix that.
sed 's/^[^ \t]\+/&#1/' file1 > marked1
sed 's/^[^ \t]\+/&#2/' file2 > marked2
sort marked1 marked2 > sorted
sed '$d;N;/^\([^ \t]\+\)#1.*\n\1#2/{s/\(.*\)\n\(.*\)/\2\n\1/;P};D' sorted > filtered
sed 's/^\([^ \t]\+\)#2/\1/' filtered > out2
Explanation:
In the first command, s/^[^ \t]\+/&#1/ appends #1 to every date. This makes it possible to merge the files, keep equal dates together when sorting, and still be able to tell lines from different files apart.
The second command does the same for file2; obviously with its own marker #2.
The sort command merges the two files, grouping equal dates together.
The third sed command returns all lines from file2 that have a date that also occurs in file1.
The fourth sed command removes the #2 marker from the output.
The third sed command in detail:
$d suppresses inappropriate printing of the last line
N reads and appends another line of input to the line already present in the pattern space
/^\([^ \t]\+\)#1.*\n\1#2/ matches two lines originating from different files but with the same date
{ starts a command group
s/\(.*\)\n\(.*\)/\2\n\1/ swaps the two lines in the pattern space
P prints the first line in the pattern space
} ends the command group
D deletes the first line from the pattern space
The bad news is, even the second approach is slower than the awk approach made by #John1024. Sed was never designed to be a merge tool. Neither was awk, but awk has the advantage of being able to store an entire file in a dictionary, making #John1024's solution blazingly fast. The downside of a dictionary is memory consumption. On huge input files, my solution should have the advantage.