awk output into specified number of columns - unix

EDIT: I can get all my desired values from the command:
awk '/Value/{print $4}' *.log > ofile.csv
This makes a .csv file with a single column with hundreds of values. I would like to separate these values into a specified number of columns, i.e. instead of having values 1-1000 in a single column, I could specify that I want 100 columns and then my .csv file would have the first column be 1-10, 2nd column be 11-20... 100th column be 991-1000.
Previously, I was using the pr command to do this, but it doesn't work when the number of columns I want is too high (>36 in my experience).
awk '/Value/{print $4}' *.log | pr -112s',' > ofile.csv
the pr command gives the following message:
pr: page width too narrow
Is there an alternative to this command that I can use, that won't restrict the amount of comma delimiters in a row of data?

If your values are always the same length, you can use column:
$ seq 100 200 | column --columns 400 | column --table --output-separator ","
100,103,106,109,112,115,118,121,124,127,130,133,136,139,142,145,148,151,154,157,160,163,166,169,172,175,178,181,184,187,190,193,196,199
101,104,107,110,113,116,119,122,125,128,131,134,137,140,143,146,149,152,155,158,161,164,167,170,173,176,179,182,185,188,191,194,197,200
102,105,108,111,114,117,120,123,126,129,132,135,138,141,144,147,150,153,156,159,162,165,168,171,174,177,180,183,186,189,192,195,198,
The --columns control the number of columns, but in my example, 400 is the number of characters, not the number of columns.
If your values are not the same character length, you will find spaces inserted where the values have a different width.

Related

How to split a column into two in Bash shell

I have a huge file with many column. I want to count the number of occurences of each values in 1 column. Therefore, I use
cut -f 2 "file" | sort | uniq -c
. I got the result as I want. However, when I read this file to R, It shows that I have only 1 column but the data is like the example below
Example:
123 Chelsea
65 Liverpool
77 Manchester city
2 Brentford
The thing I want is two columns, one for the counts the other for the names. However, I got one only. Can anyone help me to split the column into 2 or a better method to extract from the big file?
Thanks in advance !!!!
Not a beautiful solution, but try this.
Pipe the output of the previous command into this while loop:
"your program" | while read count city
do
printf "%20s\t%s" $count $city
done
If you want to simply count the unique instances in each column, your best bet would be the cut command with the custom delimiter. For instance, it would be the whitespace delimiter.
In this case you have to consider that you have subsequent spaces after the first one e.g. Manchester city.
So, in order to count the unique occurrences of the first column:
cut -d ' ' -f1 <your_file> | uniq | wc -l
where -d sets the delimiter to whitespace ' ', and -f1 gives you the first column; uniq keeps the unique instances and wc -l counts the number of occurrences.
Similarly, to count the unique occurrences of the second column:
cut -d ' ' -f2- <your_file> | uniq | wc -l
where all parameters/commands are the same except for -f2- which allows you to get the from the second column to the last (see cut man page -f<from>-<to>).
EDIT
Based on the update of your question, here is a proposition on how to get what you want in r:
You can use cut with pipe:
df = read.csv(pipe("cut -f1,2- -d ' ' <your_csv_file>"))
And this should return a dataframe with the data separated as you want.

Parsing a file with multiple line and replace the each first occurrence with Pipe

Currently we have receiving a file, which is more like a name value pair. Each of the pair data is separated by pipe delimiter and Name and value pair is separated by space. I want to replace the space with Pipe inside the pipe delimited values.
I replace pipe to double code and tried using the below Perl Command line to add then replacing the space with the Pipe value. But it is add Pipe to each occurrence of the space.
perl -pe' s{("[^"]+")}{($x=$1)=~tr/ /|/;$x}ge'
Sample data:
|id 12345|code_value TTYE|Code_text Sample Data|Comments3 |
|id 23456|code_value2 UHYZ|Code_text3 Second Line Text|Comments M D Test|
|id 45677|code_value4 TEST DAT|Code_text Third line|Comments2 A D T Come|
|id 78904|code_value |Code_text2 Done WIth Sample data|Comments |
Expected Result:
|id|12345|code_value|TTYE|Code_text|Sample Data|Comments3 |
|id|23456|code_value2|UHYZ|Code_text3|Second Line Text|Comments|M D Test|
|id|45677|code_value4|TEST DAT|Code_text|Third line|Comments2|A D T Come|
|id|78904|code_value |Code_text2|Done WIth Sample data|Comments |
This sed script creates the output as shown in the question.
sed 's/\(|[^ ][^ ]*\) \([^|]\)/\1|\2/g' inputfile
From your expected output I assume the first space after a pipe should not be replaced with a pipe if it is followed by a pipe as in |code_value | or |Comments3 |.
Explanation:
\(|[^ ][^ ]*\) - 1st capturing group that contains a character which is not a space followed by 0 or more of the same
- followed by a space
\([^|]\) - 2nd capturing group that contains a character which is not a pipe
\1|\2 - replaced by group 1 followed by pipe and group 2
/g - replace all occurrences (global)
Using the two grouped patterns before and after the space makes sure the script does not replace a space followed immediately by a pipe.
Edit: Depending on your sed you cpould replace the double [^ ] in the first group \(|[^ ][^ ]*\) with \(|[^ ]+*\) or other variants.

Using sed or awk (or similar) incrementally or with a loop to do deletions in data file based on lines and position numbers given in another text file

I am looking to do deletions in a data file at specific positions in specific lines, based on a list in a separate text file, and have been struggling to get my head around it.
I'm working in cygwin, and have a (generally large) data file (data_file) to do the deletions in, and a tab-delimited text file (coords_file) listing the relevant line numbers in column 2 and the matching position numbers for each of those lines in column 3.
Effectively, I think I'm trying to do something similar to the following incomplete sed command, where coords_file$2 represents the line number taken from the 2nd column of coords_file and coords_file$3 represents the position in that line to delete from.
sed -r 's coords_file$2/(.{coords_file$3}).*/\1/' datafile
I'm wondering if there's a way to include a loop or iteration so that sed runs first using the values in the first row of coords_file to fill in the relevant line and position coordinates, and then runs again using the values from the second row, etc. for all the rows in coords_file? Or if there's another approach, e.g. using awk to achieve the same result?
e.g. for awk, I identified these coordinates based on string matches using this really handy awk command from Ed Morton's response to this question: line and string position of grep match.
awk 'NR==FNR{strings[$0]; next} {for (string in strings) if ( (idx = index($0,string)) > 0 ) print string, FNR, idx }' strings.txt data_file > coords_file.txt
Was thinking potentially something similar could work doing an in-place deletion rather than just finding the lines, such as incorporating a simple find and replace like {if($0=="somehow_reference_coords_file_values_here"){$0=""}. But it's a bit beyond me (am a coding novice, so I barely understand how that original command is actually working, let alone how to mod it).
File examples
data_file
#vandelay.1
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
coords_file (tab-delimited)
(column 1 is just the string that was matched, column 2 is the line number it matched in, and column 3 is the position number of the match).
stringID 2 20
stringID 4 20
stringID 10 27
stringID 12 27
Desired result:
#vandelay.1
blablablablablablab
+
mehmehmehmehmehmehm
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablabl
+
mehmehmehmehmehmehmehmehme
Any guidance would be much appreciated thanks! (And as I mentioned, I'm very new to this coding scene, so apologies if some of that doesn't make sense or my question format's shonky (or if the question itself is rudimentary)).
Cheers.
(Incidentally, this has all been a massive work around to delete strings identified in the blablabla lines of data_file as well as the same positions 2 lines below (i.e. the mehmehmeh lines), since the mehmehmeh characters are quality scores that match the blablabla characters for each sample (each #vandelay.xx). i.e. Essentially this: sed -i 's/string.*//' datafile, but also running the same deletion 2 lines below every time it identifies the string. So if there's actually an easier script to do just that instead of all the stuff in the question above, please let me know!)
You can simply use one liner awk to do that,
$ awk 'NR==FNR{a[$2]=$3;next} (FNR in a){$0=substr($0,0,a[FNR]-1)}1' coords_file data_file
#vandelay.1
blablablablablablab
+
mehmehmehmehmehmehm
#vandelay.2
blablablablablablablablablablablabla
+
mehmehmehmehmehmehmehmehmehmehmehmeh
#vandelay.3
blablablablablablablablabl
+
mehmehmehmehmehmehmehmehme
Brief explanation,
NR==FNR{a[$2]=$3;next}: create the line number and the matching position map in array a. This part of expression would only process coords_file because of NR==FNR
(FNR in a): then awk would start to process data_file. Use the expression to search any FNR contained in array a.
$0=substr($0,0,a[FNR]-1): re-assign the $0 to the line be cut.
1: print all lines

How to exclude columns from a data.frame and keep the spaces between columns?

I have a tab.table (like below) with million of rows and 340 columns
HanXRQChr00c0001 68062 N N N N A
HanXRQChr00c0001 68080 N N N N A
HanXRQChr00c0001 68285 N N N N A
I want to remove 28 columns. It is easy to do that, but in the output file I lose the space between my columns.
Is there any way to exclude these columns and still keep the space between them like above?
You can try different things. I include some of them below:
awk -i inplace '{$0=gensub(/\s*\S+/,"",28)}1' file
or
sed -i -r 's/(\s+)?\S+//28' file
or
awk '{$28=""; print $0}' file
or using cut as mentioned in the comments.

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources