group records based on field count unix - unix

I'm having a tough time dealing with a file in Unix. Can someone give me directions on to how to deal with this? Thanks in advance
My file's field count is not consistent. I'm trying to gather all the records which has X number of fields.
when I did a awk check on the file based on the delimiter I found that a big % of my records having 19 fields. So I wanted to quarantine those records and make it a separate file.
file : x_orbit.txt
records : 1000
comma delimeter
cat x_orbit.txt | awk --field-separator=',' '{print NF}' | sort -n| uniq -c
rec col
700 19
50 20
50 21
50 22
50 23
10 24
10 25
10 26
10 27
10 28
9 29
1 31
1 32
1 33
1 35
10 36
27 42

awk '{print > ("out_"(NF==19 ? NF : "other"))}' x_orbit.txt

Related

Print the file according to the pattern

Consider a file "test" containing following data:
REG NO: 123
20
24
REG NO: 124
30
70
Need to print the following:
123 20
123 24
124 30
124 70
The value need to be printed from different fields. And content of the heding column (regno ie 123,124) need to repeated for their marks.
In awk:
$ awk 'NF>1 {p=$NF;next} {print p, $1}' file
123 20
123 24
124 30
124 70
Explained:
$ awk '
NF>1 { # when a record has more than one field
p=$NF # store the last field to p var
next # proceed to next record
}
{
print p, $1 # otherwise output the p and the value on the record
}' file

Moving certain rows of a text file into a new one using shell

I have a tab delimited text file like
file 1
x1 23 47 59
x2 58 23 12
x3 39 30 11
...
x21909 020
and a simple list of values like
file 2
x1
x34
x56
x123
..
x9876
I'd like to take all of the rows in file 1 beginning with the values in file 2 and move them into a file 3, such that..
file 3
x1 23 47 69
x34 38 309 20
x56 49 201 10
x123 39 30 10
..
x9876 48 309 123
I've tried using the grep command but I'm not sure how to do it with a long list of values in file 2, and how to make it take the entire row. Is there a simple shell command which could do this?
With tabs as the delimiter, this should work:
for i in `cat file2`; do grep -P "^$i\t" file1; done
And with spaces as the delimiter, something like this should work:
for i in `cat file2`; do grep "^$i " file1; done
Cheers.
awk 'NR==FNR{a[$1]=$0;next} {print a[$1]}' file1 file2
if the first field values are distinct from the rest of the fields, grep fits the bill
grep -Ff file2 file1

changing fasta headers to contain count info

I have a file with several rows like this:
fg-000001 ATGGATGACCGATGCTAGC 34 23 33 45 34 23 34 35 34 35 56 47
I would like to convert it to just a fasta file with the sum of the count info along with the name basically a fasta file
fg-000001_x433
ATGGATGACCGATGCTAGC
Is that possible?
Thanks in advance!
This awk script should do what you are asking:
awk '{sum=0; for (i=3; i<=NF; i++) {sum += $i}; print $1"_x"sum, $2}' file

Dynamically change what is being awked

I have this input below:
IDNO H1 H2 H3 HT Q1 Q2 Q3 Q4 Q5 M1 M2 EXAM
OUT OF 100 100 150 350 30 30 30 30 30 100 150 400
1434 22 95 135 252 15 20 12 18 14 45 121 245
1546 99 102 140 341 15 17 14 15 23 91 150 325
2352 93 93 145 331 14 17 23 14 10 81 101 260
(et cetera)
H1 H2 H3 HT Q1 Q2 Q3 Q4 Q5 M1 M2 EXAM
OUT OF 100 100 150 350 30 30 30 30 30 100 150 400
I need to use write a unix script to use the awk function to dynamically find any column that is entered in and have it displayed to the screen. I have successfully awked specific columns, but I cant seem to figure out how to make it change based on different columns. My instructor will simply pick a column for test data and my program needs to find that column.
what I was trying was something like:
#!/bin/sh
awk {'print $(I dont know what goes here)'} testdata.txt
EDIT: Sorry i should have been more specific, he is entering in the header name as the input. for example "H3". Then it needs to awk that.
I think you are just looking for:
#!/bin/sh
awk 'NR==1{ for( i = 1; i <= NF; i++ ) if( $i == header ) col=i }
{ print $col }' header=${1?No header entered} testdata.txt
This makes no attempt to deal with a column header that does not appear
in the input. (Left as an exercise for the reader.)
Well, you question is quite diffuse and in principle you want someone else write your awk script... You should check man pages for awk, they are quite descriptive.
my 2 cent wort, as an example would be (for the row) :
myscript.sh:
#/bin/sh
cat $1 | awk -v a=$2 -v b=$3 '{if ($(a)==b){print $0}}';
if you just want a column, well,
#/bin/sh
cat $1 | awk -v a=$2 '{print $(a)}';
Your input would be :
myscript.sh file_of_data col_num
Again, reiterating: Please study man pages of awk. Also please when asking question present what you have tried (code) and errors (logs). This will make people more ready to help you.
Your line format has a lot of variation (in the number of fields). That said, what about something like this:
echo "Which column name?"
read column
case $column in
(H1) N=2;;
(H2) N=3;;
(H3) N=4;;
...
(*) echo "Please try again"; exit 1;;
esac
awk -v N=$N '{print $N}' testdata.txt

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources