Merging the rows in a file using awk - unix
Can somebody explain the meaning of the below script please?
awk -F "," 's != $1 || NR ==1{s=$1;if(p){print p};p=$0;next}
{sub($1,"",$0);p=p""$0;}
END{print p}' file
The file has the following data:
2,"URL","website","aaa.com"
2,"abc","text","some text"
2,"Password","password","12345678"
3,"URL","website","10.0.10.75"
3,"uname","text","some text"
3,"Password","password","password"
3,"ulang","text","some text"
4,"URL","website","192.168.2.110"
4,"Login","login","admin"
4,"Password","password","blah-blah"
and the output is:
2,"URL","website","aaa.com","abc","text","some text",Password","password","12345678"
3,"URL","website","10.0.10.75","uname","text","some text""Password","password","password","ulang","text","some text"
awk has this structure
pattern {action}
for your script, let's analyze the elements, first pattern
s != $1 || NR == 1 # if the variable s is not equal to first field
# or we're operating on first row
first action
s = $1 # assign variable s to first field
if (p) { # if p is not empty, print
print p
}
p = $0 # assign the current line to p variable
next # move to next, skip the rest
next pattern is missing, so the action will apply to all rows
sub($1, "", $0) # same as $1="", will clear the first field
p = ((p "") $0) # concat new line to p
last pattern is special reserved word END, only applied when all rows are consumed (there is counterpart BEGIN that's applied before the file is opened)
END {
print p # print the value of p
}
Related
awk to print incremental count of occurrences of unique values in each column
Would like print to incrementally count and then print the counts of the unique values in column 1 & column 2 & column 3 ...Column NF and Column $0 and if the word is appeared only one time of column 1, would like to print remarks as "No" as duplicated flag and if the word is appeared more than one time of column 1, would like to print remarks as "Yes" as duplicated flag Looking something like this awk -F"," '{OFS=","; if (word == $1) { counter++ } else { counter = 1; word = $1 }; print $0 ",", "Yes/No", counter }' For example, I am trying to check is there any duplicate information in the field $1 (Fruits Name) . Under Name field, "Apple" appears three times , "Orange" appears two times,"Mango" appear one time. So if any word is not repeated more than one time is consieder as "Name_Dup=No" duplicate and count of appears is "Name_Counter=1" (i.e Mango) where "Apple" appears 3 times , so it is repeated/duplicated -remarks as "Yes" when it appears first time count is "Name_Dup=Yes" and Name_Counter=1" , when it appears second time "Name_Dup=Yes" and Name_Counter=2, when it appears 3rd time "Name_Dup=Yes" and Name_Counter=3 Then need to check each column $2, $3 .. till $NF and $0 .. My actual input file is not sorted on any order. No of fields used to be vary like 10 fields, 12 fields and 15 fields etc Input.csv Name,Amount,Dept Apple,10,eee Orange,20,csc Apple,30,mec Mango,40,sss Apple,10,eee Orange,10,csc Desired Output Name,Amount,Dept,Name_Dup,Name_Counter,Amount_Dup,Amount_Counter,Dept_Dup,Dept_Counter,EntireLine_Dup,EntireLine_Counter Apple,10,eee,Yes,1,Yes,1,Yes,1,Yes,1 Orange,20,csc,Yes,1,No,1,Yes,1,No,1 Apple,30,mec,Yes,2,No,1,No,1,No,1 Mango,40,sss,No,1,No,1,No,1,No,1 Apple,10,eee,Yes,3,Yes,2,Yes,2,Yes,2 Orange,10,csc,Yes,2,Yes,3,Yes,2,No,1 For example , Please find below steps for reference. Step#1 - Field $1 check and Output Name,Name_Dup,Name_Counter Apple,Yes,1 Orange,Yes,1 Apple,Yes,2 Mango,No,1 Apple,Yes,3 Orange,Yes,2 Step#2 - Field $2 check and Output Amount,Amount_Dup,Amount_Counter 10,Yes,1 20,No,1 30,No,1 40,No,1 10,Yes,2 10,Yes,3 Step#3 - Field $3 check and Output Dept,Dept_Dup,Dept_Counter eee,Yes,1 csc,Yes,1 mec,No,1 sss,No,1 eee,Yes,2 csc,Yes,2 Step#4-Field $0 check, combination of $1 & $2 & $3 and Output "Name,Amount,Dept",EntireLine_Dup,EntireLine_Counter "Apple,10,eee",Yes,1 "Orange,20,csc",No,1 "Apple,30,mec",No,1 "Mango,40,sss",No,1 "Apple,10,eee",Yes,2 "Orange,10,csc",No,1
awk solution: OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far. $ cat tst.awk BEGIN{ FS=OFS="," } NR==1{ header=$0 n=split("Dup,Counter",h) for (i=1; i<=NF; i++) for (j=1; j<=n; j++) header=header OFS $i"_"h[j] printf("%s,EntireLine_Dup,EntireLine_Counter\n", header) next } { r[++lines]=$0 for (col=1; col<=NF; col++) v[col][$col]++ v[col][$0]++ } END { for (l=1; l<=lines; l++){ n=split(r[l], s) res="" for (c=1; c<=n; c++) res=res OFS output(v,c,s[c]) res=res OFS output(v,c,r[l]) print r[l] res } } function output(arr, col, val){ return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val]) } with input: $ cat input.txt Name,Amount,Dept,Nonsense Apple,10,eee,eee Orange,20,csc,eee Apple,30,mec,eee Mango,40,sss,eee Apple,10,eee,eee Orange,10,csc,eee this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample): $ awk -f tst.awk input.txt # deleted header line Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1 Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1 Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1 Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1 Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2 Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1
you are not providing what efforts you placed so far. Here is a hint where I would start. I guess since awk is the tool to use, start with shell command sort Input.csv and pipe it to awk. Populate an array when reading the input as well as an associative array with index the first field. I n the END section go over the array and see if you find the first field more than once. It takes a bit of time however that sounds like a homework. Not a production problem.
Get a specific column number by column name using awk
i have n number of file, in these files a specific column named "thrudate" is given at different column number in every files. i Just want to extract the value of this column from all files in one go. So i tried with using awk. Here i'm considering only one file, and extracting the values of thrudate awk -F, -v header=1,head="" '{for(j=1;j<=2;j++){if($header==1){for(i=1;i<=$NF;i++){if($i=="thrudate"){$head=$i;$header=0;break}}} elif($header==0){print $0}}}' file | head -10 How i have approached: used find command to find all the similar files and then executing the second step for every file loop all fields in first row, checking the column name with header values as 1 (initialized it to 1 to check first row only), once it matched with 'thrudate', i set header as 0, then break from this loop. once i get the column number then print it for every row.
You can use the following awk script: print_col.awk: # Find the column number in the first line of a file FNR==1{ for(n=1;n<=NF;n++) { if($n == header) { next } } } # Print that column on all other lines { print $n } Then use find to execute this script on every file: find ... -exec awk -v header="foo" -f print_col.awk {} + In comments you've asked for a version that could print multiple columns based on their header names. You may use the following script for that: print_cols.awk: BEGIN { # Parse headers into an assoc array h split(header, a, ",") for(i in a) { h[a[i]]=1 } } # Find the column numbers in the first line of a file FNR==1{ split("", cols) # This will re-init cols for(i=1;i<=NF;i++) { if($i in h) { cols[i]=1 } } next } # Print those columns on all other lines { res = "" for(i=1;i<=NF;i++) { if(i in cols) { s = res ? OFS : "" res = res "" s "" $i } } if (res) { print res } } Call it like this: find ... -exec awk -v header="foo,bar,test" -f print_cols.awk {} +
unix split FASTA using a loop, awk and split
I have a long list of data organised as below (INPUT). I want to split the data up so that I get an output as below (desired OUTPUT). The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B. Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi" I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|") My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...: B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1) awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" ' BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1} NR in change {$0 = ">" $4} 1 ' let me know if more explanations are needed! INPUT: >gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA >gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5] MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV desired OUTPUT: >1E0Y MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV >GAD99964.1 Byssochlamys spectabilis No. 5 MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk): awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output In a more readable way: /^>gi/ { # when the line starts with ">gi" a=1; # set flag "a" to 1 # extract the eventual part between brackets in the last field match($NF,"\\[([^]]*)]", b); print ">"$4" "b[1]; # display the line next # jump to the next record } a { print } # when "a" (allowed block) display the line !$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display
print if all value are higher
I have a file like: A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2 B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8 . . . I want to print those lines that all values in column 2 are more than 50 output: A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2 I tried: cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question. tab <- read.table("file") splt <- strsplit(as.character(tab[[2]]), ",") rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50))) tab[rows,] This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works: awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile Output: A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2 Explanation The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.
Printing all lines that have/are duplicates of first field in Unix shell scipt
I have a list of of semi-colon delimeted data: TR=P561;dir=o;day=sa;TI=16:30;stn=south station;Line=worcester I need to take this file and print out only the lines with TR values that occur more than once. I would like first occurrence and all duplicates listed. Thanks
If you are willing to make 2 passes on the file: awk '!/TR=/ { next } # Ignore lines that do not set TR {t=$0; sub( ".*TR=", "", t ); sub( ";.*", "", t ) } # Get TR value FNR == NR { a[t] +=1 } # Count the number of times this value of TR seen FNR != NR && a[t] > 1 # print those lines whose TR value is seen more than once ' input-file input-file This uses a common awk idiom of checking FNR to see which file we are using. By passing the input-file as an argument twice, it becomes a way to run one command on the first pass, and a different command on the second.