Conditionally Merging two lines into one line - unix

How can I merge two lines if the have met specific criteria in Unix terminal?
I have data like:
A1
B1
A2
B2
A3
A4
A5
B5
And I want to merge to like that:
A1, B1
A2, B2
A3,
A4,
A5, B5
Real data looks like this:
"224222"
<Frequency freq="0.136" allele="T" sampleSize="5008"/>
"224223"
<Frequency freq="0.3864" allele="T" sampleSize="5008"/>
"224224"
"224225"
<Frequency freq="0.3894" allele="G" sampleSize="5008"/>
"1801179"
"1861759"
I actually tried to add dummy deliminator texts to before the "A" data to separate them. But I couldn't achive it.

Using sed
sed 's/$/, /;N;/\n<Freq/{s/\n//};P;D' <file>
Explanation:
s/$/, / - Append a comma to the current line
N - Get the next line
/\n<Freq/{s/\n//} - If the second line contains <Freq, delete the newline
P - Print first portion of pattern space
D - Delete first portion of pattern space

It can be done using awk getline:
awk '{ if(condition){ if((getline var)>0) print $0","$var; else print $0; } else print $0;}' <file>

Related

Cut specific columns and collapse with delimiter in Unix

Say I have 6 different columns in a text file (as shown below)
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
I would like to extract columns first, second and fourth columns as A1_B1_D1 collapsed together and the third column separated by tab.
So the result would be:
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
I tried
cut -f 1,2,4 -d$'\t' 3, but is just not what I want.
If you need to maintain your column alignment, you can check the length of the combination of fields 1, 2 and 4 and add one or two tab characters as necessary,
awk '{
printf (length($1"_"$2"_"$4) >= 8) ? "%s_%s_%s\t%s\n" : "%s_%s_%s\t\t%s\n",
$1,$2,$4,$3
}' file
Example Output
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
Could you please try following.
awk '
BEGIN{
OFS="\t"
}
{
print $1"_"$2"_"$4,$3
}
' Input_file
I've tried RavinderSingh13 code and it has the same output as mine but I don't quite know the difference, anyways, here it is:
awk -F ' ' '{print $1"_"$2"_"$4"\t"$3}' /path/to/file
This might work for you (GNU sed):
sed 's/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+.*/\1_\2_\4\t\3/' -E file
Use pattern matching and back references.
\S+ means one or more non-white space characters.
\s+ means one or more white space characters.
\t represents a tab.
Another awk and using column -t for formatting.
$ cat cols_345.txt
A1 B1 C1 D1 E1 F1
1 G PP GG HH GG
z T CC GG FF JJ
$ awk -v OFS="_" '{ $3="\t"$3; print $1,$2,$4 $3 } ' cols_345.txt | column -t
A1_B1_D1 C1
1_G_GG PP
z_T_GG CC
$

deleting repetitive columns in unix

I would like to delete multiple repetitive columns from a huge file (about 1 million).
The columns that I want to delete has the same column names: A and others has different unique name. Say:
A B2 A B3
1.1 AA 1.2 AA
2.1 AB 4.3 CT
2.2 AC 6.4 GT
so column headers are A, B2, A, B3,... .
How could I delete the columns named as A's from the data.
Another in awk:
$ awk '
NR==1 {
split($0,a)
for(i in a)
if(a[i]=="A")
delete a[i]
}
{
for(i=1;i<=NF;i++)
printf "%s",(i in a?$i OFS:"")
printf ORS
}' file
B2 B3
AA AA
AB CT
AC GT
I'm not sure I'm understanding your question correctly, but here an (GNU) awk solution to delete all duplicate columns (keeping only the first occurrence):
#!/usr/bin/awk -f
NR==1 {
seen[$1] = 1
cols[0] = 1
for (i=2; i<=NF; i++) {
if (!($i in seen)) {
seen[$i] = 1
cols[length(cols)] = i
}
}
}
{
for (i=0; i<length(cols); i++)
printf $(cols[i]) " "
printf "\n"
}
For the first line (NR==1), we find all non-duplicate columns (preserving the order), and for all the other lines, we just print out the columns (fields) we selected before (cols array holds column/field indexes we wish to keep).
$ ./filter.awk file
A B2 B3
1.1 AA AA
2.1 AB CT
2.2 AC GT
cut -d' ' -f $(head -1 filename|tr ' ' '\n'|awk '{if(!seen[$0]++) print NR}'|paste -s -d ',') filename
this will work like a charm.
The question is solved by the James Brown code.
I added
!/usr/bin/awk -f
to the first line of his code and correct tiny typo at the end of the code (simply additional -'- deleted).
I am sorry, I did not have time to try all other suggestions
with my best wishes

Copy values from Dataframe to new Dataframe depending on value in first column and first row [duplicate]

I am trying to use unix to transform a tab delimited file from a short/wide format to long format, in a similar way as the reshape function in R. I hope to create three rows for each row in the starting file. Column 4 currently contains 3 values separated by commas. I hope to keep columns 1, 2, and 3 the same for each starting row, but have column 4 be one of the values from the initial column 4. This example probably makes it more clear than I can describe verbally:
current file:
A1 A2 A3 A4,A5,A6
B1 B2 B3 B4,B5,B6
C1 C2 C3 C4,C5,C6
goal:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As someone just becoming familiar with this language, my initial thought was to use sed to find the commas replace with a hard return
sed 's/,/&\n/' data.frame
I am really not sure how to include the values for columns 1-3. I had low hopes of this working, but the only thing I could think of was to try inserting the column values with {print $1, $2, $3}.
sed 's/,/&\n{print $1, $2, $3}/' data.frame
Not to my surprise, the output looked like this:
A1 A2 A3 A4
{print $1, $2, $3} A5
{print $1, $2, $3} A6
B1 B2 B3 B4
{print $1, $2, $3} B5
{print $1, $2, $3} B6
C1 C2 C3 C4
{print $1, $2, $3} C5
{print $1, $2, $3} C6
It seems like an approach might be to store the values of columns 1-3 and then insert them. I am not really sure how to store the values, I think that it may involve using an adaptation of the following script, but I am having a hard time understanding all of the components.
NR==FNR{a[$1, $2, $3]=1}
Thanks in advance for your thoughts on this.
You can a write simple read loop for this and use brace expansion for parsing the comma delimited field:
#!/bin/bash
while read -r f1 f2 f3 c1; do
# split the comma delimited field 'c1' into its constituents
for c in ${c1//,/ }; do
printf "$f1 $f2 $f3 $c\n"
done
done < input.txt
Output:
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
As solution without calling an external program :
#!/bin/bash
data_file="d"
while IFS=" " read -r f1 f2 f3 r
do
IFS="," read f4 f5 f6 <<<"$r"
printf "$f1 $f2 $f3 $f4\n$f1 $f2 $f3 $f5\n$f1 $f2 $f3 $f6\n"
done <"$data_file"
In the great Miller there is the nest verb to do it
With
mlr --nidx --ifs "\t" nest --explode --values --across-records -f 4 --nested-fs "," input.tsv
you will have
A1 A2 A3 A4
A1 A2 A3 A5
A1 A2 A3 A6
B1 B2 B3 B4
B1 B2 B3 B5
B1 B2 B3 B6
C1 C2 C3 C4
C1 C2 C3 C5
C1 C2 C3 C6
If you don't need the output to be in any particular order within a group of the fourth column, the following awk one-liner might do:
awk '{split($4,a,","); for(i in a) print $1,$2,$3,a[i]}' input.txt
This works by splitting your 4th column into an array, then for each element of the array, printing the "new" four columns.
If order is important -- that is, A4 must come before A5, etc, then you can use a classic for loop:
awk '{split($4,a,","); for(i=1;i<=length(a);i++) print $1,$2,$3,a[i]}' input.txt
But that's awk. And you're asking about bash.
The following might work:
#!/usr/bin/env bash
mapfile -t arr < input.txt
for s in "${arr[#]}"; do
t=($s)
mapfile -t -d, u <<<"${t[3]}"
for v in "${u[#]}"; do
printf '%s %s %s %s\n' "${t[#]:0:3}" "${v%$'\n'}"
done
done
This copies your entire input file into the elements of an array, and then steps through that array, mapping each 4th-column into a second array. It then steps through that second array, printing the first three columns from the first array, along with the current field from the second array.
It's obviously similar in structure to the awk alternative, but much more cumbersome to read and code.
Note the ${v%$'\n'} on the printf line. This strips off the last field's trailing newline, which doesn't get stripped by mapfile because we're using an alternate delimiter.
Note also that there's no reason you have to copy all your input into an array, I just did it that way to demonstrate a little more of mapfile. You could of course use the old standard,
while read s; do
...
done < input.txt
if you prefer.

Using AWK to take a range of columns and print them as a single column

I have a file where I want to print every entry for a column i>N followed by the contents of the next column. Each line has the same number of columns. An example input:
a b c d
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
say in this case I want to skip the first column so the desired output would be
b
b1
b2
b3
c
c1
c2
c3
d
d1
d2
d3
I got close to what I wanted using
awk '{for(i=2; i<=NF; print $i; i++)}'
but this prints each entry in a line consecutively instead off all entries from each column consecutively.
Thanks in advance
If every line has same number of fields then you can do:
awk '
{
for(i=2;i<=NF;i++)
rec[i]=(rec[i]?rec[i]RS$i:$i)
}
END {
for(i=2;i<=NF;i++) print rec[i]
}' file
If the number of fields are uneven, then you need to remember which line has the maximum number of fields.
awk '
{
for(i=2;i<=NF;i++) {
rec[i]=(rec[i]?rec[i]RS$i:$i)
}
num=(num>NF?num:NF)
}
END {
for(i=2;i<=num;i++) print rec[i]
}' file
Output:
b
b1
b2
b3
c
c1
c2
c3
d
d1
d2
d3
Using cut would be easier here:
# figure out how many fields
read -a fields < <(sed 1q file)
nf=${#fields[#]}
# start dumping the columns.
n=3
for ((i = n; i <= nf; i++)); do
cut -d " " -f $i file
done

Separate into different files after splitting the pattern

I have following pattern
k0
lj33
lp90
ko00
j9
mn12
sh30
lp33
ji90
e3
nd32
jk90
hi43
df45
cv89
er43
I need different files containing
File1 File2 File3
k0 j9 e3
lj33 mn12 nd32
lp90 sh30 jk90
ko00 lp33 hi43
ji90 df45
cv89
er43
Any suggestions ?
Do you mean: each file starts with a two-character string?
Try this command:
csplit input /^..$/ {*}
Please ignore the first empty file xx00.
assuming that you need the data to be split when you reach a two character string:
awk '{if(length($0)==2){filename=$0}; print >filename}' your_file

Resources