How to print non duplicated rows based on a field with AWK? - unix

I wish to print the non duplicated rows based on the 1st field using AWK. Could anyone please kindly help?
Thanks
Input
1 28324 2077 2 1
1 24682 2088 1 0
1 25399 2074 1 0
2 28925 1582 2 1
3 30254 1450 1 0
4 25552 1131 1 1
4 31033 1134 1 0
5 29230 1522 2 0
Desired Output
2 28925 1582 2 1
3 30254 1450 1 0
5 29230 1522 2 0

awk '
(count[$1]++ < 1) { data[$1] = $0; }
END { for (x in data) if (count[x] == 1) print data[x]; }
'
If the output should be sorted on the first column, pipe it through sort -nk1.

If your data is sorted, you can use this which doesn't accumulate a potentially large array.
awk '
$1 != prev { if (count == 1) print line; count = 0 }
{ prev=$1; line = $0; ++count }
END { if (count == 1) print }' inputfile

For fixed number of characters in the first column and uniq implementation that supports -w option:
sort infile|uniq -uw1

Related

Most efficient way to subset a file by a list of text patterns to match

I have a large, tab delimited file (technically a VCF of genetic variants), call it file.vcf, with millions of lines that look something like this
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus2 1 10 0 0/1,21,2,2,;0
locus3 1 2 0 0/1,21,2,1,;0
...
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
I'd like to subset this original file to include all lines from loci in another file (search-file.txt). For example, if search-file.txt were:
locus1
locus3
locus123929
Then the final would be:
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus3 1 2 0 0/1,21,2,1,;0
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
What is the most efficient way to subset this large of a file using either bash or R? (Note, reading the entire file into memory, as in R is very very very slow, and often crashes the system.)
I'd use awk:
awk -F'\t' '
NR == FNR { a[$0]; next }
$1 in a
' search-file.txt file.vcf > filtered_file
bash would be too slow for this job.
Note: Make sure the file search-file.txt doesn't have DOS line endings.
Alternatively,
LC_ALL=C sort search-file.txt file.vcf |
awk '
NF == 1 { loc = $1; next }
$1 == loc
' > filtered_file
but this version may disturb the original order of lines.

AWK Split File every n-th Row but group IDs together

Lets assume I have the following file text.txt:
#something
#somethingelse
#anotherthing
1
2
2
3
3
3
4
4
4
5
5
6
7
7
8
9
9
9
10
11
11
11
14
15
I want to split this into multiple files by every 5th data row, but if the number of the next row is identical it should still end up in the same file. Header should be in every file, but that could also be ignored and reintroduced later.
This means something like this:
text.txt.1
#something
#somethingelse
#anotherthing
1
2
2
3
3
3
text.txt.2
#something
#somethingelse
#anotherthing
4
4
4
5
5
text.txt.3
#something
#somethingelse
#anotherthing
6
7
7
8
9
9
9
text.txt.4
#something
#somethingelse
#anotherthing
10
11
11
11
14
text.txt.5
#something
#somethingelse
#anotherthing
15
So I was thinking about something like this:
awk 'NR%5==1 && $1!=prev{i++;prev=$1}{print > FILENAME"."i}' test.txt
Both statements work by itself but not together.. is that possible using awk?
Nice question.
With your example, this would work:
awk 'BEGIN{i=1;}/\#/{header= header == ""? $0 : header "\n" $0; next}c>=5 && $1!=prev{i++;c=0;}{if(!c) print header>FILENAME"."i; print > FILENAME"."i;c++;prev=$1;}' test.txt
You need strip the header out, and set a counter (c in above), NR is just current line number of the input, it will not meet your needs when the actual lines are not times of 5.
Break it up and improve a tiny bit:
awk 'BEGIN{i=1;}
/\#/{header= header == ""? $0 : header ORS $0; next}
c>=5 && $1!=prev{i++;c=0;}
!c {print header>FILENAME"."i;}
{print > FILENAME"."i;c++;prev=$1;}
' test.txt
To solve the potential problems mentioned in the comment:
awk 'BEGIN{i=1}
/\#/{header= header == ""? $0 : header ORS $0; next}
c>=5 && $1!=prev{i++;c=0}
!c {close(f);f=(FILENAME"."i);print header>f}
{print>f;c++;prev=$1}
' test.txt
or check Ed's answer which is more precise and different platforms/versions compatible.
Using any awk in any shell on every Unix box:
$ cat tst.awk
/^#/ {
hdr = hdr $0 ORS
next
}
( (++numLines) % 5 ) == 1 {
if ( $0 == prev ) {
--numLines
}
else {
close(out)
out = FILENAME "." (++numBlocks)
printf "%s", hdr > out
numLines = 1
}
}
{
print > out
prev = $0
}
$ awk -f tst.awk text.txt
$ head text.txt.*
==> text.txt.1 <==
#something
#somethingelse
#anotherthing
1
2
2
3
3
3
==> text.txt.2 <==
#something
#somethingelse
#anotherthing
4
4
4
5
5
==> text.txt.3 <==
#something
#somethingelse
#anotherthing
6
7
7
8
9
9
9
==> text.txt.4 <==
#something
#somethingelse
#anotherthing
10
11
11
11
14
==> text.txt.5 <==
#something
#somethingelse
#anotherthing
15
With your shown samples, please try following awk program. Written and tested in GNU awk.
awk '
BEGIN{
outFile="test.txt"
count=1
}
/#/{
header=(header?header ORS:"")$0
next
}
{
arr[$0]=(arr[$0]?arr[$0] ORS:"")$0
}
END{
PROCINFO["sorted_in"] = "#ind_num_asc"
print header > (outFile count)
for(i in arr){
num=split(arr[i],arr2,"\n")
print arr[i] > (outFile count)
len+=num
if(len>=5){ len=0 }
if(len==0){
close(outFile count)
count++
print header > (outFile count)
}
}
}
' Input_file

Replace a value if this value is present in a txt file

Goodmorning everyone, I have a data.ped file made up of thousands of columns and hundreds of lines. The first 6 columns and the first 4 lines of the file look like this:
186 A_Han-4.DG 0 0 1 1
187 A_Mbuti-5.DG 0 0 1 1
188 A_Karitiana-4.DG 0 0 1 1
191 A_French-4.DG 0 0 1 1
And I have a ids.txt file that looks like this:
186 Ignore_Han(discovery).DG
187 Ignore_Mbuti(discovery).DG
188 Ignore_Karitiana(discovery).DG
189 Ignore_Yoruba(discovery).DG
190 Ignore_Sardinian(discovery).DG
191 Ignore_French(discovery).DG
192 Dinka.DG
193 Dai.DG
What I need is to replace (in unix) the value in the first column of the data.ped file with the value in the second column of the ids.txt that is in the same line of the value that is going to be replaced from the data.ped file. For example, I want to replace the "186" value from the data.ped first column with the "Ignore_Han(discovery).DG" value from the ids.txt second column (and this because in the first column of the same line of this value there is "186") So the output.ped file must look like this:
Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1
The values of the first column of the data.ped file are a subset of the values present in the first column of the ids.txt file. So there is always match.
Edit:
I've tried with this:
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped
but when I check the result with:
cut -f 1-6 -d " " output.ped
I get this strange output:
A_Han-4.DG 0 0 1 1y).DG
A_Mbuti-5.DG 0 0 1 1y).DG
A_Karitiana-4.DG 0 0 1 1y).DG
A_French-4.DG 0 0 1 1y).DG
while if I use this command:
cut -f 1-6 -d " " output.ped | less
I get this:
Ignore_Han(discovery).DG^M A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG^M A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG^M A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG^M A_French-4.DG 0 0 1 1
and I can't figure out why there is that ^M in every line.
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]} 1' ids.txt data.ped
output:
Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1
This is a classic awk task with various modifications according to your requirements. Here we replaced the first field of data.ped only if we have found its value in the ids.txt, else we print the line unchanged. If you would like to remove lines that don't match:
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped
There is no need for the input files to be sorted and the order of the second file is preserved.
UPDATE:
If you have Ctrl-M characters in your inputs, remove them first with
cat file | tr -d '^M' > file.tmp && mv file.tmp file
for any file you use. In general, I suggest running dos2unix for any text files that could contain characters like ^M or \r, usually coming from dos/windows editing.
Use join command to join two files
join ids.txt data.ped > temp
You can use cut command to remove the first column like:
cut -d " " -f 2- temp > output.ped

How to fill in observations using other observations R or Stata

I have a dataset like this:
ID dum1 dum2 dum3 var1
1 0 1 . hi
1 0 . 0 hi
2 1 . . bye
2 0 0 1 .
What I'm trying to do is that I want to fill in information based on the same ID if observations are missing. So my end product would be something like:
ID dum1 dum2 dum3 var1
1 0 1 0 hi
1 0 1 0 hi
2 1 0 1 bye
2 0 0 1 bye
Is there any way I can do this in R or Stata?
This continues discussion of Stata solutions. The solution by #Pearly Spencer looks backward and forward from observations with missing values and so is fine for the example with just two observations per group, and possibly fine for some other situations.
An alternative approach makes use, as appropriate, of the community-contributed commands mipolate and stripolate from SSC as explained also at https://www.statalist.org/forums/forum/general-stata-discussion/general/1308786-mipolate-now-available-from-ssc-new-program-for-interpolation
Examples first, then commentary:
clear
input ID dum1a dum2a dum3a str3 var1a
1 0 1 . "hi"
1 0 . 0 "hi"
2 1 . . "bye"
2 0 0 1 ""
2 0 1 . ""
end
gen long obsno = _n
foreach v of var dum*a {
quietly count if missing(`v')
if r(N) > 0 capture noisily mipolate `v' obsno, groupwise by(ID) generate(`v'_2)
}
foreach v of var var*a {
quietly count if missing(`v')
if r(N) > 0 capture noisily stripolate `v' obsno, groupwise by(ID) generate(`v'_2)
}
list
+----------------------------------------------------------------+
| ID dum1a dum2a dum3a var1a obsno dum3a_2 var1a_2 |
|----------------------------------------------------------------|
1. | 1 0 1 . hi 1 0 hi |
2. | 1 0 . 0 hi 2 0 hi |
3. | 2 1 . . bye 3 1 bye |
4. | 2 0 0 1 4 1 bye |
5. | 2 0 1 . 5 1 bye |
+----------------------------------------------------------------+
Notes:
The groupwise option of mipolate and stripolate uses the rule: replace missing values within groups with the non-missing value in that group if and only if there is only one distinct non-missing value in that group. Thus if the non-missing values in a group are all 1, or all 42, or whatever it is, then interpolation uses 1 or 42 or whatever it is. If the non-missing values in a group are 0 and 1, then no go.
The variable obsno created here plays no role in that interpolation and is needed solely to match the general syntax of mipolate.
There is no assumption here that groups consist of just two observations or have the same number of observations. A common playground for these problems is data on families whenever some variables were recorded only for certain family members but it is desired to spread the values recorded to other family members. Naturally, in real data families often have more than two members and the number of family members will vary.
This question exposed a small bug in mipolate, groupwise and stripolate, groupwise: it doesn't exit as appropriate if there is nothing to do, as in dum1a where there are no missing values. In the code above, this is trapped by asking for interpolation if and only if missing values are counted. At some future date, the bug will be fixed and the code in this answer simplified accordingly, or so I intend as program author.
mipolate, groupwise and stripolate, groupwise both exit with an error message if any group is found with two or more distinct non-missing values; no interpolation is then done for any groups, even if some groups are fine. That is the point of the code capture noisily: the error message for dum2a is not echoed above. As program author I am thinking of adding an option whereby such groups will be ignored but that interpolation will take place for groups with just one distinct non-missing value.
Assuming your data is in df
library(dplyr)
df %>%
group_by(ID) %>%
mutate(dum1=dum1[dum1!="."][1],
dum2=dum2[dum2!="."][1],
dum3=dum3[dum3!="."][1],
var1=var1[var1!="."][1])
Using your toy example:
clear
input ID dum1a dum2a dum3a str3 var1a
1 0 1 . "hi"
1 0 . 0 "hi"
2 1 . . "bye"
2 0 0 1 "."
end
replace var1a = "" if var1a == "."
sort ID (dum2a)
list
+------------------------------------+
| ID dum1a dum2a dum3a var1a |
|------------------------------------|
1. | 1 0 1 . hi |
2. | 1 0 . 0 hi |
3. | 2 0 0 1 |
4. | 2 1 . . bye |
+------------------------------------+
In Stata you can do the following:
ds ID, not
local varlist `r(varlist)'
foreach var of local varlist {
generate `var'b = `var'
bysort ID (`var'): replace `var'b = cond(!missing(`var'[_n-1]), `var'[_n-1], ///
`var'[_n+1]) if missing(`var')
}
list ID dum?ab var?ab
+----------------------------------------+
| ID dum1ab dum2ab dum3ab var1ab |
|----------------------------------------|
1. | 1 0 1 0 hi |
2. | 1 0 1 0 hi |
3. | 2 0 0 1 bye |
4. | 2 1 0 1 bye |
+----------------------------------------+

Is preprocessing file with awk needed or it can be done directly in R?

I used to process csv file with awk, here is my 1st script:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less
this script looks for repeating values in 2nd column (if value on line n is same as on line n+1, n+2 ...) and prints only first occurrence. For example if you feed following input:
ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
Then the output will be:
1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
EDIT:
I've made this a bit challenging adding 2nd script:
The second script does the same but prints last duplicate occurrence:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less
It's output will be:
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
I suppose R is powerful language which should handle such tasks, but I've found only questions regarding calling awk scripts from R etc. How to do this in R?
Regarding the update to your question, a more general solution, thanks to #nicola:
Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
# ord orig pred as o.p
# 1 1 0 0 1 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
If you want to use the last occurrence of a value in a run, rather than the first, just append TRUE to #nicola's indexing expression instead of prepending it:
Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
# ord orig pred as o.p
# 22 22 0 0 0 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
In either case, tbl$orig[-1] != tbl$orig[-nrow(tbl)] is comparing the 2nd through nth values in column 2 with the 1st through n-1th values in column 2. The result is a logical vector, where TRUE elements indicate a change in consecutive values. Since the comparison is of length n-1, pushing an extra TRUE value to the front (case 1) will select the first occurrence in a run, whereas adding an extra TRUE to the back (case 2) will select the last occurrence in a run.
Data:
tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
For the (updated) question, you could use for example (thanks to #nrussell for his comment and suggestion):
idx <- c(1, cumsum(rle(tbl[,2])[[1]])[-1])
tbl[idx,]
# ord orig pred as o.p x
#1 1 0 0 1 0 1
#23 23 4 0 0 4 2
#24 24 402 0 1 402 3
#25 25 0 0 1 0 4
It will return the first row of each 'block' of identical values in column orig.
rle(tbl[,2])[[1]] computes the run lengths of each new (different than previous) value that appears in column orig
cumsum(...) computes the cumulative sum of those run lengths
finally, c(1, cumsum(...)[-1]) replaces the first number in that vector with a 1, so that the very first line of the data will always be present

Resources