Replace a value if this value is present in a txt file - unix

Goodmorning everyone, I have a data.ped file made up of thousands of columns and hundreds of lines. The first 6 columns and the first 4 lines of the file look like this:
186 A_Han-4.DG 0 0 1 1
187 A_Mbuti-5.DG 0 0 1 1
188 A_Karitiana-4.DG 0 0 1 1
191 A_French-4.DG 0 0 1 1
And I have a ids.txt file that looks like this:
186 Ignore_Han(discovery).DG
187 Ignore_Mbuti(discovery).DG
188 Ignore_Karitiana(discovery).DG
189 Ignore_Yoruba(discovery).DG
190 Ignore_Sardinian(discovery).DG
191 Ignore_French(discovery).DG
192 Dinka.DG
193 Dai.DG
What I need is to replace (in unix) the value in the first column of the data.ped file with the value in the second column of the ids.txt that is in the same line of the value that is going to be replaced from the data.ped file. For example, I want to replace the "186" value from the data.ped first column with the "Ignore_Han(discovery).DG" value from the ids.txt second column (and this because in the first column of the same line of this value there is "186") So the output.ped file must look like this:
Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1
The values of the first column of the data.ped file are a subset of the values present in the first column of the ids.txt file. So there is always match.
Edit:
I've tried with this:
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped
but when I check the result with:
cut -f 1-6 -d " " output.ped
I get this strange output:
A_Han-4.DG 0 0 1 1y).DG
A_Mbuti-5.DG 0 0 1 1y).DG
A_Karitiana-4.DG 0 0 1 1y).DG
A_French-4.DG 0 0 1 1y).DG
while if I use this command:
cut -f 1-6 -d " " output.ped | less
I get this:
Ignore_Han(discovery).DG^M A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG^M A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG^M A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG^M A_French-4.DG 0 0 1 1
and I can't figure out why there is that ^M in every line.

awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]} 1' ids.txt data.ped
output:
Ignore_Han(discovery).DG A_Han-4.DG 0 0 1 1
Ignore_Mbuti(discovery).DG A_Mbuti-5.DG 0 0 1 1
Ignore_Karitiana(discovery).DG A_Karitiana-4.DG 0 0 1 1
Ignore_French(discovery).DG A_French-4.DG 0 0 1 1
This is a classic awk task with various modifications according to your requirements. Here we replaced the first field of data.ped only if we have found its value in the ids.txt, else we print the line unchanged. If you would like to remove lines that don't match:
awk 'NR==FNR{a[$1]=$2; next} $1 in a{$1=a[$1]; print}' ids.txt data.ped
There is no need for the input files to be sorted and the order of the second file is preserved.
UPDATE:
If you have Ctrl-M characters in your inputs, remove them first with
cat file | tr -d '^M' > file.tmp && mv file.tmp file
for any file you use. In general, I suggest running dos2unix for any text files that could contain characters like ^M or \r, usually coming from dos/windows editing.

Use join command to join two files
join ids.txt data.ped > temp
You can use cut command to remove the first column like:
cut -d " " -f 2- temp > output.ped

Related

Most efficient way to subset a file by a list of text patterns to match

I have a large, tab delimited file (technically a VCF of genetic variants), call it file.vcf, with millions of lines that look something like this
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus2 1 10 0 0/1,21,2,2,;0
locus3 1 2 0 0/1,21,2,1,;0
...
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
I'd like to subset this original file to include all lines from loci in another file (search-file.txt). For example, if search-file.txt were:
locus1
locus3
locus123929
Then the final would be:
locus1 1 15 0 0/0,21,2,2,;0
locus1 2 17 0 0/0,21,2,1,;0
locus3 1 2 0 0/1,21,2,1,;0
locus123929 1 3 0 1/0,22,2,1,;0
locus123929 2 4 0 1/2,1,1,3,;0
What is the most efficient way to subset this large of a file using either bash or R? (Note, reading the entire file into memory, as in R is very very very slow, and often crashes the system.)
I'd use awk:
awk -F'\t' '
NR == FNR { a[$0]; next }
$1 in a
' search-file.txt file.vcf > filtered_file
bash would be too slow for this job.
Note: Make sure the file search-file.txt doesn't have DOS line endings.
Alternatively,
LC_ALL=C sort search-file.txt file.vcf |
awk '
NF == 1 { loc = $1; next }
$1 == loc
' > filtered_file
but this version may disturb the original order of lines.

How can I ignore lines while reading a text file in R? [duplicate]

This question already has answers here:
R - Reading lines from a .txt-file after a specific line
(3 answers)
Closed 5 years ago.
I have a text file with tens of thousands of rows, with time stamps such as 2010 5 3 0 0 interspersed in between. They are not consistent, but the 2 rows are.
How can I import the 2 columns (trial and the number), while ignoring the rows where I have these timestamps?
a <- read.table('test.txt')
Currently, I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 5 did not have 2 elements
Data
Trial 0.214526266019124
Trial 0.213914388985549
Trial 0.213886659329060
Trial 0.213886587273578
2010 5 3 0 0
Trial 0.213886587273578
Trial 0.213256610071994
Trial 0.213232963405967
Trial 0.213232928149832
2011 2 3 0 0
Trial 0.213886587273578
Trial 0.213256610071994
Trial 0.213232963405967
Trial 0.213232928149832
Trial 0.213886587273578
Trial 0.213256610071994
Trial 0.213232963405967
2011 2 6 0 0
You can use read.table (or other function) in combination with grep:
read.table(text=grep("Trial", readLines(path_to_your_file), value=TRUE))
Does this solve your problem?
if you have perl, you can do the data cleaning with it and capture the output without leaving R using pipe. Having to escape regex and quotes in the perl "one-liner" makes it a little weird and probably better as it's own script.
The pipe to perl here is maybe more complicated than you need. perl -lne 'print $1 if m/Trial (.*)/' would probably suffice. Below captures the time stamp and appends it to all the lines until timestamp is found. \W+ matches one or more white space characters, but needs the extra escape to be escaped from R's parser and passed to perl: \\W+. \" is used to keep R from thinking the string we are giving to it has ended, while still allowing string delimiters in perl (could use qq(..) instead of "..." in perl).
a <- read.table(
pipe("perl -lne '
BEGIN{$ts=\"0 0 0 0 0\"}
chomp;
if(/Trial\\W+(.*)/){
print \"$1 $ts\"
} else {
$ts=$_
}' test.txt"))
for the example data, the output would be
V1 V2 V3 V4 V5 V6
1 0.2145263 0 0 0 0 0
2 0.2139144 0 0 0 0 0
3 0.2138867 0 0 0 0 0
4 0.2138866 0 0 0 0 0
5 0.2138866 2010 5 3 0 0
6 0.2132566 2010 5 3 0 0
7 0.2132330 2010 5 3 0 0
8 0.2132329 2010 5 3 0 0
txt<-readLines("C:\\Users\\abc\\Desktop\\new2.txt")
table<-strsplit(txt[grepl("T",substr(txt,1,1))],split = "\\s\\s")
table<-do.call("rbind", table)

Is preprocessing file with awk needed or it can be done directly in R?

I used to process csv file with awk, here is my 1st script:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less
this script looks for repeating values in 2nd column (if value on line n is same as on line n+1, n+2 ...) and prints only first occurrence. For example if you feed following input:
ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
Then the output will be:
1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
EDIT:
I've made this a bit challenging adding 2nd script:
The second script does the same but prints last duplicate occurrence:
tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less
It's output will be:
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0
I suppose R is powerful language which should handle such tasks, but I've found only questions regarding calling awk scripts from R etc. How to do this in R?
Regarding the update to your question, a more general solution, thanks to #nicola:
Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
# ord orig pred as o.p
# 1 1 0 0 1 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
If you want to use the last occurrence of a value in a run, rather than the first, just append TRUE to #nicola's indexing expression instead of prepending it:
Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
# ord orig pred as o.p
# 22 22 0 0 0 0
# 23 23 4 0 0 4
# 24 24 402 0 1 402
# 25 25 0 0 1 0
In either case, tbl$orig[-1] != tbl$orig[-nrow(tbl)] is comparing the 2nd through nth values in column 2 with the 1st through n-1th values in column 2. The result is a logical vector, where TRUE elements indicate a change in consecutive values. Since the comparison is of length n-1, pushing an extra TRUE value to the front (case 1) will select the first occurrence in a run, whereas adding an extra TRUE to the back (case 2) will select the last occurrence in a run.
Data:
tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")
For the (updated) question, you could use for example (thanks to #nrussell for his comment and suggestion):
idx <- c(1, cumsum(rle(tbl[,2])[[1]])[-1])
tbl[idx,]
# ord orig pred as o.p x
#1 1 0 0 1 0 1
#23 23 4 0 0 4 2
#24 24 402 0 1 402 3
#25 25 0 0 1 0 4
It will return the first row of each 'block' of identical values in column orig.
rle(tbl[,2])[[1]] computes the run lengths of each new (different than previous) value that appears in column orig
cumsum(...) computes the cumulative sum of those run lengths
finally, c(1, cumsum(...)[-1]) replaces the first number in that vector with a 1, so that the very first line of the data will always be present

How do I create/ sort a Table containing a list of matched terms with their corresponding counts

I am having problems trying to create a table containing a master list of names that have been matched and counted in two separate groups.
The Input_list.txt contains a master list of names and looks like this:
-5S_rRNA
-7SK
-ABCA8
-AC002480.4
-AC002978.1
-RP11-129B22.2
These names have been grep'd and counted in two separate data groups; group1_data.txt and group2_data.txt and look like this:
group1_data.txt
-5S_rRNA 20
-7SK 25
-AC002480.4 1
-AC002978.1 2
group2_data.txt
-5S_rRNA 1
-ABCA8 1
I would like to create a table that contains the master Input_list.txt and the 2 data.txt files with the matched names and corresponding counts. If there isn't a match, I would like to return a value of 0 and to look like this:
Input group1 group2
5S_rRNA 20 1
7SK 25 0
ABCA8 0 1
AC002480.4 1 0
AC002978.1 2 0
The number of matched names are not equal between the Input_list.txt and two data.txt files.
I've tried sort but I'm really stuck. Any suggestions would be great!
Using join:
join -e 0 -a 1 -o '1.1 2.2' Input_list.txt group1_data.txt | \
join -a 1 -e 0 -o '1.1 1.2 2.2' - group2_data.txt | \
sed '/ 0 0$/d'
Prints:
-5S_rRNA 20 1
-7SK 25 0
-ABCA8 0 1
-AC002480.4 1 0
-AC002978.1 2 0

How to print non duplicated rows based on a field with AWK?

I wish to print the non duplicated rows based on the 1st field using AWK. Could anyone please kindly help?
Thanks
Input
1 28324 2077 2 1
1 24682 2088 1 0
1 25399 2074 1 0
2 28925 1582 2 1
3 30254 1450 1 0
4 25552 1131 1 1
4 31033 1134 1 0
5 29230 1522 2 0
Desired Output
2 28925 1582 2 1
3 30254 1450 1 0
5 29230 1522 2 0
awk '
(count[$1]++ < 1) { data[$1] = $0; }
END { for (x in data) if (count[x] == 1) print data[x]; }
'
If the output should be sorted on the first column, pipe it through sort -nk1.
If your data is sorted, you can use this which doesn't accumulate a potentially large array.
awk '
$1 != prev { if (count == 1) print line; count = 0 }
{ prev=$1; line = $0; ++count }
END { if (count == 1) print }' inputfile
For fixed number of characters in the first column and uniq implementation that supports -w option:
sort infile|uniq -uw1

Resources