If columns match then - with-statement

I have looked at several examples and I am confused as to what to use.
I have 3 columns. If the data in all 3 columns(same line across) I would like it either to highlight the line or make some indicated that atleast one of the columns differ. What is the best approach?
3550 3550 3550 true
3551 3150 3550 false

Quick way would be with a spreadsheet like Excel or Libreoffice:
Paste your data in to Excel or similar
In the fourth column put a function to check the value and return true or false based on whether the values match.
E.g. if you have your values in columns A, B and C, put a function like the following in the fourth column:
=IF(($A1=$B1)&($A1=$C1),"true","false")
Another way would be using awk if you have the data in a text file, e.g
$ cat data.txt
3550 3550 3550
3550 3551 3552
3550 3550 3550
$ awk '{if (($1==$2) && ($1==$3)) { print $1, $2, $3, "true" } else { print $1, $2, $3}}' data.txt
3550 3550 3550 true
3550 3551 3552
3550 3550 3550 true

Related

merging columns but it goes into a new line in perl code

I am pretty new to Perl codes, and I am merging some datasets with the following code. The data is set up as such: first row specifying the sample names, followed by the counts on the second, third columns.... The first column specifies the gene names. I've got 2 big datasets that I'm merging together, and I have been using the following Perl script, by specifying the path to the perl script, and running the following code in Terminal:
$ cd /path/to/file
$ perl /path/to/file dataset1.txt dataset2.txt merged.txt
The Perl script is as follows:
use strict;
my $file1=$ARGV[0];
my $file2=$ARGV[1];
my $out=$ARGV[2];
my %hash=();
open(RF,"$file1") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
$hash{$gene}=join("\t",#arr);
}
close(RF);
open(RF,"$file2") or die $!;
open(WF,">$out") or die $!;
while(my $line=<RF>){
chomp($line);
my #arr=split(/\t/,$line);
my $gene=shift(#arr);
if(exists $hash{$gene}){
print WF $gene . "\t" . $hash{$gene} . "\t" . join("\t",#arr) . "\n";
}
}
close(WF);
close(RF);
With the above code is I am supposed to get a merged table, with the duplicate rows deleted, and the second text file's (Sample A to Sample Z) columns merged to the first text file's columns (Sample 1 to Sample 100), so it should look like this, separated by tabs.
Gene Name Sample 1 Sample 2 ..... Sample A Sample B...
TP53 2.345 2.234 4.32 4.53
The problem arises when my merged files come back with the two datasets merged, however the second dataset in the next row instead of the same row. It will recognise, sort, and merge the counts, but onto the next row. Is there something wrong with my codes or my input?
Thank you for all of your help!!
The double line issue might be because of foreign line endings in your input file. You can check this with a command such as:
$ perl -MData::Dumper -ne'$Data::Dumper::Useqq=1; print Dumper $_' file1.txt
There are more issues with your code, as follows.
What you seem to be doing is joining lines based on the name in column 1. You should be aware that this match is done case-sensitively, so it will differentiate between for example tp53 and TP53, or Gene name and Gene Name, or something as subtle as TP53 and TP53 (an extra space). That can be both good and bad, but be prepared for edge cases.
You are expecting 3 arguments to your program, input files and output, but this is a quite un-Perlish way to go about it. I would use the diamond operator for input files, and then redirect output with shell commands, such as:
$ perl foo.pl file1 file2 > merged.txt
This will give you the flexibility of adding more files to merge, for example, and gives you the option to test the merge without committing to a file.
You are using 2 argument open command, without specifying an open mode (e.g. "<"). That is very dangerous and leaves you open to code injection. For example, someone could enter "| rm -rf /" as the first argument to your program and delete your whole hard drive (or as much as their permissions allowed). To prevent this, you use 3-argument open and specify a hard coded open mode.
Open commands in Perl should also use the lexical file handle, e.g. my $fh, and not global. It should look like this:
open my $fh, "<", $input1 or die $!;
open my $fh_out, ">", $output or die $!;
But since we are using the diamond operator, Perl handles that for us automagically.
You also do not need to separate the reading of the files into two loops, since you are basically doing the same thing. There is also no need to first split the lines, and then join them back together.
I wrote this as a sample of how it can be done:
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my ($name, $line) = /^([^\t]+)(.+)/; # using a regex match avoiding split
$data{$name} .= $line; # merge lines using concatenation
}
for my $name (sort keys %data) {
print $name . $data{$name} . "\n";
}
__DATA__
Gene Name Sample 1 Sample 2 Sample 3 Sample 4
TP53 2.345 2.234 4.32 4.53
TP54 2.345 2.234 4.32 4.53
TP55 2.345 2.234 4.32 4.53
Gene Name Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 2.53
TP54 2.212 1.234 3.32 6.53
TP55 1.345 2.114 7.32 5.53
On my system it gives the output:
Gene Name Sample 1 Sample 2 Sample 3 Sample 4 Sample A Sample B Sample C Sample D
TP53 2.345 2.234 4.32 4.53 2.345 2.234 4.32 2.53
TP54 2.345 2.234 4.32 4.53 2.212 1.234 3.32 6.53
TP55 2.345 2.234 4.32 4.53 1.345 2.114 7.32 5.53
This will output the lines in alphabetical order. If you want to preserve the order of the files, you can collect the names in an array while reading the file, and use that when printing. Arrays preserve the order, hash keys do not.

grep and awk, combine commands?

I have file that looks like:
This is a RESTRICTED site.
All connections are monitored and recorded.
Disconnect IMMEDIATELY if you are not an authorized user!
sftp> cd outbox
sftp> ls -ltr
-rw------- 1 0 0 1911 Jun 12 20:40 61N0584832_EDIP000749728818_MFC_20190612203409.txt
-rw------- 1 0 0 1878 Jun 13 06:01 613577165_EDIP000750181517_MFC_20190613055207.txt
I want to print only the .txt file names, ideally in one command.
I can do:
grep -e '^-' outfile.log > outfile.log2
..which gives only the lines that start with '-'.
-rw------- 1 0 0 1911 Jun 12 20:40 61N0584832_EDIP000749728818_MFC_20190612203409.txt
-rw------- 1 0 0 1878 Jun 13 06:01 613577165_EDIP000750181517_MFC_20190613055207.txt
And then:
awk '{print $9}' outfile.log2 > outfile.log3
..which gives the desired output:
61N0584832_EDIP000749728818_MFC_20190612203409.txt
613577165_EDIP000750181517_MFC_20190613055207.txt
And so the question is, can these 2 commands be combined into 1?
You may use a single awk:
awk '/^-/{ print $9 }' file > outputfile
Or
awk '/^-/{ print $9 }' file > tmp && mv tmp file
It works like this:
/^-/ - finds each line starting with -
{ print $9 } - prints Field 9 of the matching lines only.
Seems like matching the leading - is not really want you want. If you want to just get the .txt files as output, filter on the file name:
awk '$9 ~ /\.txt$/{print $9}' input-file
Using grep with PCRE enabled (-P) flag:
grep -oP '^-.* \K.*' outfile.log
61N0584832_EDIP000749728818_MFC_20190612203409.txt
613577165_EDIP000750181517_MFC_20190613055207.txt
'^-.* \K.*' : Line starting with - till last white space are matched but ignored (anything left of \K will be matched and ignored) and matched part right of \K will be printed.
Since he clearly writes I want to print only the .txt file names, we should test for txt file and since file name are always the latest column we make it more portable by only test the latest filed line this:
awk '$NF ~ /\.txt$/{print $NF}' outfile.log > outfile.log2
61N0584832_EDIP000749728818_MFC_20190612203409.txt
613577165_EDIP000750181517_MFC_20190613055207.txt

R truncates text files with certain encodings

I'm trying to read into R a test file encoded in Code page 437. Here is the file, and here is its hex-dump:
00000000: 0b0c 0e0f 1011 1213 1415 1617 1819 1a1b ................
00000010: 1c1d 1e1f 2021 2223 2425 2627 2829 2a2b .... !"#$%&'()*+
00000020: 2c2d 2e2f 3031 3233 3435 3637 3839 3a3b ,-./0123456789:;
00000030: 3c3d 3e3f 4041 4243 4445 4647 4849 4a4b <=>?#ABCDEFGHIJK
00000040: 4c4d 4e4f 5051 5253 5455 5657 5859 5a5b LMNOPQRSTUVWXYZ[
00000050: 5c5d 5e5f 6061 6263 6465 6667 6869 6a6b \]^_`abcdefghijk
00000060: 6c6d 6e6f 7071 7273 7475 7677 7879 7a7b lmnopqrstuvwxyz{
00000070: 7c7d 7e7f ffad 9b9c 9da6 aeaa f8f1 fde6 |}~.............
00000080: faa7 afac aba8 8e8f 9280 90a5 999a e185 ................
00000090: a083 8486 9187 8a82 8889 8da1 8c8b a495 ................
000000a0: a293 94f6 97a3 9681 989f e2e9 e4e8 eae0 ................
000000b0: ebee e3e5 e7ed fc9e f9fb ecef f7f0 f3f2 ................
000000c0: a9f4 f5c4 b3da bfc0 d9c3 b4c2 c1c5 cdba ................
000000d0: d5d6 c9b8 b7bb d4d3 c8be bdbc c6c7 ccb5 ................
000000e0: b6b9 d1d2 cbcf d0ca d8d7 cedf dcdb ddde ................
000000f0: b0b1 b2fe 0a .....
The file contains 245 characters (including the final newline), but R only reads 242 of them:
> test_text <- readLines(file('437__characters.txt', encoding='437'))
Warning message:
In readLines(file("437__characters.txt", :
incomplete final line found on '437__characters.txt'
> test_text
[1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037 !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\177 ¡¢£¥ª«¬°±²µ·º»¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬▀▄█▌▐░▒"
> nchar(test_text)
[1] 242
You'll note that R doesn't read the final characters "▓■\n".
My best guess is that this is something to do with how R determines the length of text files, because of the following:
Even though the file is terminated with a newline (0x0a), R gives an 'incomplete final line found' warning
Adding seven or more characters to the end of the file makes it read correctly
Similarly, the file is read correctly if you remove three characters from anywhere in the file
The same issue seems to occur with reading files encoded in other DOS code pages
This question might be related: R: read.table stops when meeting specific utf-16 characters.
It appears to be something wrong with readLines(), but could very well be an issue with the file connection for text, with something amiss happening in the encoding = part. Anyway, here's a workaround: Load the file as binary, and then convert. And stay away from bad voodoo 1980s code pages.
Using readLines()
This does not capture the last \n since that delimits the unit of text input by `readLines().
test_text2 <- readLines(file("~/Downloads/437__characters.txt", raw = TRUE))
test_text3 <- stringi::stri_conv(test_text2, "IBM437", "UTF-8")
stringi::stri_length(test_text3)
## [1] 244
test_text3
## [1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\034\033\177\035\036\037 !\"#$%&'()*+,-./
## 0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\032 ¡¢£¥ª«¬°±²μ·º
## »¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥
## ╦╧╨╩╪╫╬▀▄█▌▐░▒▓■"
Using readBin()
Captures everything including the \n.
test_text_bin <- readBin(file("~/Downloads/437__characters.txt", "rb"),
n = 245, what = "raw")
test_text_bin_UTF8 <- stringi::stri_conv(test_text_bin, "IBM437", "UTF-8")
stringi::stri_length(test_text_bin_UTF8)
## [1] 245
test_text_bin_UTF8
## [1] "\v\f\016\017\020\021\022\023\024\025\026\027\030\031\034\033\177\035\036\037 !\"#$%&'()*+,-./
## 0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\032 ¡¢£¥ª«¬°±²μ·º
## »¼½¿ÄÅÆÇÉÑÖÜßàáâäåæçèéêëìíîïñòóôö÷ùúûüÿƒΓΘΣΦΩαδεπστφⁿ₧∙√∞∩≈≡≤≥⌐⌠⌡─│┌┐└┘├┤┬┴┼═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥
## ╦╧╨╩╪╫╬▀▄█▌▐░▒▓■\n"

Unix Pipeling "AWK" - The summation whilst matching

Below I have some raw data. My goal is to match 'column one' values and have the total number of bytes in a single line of output for each ip address.
For example output:
81.220.49.127 6654
81.226.10.238 328
81.227.128.93 84700
Raw Data:
81.220.49.127 328
81.220.49.127 328
81.220.49.127 329
81.220.49.127 367
81.220.49.127 5302
81.226.10.238 328
81.227.128.93 84700
Can anyone advise me on how to do this.
Using an associative array:
awk '{a[$1]+=$2}END{for (i in a){print i,a[i]}}' infile
Alternative to preserve order:
awk '!($1 in a){b[++cont]=$1}{a[$1]+=$2}END{for (c=1;c<=cont;c++){print b[c],a[b[c]]}}' infile
Another way where arrays are not needed:
awk 'lip != $1 && lip != ""{print lip,sum;sum=0}
{sum+=$NF;lip=$1}
END{print lip,sum}' infile
Result
81.220.49.127 6654
81.226.10.238 328
81.227.128.93 84700

How can we split values in a column separated by pipe into multiple rows

For example:
Roll No: Name City
100|200|300 Vicky Hyd
400|500|600 Kalyan Viz
into
100 vicky Hyd
200 vicky Hyd
300 vicky Hyd
400 Kalyan Viz
500 Kalyan Viz
600 Kalyan Viz
Could you please suggest a solution? Thanks
You can use awk to split the string and then print along with other values:
awk 'NR!=1{split($1, arr, "|"); for ( i in arr) print arr[i], $2, $3}' file

Resources