Using `diff` from R via `system(..)` - r

I was willing to compare two paths (named a and b) in R using the diff command from Bash.
In bash I would do
$ a=Path/to/foo/directory/
$ b=Path/to/bar/directory/
$ diff <(printf ${a} | tr / '\n') <(printf ${b} | tr / '\n')
3c3
< foo
---
> bar
So from R I am trying
a="Path/to/foo/directory/"
b="Path/to/bar/directory/"
system(
paste0(
"a=",a,
";b=",b,
";diff <(printf ${a} | tr / '\n') <(printf ${b} | tr / '\n')"
)
)
OR
system(
paste0(
"diff <(printf ",a," | tr / '\n') <(printf ",b," | tr / '\n')"
)
)
but both return an error.
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `a=Path/to/foo/directory/;b=Path/to/bar/directory/;diff <(printf ${a} | tr / ''
even though copy-pasting the output of the paste0 function into bash works fine.
There might be better ways to compare strings in R and I would welcome alternative solutions. However, I am particularly interested in understanding what is going wrong with my usage of the system() function and how to solve it.

As explained here, system(..) is not running /usr/bin/bash but /usr/bin/sh. Here are two possible solutions to the problem.
Solution in "usr/bin/sh"
So in order to make a script that run through /usr/bin/sh I had to print strings on files.
DiffPath = function(a,b,ManipulationFolder="~")
{
if (file.exists(ManipulationFolder))
{
system(
paste0(
"cd ",ManipulationFolder,
";a=",a,
";b=",b,
";printf ${a} | tr / '\n' > a.txt",
";printf ${b} | tr / '\n' > b.txt",
";diff a.txt b.txt",
";rm a.txt;rm b.txt"
)
)
} else
{
warning(paste0("Cannot find the ManipulationFolder ( ",ManipulationFolder," )"))
}
}
Solution in "usr/bin/bash"
An alternative and nicer solution is to explicitly give the command to bash.
DiffPath = function(a,b)
{
system(
paste0(
'bash -c \'diff <(printf ',a,' | tr / "\n") <(printf ',b,' | tr / "\n")\''
)
)
}
Function call
a="Path/to/foo/directory/"
b="Path/to/bar/directory/"
DiffPath(a,b)
3c3
< foo
---
> bar

Related

jq parsing date to timestamp

I have the following script:
curl -s -S 'https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-NBT&tickInterval=thirtyMin&_=1521347400000' | jq -r '.result|.[] |[.T,.O,.H,.L,.C,.V,.BV] | #tsv | tostring | gsub("\t";",") | "(\(.))"'
This is the output:
(2018-03-17T18:30:00,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(2018-03-17T19:00:00,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(2018-03-17T19:30:00,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(2018-03-17T20:00:00,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
I want to replace the date with timestamp.
I can make this conversion with date in the shell
date -d '2018-03-17T18:30:00' +%s%3N
1521325800000
I want this result:
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
This data is stored in MySQL.
Is it possible to execute the date conversion with jq or another command like awk, sed, perl in a single command line?
Here is an all-jq solution that assumes the "Z" (UTC+0) timezone.
In brief, simply replace .T by:
((.T + "Z") | fromdate | tostring + "000")
To verify this, consider:
timestamp.jq
[splits("[(),]")]
| .[1] |= ((. + "Z")|fromdate|tostring + "000") # milliseconds
| .[1:length-1]
| "(" + join(",") + ")"
Invocation
jq -rR -f timestamp.jq input.txt
Output
(1521311400000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521313200000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521315000000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521316800000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Here is an unportable awk solution. It is not portable because it relies on the system date command; on the system I'm using, the relevant invocation looks like: date -j -f "%Y-%m-%eT%T" STRING "+%s"
awk -F, 'BEGIN{OFS=FS}
NF==0 { next }
{ sub(/\(/,"",$1);
cmd="date -j -f \"%Y-%m-%eT%T\" " $1 " +%s";
cmd | getline $1;
$1=$1 "000"; # milliseconds
printf "%s", "(";
print;
}' input.txt
Output
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Solution with sed :
sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
test :
<commande_curl> | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
or :
<commande_curl> > results_curl.txt
cat results_curl.txt | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh

Calling comm from system() in R with process substitution

For efficiency reasons, I'd like to call comm in R via system(). I've grown accustomed to using syntax like:
comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($7 ~ /^".*"$/ && $9 ~ /^".*"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^["]/) {print "\""toupper($7)"\"",toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^[^"]/) {print "\""toupper($7)"\"","\""toupper($9)"\""}}' | sort) <(awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($1 ~ /^".*"$/ && $2 ~ /^".*"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^["]/) {print "\""toupper($1)"\"",toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^[^"]/) {print "\""toupper($1)"\"","\""toupper($2)"\""}}' /path/to/file | sort)
But when using this syntax from system, as in
system("comm -13 <(filea) <fileb)")
I get the familiar error:
sh: -c: line 0: syntax error near unexpected token `('
From the above it's clear that system() is using sh and not bash, and that process substitution isn't supported. After reading other articles, I've attempted using
system("bash -c 'comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($7 ~ /^\".*\"$/ && $9 ~ /^\".*\"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[\"]/) {print \"\\\"\"toupper($7)\"\\\"\",toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[^\"]/) {print \"\\\"\"toupper($7)\"\\\"\",\"\\\"\"toupper($9)\"\\\"\"}}' | sort) <(awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($1 ~ /^\".*\"$/ && $2 ~ /^\".*\"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[\"]/) {print \"\\\"\"toupper($1)\"\\\"\",toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[^\"]/) {print \"\\\"\"toupper($1)\"\\\"\",\"\\\"\"toupper($2)\"\\\"\"}}' /path/to/file | sort)")
That is, escaping double quotes and backslashes as necessary. However, this returns the same error:
sh: -c: line 0: syntax error near unexpected token `('
I'm guessing this has something to do with the escaping of single quotes within bash -c within a double quoted string in system(). I'm a little confused as to how to manage the single quoting within bash -c within a double quoted string in system(). How should I navigate all of this escaping?
To solve this issue, I merely needed to escape everything in within:
bash -c "[within]"
Using bash's escape rules (https://www.gnu.org/software/bash/manual/html_node/Double-Quotes.html), and everything in within2:
system("[within2]")
Using R's escape rules.
The end result is double escaping backslashes and quotes (bash and R), and single escaping $ (bash).

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

While read line, awk $line and write to variable

I am trying to split a file into different smaller files depending on the value of the fifth field. A very nice way to do this was already suggested and also here.
However, I am trying to incorporate this into a .sh script for qsub, without much success.
The problem is that in the section where the file to which output the line is specified,
i.e., f = "Alignments_" $5 ".sam" print > f
, I need to pass a variable declared earlier in the script, which specifies the directory where the file should be written. I need to do this with a variable which is built for each task when I send out the array job for multiple files.
So say $output_path = ./Sample1
I need to write something like
f = $output_path "/Alignments_" $5 ".sam" print > f
But it does not seem to like having a $variable that is not a $field belonging to awk. I don't even think it likes having two "strings" before and after the $5.
The error I get back is that it takes the first line of the file to be split (little.sam) and tries to name f like that, followed by /Alignments_" $5 ".sam" (those last three put together correctly). It says, naturally, that it is too big a name.
How can I write this so it works?
Thanks!
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
UPDATE, AFTER ADDING -V TO AWK AND DECLARING THE VARIABLE OPATH
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h $input | awk 'NR >= 18' | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR {
num[$1]
next
}
$5 in num {
f = newdir"/Alignments_"$5".sam";
print > f
} ' Tile_Number_List.txt -
mkdir: created directory little_TEST'
awk: cmd. line:10: (FILENAME=- FNR=1) fatal: can't redirect to `/Alignments_1101.sam' (Permission denied)
awk variables are like C variables - just reference them by name to get their value, no need to stick a "$" in front of them like you do with shell variables:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
output_path = "./Sample1/"
f = output_path "Alignments_" $5 ".sam"
print > f
} ' Tile_Number_List.txt little.sam
To pass the value of the shell variable such as $output_path to awk you need to use the -v option.
$ output_path=./Sample1/
$ awk -F '[:\t]' -v opath="$ouput_path" '
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = opath"Alignments_"$5".sam"
print > f
} ' Tile_Number_List.txt little.sam
Also you still have the error from your previous question left in your script
EDIT:
The awk variable created with -v is obase but you use newdir what you want is:
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h "$input" | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR && NR >= 18 {
num[$1]
next
}
$5 in num {
f = opath"/Alignments_"$5".sam" # <-- opath is the awk variable not newdir
print > f
}' Tile_Number_List.txt -
You should also move NR >= 18 into the second awk script.

extracting a pattern and a certain field from the line above it using awk and grep preferably

i have a text file like this:
********** time1 **********
line of text1
line of text1.1
line of text1.2
********** time2 **********
********** time3 **********
********** time4 **********
line of text2.1
line of text2.2
********** time5 **********
********** time6 **********
line of text3.1
i want to extract line of text and the time(without the stars) above it and store it in a file.(time with no line of text beneath them have to be ignored). I want to do this preferably with grep and awk.
So for example, my output for the above code should be
time1 : line of text1
time1 : line of text1.1
time1 : line of text1.2
time4 : line of text2.1
time4 : line of text2.2
time6 : line of text3
how do i go about it?
This assumes that there are no spaces in the time and that there is only one (or zero) line of text after each time marker.
awk '$1 ~ /\*+/ {prev = $2} $1 !~ /\*+/ {print prev, ":", $0}' inputfile
Works with spaces in the time:
awk '/^[^*]+/ { gsub(/*/,"",x);printf x": "; print };{x=$0}' data.txt
You can do it like this with vim:
:%s_\*\+ \(YOUR TIME PATTERN\) \*\+\_.\(\[^*\].*\)$_\1 : \2_ | g_\*\+ YOUR TIME PATTERN \*\+_d
That is search for TIME PATTERN lines and saves the time pattern and the next line if it's not started with *. Then create the new line from them. Then delete every remaining TIME PATTERN line.
Note this assumes, that the time pattern lines are ending with *, etc.
With awk:
awk '/\*+ YOUR TIME PATTERN \*+/ { time=gensub("\*+ (YOUR TIME PATTERN) \*+","\\1","g") }
! /\*+ YOUR TIME PATTERN \*+/ { print time " : " $0 }' INPUTFILE
And there are other ways to do it.
In awk, see :
#!/bin/bash
awk '
BEGIN{
t=0
}
{
if ($0 ~ " time[0-9]+ ") {
v=$2
t=1
}
else if ($0 ~ "line of text") {
if (t==1) {
printf("%s : %s\n", v, $0)
} else {
t=0;
}
}
}
' FILE
Just replace FILE by your filename.
This might work for you (GNU sed):
sed '/^\*\+ \S\+.*/!d;s/[ *]//g;$!N;/\n[^*]/!D;s/\n/ : /' file
Explanation:
Look for lines beginning with *'s if not delete. /^\*\+ \S\+.*/!d
Got a time line. Delete *'s and spaces (leaving time). s/[ *]//g
Get next line $!N
Check the second line doesn't begin with *'s otherwise delete first line /\n[^*]/!D
Got intended pattern, replace \n with spaced : and print. s/\n/ : /
awk '{ if( $0 ~ /^\*+ time[0-9] \*+$/ ) { time = $2 } else { print time " : " $0 } }' file
$ uniq -f 2 input-file | awk '{getline n; print $2 " : " n}'
If your timestamp has spaces in it, change the argument to the -f option so that uniq is only comparing the final string of *. Eg, use -f X where X-2 is the number of spaces in the timestamp. Also if there are spaces in the timestamp, the awk will need to change. Either of these will work:
$ uniq -f 3 input-file | awk -F '**********' '{getline n; print $2 " : " n}'
$ uniq -f 3 input-file | awk '{getline n; $1=""; $NF=""; print $0 ": " n }'

Resources