Calling comm from system() in R with process substitution - r

For efficiency reasons, I'd like to call comm in R via system(). I've grown accustomed to using syntax like:
comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($7 ~ /^".*"$/ && $9 ~ /^".*"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^["]/) {print "\""toupper($7)"\"",toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^[^"]/) {print "\""toupper($7)"\"","\""toupper($9)"\""}}' | sort) <(awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($1 ~ /^".*"$/ && $2 ~ /^".*"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^["]/) {print "\""toupper($1)"\"",toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^[^"]/) {print "\""toupper($1)"\"","\""toupper($2)"\""}}' /path/to/file | sort)
But when using this syntax from system, as in
system("comm -13 <(filea) <fileb)")
I get the familiar error:
sh: -c: line 0: syntax error near unexpected token `('
From the above it's clear that system() is using sh and not bash, and that process substitution isn't supported. After reading other articles, I've attempted using
system("bash -c 'comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($7 ~ /^\".*\"$/ && $9 ~ /^\".*\"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[\"]/) {print \"\\\"\"toupper($7)\"\\\"\",toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[^\"]/) {print \"\\\"\"toupper($7)\"\\\"\",\"\\\"\"toupper($9)\"\\\"\"}}' | sort) <(awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($1 ~ /^\".*\"$/ && $2 ~ /^\".*\"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[\"]/) {print \"\\\"\"toupper($1)\"\\\"\",toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[^\"]/) {print \"\\\"\"toupper($1)\"\\\"\",\"\\\"\"toupper($2)\"\\\"\"}}' /path/to/file | sort)")
That is, escaping double quotes and backslashes as necessary. However, this returns the same error:
sh: -c: line 0: syntax error near unexpected token `('
I'm guessing this has something to do with the escaping of single quotes within bash -c within a double quoted string in system(). I'm a little confused as to how to manage the single quoting within bash -c within a double quoted string in system(). How should I navigate all of this escaping?

To solve this issue, I merely needed to escape everything in within:
bash -c "[within]"
Using bash's escape rules (https://www.gnu.org/software/bash/manual/html_node/Double-Quotes.html), and everything in within2:
system("[within2]")
Using R's escape rules.
The end result is double escaping backslashes and quotes (bash and R), and single escaping $ (bash).

Related

jq parsing date to timestamp

I have the following script:
curl -s -S 'https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-NBT&tickInterval=thirtyMin&_=1521347400000' | jq -r '.result|.[] |[.T,.O,.H,.L,.C,.V,.BV] | #tsv | tostring | gsub("\t";",") | "(\(.))"'
This is the output:
(2018-03-17T18:30:00,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(2018-03-17T19:00:00,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(2018-03-17T19:30:00,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(2018-03-17T20:00:00,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
I want to replace the date with timestamp.
I can make this conversion with date in the shell
date -d '2018-03-17T18:30:00' +%s%3N
1521325800000
I want this result:
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
This data is stored in MySQL.
Is it possible to execute the date conversion with jq or another command like awk, sed, perl in a single command line?
Here is an all-jq solution that assumes the "Z" (UTC+0) timezone.
In brief, simply replace .T by:
((.T + "Z") | fromdate | tostring + "000")
To verify this, consider:
timestamp.jq
[splits("[(),]")]
| .[1] |= ((. + "Z")|fromdate|tostring + "000") # milliseconds
| .[1:length-1]
| "(" + join(",") + ")"
Invocation
jq -rR -f timestamp.jq input.txt
Output
(1521311400000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521313200000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521315000000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521316800000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Here is an unportable awk solution. It is not portable because it relies on the system date command; on the system I'm using, the relevant invocation looks like: date -j -f "%Y-%m-%eT%T" STRING "+%s"
awk -F, 'BEGIN{OFS=FS}
NF==0 { next }
{ sub(/\(/,"",$1);
cmd="date -j -f \"%Y-%m-%eT%T\" " $1 " +%s";
cmd | getline $1;
$1=$1 "000"; # milliseconds
printf "%s", "(";
print;
}' input.txt
Output
(1521325800000,0.00012575,0.00012643,0.00012563,0.00012643,383839.45768188,48.465051)
(1521327600000,0.00012643,0.00012726,0.00012642,0.00012722,207757.18765437,26.30099514)
(1521329400000,0.00012726,0.00012779,0.00012698,0.00012779,97387.01596624,12.4229077)
(1521331200000,0.0001276,0.0001278,0.00012705,0.0001275,96850.15260027,12.33316229)
Solution with sed :
sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
test :
<commande_curl> | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh
or :
<commande_curl> > results_curl.txt
cat results_curl.txt | sed -e 's/(\([^,]\+\)\(,.*\)/echo "(\$(date -d \1 +%s%3N),\2"/g' | ksh

No results while making precise matching using awk

I am having rows like this in my source file:
"Sumit|My Application|PROJECT|1|6|Y|20161103084527"
I want to make a precise match on Column 3 i.e. I do not want to use '~' operator while writing my awk command. However the command:
awk -F '|' '($3 ~ /'"$Var_ApPJ"'/) {print $3}' ${Var_RDR}/${Var_RFL};
is fetching me correct result but the command:
awk -F '|' '($3 == "${Var_ApPJ}") {print $3}' ${Var_RDR}/${Var_RFL};
fails to do so. Can anyone help in explaining why it happens and I am willing to use '==' because I do not want to match if the value is "PROJECT1" in source file.
Parameter Var_ApPJ="PROJECT"
${Var_RDR}/${Var_RFL} -> Refers to source file.
Refer to this part of documentation to know how to pass variable to awk.
I found an alternative way of '==' with '~':
awk -F '|' '($3 ~ "^${Var_ApPJ}"$) {print $3}' ${Var_RDR}/${Var_RFL};
here is the problem -
try below command -
awk -F '|' '$3 == Var_ApPJ {print $3}' ${Var_RDR}/${Var_RFL};
Remove curly braces and bracket.
vipin#kali:~$ cat kk.txt
a 5 b cd ef gh
vipin#kali:~$ awk -v var1="5" '$2 == var1 {print $3}' kk.txt
b
vipin#kali:~$
OR
#cat kk.txt
a 5 b cd ef gh
#var1="5"
#echo $var1
5
#awk '$2 == "'"$var1"'" {print $3}' kk.txt ### With "{}"
b
#
#awk '$2 == "'"${var1}"'" {print $3}' kk.txt ### without "{}"
b
#

Stop awk search on a condition

At the moment I have my R function generate an awk script to load, selectively, a subset of a csv into fread.
The resulting awk string looks something like this:
tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv | parallel -k -q --block 500M --pipe awk -F , ' $5 > \"2013-01-01\" && $5 < \"2015-11-17\" && $2 ~ /^F59PHI$|^GP20ECO$|^GT42CU-ACE$/ && $20 ~ /^Disregard$|^EMD Work Item$|^Pending$|^Pre-Work Item$|^Road Failure$|^Unit Shopped$|^Watch$|^Work Item$|^NA$/ {print $2 \",\" $88 \",\" $17 \",\" $5 \",\" $9 \",\" $22 \",\" $3 \",\" $15 \",\" $14 } '
The thing is: as of recent, my csv is ordered by dates ($5), in descending order, so if the user enters a specific lower-bound date, and awk gets to that line, it make sense for it to stop. (I am not sure how that would work the parallelization I am doing above. Maybe there is a way to select only the part of the csv that is “above” the lower-bound of the date and then pass the resulting csv into the awk script.) Is there a way to do that?
Generate an awk program such as this which explitly exits on a condition (here if field 3 exceeds 4):
$2 > 3 & $2 < 10
$3 > 4 { exit}
Put the above in a file called myprog.awk, say, and assuming default separators run it with this (your awk may be called something else):
gawk -f myprog.awk mydata.dat
or put it on the command line but you will have to be careful regarding quoting depending on the shell you use:
$2 > 3 & $2 < 10; $3 > 4 { exit}

arguments in awk command

for var in 1 2 3 4 5 6 7 8 9 10
do
echo $var
export lower_bound=$((($var - 1) * 10))
echo $lower_bound
export upper_bound=$(($var * 10))
echo $upper_bound
awk '$2 >= $(($lower_bound)) && $2 < $(($upper_bound))' shubham_test.txt > file_$var.txt
#awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= $var1 && $2 < $var2}' shubham_test.txt > file_$var.txt
done
I am trying to split a file based on the value of a field in a text file in unix. The awk works when I pass hardcoded values to caompare but does not work properly if I am passing varaibles for comparison (upper_bound and lower_bound).
I looked it up and even replaced the awk command with the following:
awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= $var1 && $2 < $var2}' shubham_test.txt > file_$var.txt
so that it takes arguments. But this is not working either. Can anyone help?
Your second format should work, but without the $ for the variable names - so :
awk -v var1="$lower_bound" -v var2="$upper_bound" '{$2 >= var1 && $2 < var2}' shubham_test.txt
Your first version fails because the single quotes mean the unix shell does not evaluate the $lower_bound variables, but instead is passing the entire literal string to awk. If you're writing shell scripts, you must know how to use single-, double- and back-quotes - take a look at http://www.tutorialspoint.com/unix/unix-quoting-mechanisms.htm for example.

While read line, awk $line and write to variable

I am trying to split a file into different smaller files depending on the value of the fifth field. A very nice way to do this was already suggested and also here.
However, I am trying to incorporate this into a .sh script for qsub, without much success.
The problem is that in the section where the file to which output the line is specified,
i.e., f = "Alignments_" $5 ".sam" print > f
, I need to pass a variable declared earlier in the script, which specifies the directory where the file should be written. I need to do this with a variable which is built for each task when I send out the array job for multiple files.
So say $output_path = ./Sample1
I need to write something like
f = $output_path "/Alignments_" $5 ".sam" print > f
But it does not seem to like having a $variable that is not a $field belonging to awk. I don't even think it likes having two "strings" before and after the $5.
The error I get back is that it takes the first line of the file to be split (little.sam) and tries to name f like that, followed by /Alignments_" $5 ".sam" (those last three put together correctly). It says, naturally, that it is too big a name.
How can I write this so it works?
Thanks!
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
UPDATE, AFTER ADDING -V TO AWK AND DECLARING THE VARIABLE OPATH
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h $input | awk 'NR >= 18' | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR {
num[$1]
next
}
$5 in num {
f = newdir"/Alignments_"$5".sam";
print > f
} ' Tile_Number_List.txt -
mkdir: created directory little_TEST'
awk: cmd. line:10: (FILENAME=- FNR=1) fatal: can't redirect to `/Alignments_1101.sam' (Permission denied)
awk variables are like C variables - just reference them by name to get their value, no need to stick a "$" in front of them like you do with shell variables:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
output_path = "./Sample1/"
f = output_path "Alignments_" $5 ".sam"
print > f
} ' Tile_Number_List.txt little.sam
To pass the value of the shell variable such as $output_path to awk you need to use the -v option.
$ output_path=./Sample1/
$ awk -F '[:\t]' -v opath="$ouput_path" '
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = opath"Alignments_"$5".sam"
print > f
} ' Tile_Number_List.txt little.sam
Also you still have the error from your previous question left in your script
EDIT:
The awk variable created with -v is obase but you use newdir what you want is:
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h "$input" | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR && NR >= 18 {
num[$1]
next
}
$5 in num {
f = opath"/Alignments_"$5".sam" # <-- opath is the awk variable not newdir
print > f
}' Tile_Number_List.txt -
You should also move NR >= 18 into the second awk script.

Resources