Combining R + awk + bash commands - r

I want to combine awk and R language. The thing is that I have a set of *.txt files in a specified directory and that I don't know the length of the header from the files. In some cases I have to skip 25 lines while in others I have to skip 27 and etc. So I want to type some awk commands to get the number of lines to skip. Once I have this value, I can begin processing the data with R.
Furthermore, in the R file I combine R an bash so my code looks like this :
!/usr/bin/env Rscript
...
argv <- commandArgs(T)
**error checking...**
import_file <- argv[1]
export_file <- argv[2]
**# your function call**
format_windpro(import_file, export_file)
Where and how can i type my awk command. Thanks!
I tried to do what you told me about awk commands and I still get an error. The program doesn't recognize my command and so I can not enter the number of lines to skip to my function. Here is my code:
**nline <- paste('$(grep -n 'm/s' import_file |awk -F":" '{print $1}')')
nline <- scan(pipe(nline),quiet=T)**
I look for the pattern m/s in the first column in order to know where I have my header text. I use R under w7.

Besides Vincent's hint of using system("awk ...", intern=TRUE), you can also use the pipe() function that is part of the usual text connections:
R> sizes <- read.table(pipe("ls -l /tmp | awk '!/^total/ {print $5}'"))
R> summary(sizes)
V1
Min. : 0
1st Qu.: 482
Median : 4096
Mean : 98746
3rd Qu.: 13952
Max. :27662342
R>
Here I am piping a command into awk and then read all the output from awk, that could also be a single line:
R> cmd <- "ls -l /tmp | awk '!/^total/ {sum = sum + $5} END {print sum}'"
R> totalsize <- scan(pipe(cmd), quiet=TRUE)
R> totalsize
[1] 116027050
R>

You can use system to run an external program from R.
system("gawk --version", intern=TRUE)

Related

R: pass variable from R to unix

I am running an R script via bash script and want to return the output of the R script to the bash script to keep working with it there.
The bash is sth like this:
#!/bin/bash
Rscript MYRScript.R
a=OUTPUT_FROM_MYRScript.R
do sth with a
and the R script is sth like this:
for(i in 1:5){
i
sink(type="message")
}
I want bash to work with one variable from R at the time, meaning: bash receives i=1 and works with that, when that task is done, receives i=2 and so on.
Any ideas how to do that?
One option is to make your R script executable with #!/usr/bin/env Rscript (setting the executable bit; e.g. chmod 0755 myrscript.r, chmod +x myrscript.r, etc...), and just treat it like any other command, e.g. assigning the results to an array variable below:
myrscript.r
#!/usr/bin/env Rscript
cat(1:5, sep = "\n")
mybashscript.sh
#!/bin/bash
RES=($(./myrscript.r))
for elem in "${RES[#]}"
do
echo elem is "${elem}"
done
nrussell$ ./mybashscript.sh
elem is 1
elem is 2
elem is 3
elem is 4
elem is 5
Here is MYRScript.R:
for(iter in 1:5) {
cat(iter, ' ')
}
and here is your bash script:
#!/bin/bash
r_output=`Rscript ~/MYRscript.R`
for iter in `echo $r_output`
do
echo Here is some output from R: $iter
done
Here is some output from R: 1
Here is some output from R: 2
Here is some output from R: 3
Here is some output from R: 4
Here is some output from R: 5

Better string interpolation in R

I need to build up long command lines in R and pass them to system(). I find it is very inconvenient to use paste0/paste function, or even sprintf function to build each command line. Is there a simpler way to do like this:
Instead of this hard-to-read-and-too-many-quotes:
cmd <- paste("command", "-a", line$elem1, "-b", line$elem3, "-f", df$Colum5[4])
or:
cmd <- sprintf("command -a %s -b %s -f %s", line$elem1, line$elem3, df$Colum5[4])
Can I have this:
cmd <- buildcommand("command -a %line$elem1 -b %line$elem3 -f %df$Colum5[4]")
For a tidyverse solution see https://github.com/tidyverse/glue. Example
name="Foo Bar"
glue::glue("How do you do, {name}?")
With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package.
# sample data
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
# do the string interpolation
stringr::str_interp("command -a ${line$elem1} -b ${line$elem3} -f ${df$Colum5[4]}")
#[1] "command -a 10 -b 30 -f 4"
This comes pretty close to what you are asking for. When any function f is prefaced with fn$, i.e. fn$f, character interpolation will be performed replacing ... with the result of running ... as an R expression.
library(gsubfn)
cmd <- fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
Here is a self contained reproducible example:
library(gsubfn)
# test inputs
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## [1] "command -a 10 -b 30 -f 4"
system
Since any function can be used we could operate directly on the system call like this. We have used echo here to make it executable but any command could be used.
exitcode <- fn$system("echo -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## -a 10 -b 30 -f 4
Variation
This variation would also work. fn$f also performs substitution of $whatever with the value of variable whatever. See ?fn for details.
with(line, fn$identity("command -a $elem1 -b $elem3 -f `df$Colum5[4]`"))
## [1] "command -a 10 -b 30 -f 4"
Another option would be to use whisker.render from https://github.com/edwindj/whisker which is a {{Mustache}} implementation in R. Usage example:
require(dplyr); require(whisker)
bedFile="test.bed"
whisker.render("processing {{bedFile}}") %>% print
Not really a string interpolation solution, but still a very good option for the problem is to use the processx package instead of system() and then you don't need to quote anything.
library(GetoptLong)
str = qq("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
cat(str)
qqcat("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
https://cran.r-project.org/web/packages/GetoptLong/vignettes/variable_interpolation.html

SED command to change the header

Well, I have about 114 files that I want to join side-by-side based on the 1st column that each file shares, which's the ID number. Each file consists of 2 columns and over 400000 lines. I used write.table to join those tables together in one table and I got X's in my header. For example, my header should be like:
ID 1_sample1 2_sample2 3_sample3
But I get it like this:
ID X1_sample1 X2_sample2 X3_sample3
I read about this problem and found out the check.names get rid of this problem, but in my case when I use check.names I get the following error:
"unused argument (check.name = F)"
Thus, I decided to use sed to fix the problem, it actually works great, BUT it joins the 2nd line and the 1st line. For instance, my 1st column and second column should be something like this:
ID 1_sample1 2_sample2 3_sample
cg123 .0235 2.156 -5.546
But I get the following instead:
ID 1_sample1 2_sample2 3_sample cg123 .0235 2.156 -5.546
Can any one check this code for me, please. I might've done something wrong to not get each line separated from the other.
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
tail -n +2 beta.norm.txt >> outFILE
If your data is tab delimited, the simple fix would be
sed '1,1s/\tX/\t/g' < inputfile > outputfile
1,1 only operate on the range "line 1 to line 1"
\tX find tab followed by X
/\t/ replace with tab
g all occurrences
It does seem as though your original attempt does more than just strip the X - it also changes successive dots to (-) but you don't show in your example why you need that. The reason your code joins the first two lines is that you only replace \n with \t in your last tr command - which leaves you with no \n at the end of the line.
You need to attach a \n at the end of your first line before concatenating lines 2 and beyond with your second command. Experiment with
head -n 1 inFILE | tr "\t" "\n" | sed -e 's/^X//g' | sed -e 's/\./-/' | sed -e 's/\./(/' |sed -e 's/\./)/' | tr "\n" "\t" > outFILE
echo "\n" >> outFile
tail -n +2 beta.norm.txt >> outFILE
whether that works depends on your OS. There are other ways to add a newline...
edit using awk is probably much cleaner - for example
awk '(NR==1){gsub(" X"," ", $0);}{print;}' inputFile > outputFile
Explanation:
(NR==1) for the first line only (record number == 1) do:
{gsub(" X","", $0);} do a global substitution of "space followed by X", with "space"
for all lines (including the one that was just modified) do:
{print;}' print the whole line

Read only part of file/ cut to specific symbol

I have 100 files which all have a similar structure
line1
line2
stuff
RR
important stuff
The problem is that I want to cut when RR appears (which it does in each file). However, this is not always in the same line (it can be line 20, it can be line 35) but it is always there. Hence, is there any way in bash or R (when reading in the file) to that( just cuttign of the header)? I would prefer R.
You can read all rows and remove the unnecessary ones:
dat <- readLines(textConnection(
"line1
line2
stuff
RR
important stuff"))
# dat <- readLines("file.name")
dat[seq(which.max(dat == "RR") + 1, length(dat))]
# [1] "important stuff"
If you have awk available through bash you could do:
awk '(/RR/){p=1; next} (p){print}' < file.txt
$ cat file.txt
line1
line2
stuff
RR
important stuff
$ awk '(/RR/){p=1; next} (p){print}' < file.txt
important stuff
This sets a flag p when the 'RR' string is found, next causing the next line to be read without first evaluating (p){ print }. Subsequent lines will be printed.
Here's a few ways:
Using basic tools:
$ tail -n+$((1 + $(grep -n '^RR$' file.txt | cut -d: -f1))) file.txt
important stuff
$
Using pure bash:
$ { while read ln; do [ "$ln" == RR ] && break; done; cat; } < file.txt
important stuff
$
And another way, assuming you can guarantee no more than 9999 lines in a file:
$ grep -A9999 '^RR$' file.txt | tail -n+2
important stuff
$

two layers of quotes around a bash variable

I've written a bash script that makes a series of R scripts. However, Ive run into difficulty quoting a bash variable to echo to the R script as a file to be read into R. I have
echo "loadings_file <- $loadings ; calls_file <- $file" | cat - template.R > temp && mv temp $scriptname
$loadings and $file are files I want R to read in. But when I run it as is they end up in the R script with no quotes aroudn them for R to treat as a string. How do I make sure they're quoted in R but still expanded in bash first?
echo "loadings_file <- '$loadings' ; calls_file <- '$file'"
If you specifically need double quoting:
echo "loadings_file <- \"$loadings\" ; calls_file <- \"$file\""
You have to escape your quotes (\") around the variables:
echo "loadings_file <- \"$loadings\" ; calls_file <- \"$file\"" | cat - template.R > temp && mv temp $scriptname

Resources