I need to execute a Perl program as part of a larger R program.
The R code generates a series of output files with different extensions, for instance .out or .lis.
I have a Perl program that converts those files to CSV.
I've seen Perl arguments executed on R, but nothing with this complexity.
#outfiles = glob( "*.lis" );
foreach $outfile ( #outfiles ) {
print $outfile, "\n";
$outfile =~ /(\S+)lis$/;
$csvfile = $1 . "lis.csv";
print $csvfile, "\n";
open( OUTFILE, "$outfile" ) || die "ERROR: Unable to open $outfile\n";
open( CSVFILE, ">$csvfile" ) || die "ERROR: Unable to open $csvfile\n";
$lineCnt = 0;
while ( $outline = <OUTFILE> ) {
chomp( $outline );
$lineCnt++;
$outline =~ s/^\s+//; # Remove whitespace at the beginning of the line
if ( $lineCnt == 1 ) {
$outline =~ s/,/\./g; # Replace all the commas with periods in the hdr line
}
$outline =~ s/\s+/,/g; # Replace remaining whitespace delimiters with a comma
print CSVFILE "$outline\n";
}
close( OUTFILE );
close( CSVFILE );
}
Is there any way I can integrate the Perl code into my R code? I could develop an R program that does the same. But I wouldn't know where to start to convert a .lis or .out file to .csv.
Call it by using R's system call:
my.seed <- as.numeric(try(system(" perl -e 'print int(rand(1000000))'", intern = TRUE))) #get random number :D
However, I must agree with #ikegami, there are better modules to handle CSV data.
Related
I want to call .exe program (spi_sl_6.exe) using a command of R (system), however I can't input parameters to the program using "system". The followwing is my command and parameters:system("D:\\working\spi_sl_6.exe")
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
This is using the Standardized Precipitation Index software from
http://drought.unl.edu/MonitoringTools/DownloadableSPIProgram.aspx.
This seems to give a working solution using Windows (but not without warnings!)
Fisrt download the software and example files
# Create directory to download software
mydir <- "C:\\Users\\david\\spi"
dir.create(mydir)
url <- "http://drought.unl.edu/archive/Programs/SPI"
download.file(file.path(url, "spi_sl_6.exe"), file.path(mydir, "spi_sl_6.exe"), mode="wb")
# Download example files
download.file(file.path(url, "SPI_samplefiles.zip"), file.path(mydir, "SPI_samplefiles.zip"))
# extract one example file, and write out
temp <- unzip(file.path(mydir, "SPI_samplefiles.zip"), "wymo.cor")
dat <- read.table(temp)
# Use this file as an example input
write.table(dat, file.path(mydir,"wymo.cor"), col.names = FALSE, row.names = FALSE)
From page 3 of the help file basic-spi-program-information.pdf at the above link the command line code should be of the form spi 3 6 12 <infile.dat >outfile.dat, however,
neither of the following worked (just from command line not in R), and various iterations of how to pass parameters.
C:\Users\david\spi\spi_sl_6 3 <C:\Users\david\spi\wymo.cor >C:\Users\david\spi\out.dat
cd C:\Users\david\spi && spi_sl_6 3 <wymo.cor >out.dat
However, using the accepted answer from Running .exe file with multiple parameters in c#
seems to work. That is again from the command line
cd C:\Users\david\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out1.dat) | spi_sl_6
So to run this in R you can wrap this in a shell (you will need to change the path to where you have saved the exe)
shell("cd C:\\Users\\david\\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out2.dat) | spi_sl_6", intern=TRUE)
out1.dat and out2.dat should be the same.
This throws warning messages, I think from the echo (in R but not from command line) but the output file is produced.
Suppose you can automate all the echo calls sligtly, so all you need to do is input the time parameters.
timez <- c(2, 3, 6)
stime <- paste("echo", timez, collapse =" && ")
infile <- "wymo.cor"
outfile <- "out3.dat"
spiCall <- paste("cd", mydir, "&& (" , stime, "&& echo", infile, "&& echo", outfile, " ) | spi_sl_6")
shell(spiCall)
You can construct the command using sprintf :
cmd_name <- "D:\\working\spi_sl_6.exe"
param1 <- "a"
param2 <- "b"
system2(sprintf("%s %s %s",cmd_name,param1,param2))
Or using system2( I prefer this option):
system2(cmd_name, args = c(param1,param2))
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have two files, file1 contains substrings of file2. I want to match file1 to file2 and output the part that is to the left of the match and not the match itself. I would also like to know how to output what is to the right of the match and not the match itself.
Here is part of my data (these strings do probably not match, just example data:
file1
ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC
file2
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC
example:
file1:
GCUGUGGAGAUAACUGCGC
file2
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC
output
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC
Here are a couple ways to only keep the text that comes before your pattern, if it exists
a <- "GCUGUGGAGAUAACUGCGC"
b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC"
strsplit(b, a)[[1]][1]
sub(paste0(a, ".*$"), "", b)
Now, you just need to read the files into R and loop over each pattern. I'm not exactly sure what you are looking for, but here is an idea
# read data into 2 variables, a and b
# you could use readLines() to do read from disk
a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC"))
b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC"))
Now, loop over each value from the first file
lapply(a, function(x) sapply(strsplit(b, x), "[", 1))
Opening file handles to strings for testing:
use strict;
use warnings;
use autodie;
open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n";
open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n";
while ( !eof $fh1 && !eof $fh2 ) {
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
print join( ' ', split /$line1/, $line2, 2 ), "\n";
}
Outputs:
GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC
You can even try this below Perl code for before , after and match of a string using $PREMATCH($`), $POSTMATCH($') and $MATCH($&) :
InputFiles:
file1.txt:
ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC
file2.txt:
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC
Code:
use strict;
use warnings;
open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!";
open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!";
while(!eof $fh1 && !eof $fh2)
{
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
if($line2 =~ /$line1/isg)
{
print "Prematch: $`\n";
print "Postmatch: $'\t";
}
}
close($fh1);
close($fh2);
Output:
Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA Postmatch: CAGG
Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG
Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA Postmatch: UUCAGGC
Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G
Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC
I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}
I have to extract value of a variable which occurs multiple times in a file. for example, I have a text file abc.txt . There is a variable result. Suppose value of result in first line is 2, in third line it is 55 and in last line it is 66.
Then my desired output should be :
result:2,55,66
I am new in unix so I could not figure out how to do this. Please help
The contents of text file can be as follows:
R$#$#%$W%^BHGF, result=2,
fsdfsdsgf
VSDF$TR$R,result=55
fsdf4r54
result=66
Try this :
using awk code :
awk -F'(,| |^)result=' '
/result=/{
gsub(",", "", $2)
v = $2
str = (str) ? str","v : v
}
END{print "result:"str}
' abc.txt
Using perl code :
perl -lane '
push #arr, $& if /\bresult=\K\d+/;
END{print "result:" . join ",", #arr}
' abc.txt
Output :
result:2,55,66
I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno