Output left or right part of a matched string [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have two files, file1 contains substrings of file2. I want to match file1 to file2 and output the part that is to the left of the match and not the match itself. I would also like to know how to output what is to the right of the match and not the match itself.
Here is part of my data (these strings do probably not match, just example data:
file1
ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC
file2
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC
example:
file1:
GCUGUGGAGAUAACUGCGC
file2
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC
output
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC

Here are a couple ways to only keep the text that comes before your pattern, if it exists
a <- "GCUGUGGAGAUAACUGCGC"
b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC"
strsplit(b, a)[[1]][1]
sub(paste0(a, ".*$"), "", b)
Now, you just need to read the files into R and loop over each pattern. I'm not exactly sure what you are looking for, but here is an idea
# read data into 2 variables, a and b
# you could use readLines() to do read from disk
a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC"))
b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC"))
Now, loop over each value from the first file
lapply(a, function(x) sapply(strsplit(b, x), "[", 1))

Opening file handles to strings for testing:
use strict;
use warnings;
use autodie;
open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n";
open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n";
while ( !eof $fh1 && !eof $fh2 ) {
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
print join( ' ', split /$line1/, $line2, 2 ), "\n";
}
Outputs:
GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC

You can even try this below Perl code for before , after and match of a string using $PREMATCH($`), $POSTMATCH($') and $MATCH($&) :
InputFiles:
file1.txt:
ACUGUACAGGCCACUGCCUUGC
CUGCGCAAGCUACUGCCUUGCU
UGGAAUGUAAAGAAGUAUGUAU
CGAAUCAUUAUUUGCUGCUCUA
AUCACAUUGCCAGGGAUUACC
UUCACAGUGGCUAAGUUCUGC
file2.txt:
CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC
Code:
use strict;
use warnings;
open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!";
open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!";
while(!eof $fh1 && !eof $fh2)
{
chomp( my $line1 = <$fh1> );
chomp( my $line2 = <$fh2> );
if($line2 =~ /$line1/isg)
{
print "Prematch: $`\n";
print "Postmatch: $'\t";
}
}
close($fh1);
close($fh2);
Output:
Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA Postmatch: CAGG
Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG
Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA Postmatch: UUCAGGC
Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G
Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC

Related

How to remove the tab delimiter after the last column by using unix

I have a tab separated file. I am using the below code:
awk -v var="MAS_CONTROL_WL_column_nmbr.dat" 'BEGIN{RS="\n"}
{ while(getline line < var){ printf("%s\t",$line)};close(var);
printf( "\n") }' MAS_CONTROL_WL.tsv > test.tsv
This code prints the column number that is present in the column number file but the issue that I am facing is \t is coming after the last column.
How to remove that?
First a test file:
$ cat > foo
1
2
3
And the awk:
$ awk -v var=foo '
BEGIN { RS="\n" }
{
out="" # introducing output buffer
while(getline line < var) {
out=out sprintf("%s%s",(out==""?"":"\t"),line) # controlling tabs
}
close(var)
print out # output output buffer
}' foo | cat -T # useful use of cat
Output:
1^I2^I3
1^I2^I3
1^I2^I3
Instead of printing "field-tab" for every field, print the first field without a tab, then append the rest as "tab-field":
awk -v var="MAS_CONTROL_WL_column_nmbr.dat" '
BEGIN{RS="\n"}
{
if (getline line < var) printf("%s",$line);
while (getline line < var) printf("\t%s",$line);
close(var);
printf( "\n");
}
' MAS_CONTROL_WL.tsv > test.tsv
In case you still need an answer to your original question (removing \t after the last column) :sed -i 's/[[:space:]]$//' your_file.tsv will remove the white space at the end of the lines of your file.

Execute a Perl script from R

I need to execute a Perl program as part of a larger R program.
The R code generates a series of output files with different extensions, for instance .out or .lis.
I have a Perl program that converts those files to CSV.
I've seen Perl arguments executed on R, but nothing with this complexity.
#outfiles = glob( "*.lis" );
foreach $outfile ( #outfiles ) {
print $outfile, "\n";
$outfile =~ /(\S+)lis$/;
$csvfile = $1 . "lis.csv";
print $csvfile, "\n";
open( OUTFILE, "$outfile" ) || die "ERROR: Unable to open $outfile\n";
open( CSVFILE, ">$csvfile" ) || die "ERROR: Unable to open $csvfile\n";
$lineCnt = 0;
while ( $outline = <OUTFILE> ) {
chomp( $outline );
$lineCnt++;
$outline =~ s/^\s+//; # Remove whitespace at the beginning of the line
if ( $lineCnt == 1 ) {
$outline =~ s/,/\./g; # Replace all the commas with periods in the hdr line
}
$outline =~ s/\s+/,/g; # Replace remaining whitespace delimiters with a comma
print CSVFILE "$outline\n";
}
close( OUTFILE );
close( CSVFILE );
}
Is there any way I can integrate the Perl code into my R code? I could develop an R program that does the same. But I wouldn't know where to start to convert a .lis or .out file to .csv.
Call it by using R's system call:
my.seed <- as.numeric(try(system(" perl -e 'print int(rand(1000000))'", intern = TRUE))) #get random number :D
However, I must agree with #ikegami, there are better modules to handle CSV data.

Bash script awk

I am new to Bash scripting. I am struggling to understand this particular line of code. Please help.
old_tag = awk -v search="$new_tag" -F" " '$1==search { a[count] = $2; count++; } END { srand();print a[int(rand()*(count-1))+1] }' $tag_dir/$file
[ -z "$new_tag" ] && break
The code seems to be incorrect. With old_tag = awk the code tries to out the results of the awk command in the var old_tag. An assignment of a var should be done without spaces around the =, and the command should be enclosed in $(..). It might have been backtics in the original code, these are depreciated and backtics are used for formatting in SO.
Your question would have been easier to answer with an example inputfile, but try to explain assuming inputlines like
apple x1
car a
rotten apple
tree sf
apple x5
car a4
apple x3
I switched old_tag and new_tag, that seems to make more sense.
new_tag=$(awk -v search="$old_tag" -F" " '
$1==search { a[count] = $2; count++; }
END { srand(); print a[int(rand()*(count-1))+1] }
' $tag_dir/$file)
[ -z "$new_tag" ] && break
This cod tries to replace to find a new tag by searching the old tag in $tag_dir/$file. When the tag occurs more than once, take one of the lines random.
The code explained in more detail:
# assign output to variable new_tag
new_tag=$(..)
# use awk program
awk ..
# Assign the valuo of old_tag to a variable "search" that can be used in awk
-v search="$old_tag"
# Different fields seperated by spaces
-F" "
# The awk programming lines
' .. '
# Check first field of line with the variable search
$1==search { .. }
# When true, store second field of line in array and increment index
a[count] = $2; count++;
# Additional comands after processing everything
END {..}
# Print random index from array
srand(); print a[int(rand()*(count-1))+1]
# Use file as input for awk
$tag_dir/$file
# Stop when no new_tag has been found
[ -z "$new_tag" ] && break
# I would have preferred the syntax
test -z "${new_tag}" && break
With the sample input and old_tag="apple", the code will find the lines with apple as the first word
apple x1
apple x5
apple x3
The words x1 x5 x3 are stored in array a and randomly one of these 3 is assigned to new_tag.

How to call exe program and input parameters using R?

I want to call .exe program (spi_sl_6.exe) using a command of R (system), however I can't input parameters to the program using "system". The followwing is my command and parameters:system("D:\\working\spi_sl_6.exe")
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
This is using the Standardized Precipitation Index software from
http://drought.unl.edu/MonitoringTools/DownloadableSPIProgram.aspx.
This seems to give a working solution using Windows (but not without warnings!)
Fisrt download the software and example files
# Create directory to download software
mydir <- "C:\\Users\\david\\spi"
dir.create(mydir)
url <- "http://drought.unl.edu/archive/Programs/SPI"
download.file(file.path(url, "spi_sl_6.exe"), file.path(mydir, "spi_sl_6.exe"), mode="wb")
# Download example files
download.file(file.path(url, "SPI_samplefiles.zip"), file.path(mydir, "SPI_samplefiles.zip"))
# extract one example file, and write out
temp <- unzip(file.path(mydir, "SPI_samplefiles.zip"), "wymo.cor")
dat <- read.table(temp)
# Use this file as an example input
write.table(dat, file.path(mydir,"wymo.cor"), col.names = FALSE, row.names = FALSE)
From page 3 of the help file basic-spi-program-information.pdf at the above link the command line code should be of the form spi 3 6 12 <infile.dat >outfile.dat, however,
neither of the following worked (just from command line not in R), and various iterations of how to pass parameters.
C:\Users\david\spi\spi_sl_6 3 <C:\Users\david\spi\wymo.cor >C:\Users\david\spi\out.dat
cd C:\Users\david\spi && spi_sl_6 3 <wymo.cor >out.dat
However, using the accepted answer from Running .exe file with multiple parameters in c#
seems to work. That is again from the command line
cd C:\Users\david\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out1.dat) | spi_sl_6
So to run this in R you can wrap this in a shell (you will need to change the path to where you have saved the exe)
shell("cd C:\\Users\\david\\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out2.dat) | spi_sl_6", intern=TRUE)
out1.dat and out2.dat should be the same.
This throws warning messages, I think from the echo (in R but not from command line) but the output file is produced.
Suppose you can automate all the echo calls sligtly, so all you need to do is input the time parameters.
timez <- c(2, 3, 6)
stime <- paste("echo", timez, collapse =" && ")
infile <- "wymo.cor"
outfile <- "out3.dat"
spiCall <- paste("cd", mydir, "&& (" , stime, "&& echo", infile, "&& echo", outfile, " ) | spi_sl_6")
shell(spiCall)
You can construct the command using sprintf :
cmd_name <- "D:\\working\spi_sl_6.exe"
param1 <- "a"
param2 <- "b"
system2(sprintf("%s %s %s",cmd_name,param1,param2))
Or using system2( I prefer this option):
system2(cmd_name, args = c(param1,param2))

Extract values of a variable occurring multiple times in a file

I have to extract value of a variable which occurs multiple times in a file. for example, I have a text file abc.txt . There is a variable result. Suppose value of result in first line is 2, in third line it is 55 and in last line it is 66.
Then my desired output should be :
result:2,55,66
I am new in unix so I could not figure out how to do this. Please help
The contents of text file can be as follows:
R$#$#%$W%^BHGF, result=2,
fsdfsdsgf
VSDF$TR$R,result=55
fsdf4r54
result=66
Try this :
using awk code :
awk -F'(,| |^)result=' '
/result=/{
gsub(",", "", $2)
v = $2
str = (str) ? str","v : v
}
END{print "result:"str}
' abc.txt
Using perl code :
perl -lane '
push #arr, $& if /\bresult=\K\d+/;
END{print "result:" . join ",", #arr}
' abc.txt
Output :
result:2,55,66

Resources