Unix : Split line with delimiter

Unix : Split line with delimiter - unix

I've a file like this
a b c,d
e f g
x y r,s,t
and I would like to split this into columns using "," as delimiter. The other columns should be copied.
Expected result :
a b c
a b d
e f g
x y r
x y s
x y t
Thank you

Using awk. Expects field separators to be space or tab:
$ awk '{
split($3,a,",") # split the third field on commas, hash to a
for(i in a) { # for all entries in a
sub(/[^ \t]+$/,a[i],$0) # replace last field with entries in a...
print # ... preserving separators, space or tab
}
}' file
a b c
a b d
e f g
x y r
x y s
x y t
Due to the use of sub() it will produce false results if there is a & in the $3. Also, as mentioned in the comments, using for(i in a) may result in records outputing in seemingly random order. If that is a problem, use:
$ awk '{
n=split($3,a,",") # store element count to n
for(i=1;i<=n;i++) { # iterate in order
sub(/[^ \t]+$/,a[i],$0)
print
}
}' file
For tab separated files:
$ awk '
BEGIN { OFS="\t" } # output field separator
{
n=split($3,a,",")
for(i=1;i<=n;i++) {
$3=a[i] # & friendly version
print
}
}' file

Related

Single line input multiple values used space 4 5 without press enter

How can I input two digits in a single line using space like 5 6?
c code:
int a=0,b=0;
printf(" Input a and b in digits: \'use space\' ");
scanf("%d%d",&a,&b);
R single line input multiple values used space 4 5 without press enter:
sum_of_total_number <- function (first,second){
first <- as.numeric(first)
second <- as.numeric(second)
return (first + second )
}
main <- function(a,b){
{ a <- readline("Input value a: "); b <- readline("INput value b: ")}
print(sum_of_total_number(a,b))
}
main()

You can use the scan() function to read multiple values.
main <- function(){
vals <- scan(n=2, what=numeric(), quiet=TRUE)
print(sum(vals))
}
main()

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string GTG. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).
>ENST00000619537.4 cds:known chromosome:GRCh38:21:6560714:6564489:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGATGTGACCATCCAGCACCCCTGGTTCAAGCGCACCCTGGGGCCCTTCTACCCCAGC
CGGCTGTTCGACCAGTTTTTCGGCGAGGGCCTTTTTGAGTATGACCTGCTGCCCTTCCTG
TCGTCCACCATCAGCCCCTACTACCGCCAGTCCCTCTTCCGCACCGTGCTGGACTCCGGC
ATCTCTGAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTC
TCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCAC
AACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTG
CCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACC
TTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCC
GTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624019.3 cds:known chromosome:GRCh38:21:6561284:6563978:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGGACGCCCCCCCCCCCCACCCAACCACAGGCCTCCTCTCTGAGCCACGGGTTCGATCC
GACCGGGACAAGTTCGTCATCTTCCTCGATGTGAAGCACTTCTCCCCGGAGGACCTCACC
GTGAAGGTGCAGGACGACTTTGTGGAGATCCACGGAAAGCACAACGAGCGCCAGGACGAC
CACGGCTACATTTCCCGTGAGTTCCACCGCCGCTACCGCCTGCCGTCCAACGTGGACCAG
TCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGCATGCTGACCTTCTGTGGCCCCAAGATC
CAGACTGGCCTGGATGCCACCCACGCCGAGCGAGCCATCCCCGTGTCGCGGGAGGAGAAG
CCCACCTCGGCTCCCTCGTCCTAA
>ENST00000624932.1 cds:known chromosome:GRCh38:21:6561954:6564203:1 gene:ENSG00000276076.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CH507-152C13.3 description:alpha-crystallin A chain [Source:RefSeq peptide;Acc:NP_001300979]
ATGCCTGTCTGTCCAGGAGACAGTCACAGGCCCCCGAAAGCTCTGCCCCACTTGGTGTGT
GGGAGAAGAGGCCGGCAGGTTCGATCCGACCGGGACAAGTTCGTCATCTTCCTCGATGTG
AAGCACTTCTCCCCGGAGGACCTCACCGTGAAGGTGCAGGACGACTTTGTGGAGATCCAC
GGAAAGCACAACGAGCGCCAGGACGACCACGGCTACATTTCCCGTGAGTTCCACCGCCGC
TACCGCCTGCCGTCCAACGTGGACCAGTCGGCCCTCTCTTGCTCCCTGTCTGCCGATGGC
ATGCTGACCTTCTGTGGCCCCAAGATCCAGACTGGCCTGGATGCCACCCACGCCGAGCGA
GCCATCCCCGTGTCGCGGGAGGAGAAGCCCACCTCGGCTCCCTCGTCCTAA
Output:
Gene Frequency
Gene1: 3
Gene2 6.3
....
I was thinging of something like this, but I dont now how to define the positions requirements:
freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)

Here is an idea in R using stringi.
We use stri_locate_all_fixed() to find the start and end position of each GTG occurence. Then we create a column condition to test if start position is in 1,4,7,10,13,16,19,22 ....
library(stringi)
library(dplyr)
data.frame(stri_locate_all_fixed(gene1, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(gene), 3))
Which gives:
# start end condition
#1 4 6 TRUE
If you want to generalize this to a list of genes, you could do:
lst <- list(gene1, gene2, gene3)
res <- lapply(lst, function(x) {
data.frame(stri_locate_all_fixed(x, "GTG")) %>%
mutate(condition = start %in% seq(1, nchar(x), 3))
})
Which would give:
#[[1]]
# start end condition
#1 4 6 TRUE
#
#[[2]]
# start end condition
#1 NA NA FALSE
#
#[[3]]
# start end condition
#1 3 5 FALSE
#2 9 11 FALSE
#3 21 23 FALSE
#4 70 72 TRUE
#5 75 77 FALSE
Following #Sobrique's comment, if divided by length means number of occurences respecting condition divided by total number of char in each gene, you could do:
lapply(1:length(res), function(x) sum(res[[x]][["condition"]]) / nchar(lst[[x]]))
Which would give:
#[[1]]
#[1] 0.004830918
#
#[[2]]
#[1] 0
#
#[[3]]
#[1] 0.003021148

Here's a Perl solution that does as you ask
But I don't understand how your example output is derived: the first and last sequences have only one occurrence of GTG in the positions you require, and the second sequence has none at all. That means the outputs are 1 / 207, 0 / 74, and 1 / 331 respectively. None of those are anything like 3 and 6.3 that you say you're expecting
This program expects the path to the input file as a parameter on the command line
use strict;
use warnings 'all';
print "Gene Frequency\n";
my $name;
local $/ = '>';
while ( <> ) {
chomp;
next unless /\S/;
my ($name, $seq) = split /\n/, $_, 2;
$seq =~ tr/A-Z//cd;
my $n = 0;
while ( $seq =~ /(?=GTG)/g ) {
++$n if $-[0] % 3 == 0;
}
printf "%-7s%.6f\n", $name, $n / length($seq);
}
output
Gene Frequency
Gene1 0.004831
Gene2 0.000000
Gene3 0.003021

Here is an alternate solution that does not use a pattern match. Not that it will matter much.
use strict;
use warnings;
my $gene;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$gene = $1;
next;
}
chomp $line;
printf "%s: %s\n",
$gene,
( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}
Output:
Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.
The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.
my #foo = split /(...)/, '1234567890';
p #foo; # from Data::Printer
__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.
You can run it by calling perl foo.pl horribly-large-gene-sequence.file.

Well, you have an R solution. I've hacked something together in perl because you tagged it:
#!/usr/bin/env perl
use strict;
use warnings;
my $target = 'GTG';
local $/ = "\n>";
while ( <> ) {
my ($gene) = m/(Gene\d+)/;
my #hits = grep { /^$target$/ } m/ ( [GTCA]{3} ) /xg;
print "$gene: ".( scalar #hits), "\n";
}
This doesn't give the same results as your input though:
Gene1: 1
Gene2: 0
Gene3: 1
I'm decomposing your string into 3 element lists, and looking for ones that specifically match. (And I haven't divided by length, as I'm not entirely clear if that's the actual string length in letters, or some other metric).
Including length matching - we need to capture both name and string:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n>";
while (<>) {
my ($gene, $gene_str) = m/(Gene\d+)\n([GTCA]+)/m;
my #hits = grep { /^GTG$/ } $gene_str =~ m/ ( [GTCA]{3} ) /xg;
print "$gene: " . #hits . "/". length ( $gene_str ), " = ", #hits / length($gene_str), "\n";
}
We use <> which is the 'magic' filehandle, and tells perl to read from either STDIN or a file specified on command line. Much like sed or grep does.
With your input:
Gene1: 1/207 = 0.00483091787439614
Gene2: 0/74 = 0
Gene3: 1/331 = 0.00302114803625378

Here is a function I created based on your requirement. I am pretty sure there are alternate ways better than this but this solves the problem.
require(stringi)
input_gene_list<- list(gene1= "GTGGGGGTTTGTGGGGGTG", gene2= "GTGGGGGTTTGTGGGGGTG", gene3= "GTGGGGGTTTGTGGGGGTG")
gene_counter<- function(gene){
x<- gene
y<- gsub(pattern = "GTG",replacement = "GTG ", x = x, perl=TRUE)
if(str_count(y,pattern = "GTG")) {
gene_count<- unlist(gregexpr(pattern = " ", y))
counter<- 0
for(i in 1:length(gene_count)){
if((gene_count[i] %% 3) == 1) counter=counter+1
}
return(counter/nchar(x))
}
}
output_list<- lapply(input_gene_list, function(x) gene_counter(x))
result<- t(as.data.frame(output_list))
result
[,1]
gene1 0.1052632
gene2 0.1052632
gene3 0.1052632
Also share your thoughts on it! Thanks!

Counting data that is valid to two conditions in columns 1 and 2

I am trying to run the following loop, the two while statements work, but the # c awk line seems to be causing me some problems.
printf "" >! loop.txt
# x = -125
while ($x <= -114)
# y = 32
while ($y <= 42)
# c =`awk '{ for ($1 = $x*; $2 = $y*){count[$1]++}}' text.txt`
printf "$x $y $c\n" >> loop.txt
# y++
end
# x++
end
With the awk line, I am trying to reference a file with lots of differing values in columns 1 and 2 of the text.txt file.
I want to be able to firstly reference all of the values in column 1 that start with $x (as they all have several decimal places), then reference from that sub-list all of the values in column 2 that begin with $y. After this second sub-list has been formed, I would like to count all of the entries valid to those conditions.
However, I keep getting syntax errors with the line, and I'm not sure that I'm using the correct function!
EDIT:
The executable file is a .csh type (C shell, I think)
A sample input format...
-125.025 32.058 2.25
-125.758 32.489 2.67
-125.349 32.921 3.49
-125.786 32.753 4.69
-125.086 33.008 2.78
And the expected output...
-125 32 4
-125 33 1

So this is all you want?
$ awk '{cnt[int($1)][int($2)]++} END{for (x in cnt) for (y in cnt[x]) print x, y, cnt[x][y]}' file
-125 32 4
-125 33 1
If you want to specify a range of x and y values, just add that range check before incrementing the array entry:
awk '
{ x=int($1); y=int($2) }
x>=-125 && x<=-114 && y>=32 && y<=42 { cnt[x][y]++ }
END { for (x in cnt) for (y in cnt[x]) print x, y, cnt[x][y] }
' file
I spit it into multiple lines to improve readability and added variables to avoid calling int() multiple times for each field.
Note that the above will read your input file just once compared to the script you posted in your question which will read the whole input file 132 times so you can imagine the performance improvement from that alone, never mind all the starting/stopping processes 132 times, etc.
The above use GNU awk for 2D arrays but can be easily simulated with other awks.

GNU Make Corresponding Lists

I have a list of products:
1, 2, 3, 4...
Which depends on a different list of sources:
z, y, x, w...
The dependence is one-to-one (the nth element of the first list has the nth element of the second list as its source), and the recipe is the same in all cases. There is, for all intents and purposes, no structure to the lists - it's not possible to write a simple expression which allows the nth element of the list to be generated from n. The solution that I know will work is
1 : z
[recipe]
2 : y
[identical recipe]
3 : x
[identical recipe]
4 : w
[identical recipe]
...but I don't like this because it makes it easier to make a mistake when modifying the lists or the recipe. I would prefer to take advantage of the correspondence pattern and begin with
SRCLIST = z y x w
DSTLIST = 1 2 3 4
And then somehow have a general rule like
DSTLIST_n : SRCLIST_n
[recipe]
Is there any way of doing something like this?

This is a bit ugly but I believe it should work. (There are probably slightly better ways but this was the first thing I came up with.)
SRCLIST = z y x w
DSTLIST = 1 2 3 4
# Create a list of : the length of SRCLIST
MIDLIST = $(foreach s,$(SRCLIST),:)
$(info SRCLIST:$(SRCLIST))
$(info DSTLIST:$(DSTLIST))
$(info MIDLIST:$(MIDLIST))
# Join the three lists together (in two passes since $(join) only takes two lists)
JOINLIST = $(join $(join $(DSTLIST),$(MIDLIST)),$(SRCLIST))
$(info joined:$(JOINLIST))
# eval each of the created strings to create the prerequisite entries
$(foreach r,$(JOINLIST),$(eval $r))
# Set the rules to build all the targets.
$(DSTLIST):
echo '$# for $^'
$ touch z y x w
$ make
SRCLIST:z y x w
DSTLIST:1 2 3 4
MIDLIST:: : : :
joined:1:z 2:y 3:x 4:w
echo '1 for z'
1 for z
echo '2 for y'
2 for y
echo '3 for x'
3 for x
echo '4 for w'
4 for w
I should note that this will not deal with spaces in any of the entries at all well (but that's generally true of make so nothing specific to this solution).
You could also always just create a Canned Recipe and then just stick that in each explicitly written out target as in your original idea.

Inspired by Etan, here is what I found worked:
SRCLIST = z y x w
DSTLIST = 1 2 3 4
# Make a list of ":" for combining
SEPARATOR = $(foreach s,$(SRCLIST),:)
# Define a parameterized rule which accepts the dst:src info as an argument
define dst-src
$1
[rule]
endef
# Define the list of dependencies
DST_SRC_RELNS = $(join $(join $(DSTCLIST),$(SEPARATOR)),$(SRCLIST))
# ^ DST_SRC_RELNS evaluates to z:1 y:2 x:3 w:4
# Print a preview of the rules the makefile generates itself
$(info $(foreach r,$(DST_SRC_RELNS),$(call dst-src,$r)))
# Generate the rules
$(foreach r,$(DST_SRC_RELNS),$(eval $(call dst-src,$r)))
I think that you could get away with not defining the parameterized rule dst-src by actually writing the rule out inside the $(eval ...), but I didn't like this for two reasons:
you need to define a newline macro for the result to be something that make will recognize as a rule
adding more text within the $(foreach ...) makes it even harder for a human reader to figure out what's really going on

Nice problem. You didn't mention which version of make you are using, but .SECONDEXPANSION often works well for these sorts of source lookup tables.
A sketch:
srcs := z x y w
targets := 1 2 3 4
.SECONDEXPANSION:
pairs := $(join ${targets},$(addprefix :,${srcs}))
lookup-src = $(patsubst $1:%,%,$(filter $1:%,${pairs}))
${targets}: $$(call lookup-src,$$#)
echo '[$^] -> [$#]'

How to calculate the tree the results by combining individual leaf paths?

Let's say I have an input file where each line contains the path from the root (A) to a leaf
echo "A\tB\tC\nA\tB\tD\nA\tE" > lines.txt
A B C
A B D
A E
How can I easily generate the resulting tree?: (A(B(C,D),E))
I'd like to use GNU tools (awk, sed, etc.) because they tend to work better with large files, but an R script would also work. The R input would be:
# lines <- lapply(readLines("lines.txt"), strsplit, " +")
lines <- list(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))

In Perl:
#!/usr/bin/env perl
use strict;
my $t = {};
while (<>) {
my #a = split;
my $t1 = $t;
while (my $a = shift #a) {
$t1->{$a} = {} if not exists $t1->{$a};
$t1 = $t1->{$a};
}
}
print &p($t)."\n";
sub p {
my ($t) = #_;
return
unless keys %$t;
return '('
. join(',', map { $_ . p($t->{$_}) } sort keys %$t)
. ')';
}
This script returns:
% cat <<EOF | perl l.pl
A B C
A B D
A E
EOF
(A(B(C,D),E))
Note that this script, due to recursion in p is not at all suited for large datasets. But that can be easily resolved by turning that into a double for loop, like in the first while above.

Why do it the easy way, if you can use Bourne Shell script instead? Note, this is not even Bash, this is plain old Bourne shell, without arrays...
#!/bin/sh
#
# A B C
# A B D
# A E
#
# "" vs "A B C" -> 0->3, ident 0 -> -0+3 -> "(A(B(C"
# "A B C" vs "A B D" -> 3->3, ident 2 -> -1+1 -> ",D"
# "A B D" vs "A E" -> 3->2, ident 1 -> -2+1 -> "),E"
# "A E" vs. endc -> 2->0, ident 0 -> -2+0 -> "))"
#
# Result: (A(B(C,D),E))
#
# Input stream is a path per line, path segments separated with spaces.
process_line () {
local line2="$#"
n2=$#
set -- $line1
n1=$#
s=
if [ $n2 = 0 ]; then # last line (empty)
for s1 in $line1; do
s="$s)"
done
else
sep=
remainder=false
for s2 in $line2; do
if ! $remainder; then
if [ "$1" != $s2 ]; then
remainder=true
if [ $# = 0 ]; then # only children
sep='('
else # sibling to an existing element
sep=,
shift
for s1 in $#; do
s="$s)"
done
fi
fi
fi
if $remainder; then # Process remainder as mismatch
s="$s$sep$s2"
sep='('
fi
shift # remove the first element of line1
done
fi
result="$result$s"
}
result=
line1=
(
cat - \
| sed -e 's/[[:space:]]\+/ /' \
| sed -e '/^$/d' \
| sort -u
echo '' # last line marker
) | while read line2; do
process_line $line2
line1="$line2"
test -n "$line2" \
|| echo $result
done
This produces the correct answer for two different files (l.sh is the shell version, l.pl the version in Perl):
% for i in l l1; do cat $i; ./l.sh < $i; ./l.pl < $i; echo; done
A
A B
A B C D
A B E F
A G H
A G H I
(A(B(C(D),E(F)),G(H(I))))
(A(B(C(D),E(F)),G(H(I))))
A B C
A B D
A E
(A(B(C,D),E))
(A(B(C,D),E))
Hoohah!

Okay, so I think I got it:
# input
lines <- c(list(c("A", "B", "C")), list(c("A", "B", "D")), list(c("A","E")))
# generate children
generate_children <- function(lines){
children <- list()
for (line in lines) {
for (index in 1:(length(line)-1)){
parent <- line[index]
next_child <- line[index + 1]
if (is.null(children[[parent]])){
children[[parent]] <- next_child
} else {
if (next_child %notin% children[[parent]]){
children[[parent]] <- c(children[[parent]], next_child)
}
}
}
}
children
}
expand_children <- function(current_parent, children){
if (current_parent %in% names(children)){
expanded_children <- sapply(children[[current_parent]], function(current_child){
expand_children(current_child, children)
}, USE.NAMES = FALSE)
output <- setNames(list(expanded_children), current_parent)
} else {
output <- current_parent
}
output
}
children <- generate_children(lines)
root <- names(children)[1]
tree <- expand_children(root, children)
dput(tree)
# structure(list(A = structure(list(B = c("C", "D"), "E"), .Names = c("B",""))), .Names = "A")
Is there a simpler answer?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unix : Split line with delimiter - unix

I've a file like this a b c,d e f g x y r,s,t and I would like to split this into columns using "," as delimiter. The other columns should be copied. Expected result : a b c a b d e f g x y r x y s x y t Thank you

Related

Single line input multiple values used space 4 5 without press enter

find frequency of substring in a set of strings

Counting data that is valid to two conditions in columns 1 and 2

GNU Make Corresponding Lists

How to calculate the tree the results by combining individual leaf paths?

Categories

Resources