read textfile as formatted list - r

I'd like to import a textfile as a list into R but don't know how to tell R the desired format. The source list is a txt file and is formatted as followes:
Column1=a
Date=21.01.2020
Column2=1|2|3
Subtable_Column1=a|2|3
Subtable_Column2=c|d|e
[2]
Column1=b
Date=21.02.2020
Column2=1|2|4
Subtable_Column1=a|2|3
Subtable_Column2=c|d|e
Subtable_Column3=c|d|e
In the end, each [n] should be the list index containing the named vectors (e.g. Column 1, Date etc.)
How's that possible in R?
Thanks for your help!

Here is a way to pre-format the data.
varval.dat
Column1=a
Date=21.01.2020
Column2=1|2|3
Subtable_Column1=a|2|3
Subtable_Column2=c|d|e
[2]
Column1=b
Date=21.02.2020
Column2=1|2|4
Subtable_Column1=a|2|3
Subtable_Column2=c|d|e
Subtable_Column3=c|d|e
It puts each dataset into a row of a data-frame:
awk -F '=' 'BEGIN{i=1}
!NF{i++; j=0; next}
/=/{ j++; cname[0,j]=$1; content[i,j]=$2; rows=i; cols[i]++ }
END{ m=0; for(k in cols){ if(cols[k]>m){ m=cols[k] } };
for(j=1;j<m;j++){ printf("%s,",cname[0,j]) }; print cname[0,m];
for(i=1;i<=rows;i++){
for(j=1;j<m;j++){
printf("%s,",content[i,j]) } print content[i,m] } }' varval.dat > varval.csv
R:
a <- read.csv("varval.csv", na.strings="")
# Column1 Date Column2 Subtable_Column1 Subtable_Column2 Subtable_Column3
# 1 a 21.01.2020 1|2|3 a|2|3 c|d|e <NA>
# 2 b 21.02.2020 1|2|4 a|2|3 c|d|e c|d|e
I'm sure this is not exactly what you wanted but may serve as a starting point.

Related

How to read the comma separated values

We have a fixed width file
Col1 length 10
Col2 length 10
Col3 length 30
Col4 length 40
Sample record
ABC 123 xyz. 5171-5261,51617
ABC. 1234. Xxy. 81651-61761
Col4 can have any number of comma separated values
1 or more within length of 40 characters: If it is has 1 value for that record there is no change in output file.
If more than one value is there i.e. comma separated (5171-5261,51617)
the output file should have multiple records.
1 record
ABC. 123. Xyz. 5171-5261
ABC 123. Xyz. 51617
What is the most efficient way to do this.
As of now trying using while and for loop but it is taking so long for execution since we are doing this splitting by reading each record.
The output file can be comma separated or fixed width.
awk is your friend here.
A single line of awk will achieve what you need:
awk -v FIELDWIDTHS="10 10 30 40" '{ if (match($4,",")) { split($4,array,","); for (i in array) { print $1,$2,$3,array[i]; }; } else { print $1,$2,$3,$4 }; }' samp.dat
For ease of reading the code is:
{
if (match($4,",")) {
split($4,array,",");
for (i in array) {
print $1,$2,$3,array[i];
};
} else {
print $1,$2,$3,$4
};
}
Testing with the sample data you supplied gives:
ABC 123 xyz. 5171-5261
ABC 123 xyz. 51617
ABC. 1234. Xxy. 81651-61761
How it works:
awk reads your file one line at a time.
The FIELDWIDTHS directive allows us to reference each column as $1,$2...
Now that we have our columns we can look for a comma in the fourth field with match($4,",").
If we find one we make an array of the values in the fourth field that are separated by commas with split($4,array,",").
Then we loop through this array and print multiple lines of output, one for each element of the array.
If the fourth field has no comma the else clause prints a single line.
This process repeats for each line in your fixed width file.
NOTE:
awk associative arrays do not guarantee to preserve the order of your data.
This means that your output might come out as
ABC 123 xyz. 51617
ABC 123 xyz. 5171-5261
ABC. 1234. Xxy. 81651-61761
i.e. 5171-5261,51617 in the input data produced a line from the second value before the first.
If the ordering is important to you then you can use the code below that makes a csv from your input data first, then produces the output preserving the order.
awk -v FIELDWIDTHS="10 10 30 40" '{print $1,$2,$3,$4}' OFS=',' samp.data > samp.csv
awk -F',' '{ for (i=4; i<=NF; i++) { print $1,$2,$3,$i } }' samp.csv

Get a specific column number by column name using awk

i have n number of file, in these files a specific column named "thrudate" is given at different column number in every files.
i Just want to extract the value of this column from all files in one go. So i tried with using awk. Here i'm considering only one file, and extracting the values of thrudate
awk -F, -v header=1,head="" '{for(j=1;j<=2;j++){if($header==1){for(i=1;i<=$NF;i++){if($i=="thrudate"){$head=$i;$header=0;break}}} elif($header==0){print $0}}}' file | head -10
How i have approached:
used find command to find all the similar files and then executing the second step for every file
loop all fields in first row, checking the column name with header values as 1 (initialized it to 1 to check first row only), once it matched with 'thrudate', i set header as 0, then break from this loop.
once i get the column number then print it for every row.
You can use the following awk script:
print_col.awk:
# Find the column number in the first line of a file
FNR==1{
for(n=1;n<=NF;n++) {
if($n == header) {
next
}
}
}
# Print that column on all other lines
{
print $n
}
Then use find to execute this script on every file:
find ... -exec awk -v header="foo" -f print_col.awk {} +
In comments you've asked for a version that could print multiple columns based on their header names. You may use the following script for that:
print_cols.awk:
BEGIN {
# Parse headers into an assoc array h
split(header, a, ",")
for(i in a) {
h[a[i]]=1
}
}
# Find the column numbers in the first line of a file
FNR==1{
split("", cols) # This will re-init cols
for(i=1;i<=NF;i++) {
if($i in h) {
cols[i]=1
}
}
next
}
# Print those columns on all other lines
{
res = ""
for(i=1;i<=NF;i++) {
if(i in cols) {
s = res ? OFS : ""
res = res "" s "" $i
}
}
if (res) {
print res
}
}
Call it like this:
find ... -exec awk -v header="foo,bar,test" -f print_cols.awk {} +

unix split FASTA using a loop, awk and split

I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display

Unix formatting text into table with grep or awk or sed

I have been able to locate things no problem with grep however the assignment is basically pulling data out and formatting it and displaying it as a table columns with multiple rows. Now it shouldn't be anything crazy because we only have basic knowledge of awk and sed. Now I'm curious: is there any way to take my output from grep and format it so for example I get:
Jake
0001
Bob
0002
Kim
0003
and want to make it something like this
# Name LD #
--- ---- ----
1 Jake 0001
2 Bob 0002
3 Kim 0003
Also is it possible to explain each part of your line and also is it possible to make it expandable if I have a large record to deal with?
You need to defined (or identify) a control logic that matches your grep output.
Derived from what you gave I assume the following:
the heading is a constant text that is intrinsic to your formatting
(Not to be deduced from input)
the first column is an ordinal number starting with one
the records from the input are identified by a string of all digits
Then the following awk script will do the formatting:
BEGIN {
# initialize ordinal
ordinal=1;
# print heading
printf "%-3s %5s %4s\n", "#", "Name", "LD #"
}
# match trigger line for output
/^[0-9]+$/ { printf "%3d %5s %4s\n", ordinal++, label, $1;
# cleanou label - not necessary for single data item case
# we are done with this input line
next;
}
# collect data item
{
label=$1;
# we are done with this input line
next;
}
If you want to include more record items (leading to more columns) you might check whether the preceeding column values have been encountered.
Or even just use a counter for indicating at what column you are within your record.
Then you could use e.g.:
BEGIN {
# initialize ordinal
ordinal=1;
column=0;
# print heading
printf "%-3s %5s %4s\n", "#", "Name", "LD #"
}
# match trigger line for output
/^[0-9]+$/ { printf "%3d (%d)", ordinal++, column;
for (i=0; i < column; i++) {
printf " %s", data[i];
data[i] = "";
}
printf "\n";
# we are done with this input line
column=0;
next;
}
# collect data item
{
data[column++]=$1;
if (length($1) > max[column]) {
max[column]=length($1);
}
# we are done with this input line
next;
}
END {
for (i=0; i< length(max); i++) {
printf "Col %d: %d\n", i, max[i];
}
}
I also included a way of determining the size of the columns (character count).

print if all value are higher

I have a file like:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8
.
.
.
I want to print those lines that all values in column 2 are more than 50
output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
I tried:
cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question.
tab <- read.table("file")
splt <- strsplit(as.character(tab[[2]]), ",")
rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50)))
tab[rows,]
This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works:
awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile
Output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
Explanation
The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.

Resources