Extract and organize automatically KEGG annotation results into Excel - r

I have launched a query with amino acid sequences on "KAAS - KEGG Automatic Annotation Server".
I have then downloaded the results file called "myfile.keg". A small example file that shows how it looks like can be dowloaded at: https://www.dropbox.com/s/ixf0091z5q3cx9z/myfile.keg?dl=0
+D KO
#<h2><img src="/Fig/bget/kegg3.gif" align="middle" border=0> KEGG Orthology (KO)</h2> 75prot_protdiff_GD_5h
!
A<b>Metabolism</b>
B
B <b>Carbohydrate metabolism</b>
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D MYGENEACCESSION01; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00020 Citrate cycle (TCA cycle) [PATH:ko00020]
C 00030 Pentose phosphate pathway [PATH:ko00030]
D MYGENEACCESSION02; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00040 Pentose and glucuronate interconversions [PATH:ko00040]
C 00051 Fructose and mannose metabolism [PATH:ko00051]
D MYGENEACCESSION03; K17497 PMM; phosphomannomutase [EC:5.4.2.8]
D MYGENEACCESSION04; K01623 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
C 00052 Galactose metabolism [PATH:ko00052]
C 00053 Ascorbate and aldarate metabolism [PATH:ko00053]
C 00500 Starch and sucrose metabolism [PATH:ko00500]
C 00520 Amino sugar and nucleotide sugar metabolism [PATH:ko00520]
D MYGENEACCESSION05; K01183 E3.2.1.14; chitinase [EC:3.2.1.14]
C 00620 Pyruvate metabolism [PATH:ko00620]
C 00630 Glyoxylate and dicarboxylate metabolism [PATH:ko00630]
C 00640 Propanoate metabolism [PATH:ko00640]
C 00650 Butanoate metabolism [PATH:ko00650]
C 00660 C5-Branched dibasic acid metabolism [PATH:ko00660]
C 00562 Inositol phosphate metabolism [PATH:ko00562]
B
!
#<hr>
#<b>[ KO | BRITE | KEGG2 | KEGG ]</b><br>
#Last updated: May 18, 2018
#<br><br>ยป All categories
(I open it with Notepad++)
In this file, you can see the different functional categories from KEGG for each of my genes, the latters being referred to as "MYGENEACCESSION01" (or -"02", -"03", etc).
I want to extract and organize all info from this first file.keg into a new file (e.g., excel) that looks something like this : https://www.dropbox.com/s/xq4714ngesap9dx/annotation.xlsx?dl=0
CSV version here:
accession,kegg.first.level,kegg.second.level,kegg.third.level,kegg.fourth.level,path ,KO
MYGENEACCESSION01,metabolism,carbohydrate metabolism,glycolisis / Gluconeogenesis,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00010,K01623
MYGENEACCESSION02,metabolism,carbohydrate metabolism,Pentose phosphate pathway ,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00030,K01623
MYGENEACCESSION03,metabolism,carbohydrate metabolism,Fructose and mannose metabolism, PMM; phosphomannomutase [EC:5.4.2.8],PATH:ko00051,K17497
MYGENEACCESSION04,metabolism,carbohydrate metabolism,Fructose and mannose metabolism,"ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]",PATH:ko00051,K01623
MYGENEACCESSION05,metabolism,carbohydrate metabolism,Amino sugar and nucleotide sugar metabolism,chitinase [EC:3.2.1.14],PATH:ko00520,K01183
I have done it manually but it is very tedious and I have a much larger dataset than the provided example.
Any idea to do it automatically with R or another program? (Do you think that an R script could do the job ?)

Related

Issues with importing R Data due to formatting

I'm trying to import txt data into R; however, due to the txt file's unique formatting, I'm unsure of how to do this. I definitely feel that the issue is related to the fact that the txt file was formatted to line up columns with column names; however, as it's a text file, this was done with a variety of spaces. For example:
Gene Chromosomal Swiss-Prot MIM Description
name position AC Entry name code
______________ _______________ ______________________ ______ ______________________
A3GALT2 1p35.1 U3KPV4 A3LT2_HUMAN Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3 1p36.21 Q5VUY0 ADCL3_HUMAN Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4 1p36.21 Q5VUY2 ADCL4_HUMAN Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4 1p21-p22.1 P78363 ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM protein) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10 1q42 Q9NRK6 ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ATP-binding cassette transporter
Because of this, I have not been able to import my data whatsoever. Because it was made to be justified text with spaces, the number of spaces aren't uniform at all.
This is the link to the data sheet that I am using: https://www.uniprot.org/docs/humchr01.txt
Each field has a fixed width. Therefore, you can use the function read.fwf to read the file.
The following code reads the input file (assuming the file has only the rows, without the headers)
f = read.fwf('input.txt', c(14,16,11,12,7,250), strip.white=T)
colnames(f) = c('Gene name', 'Chromosomal position', 'Swiss-Prot AC',
'Swiss-Prot Entry name', 'MIM code', 'Description')

Syntax error when using count in loop

I am trying to run a loop where I count the total in each file under the variable _merge, and then count certain outcomes of _merge, such as _merge=1 and so on. I then want to calculate percentages by dividing each instance of _merge by the total under _merge.
Below is my code:
/*define local list*/
local ward_names B C D E FN FS GS HE
/*loop for each dbase*/
foreach file of local ward_names {
use "../../../cleaning/sra/output/`file'_ward_CTS_Merged.dta", clear
count if _merge
local ward_count=r(N)
count if _merge==1
local count_master=r(N)
count if _merge==2
local count_using=r(N)
count if _merge==3
local count_match=r(N)
clear
set obs 1
g ward_count='ward_count'
g count_master=`count_master'
g count_using=`count_using'
g count_match=`count_match'
g ward= "`file'"
save "../temp/`file'_collapsed_diagnostics.dta", replace
clear
The code was running fine until I tried to add the total count for each ward file:
g ward_count='ward_count'
'ward_count' invalid name
Is this a syntax error or something more severe?
You need to use ` instead of ' when you refer to a local macro:
generate ward_count = `ward_count'
EDIT:
As per #NickCox's recommendation you can improve your code by using the tabulate command with its matcell() option to get the counts all at once:
tabulate _merge, matcell(A)
_merge | Freq. Percent Cum.
------------------------+-----------------------------------
master only (1) | 1 16.67 16.67
matched (3) | 5 83.33 100.00
------------------------+-----------------------------------
Total | 6 100.00
matrix list A
A[2,1]
c1
r1 1
r2 5
So you could then do the following:
generate count_master = A[1,1]
generate count_match = A[2,1]

R read.table() Some of the field separator and line breaks not recognized

I ran into this issue while reading this tab delimited table GO_MF.txt into R.
Here's part of my input:
YAL004W YAL004W Unknown
YAL005C SSA1 unfolded protein binding ATPase activity
YAL007C ERP2 molecular_function
YAL008W FUN14 molecular_function
YAL009W SPO7 phosphoprotein phosphatase activity
YAL012W CYS3 cystathionine gamma-lyase activity
YAL013W DEP1 molecular_function
YAL014C SYN8 SNAP receptor activity
YAL017W PSK1 protein serine/threonine kinase activity
It is perfectly fine when I open it in excel: 3606 rows and up to 11 columns.
However, when I tried to input this table into R using the following command:
no_col <- max(count.fields("GO_MF.txt", sep = "\t"), na.rm=T)
pop <- read.table("GO_MF.txt", sep = "\t",fill = TRUE, as.is=T, col.names=1:no_col)
I found there were only 2284 obs. of 11 variables. When I View(pop) in Rstudio, I found that there's a cell that doesn't seem to recognize \t and \n (see below):
my pop[261,3] is
5-flap endonuclease activity
YBR229C ROT2 alpha-glucosidase activity
YBR230C OM14 molecular_function
YBR231C SWC5 molecular_function
YBR232C YBR232C Unknown
YBR233W PBP2 mRNA binding
YBR234C ARC40 ubiquitin binding
YBR235W VHC1 ion transmembrane transporter activity
YBR236C ABD1 mRNA (guanine-N7-)-methyltransferase activity
YBR237W PRP5 RNA-dependent ATPase activity
YBR241C YBR241C substrate-specific transmembrane transporter activity
YBR242W YBR242W molecular_function
YBR243C ALG7 UDP-N-acetylglucosamine-dolichyl-phosphate N-acetylglucosaminephosphotransferase activity
YBR244W GPX2 phospholipid-hydroperoxide glutathione peroxidase activity glutathione peroxidase activity
YBR245C ISW1 nucleosome binding ATPase activity rDNA binding DNA binding
YBR246W RRT2 molecular_function
YBR248C HIS7 imidazoleglycerol-phosphate synthase activity
YBR249C ARO4 3-deoxy-7-phosphoheptulonate synthase activity
YBR250W SPO23 molecular_function
YBR251W MRPS5 structural constituent of ribosome
YBR253W SRB6 molecular_function
YBR255W MTC4 molecular_function
YBR258C SHG1 histone methyltransferase activity (H3-K4 specific)
YBR260C RGD1 Rho GTPase activator activity phosphatidylinositol-4,5-bisphosphate binding phosphatidylinositol-3,5-bisphosphate binding phosphatidylinositol-3-phosphate binding phosphatidylinositol-4-phosphate binding phosphatidylinositol-5-phosphate binding
YBR261C TAE1 N-terminal protein N-methyltransferase activity
YBR262C AIM5 molecular_function
YBR263W SHM1 glycine hydroxymethyltransferase activity
YBR264C YPT10 GTPase activity guanyl nucleotide binding
YBR266C SLM6 Unknown
YBR267W REI1 sequence-specific DNA binding
YBR269C FMP21 molecular_function
YBR270C BIT2 molecular_function
YBR271W EFM2 S-adenosylmethionine-dependent methyltransferase activity protein-lysine N-methyltransferase activity
YBR273C UBX7 molecular_function
YBR275C RIF1 telomeric DNA binding
YBR277C YBR277C Unknown
YBR278W DPB3 double-stranded DNA binding single-stranded DNA binding DNA-directed DNA polymerase activity
YBR280C SAF1 ubiquitin-protein ligase activity
YBR281C DUG2 peptidase activity gamma-glutamyltransferase activity omega peptidase activity
YBR283C SSH1 protein transmembrane transporter activity signal sequence binding
YBR284W YBR284W AMP deaminase activity molecular_function
YBR287W YBR287W molecular_function
YBR288C APM3 molecular_function
YBR290W BSD2 protein binding
YBR291C CTP1 tricarboxylate secondary active transmembrane transporter activity
YBR293W VBA2 basic amino acid transmembrane transporter activity drug transmembrane transporter activity
YBR294W SUL1 sulfate transmembrane transporter activity
YBR297W MAL33 sequence-specific DNA binding transcription factor activity
YBR298C MAL31 alpha-glucoside:hydrogen symporter activity
YBR299W MAL32 sucrose alpha-glucosidase activity alpha-glucosidase activity maltose alpha-glucosidase activity
YBR301W PAU24 molecular_function
YCL001W RER1 molecular_function
YCL002C YCL002C molecular_function
YCL005W LDB16 molecular_function
YCL008C STP22 ubiquitin binding
YCL009C ILV6 enzyme regulator activity acetolactate synthase activity
YCL010C SGF29 methylated histone residue binding RNA polymerase II transcription factor recruiting transcription factor activity
YCL016C DCC1 molecular_function
YCL023C YCL023C Unknown
YCL029C BIK1 microtubule binding protein homodimerization activity
YCL030C HIS4 phosphoribosyl-ATP diphosphatase activity phosphoribosyl-AMP cyclohydrolase activity histidinol dehydrogenase activity
YCL033C MXR2 peptide-methionine (R)-S-oxide reductase activity
YCL035C GRX1 disulfide oxidoreductase activity glutathione transferase activity glutathione peroxidase activity
YCL037C SRO9 RNA binding
YCL042W YCL042W molecular_function
YCL044C MGR1 misfolded protein binding
YCL046W YCL046W Unknown
YCL047C POF1 ATPase activity
YCL048W SPS22 molecular_function
YCL049C YCL049C molecular_function
YCL050C APA1 bis(5-nucleosyl)-tetraphosphatase activity
Any thoughts?
Thanks in advance!

Can two Unix processes simultaneous write to different positions in a single file?

This is an unresolved exam question of mine.
Can two Unix processes simultaneous write to different positions
in a single file?
Yes, the two processes will have their own file table entries
no, the shared i-node contains a single offset pointer
only one process will have write privilege
yes, but only if we operate using NFS
There is no file offset recorded in an inode so answer 2. is incorrect.
There is no documented reason for a process to have its access rights modified so 3. is incorrect.
NFS allows simultaneous access by processes on different hosts, the question here is for processes on the same host so NFS shouldn't make a difference.
Here is a shell script demonstrating the remaining answer 1. is correct:
# create a 10m file
dd if=/dev/zero of=/var/tmp/file bs=1024k count=10
# create two 1 MB files
cd /tmp
printf "aaaaaaaa" > aa
printf "bbbbbbbb" > bb
i=0
while [ $i -lt 17 ]; do
cat aa aa > aa.new && mv aa.new aa
cat bb bb > bb.new && mv bb.new bb
i=$((i+1))
done
ls -lG /var/tmp/file /tmp/aa /tmp/bb
# launch 10 processes that will write at different locations in the same file.
# Uses dd notrunc option for the file not to be truncated
# Uses GNU dd fdatasync option for unbuffered writes
i=0
while [ $i -lt 5 ]; do
(
dd if=/tmp/aa of=/var/tmp/file conv=notrunc,fdatasync bs=1024k count=1 seek=$((i*2)) 2>/dev/null &
dd if=/tmp/bb of=/var/tmp/file conv=notrunc,fdatasync bs=1024k count=1 seek=$((i*2+1)) 2>/dev/null &
) &
i=$((i+1))
done
# Check concurrency
printf "\n%d processes are currently writing to /var/tmp/file\n" "$(fuser /var/tmp/file 2>/dev/null | wc -w)"
# Wait for write completion and check file contents
wait
printf "/var/tmp/file contains:\n"
od -c /var/tmp/file
Its output shows ten processes successfully and simultaneously write to the very same file:
-rw-r--r-- 1 jlliagre 1048576 oct. 30 08:25 /tmp/aa
-rw-r--r-- 1 jlliagre 1048576 oct. 30 08:25 /tmp/bb
-rw-r--r-- 1 jlliagre 10485760 oct. 30 08:25 /var/tmp/file
10 processes are currently writing to /var/tmp/file
/var/tmp/file contains:
0000000 a a a a a a a a a a a a a a a a
*
4000000 b b b b b b b b b b b b b b b b
*
10000000 a a a a a a a a a a a a a a a a
*
14000000 b b b b b b b b b b b b b b b b
*
20000000 a a a a a a a a a a a a a a a a
*
24000000 b b b b b b b b b b b b b b b b
*
30000000 a a a a a a a a a a a a a a a a
*
34000000 b b b b b b b b b b b b b b b b
*
40000000 a a a a a a a a a a a a a a a a
*
44000000 b b b b b b b b b b b b b b b b
*
50000000
Definition:
Yes, the two processes will have their own file table entries.
If the file is opened twice with the open function the two file descriptor is created.
Each file descriptor have seperate file status flags.
So the two file descriptor have a write permission file descriptor1 and file descriptor2 have initial position of point to first character to file.
If we specify some position to both descriptor and write in file it can be tested easily.
The content of file.txt
My name is Chandru. This is an empty file.
Coding for testing:
#include<stdio.h>
#include<fcntl.h>
#include<stdlib.h>
main()
{
int fd1, fd2;
if((fd1=open("file.txt", O_WRONLY)) <0){
perror("Error");
exit(0);
}
if((fd2=open("file.txt", O_WRONLY)) < 0) {
perror("Error");
exit(0);
}
if(lseek(fd1,20,SEEK_SET) != 20)
{
printf("Cannot seek\n");
exit(0);
}
if(write(fd1,"testing",7) != 7)
{
printf("Error write\n");
exit(0);
}
if(lseek(fd2,10,SEEK_SET) != 10)
{
printf("Cannot seek\n");
exit(0);
}
if(write(fd2,"condition",9) != 9)
{
printf("Error write\n");
exit(0);
}
}
Output:
After that my output is
My name isconditionitesting empty file.
Yes, they can of course, with the following caveats:
Depending on open() mode, one process can easily wipe the file contents
Depending on scheduling, the order of write operations is not deterministic
There is no mandatory locking (in general) - careful design calls for advisory locking.
If they write in the same area using buffered I/O results can be non-deterministic.

What is the best way to parse this flat text format in R?

1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771
2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403
3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757
37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547
43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587
I want to get the gene name (ZFP112, SEP15, MLL4), the Location field (if present), the ID field, and skip the other stuff. All the string utilities like scan() seem geared toward more regular data. The blank line between records is effectively the record separator. I can write this to disk and read it back in with readLines() but I'd prefer to do it from memory since I downloaded it over HTTP.
Read the data in from "myfile.dat", say, (or just start from L below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location: or start with ID:. Then remove everything in those lines up to and including the last space. Create a group vector g which identifies the group to which each component of v2 belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2 into those groups . Expand short components of s by appropriately inserting an NA assuming that if its short that Location: is missing. (We assume the first field and the ID fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
Using the sample data in the post we get this from the last line:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"

Resources