How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"
I'm new to extracting data from XML file. I'm trying to process the following an XML file using R XML packages. The information I want is in the attribute values.
I encounter two difficulties:
some attribute values exist in one node, but not in another node. For example, "DRP" has the information in the second but not in the first
some attributes has multiple values for an individual and i don't know how to link them to that individual. For example, "EmpHs" has multiple records for an individual (identified by indvlPK).
Ideally I want the output data has the structure similar to the following:
lastNm
firstNm
indvlPK
fromDt
orgNm
hasCustComp
GIGAX
JEFFREY
2783477
03/2004
GATEWAY FINANCIAL ADVISORS, INC
GIGAX
JEFFREY
2783477
03/2004
GFA IN
GIGAX
JEFFREY
2783477
01/2007
UNITED FIRST
HINSON
BRIAN
2783737
07/1996
LINCOLN FINANCIAL ADVISORS CORPORATION
Y
HINSON
BRIAN
2783737
07/1996
FIRST FINANCIAL GROUP
Y
Is there any way I can parse the data correctly? Thanks!
The code I used but didn't give me what I want:
doc <- "Test.xml"
ind <- xmlParse(doc)
xmltop = xmlRoot(ind)
temp1 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#lastNm")))
temp2 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#firstNm")))
temp3 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#indvlPK")))
temp4 <- data.frame(unlist(getNodeSet(xmltop,"//EmpHs/#fromDt")))
temp5 <- data.frame(unlist(getNodeSet(xmltop,"//DRP/#hasCustComp")))
The data is here:
<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDIndividualReport GenOn="2021-03-29">
<Indvls>
<Indvl>
<Info lastNm="GIGAX" firstNm="JEFFREY" midNm="W" indvlPK="2783477" actvAGReg="Y" link="https://adviserinfo.sec.gov/individual/summary/2783477"/>
<OthrNms/>
<CrntEmps>
<CrntEmp orgNm="CAMBRIDGE INVESTMENT RESEARCH ADVISORS, INC." orgPK="134139" str1="1776 PLEASANT PLAIN RD." city="FAIRFIELD" state="IA" cntry="United States" postlCd="52556-8757">
<CrntRgstns>
<CrntRgstn regAuth="MO" regCat="RA" st="APPROVED" stDt="2010-09-09"/>
</CrntRgstns>
<BrnchOfLocs>
<BrnchOfLoc city="O'FALLON" state="MO" cntry="United States"/>
</BrnchOfLocs>
</CrntEmp>
</CrntEmps>
<Exms>
<Exm exmCd="S63" exmNm="Uniform Securities Agent State Law Examination" exmDt="1996-08-20"/>
<Exm exmCd="S65" exmNm="Uniform Investment Adviser Law Examination" exmDt="1999-12-21"/>
</Exms>
<Dsgntns/>
<PrevRgstns>
<PrevRgstn orgNm="WOODBURY FINANCIAL SERVICES, INC." orgPK="421" regBeginDt="2009-01-05" regEndDt="2009-12-03">
<BrnchOfLocs>
<BrnchOfLoc city="OFALLON" state="MO"/>
<BrnchOfLoc city="OFALLON" state="MO"/>
<BrnchOfLoc city="DUBLIN" state="CA"/>
</BrnchOfLocs>
</PrevRgstn>
<PrevRgstn orgNm="FSC SECURITIES CORPORATION" orgPK="7461" regBeginDt="2004-10-29" regEndDt="2008-12-01">
<BrnchOfLocs>
<BrnchOfLoc city="O'FALLON" state="MO"/>
<BrnchOfLoc city="ST. PETERS" state="MO"/>
</BrnchOfLocs>
</PrevRgstn>
<PrevRgstn orgNm="GATEWAY FINANCIAL ADVISORS, INC." orgPK="115025" regBeginDt="2004-11-11" regEndDt="2006-10-11">
<BrnchOfLocs>
<BrnchOfLoc city="ST. PETERS" state="MO"/>
</BrnchOfLocs>
</PrevRgstn>
</PrevRgstns>
<EmpHss>
<EmpHs fromDt="03/2004" orgNm="GATEWAY FINANCIAL ADVISORS, INC" city="OFALLON" state="MO"/>
<EmpHs fromDt="03/2004" orgNm="GFA INC" city="OFALLON" state="MO"/>
<EmpHs fromDt="01/2007" orgNm="UNITED FIRST" city="OFALLON" state="MO"/>
<EmpHs fromDt="09/2010" orgNm="CAMBRIDGE INVESTMENT RESEARCH ADVISORS, INC" city="FAIRFIELD" state="IA"/>
<EmpHs fromDt="09/2010" orgNm="CAMBRIDGE INVESTMENT RESEARCH, INC" city="FAIRFIELD" state="IA"/>
</EmpHss>
<OthrBuss>
<OthrBus desc="1)STONEBRIDGE WEALTH MANAGEMENT GROUP, 728 HAWK RUN DR, O'FALLON, MO, 3/2008 AS INDEPENDENT INSURANCE AGENT FOR VARIOUS INDEPENDENT INSURANCE COMPANIES. INV REL - 40/MO - 20/TRADING. 2)UNITED FIRST FINANCIAL MORTGAGE SOFTWARE SALES. START 6/1/07, 10 HOURS PER MONTH, 5 DURING TRADING HOURS. NO OWNERSHIP INTEREST. 3)MORTGAGE STOP INC., 728 HAWK RUN DR., OFALLON, MO 63368. LOAN OFFICER PROCESSING LOAN APPS FOR CLIENTS. START 6/1/2002, 25 HOURS PER MONTH, 10 DURING TRADING HOURS. NO OWNERSHIP. 4)CIRA, 1776 PLEASANT PLAIN RD, FAIRFIELD, IA, AS ADVISORY REP OF A RIA. INV REL - 40 HR/WK - 40/TRADING. SEE EMPLOYMENT HISTORY FOR START DATE. 5) THE MORTGAGE SHOP, 355 MID RIVERS MALL DRIVE, STE E, ST. PETERS, MO 63376. MORTGAGE ORIGINATOR SINCE 01/01/99. NOT INVESTMENT RELATED. WORKS 60 HOURS PER MONTH, 20 OF WHICH ARE DURING TRADING HOURS. 6.365 PROPERTIES LLC, O'FALLON, MO, 8/2018 AS OWNER OF LLC THAT BUYS, SELLS, & HOLDS REAL ESTATE. NIR - 20/MO - 0/TRADING. 7. BEST OFFER HOMES, LLC, 728 HAWK RUN DRIVE, O'FALLON, MO, REAL ESTATE SALES/MORTGAGE ORIGINATION/ ACCOUNTING/FINANCIAL ACTIVITIES, 06/16/20, NIR, 20/MO- 0/TRADING 8. GIGAX WEALTH MANAGEMENT, 728 HAWK RUN DRIVE, OFALLON, MO, INDEPENDENT INSURANCE AGENT FOR VARIOUS INDEPENDENT INSURANCE COMPANIES,11/23/20, INV REL, 10 HR/WK- 10 TRADING HR."/>
</OthrBuss>
<DRPs/>
</Indvl>
<Indvl>
<Info lastNm="HINSON" firstNm="BRIAN" midNm="TROY" indvlPK="2783737" actvAGReg="Y" link="https://adviserinfo.sec.gov/individual/summary/2783737"/>
<OthrNms/>
<CrntEmps>
<CrntEmp orgNm="BRIDGEWORTH WEALTH MANAGEMENT" orgPK="164100" str1="101 25TH STREET NORTH" city="BIRMINGHAM" state="AL" cntry="United States" postlCd="35203">
<CrntRgstns>
<CrntRgstn regAuth="AL" regCat="RA" st="APPROVED" stDt="2015-05-12"/>
<CrntRgstn regAuth="TX" regCat="RA" st="APPROVED_RES" stDt="2015-05-01"/>
</CrntRgstns>
<BrnchOfLocs>
<BrnchOfLoc str1="400 MERIDIAN STREET" str2="SUITE 200" city="HUNTSVILLE" state="AL" cntry="United States" postlCd="35801"/>
<BrnchOfLoc str1="101 25TH STREET NORTH" city="BIRMINGHAM" state="AL" cntry="United States" postlCd="35203"/>
</BrnchOfLocs>
</CrntEmp>
</CrntEmps>
<Exms>
<Exm exmCd="S63" exmNm="Uniform Securities Agent State Law Examination" exmDt="1996-10-11"/>
</Exms>
<Dsgntns>
<Dsgntn dsgntnNm="Certified Financial Planner"/>
<Dsgntn dsgntnNm="Chartered Financial Consultant"/>
<Dsgntn dsgntnNm="Personal Financial Specialist"/>
</Dsgntns>
<PrevRgstns>
<PrevRgstn orgNm="LINCOLN FINANCIAL ADVISORS CORPORATION" orgPK="3978" regBeginDt="2000-04-25" regEndDt="2015-05-11">
<BrnchOfLocs>
<BrnchOfLoc city="HUNTSVILLE" state="AL"/>
<BrnchOfLoc city="HUNTSVILLE" state="AL"/>
</BrnchOfLocs>
</PrevRgstn>
</PrevRgstns>
<EmpHss>
<EmpHs fromDt="04/2015" orgNm="BRIDGEWORTH, LLC" city="HUNTSVILLE" state="AL"/>
<EmpHs fromDt="07/1996" toDt="04/2015" orgNm="LINCOLN FINANCIAL ADVISORS CORPORATION" city="HUNTSVILLE" state="AL"/>
<EmpHs fromDt="07/1996" toDt="04/2015" orgNm="FIRST FINANCIAL GROUP" city="BIRMINGHAM" state="AL"/>
<EmpHs fromDt="04/2015" orgNm="LPL FINANCIAL LLC" city="HUNTSVILLE" state="AL"/>
</EmpHss>
<OthrBuss>
<OthrBus desc="1) 04/30/2015: BRIDGEWORTH FINANCIAL, LLC - DBA FOR LPL BUSINESS (ENTITY FOR LPL BUSINESS) - INV REL - AT REPORTED BUSINESS LOCATIONS - START 01/01/2015 - 1% OF TIME SPENT 2) 04/30/2015: BRIDGEWORTH, LLC - INV REL - AT REPORTED BUSINESS LOCATION(S) - REGISTERED INVESTMENT ADVISOR HYBRID - START 01/2015 - 99% OF TIME SPENT. 3) 5/11/2015: NO BUSINESS NAME - INVESTMENT RELATED - AT REPORTED BUSINESS LOCATION(S) - NON-VARIABLE INSURANCE - STARTED 4/1/2015 - TIME SPENT 1% - LINES OF INSURANCE INCLUDE TERM, WHOLE, UNIVERSAL, LTC, DISABILITY. 4) 6/2/2017 - Bridgeworth Financial - Investment Related - At Reported Business Location(s) - DBA for LPL Business (entity for LPL business) - Started 04/30/2015 - 5 Hours Per Month/3 Hours During Securities Trading. 5) 5/8/2018 - Foster Properties Ltd - Not Investment Related - Home Based - Other-Family Business - Started 12/22/1997 - 1 Hours Per Month/0 Hours During Securities Trading - Handle the majority of business matters for this family business."/>
</OthrBuss>
<DRPs>
<DRP hasRegAction="N" hasCriminal="N" hasBankrupt="N" hasCivilJudc="N" hasBond="N" hasJudgment="N" hasInvstgn="N" hasCustComp="Y" hasTermination="N"/>
</DRPs>
</Indvl>
</Indvls>
</IAPDIndividualReport>
Hi I've just downloaded a XML file refering to the 5.8S region in aedes aegyptii from NCBI - nucleotide. As an example I paste the info I get for the first sample in the text.
From here I wish to extract
1. <INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
2. <INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
3. <INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
4. <INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA </INSDReference_journal>
Also, as I said this is a short version of all the info I really downloadead (13 samples) https://www.ncbi.nlm.nih.gov/nuccore/?term=aedes+aegypti+5.8, is there a posibility to extract the info I wanted for all the samples?
I`m familiar with R but, which platform suites better to do this?
<INSDSeq_locus>CH477247</INSDSeq_locus>
<INSDSeq_length>3065330</INSDSeq_length>
<INSDSeq_strandedness>double</INSDSeq_strandedness>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>CON</INSDSeq_division>
<INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
<INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
<INSDSeq_definition>Aedes aegypti strain Liverpool supercont1.62 genomic scaffold, whole genome shotgun sequence</INSDSeq_definition>
<INSDSeq_primary-accession>CH477247</INSDSeq_primary-accession>
<INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>gnl|WGS:AAGE|supercont1.62</INSDSeqid>
<INSDSeqid>gb|CH477247.1|</INSDSeqid>
<INSDSeqid>gi|78216626</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_project>PRJNA12434</INSDSeq_project>
<INSDSeq_keywords>
<INSDKeyword>WGS</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_source>Aedes aegypti (yellow fever mosquito)</INSDSeq_source>
<INSDSeq_organism>Aedes aegypti</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Holometabola; Diptera; Nematocera; Culicoidea; Culicidae; Culicinae; Aedini; Aedes; Stegomyia</INSDSeq_taxonomy>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>1</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Nene,V.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>Lawson,D.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Tu,Z.J.</INSDAuthor>
<INSDAuthor>Loftus,B.</INSDAuthor>
<INSDAuthor>Xi,Z.</INSDAuthor>
<INSDAuthor>Megy,K.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Ren,Q.</INSDAuthor>
<INSDAuthor>Zdobnov,E.M.</INSDAuthor>
<INSDAuthor>Lobo,N.F.</INSDAuthor>
<INSDAuthor>Campbell,K.S.</INSDAuthor>
<INSDAuthor>Brown,S.E.</INSDAuthor>
<INSDAuthor>Bonaldo,M.F.</INSDAuthor>
<INSDAuthor>Zhu,J.</INSDAuthor>
<INSDAuthor>Sinkins,S.P.</INSDAuthor>
<INSDAuthor>Hogenkamp,D.G.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Arensburger,P.</INSDAuthor>
<INSDAuthor>Atkinson,P.W.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Biedler,J.</INSDAuthor>
<INSDAuthor>Birney,E.</INSDAuthor>
<INSDAuthor>Bruggner,R.V.</INSDAuthor>
<INSDAuthor>Costas,J.</INSDAuthor>
<INSDAuthor>Coy,M.R.</INSDAuthor>
<INSDAuthor>Crabtree,J.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Debruyn,B.</INSDAuthor>
<INSDAuthor>Decaprio,D.</INSDAuthor>
<INSDAuthor>Eiglmeier,K.</INSDAuthor>
<INSDAuthor>Eisenstadt,E.</INSDAuthor>
<INSDAuthor>El-Dorry,H.</INSDAuthor>
<INSDAuthor>Gelbart,W.M.</INSDAuthor>
<INSDAuthor>Gomes,S.L.</INSDAuthor>
<INSDAuthor>Hammond,M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Hogan,J.R.</INSDAuthor>
<INSDAuthor>Holmes,M.H.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Johnston,J.S.</INSDAuthor>
<INSDAuthor>Kennedy,R.C.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Kriventseva,E.V.</INSDAuthor>
<INSDAuthor>Kulp,D.</INSDAuthor>
<INSDAuthor>Labutti,K.</INSDAuthor>
<INSDAuthor>Lee,E.</INSDAuthor>
<INSDAuthor>Li,S.</INSDAuthor>
<INSDAuthor>Lovin,D.D.</INSDAuthor>
<INSDAuthor>Mao,C.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Menck,C.F.</INSDAuthor>
<INSDAuthor>Miller,J.R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Mori,A.</INSDAuthor>
<INSDAuthor>Nascimento,A.L.</INSDAuthor>
<INSDAuthor>Naveira,H.F.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>O'leary,S.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Pertea,M.</INSDAuthor>
<INSDAuthor>Quesneville,H.</INSDAuthor>
<INSDAuthor>Reidenbach,K.R.</INSDAuthor>
<INSDAuthor>Rogers,Y.H.</INSDAuthor>
<INSDAuthor>Roth,C.W.</INSDAuthor>
<INSDAuthor>Schneider,J.R.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Stanke,M.</INSDAuthor>
<INSDAuthor>Stinson,E.O.</INSDAuthor>
<INSDAuthor>Tubio,J.M.</INSDAuthor>
<INSDAuthor>Vanzee,J.P.</INSDAuthor>
<INSDAuthor>Verjovski-Almeida,S.</INSDAuthor>
<INSDAuthor>Werner,D.</INSDAuthor>
<INSDAuthor>White,O.</INSDAuthor>
<INSDAuthor>Wyder,S.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Zhao,Q.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Hill,C.A.</INSDAuthor>
<INSDAuthor>Raikhel,A.S.</INSDAuthor>
<INSDAuthor>Soares,M.B.</INSDAuthor>
<INSDAuthor>Knudson,D.L.</INSDAuthor>
<INSDAuthor>Lee,N.H.</INSDAuthor>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Salzberg,S.L.</INSDAuthor>
<INSDAuthor>Paulsen,I.T.</INSDAuthor>
<INSDAuthor>Dimopoulos,G.</INSDAuthor>
<INSDAuthor>Collins,F.H.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
<INSDAuthor>Fraser-Liggett,C.M.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Genome sequence of Aedes aegypti, a major arbovirus vector</INSDReference_title>
<INSDReference_journal>Science 316 (5832), 1718-1723 (2007)</INSDReference_journal>
<INSDReference_xref>
<INSDXref>
<INSDXref_dbname>doi</INSDXref_dbname>
<INSDXref_id>10.1126/science.1138878</INSDXref_id>
</INSDXref>
</INSDReference_xref>
<INSDReference_pubmed>17510324</INSDReference_pubmed>
</INSDReference>
<INSDReference>
<INSDReference_reference>2</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Devon,K.</INSDAuthor>
<INSDAuthor>Henn,M.R.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
<INSDAuthor>Collins,F.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Rounsley,S.</INSDAuthor>
<INSDAuthor>DeCaprio,D.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Lander,E.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Butler,J.</INSDAuthor>
<INSDAuthor>Alvarez,P.</INSDAuthor>
<INSDAuthor>Gnerre,S.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Kleber,M.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Brockman,W.</INSDAuthor>
<INSDAuthor>Young,S.</INSDAuthor>
<INSDAuthor>LaButti,K.</INSDAuthor>
<INSDAuthor>Pushparaj,V.</INSDAuthor>
<INSDAuthor>Koehrsen,M.</INSDAuthor>
<INSDAuthor>Engels,R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Pearson,M.</INSDAuthor>
<INSDAuthor>Howarth,C.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Yandava,C.</INSDAuthor>
<INSDAuthor>Oleary,S.</INSDAuthor>
<INSDAuthor>Alvarado,L.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
</INSDReference_authors>
<INSDReference_consortium>The Broad Institute Genome Sequencing Platform</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>3</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Loftus,B.J.</INSDAuthor>
<INSDAuthor>Nene,V.M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>White,O.R.</INSDAuthor>
<INSDAuthor>Salzberg,S.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Holmes,M.</INSDAuthor>
<INSDAuthor>Miller,J.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Pop,M.</INSDAuthor>
<INSDAuthor>Pai,G.</INSDAuthor>
<INSDAuthor>Utterback,T.</INSDAuthor>
<INSDAuthor>Rogers,Y.-H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Fraser,C.M.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>4</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_consortium>VectorBase</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (05-SEP-2012) VectorBase / Ensembl, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK</INSDReference_journal>
<INSDReference_remark>Annotation update by submitter</INSDReference_remark>
</INSDReference>
</INSDSeq_references>
<INSDSeq_comment>The sequence for this assembly was produced jointly by The Broad Institute of Harvard/MIT and The Institute for Genomic Research. The assembly represents 7.6X sequence coverage of the genome and the total length of the contigs is 1.31 Gb. Additional information about the Aedes aegypti sequencing project and assembly can be found at http://www.broad.mit.edu/annotation/disease_vector/aedes_aegypti/ and http://www.tigr.org/msc/aedes/aedes.shtml. Long-term curation of the sequence and subsequent annotation updates will be the responsibility of VectorBase at http://www.vectorbase.org.~Annotation was updated by VectorBase in Sept 2012.</INSDSeq_comment>
<INSDSeq_feature-table>
<INSDFeature>
<INSDFeature_key>source</INSDFeature_key>
<INSDFeature_location>1..3065330</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>1</INSDInterval_from>
<INSDInterval_to>3065330</INSDInterval_to>
<INSDInterval_accession>CH477247.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>Aedes aegypti</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>strain</INSDQualifier_name>
<INSDQualifier_value>Liverpool</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:7159</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>chromosome</INSDQualifier_name>
<INSDQualifier_value>2</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
</INSDSeq_feature-table>
<INSDSeq_contig>join(AAGE02003964.1:1..7226,gap(unk100),AAGE02003965.1:1..6376,gap(unk100),AAGE02003966.1:1..16236,gap(4301),AAGE02003967.1:1..174188,gap(unk100),AAGE02003968.1:1..24199,gap(1396),AAGE02003969.1:1..104064,gap(29770),AAGE02003970.1:1..12303,gap(56956),AAGE02003971.1:1..2368,gap(12542),AAGE02003972.1:1..29888,gap(1379),AAGE02003973.1:1..98175,gap(unk100),AAGE02003974.1:1..13180,gap(unk100),AAGE02003975.1:1..2872,gap(unk100),AAGE02003976.1:1..18626,gap(unk100),AAGE02003977.1:1..52378,gap(151),AAGE02003978.1:1..153108,gap(901),AAGE02003979.1:1..3583,gap(unk100),AAGE02003980.1:1..32852,gap(unk100),AAGE02003981.1:1..68239,gap(unk100),AAGE02003982.1:1..61056,gap(unk100),AAGE02003983.1:1..21852,gap(unk100),AAGE02003984.1:1..49659,gap(unk100),AAGE02003985.1:1..33070,gap(315),AAGE02003986.1:1..411266,gap(unk100),AAGE02003987.1:1..2985,gap(unk100),AAGE02003988.1:1..38365,gap(159),AAGE02003989.1:1..110697,gap(890),AAGE02003990.1:1..22405,gap(2299),AAGE02003991.1:1..7510,gap(187),AAGE02003992.1:1..447937,gap(263),AAGE02003993.1:1..92770,gap(1409),AAGE02003994.1:1..2258,gap(132),AAGE02003995.1:1..5605,gap(unk100),AAGE02003996.1:1..3451,gap(2717),AAGE02003997.1:1..20215,gap(unk100),AAGE02003998.1:1..35683,gap(514),AAGE02003999.1:1..307288,gap(unk100),AAGE02004000.1:1..71359,gap(433),AAGE02004001.1:1..10550,gap(unk100),AAGE02004002.1:1..289125,gap(unk100),AAGE02004003.1:1..45622,gap(unk100),AAGE02004004.1:1..35927)</INSDSeq_contig>
<INSDSeq_xrefs>
<INSDXref>
<INSDXref_dbname>BioProject</INSDXref_dbname>
<INSDXref_id>PRJNA12434</INSDXref_id>
</INSDXref>
<INSDXref>
<INSDXref_dbname>BioSample</INSDXref_dbname>
<INSDXref_id>SAMN02953616</INSDXref_id>
</INSDXref>
</INSDSeq_xrefs>
`
Use an xpath or a CSS selector.
Depending on the language and libraries you use.