How can I select specific information from a XML file? in R or other platforms - r

Hi I've just downloaded a XML file refering to the 5.8S region in aedes aegyptii from NCBI - nucleotide. As an example I paste the info I get for the first sample in the text.
From here I wish to extract
1. <INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
2. <INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
3. <INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
4. <INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA </INSDReference_journal>
Also, as I said this is a short version of all the info I really downloadead (13 samples) https://www.ncbi.nlm.nih.gov/nuccore/?term=aedes+aegypti+5.8, is there a posibility to extract the info I wanted for all the samples?
I`m familiar with R but, which platform suites better to do this?
<INSDSeq_locus>CH477247</INSDSeq_locus>
<INSDSeq_length>3065330</INSDSeq_length>
<INSDSeq_strandedness>double</INSDSeq_strandedness>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>CON</INSDSeq_division>
<INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
<INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
<INSDSeq_definition>Aedes aegypti strain Liverpool supercont1.62 genomic scaffold, whole genome shotgun sequence</INSDSeq_definition>
<INSDSeq_primary-accession>CH477247</INSDSeq_primary-accession>
<INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
<INSDSeqid>gnl|WGS:AAGE|supercont1.62</INSDSeqid>
<INSDSeqid>gb|CH477247.1|</INSDSeqid>
<INSDSeqid>gi|78216626</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_project>PRJNA12434</INSDSeq_project>
<INSDSeq_keywords>
<INSDKeyword>WGS</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_source>Aedes aegypti (yellow fever mosquito)</INSDSeq_source>
<INSDSeq_organism>Aedes aegypti</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Holometabola; Diptera; Nematocera; Culicoidea; Culicidae; Culicinae; Aedini; Aedes; Stegomyia</INSDSeq_taxonomy>
<INSDSeq_references>
<INSDReference>
<INSDReference_reference>1</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Nene,V.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>Lawson,D.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Tu,Z.J.</INSDAuthor>
<INSDAuthor>Loftus,B.</INSDAuthor>
<INSDAuthor>Xi,Z.</INSDAuthor>
<INSDAuthor>Megy,K.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Ren,Q.</INSDAuthor>
<INSDAuthor>Zdobnov,E.M.</INSDAuthor>
<INSDAuthor>Lobo,N.F.</INSDAuthor>
<INSDAuthor>Campbell,K.S.</INSDAuthor>
<INSDAuthor>Brown,S.E.</INSDAuthor>
<INSDAuthor>Bonaldo,M.F.</INSDAuthor>
<INSDAuthor>Zhu,J.</INSDAuthor>
<INSDAuthor>Sinkins,S.P.</INSDAuthor>
<INSDAuthor>Hogenkamp,D.G.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Arensburger,P.</INSDAuthor>
<INSDAuthor>Atkinson,P.W.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Biedler,J.</INSDAuthor>
<INSDAuthor>Birney,E.</INSDAuthor>
<INSDAuthor>Bruggner,R.V.</INSDAuthor>
<INSDAuthor>Costas,J.</INSDAuthor>
<INSDAuthor>Coy,M.R.</INSDAuthor>
<INSDAuthor>Crabtree,J.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Debruyn,B.</INSDAuthor>
<INSDAuthor>Decaprio,D.</INSDAuthor>
<INSDAuthor>Eiglmeier,K.</INSDAuthor>
<INSDAuthor>Eisenstadt,E.</INSDAuthor>
<INSDAuthor>El-Dorry,H.</INSDAuthor>
<INSDAuthor>Gelbart,W.M.</INSDAuthor>
<INSDAuthor>Gomes,S.L.</INSDAuthor>
<INSDAuthor>Hammond,M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Hogan,J.R.</INSDAuthor>
<INSDAuthor>Holmes,M.H.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Johnston,J.S.</INSDAuthor>
<INSDAuthor>Kennedy,R.C.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Kriventseva,E.V.</INSDAuthor>
<INSDAuthor>Kulp,D.</INSDAuthor>
<INSDAuthor>Labutti,K.</INSDAuthor>
<INSDAuthor>Lee,E.</INSDAuthor>
<INSDAuthor>Li,S.</INSDAuthor>
<INSDAuthor>Lovin,D.D.</INSDAuthor>
<INSDAuthor>Mao,C.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Menck,C.F.</INSDAuthor>
<INSDAuthor>Miller,J.R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Mori,A.</INSDAuthor>
<INSDAuthor>Nascimento,A.L.</INSDAuthor>
<INSDAuthor>Naveira,H.F.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>O&apos;leary,S.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Pertea,M.</INSDAuthor>
<INSDAuthor>Quesneville,H.</INSDAuthor>
<INSDAuthor>Reidenbach,K.R.</INSDAuthor>
<INSDAuthor>Rogers,Y.H.</INSDAuthor>
<INSDAuthor>Roth,C.W.</INSDAuthor>
<INSDAuthor>Schneider,J.R.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Stanke,M.</INSDAuthor>
<INSDAuthor>Stinson,E.O.</INSDAuthor>
<INSDAuthor>Tubio,J.M.</INSDAuthor>
<INSDAuthor>Vanzee,J.P.</INSDAuthor>
<INSDAuthor>Verjovski-Almeida,S.</INSDAuthor>
<INSDAuthor>Werner,D.</INSDAuthor>
<INSDAuthor>White,O.</INSDAuthor>
<INSDAuthor>Wyder,S.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Zhao,Q.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Hill,C.A.</INSDAuthor>
<INSDAuthor>Raikhel,A.S.</INSDAuthor>
<INSDAuthor>Soares,M.B.</INSDAuthor>
<INSDAuthor>Knudson,D.L.</INSDAuthor>
<INSDAuthor>Lee,N.H.</INSDAuthor>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Salzberg,S.L.</INSDAuthor>
<INSDAuthor>Paulsen,I.T.</INSDAuthor>
<INSDAuthor>Dimopoulos,G.</INSDAuthor>
<INSDAuthor>Collins,F.H.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
<INSDAuthor>Fraser-Liggett,C.M.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Genome sequence of Aedes aegypti, a major arbovirus vector</INSDReference_title>
<INSDReference_journal>Science 316 (5832), 1718-1723 (2007)</INSDReference_journal>
<INSDReference_xref>
<INSDXref>
<INSDXref_dbname>doi</INSDXref_dbname>
<INSDXref_id>10.1126/science.1138878</INSDXref_id>
</INSDXref>
</INSDReference_xref>
<INSDReference_pubmed>17510324</INSDReference_pubmed>
</INSDReference>
<INSDReference>
<INSDReference_reference>2</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Galagan,J.</INSDAuthor>
<INSDAuthor>Devon,K.</INSDAuthor>
<INSDAuthor>Henn,M.R.</INSDAuthor>
<INSDAuthor>Severson,D.W.</INSDAuthor>
<INSDAuthor>Collins,F.</INSDAuthor>
<INSDAuthor>Jaffe,D.</INSDAuthor>
<INSDAuthor>Rounsley,S.</INSDAuthor>
<INSDAuthor>DeCaprio,D.</INSDAuthor>
<INSDAuthor>Kodira,C.</INSDAuthor>
<INSDAuthor>Lander,E.</INSDAuthor>
<INSDAuthor>Crawford,M.</INSDAuthor>
<INSDAuthor>Butler,J.</INSDAuthor>
<INSDAuthor>Alvarez,P.</INSDAuthor>
<INSDAuthor>Gnerre,S.</INSDAuthor>
<INSDAuthor>Grabherr,M.</INSDAuthor>
<INSDAuthor>Kleber,M.</INSDAuthor>
<INSDAuthor>Mauceli,E.</INSDAuthor>
<INSDAuthor>Brockman,W.</INSDAuthor>
<INSDAuthor>Young,S.</INSDAuthor>
<INSDAuthor>LaButti,K.</INSDAuthor>
<INSDAuthor>Pushparaj,V.</INSDAuthor>
<INSDAuthor>Koehrsen,M.</INSDAuthor>
<INSDAuthor>Engels,R.</INSDAuthor>
<INSDAuthor>Montgomery,P.</INSDAuthor>
<INSDAuthor>Pearson,M.</INSDAuthor>
<INSDAuthor>Howarth,C.</INSDAuthor>
<INSDAuthor>Zeng,Q.</INSDAuthor>
<INSDAuthor>Yandava,C.</INSDAuthor>
<INSDAuthor>Oleary,S.</INSDAuthor>
<INSDAuthor>Alvarado,L.</INSDAuthor>
<INSDAuthor>Nusbaum,C.</INSDAuthor>
<INSDAuthor>Birren,B.</INSDAuthor>
</INSDReference_authors>
<INSDReference_consortium>The Broad Institute Genome Sequencing Platform</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>3</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_authors>
<INSDAuthor>Loftus,B.J.</INSDAuthor>
<INSDAuthor>Nene,V.M.</INSDAuthor>
<INSDAuthor>Hannick,L.I.</INSDAuthor>
<INSDAuthor>Bidwell,S.</INSDAuthor>
<INSDAuthor>Haas,B.</INSDAuthor>
<INSDAuthor>Amedeo,P.</INSDAuthor>
<INSDAuthor>Orvis,J.</INSDAuthor>
<INSDAuthor>Wortman,J.R.</INSDAuthor>
<INSDAuthor>White,O.R.</INSDAuthor>
<INSDAuthor>Salzberg,S.</INSDAuthor>
<INSDAuthor>Shumway,M.</INSDAuthor>
<INSDAuthor>Koo,H.</INSDAuthor>
<INSDAuthor>Zhao,Y.</INSDAuthor>
<INSDAuthor>Holmes,M.</INSDAuthor>
<INSDAuthor>Miller,J.</INSDAuthor>
<INSDAuthor>Schatz,M.</INSDAuthor>
<INSDAuthor>Pop,M.</INSDAuthor>
<INSDAuthor>Pai,G.</INSDAuthor>
<INSDAuthor>Utterback,T.</INSDAuthor>
<INSDAuthor>Rogers,Y.-H.</INSDAuthor>
<INSDAuthor>Kravitz,S.</INSDAuthor>
<INSDAuthor>Fraser,C.M.</INSDAuthor>
</INSDReference_authors>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (07-OCT-2005) The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA</INSDReference_journal>
</INSDReference>
<INSDReference>
<INSDReference_reference>4</INSDReference_reference>
<INSDReference_position>1..3065330</INSDReference_position>
<INSDReference_consortium>VectorBase</INSDReference_consortium>
<INSDReference_title>Direct Submission</INSDReference_title>
<INSDReference_journal>Submitted (05-SEP-2012) VectorBase / Ensembl, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK</INSDReference_journal>
<INSDReference_remark>Annotation update by submitter</INSDReference_remark>
</INSDReference>
</INSDSeq_references>
<INSDSeq_comment>The sequence for this assembly was produced jointly by The Broad Institute of Harvard/MIT and The Institute for Genomic Research. The assembly represents 7.6X sequence coverage of the genome and the total length of the contigs is 1.31 Gb. Additional information about the Aedes aegypti sequencing project and assembly can be found at http://www.broad.mit.edu/annotation/disease_vector/aedes_aegypti/ and http://www.tigr.org/msc/aedes/aedes.shtml. Long-term curation of the sequence and subsequent annotation updates will be the responsibility of VectorBase at http://www.vectorbase.org.~Annotation was updated by VectorBase in Sept 2012.</INSDSeq_comment>
<INSDSeq_feature-table>
<INSDFeature>
<INSDFeature_key>source</INSDFeature_key>
<INSDFeature_location>1..3065330</INSDFeature_location>
<INSDFeature_intervals>
<INSDInterval>
<INSDInterval_from>1</INSDInterval_from>
<INSDInterval_to>3065330</INSDInterval_to>
<INSDInterval_accession>CH477247.1</INSDInterval_accession>
</INSDInterval>
</INSDFeature_intervals>
<INSDFeature_quals>
<INSDQualifier>
<INSDQualifier_name>organism</INSDQualifier_name>
<INSDQualifier_value>Aedes aegypti</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>mol_type</INSDQualifier_name>
<INSDQualifier_value>genomic DNA</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>strain</INSDQualifier_name>
<INSDQualifier_value>Liverpool</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>db_xref</INSDQualifier_name>
<INSDQualifier_value>taxon:7159</INSDQualifier_value>
</INSDQualifier>
<INSDQualifier>
<INSDQualifier_name>chromosome</INSDQualifier_name>
<INSDQualifier_value>2</INSDQualifier_value>
</INSDQualifier>
</INSDFeature_quals>
</INSDFeature>
</INSDSeq_feature-table>
<INSDSeq_contig>join(AAGE02003964.1:1..7226,gap(unk100),AAGE02003965.1:1..6376,gap(unk100),AAGE02003966.1:1..16236,gap(4301),AAGE02003967.1:1..174188,gap(unk100),AAGE02003968.1:1..24199,gap(1396),AAGE02003969.1:1..104064,gap(29770),AAGE02003970.1:1..12303,gap(56956),AAGE02003971.1:1..2368,gap(12542),AAGE02003972.1:1..29888,gap(1379),AAGE02003973.1:1..98175,gap(unk100),AAGE02003974.1:1..13180,gap(unk100),AAGE02003975.1:1..2872,gap(unk100),AAGE02003976.1:1..18626,gap(unk100),AAGE02003977.1:1..52378,gap(151),AAGE02003978.1:1..153108,gap(901),AAGE02003979.1:1..3583,gap(unk100),AAGE02003980.1:1..32852,gap(unk100),AAGE02003981.1:1..68239,gap(unk100),AAGE02003982.1:1..61056,gap(unk100),AAGE02003983.1:1..21852,gap(unk100),AAGE02003984.1:1..49659,gap(unk100),AAGE02003985.1:1..33070,gap(315),AAGE02003986.1:1..411266,gap(unk100),AAGE02003987.1:1..2985,gap(unk100),AAGE02003988.1:1..38365,gap(159),AAGE02003989.1:1..110697,gap(890),AAGE02003990.1:1..22405,gap(2299),AAGE02003991.1:1..7510,gap(187),AAGE02003992.1:1..447937,gap(263),AAGE02003993.1:1..92770,gap(1409),AAGE02003994.1:1..2258,gap(132),AAGE02003995.1:1..5605,gap(unk100),AAGE02003996.1:1..3451,gap(2717),AAGE02003997.1:1..20215,gap(unk100),AAGE02003998.1:1..35683,gap(514),AAGE02003999.1:1..307288,gap(unk100),AAGE02004000.1:1..71359,gap(433),AAGE02004001.1:1..10550,gap(unk100),AAGE02004002.1:1..289125,gap(unk100),AAGE02004003.1:1..45622,gap(unk100),AAGE02004004.1:1..35927)</INSDSeq_contig>
<INSDSeq_xrefs>
<INSDXref>
<INSDXref_dbname>BioProject</INSDXref_dbname>
<INSDXref_id>PRJNA12434</INSDXref_id>
</INSDXref>
<INSDXref>
<INSDXref_dbname>BioSample</INSDXref_dbname>
<INSDXref_id>SAMN02953616</INSDXref_id>
</INSDXref>
</INSDSeq_xrefs>
`

Use an xpath or a CSS selector.
Depending on the language and libraries you use.

Related

Best way to import words from a text file into a data frame in R

I have a text file filled with words separated by spaces as seen below:
ACNES ACOCK ACOLD ACORN ACRED ACRES ACRID ACTED ACTIN ACTON ACTOR ACUTE ACYLS ADAGE ADAPT ADAWS ADAYS ADDAX ADDED ADDER ADDIO ADDLE ADEEM ADEPT ADHAN ADIEU ADIOS ADITS ADMAN ADMEN ADMIN ADMIT ADMIX ADOBE ADOBO ADOPT ADORE ADORN ADOWN ADOZE ADRAD ADRED ADSUM ADUKI ADULT ADUNC ADUST ADVEW ADYTA ADZED ADZES AECIA AEDES AEGIS AEONS AERIE AEROS AESIR AFALD AFARA AFARS AFEAR AFFIX AFIRE AFLAJ AFOOT AFORE AFOUL AFRIT AFROS AFTER AGAIN AGAMA AGAMI AGAPE AGARS AGAST AGATE AGAVE AGAZE AGENE AGENT AGERS AGGER AGGIE AGGRI AGGRO AGGRY AGHAS AGILA AGILE AGING AGIOS AGISM AGIST AGITA AGLEE AGLET AGLEY AGLOO AGLOW AGLUS AGMAS AGOGE AGONE AGONS AGONY AGOOD AGORA AGREE AGRIA AGRIN
What's the best way to import all these words into a 1 column data frame?

extract data from XML files - R

I'm new to extracting data from XML file. I'm trying to process the following an XML file using R XML packages. The information I want is in the attribute values.
I encounter two difficulties:
some attribute values exist in one node, but not in another node. For example, "DRP" has the information in the second but not in the first
some attributes has multiple values for an individual and i don't know how to link them to that individual. For example, "EmpHs" has multiple records for an individual (identified by indvlPK).
Ideally I want the output data has the structure similar to the following:
lastNm
firstNm
indvlPK
fromDt
orgNm
hasCustComp
GIGAX
JEFFREY
2783477
03/2004
GATEWAY FINANCIAL ADVISORS, INC
GIGAX
JEFFREY
2783477
03/2004
GFA IN
GIGAX
JEFFREY
2783477
01/2007
UNITED FIRST
HINSON
BRIAN
2783737
07/1996
LINCOLN FINANCIAL ADVISORS CORPORATION
Y
HINSON
BRIAN
2783737
07/1996
FIRST FINANCIAL GROUP
Y
Is there any way I can parse the data correctly? Thanks!
The code I used but didn't give me what I want:
doc <- "Test.xml"
ind <- xmlParse(doc)
xmltop = xmlRoot(ind)
temp1 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#lastNm")))
temp2 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#firstNm")))
temp3 <- data.frame(unlist(getNodeSet(xmltop,"//Info/#indvlPK")))
temp4 <- data.frame(unlist(getNodeSet(xmltop,"//EmpHs/#fromDt")))
temp5 <- data.frame(unlist(getNodeSet(xmltop,"//DRP/#hasCustComp")))
The data is here:
<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDIndividualReport GenOn="2021-03-29">
<Indvls>
<Indvl>
<Info lastNm="GIGAX" firstNm="JEFFREY" midNm="W" indvlPK="2783477" actvAGReg="Y" link="https://adviserinfo.sec.gov/individual/summary/2783477"/>
<OthrNms/>
<CrntEmps>
<CrntEmp orgNm="CAMBRIDGE INVESTMENT RESEARCH ADVISORS, INC." orgPK="134139" str1="1776 PLEASANT PLAIN RD." city="FAIRFIELD" state="IA" cntry="United States" postlCd="52556-8757">
<CrntRgstns>
<CrntRgstn regAuth="MO" regCat="RA" st="APPROVED" stDt="2010-09-09"/>
</CrntRgstns>
<BrnchOfLocs>
<BrnchOfLoc city="O&apos;FALLON" state="MO" cntry="United States"/>
</BrnchOfLocs>
</CrntEmp>
</CrntEmps>
<Exms>
<Exm exmCd="S63" exmNm="Uniform Securities Agent State Law Examination" exmDt="1996-08-20"/>
<Exm exmCd="S65" exmNm="Uniform Investment Adviser Law Examination" exmDt="1999-12-21"/>
</Exms>
<Dsgntns/>
<PrevRgstns>
<PrevRgstn orgNm="WOODBURY FINANCIAL SERVICES, INC." orgPK="421" regBeginDt="2009-01-05" regEndDt="2009-12-03">
<BrnchOfLocs>
<BrnchOfLoc city="OFALLON" state="MO"/>
<BrnchOfLoc city="OFALLON" state="MO"/>
<BrnchOfLoc city="DUBLIN" state="CA"/>
</BrnchOfLocs>
</PrevRgstn>
<PrevRgstn orgNm="FSC SECURITIES CORPORATION" orgPK="7461" regBeginDt="2004-10-29" regEndDt="2008-12-01">
<BrnchOfLocs>
<BrnchOfLoc city="O&apos;FALLON" state="MO"/>
<BrnchOfLoc city="ST. PETERS" state="MO"/>
</BrnchOfLocs>
</PrevRgstn>
<PrevRgstn orgNm="GATEWAY FINANCIAL ADVISORS, INC." orgPK="115025" regBeginDt="2004-11-11" regEndDt="2006-10-11">
<BrnchOfLocs>
<BrnchOfLoc city="ST. PETERS" state="MO"/>
</BrnchOfLocs>
</PrevRgstn>
</PrevRgstns>
<EmpHss>
<EmpHs fromDt="03/2004" orgNm="GATEWAY FINANCIAL ADVISORS, INC" city="OFALLON" state="MO"/>
<EmpHs fromDt="03/2004" orgNm="GFA INC" city="OFALLON" state="MO"/>
<EmpHs fromDt="01/2007" orgNm="UNITED FIRST" city="OFALLON" state="MO"/>
<EmpHs fromDt="09/2010" orgNm="CAMBRIDGE INVESTMENT RESEARCH ADVISORS, INC" city="FAIRFIELD" state="IA"/>
<EmpHs fromDt="09/2010" orgNm="CAMBRIDGE INVESTMENT RESEARCH, INC" city="FAIRFIELD" state="IA"/>
</EmpHss>
<OthrBuss>
<OthrBus desc="1)STONEBRIDGE WEALTH MANAGEMENT GROUP, 728 HAWK RUN DR, O&apos;FALLON, MO, 3/2008 AS INDEPENDENT INSURANCE AGENT FOR VARIOUS INDEPENDENT INSURANCE COMPANIES. INV REL - 40/MO - 20/TRADING. 2)UNITED FIRST FINANCIAL MORTGAGE SOFTWARE SALES. START 6/1/07, 10 HOURS PER MONTH, 5 DURING TRADING HOURS. NO OWNERSHIP INTEREST. 3)MORTGAGE STOP INC., 728 HAWK RUN DR., OFALLON, MO 63368. LOAN OFFICER PROCESSING LOAN APPS FOR CLIENTS. START 6/1/2002, 25 HOURS PER MONTH, 10 DURING TRADING HOURS. NO OWNERSHIP. 4)CIRA, 1776 PLEASANT PLAIN RD, FAIRFIELD, IA, AS ADVISORY REP OF A RIA. INV REL - 40 HR/WK - 40/TRADING. SEE EMPLOYMENT HISTORY FOR START DATE. 5) THE MORTGAGE SHOP, 355 MID RIVERS MALL DRIVE, STE E, ST. PETERS, MO 63376. MORTGAGE ORIGINATOR SINCE 01/01/99. NOT INVESTMENT RELATED. WORKS 60 HOURS PER MONTH, 20 OF WHICH ARE DURING TRADING HOURS. 6.365 PROPERTIES LLC, O&apos;FALLON, MO, 8/2018 AS OWNER OF LLC THAT BUYS, SELLS, & HOLDS REAL ESTATE. NIR - 20/MO - 0/TRADING. 7. BEST OFFER HOMES, LLC, 728 HAWK RUN DRIVE, O&apos;FALLON, MO, REAL ESTATE SALES/MORTGAGE ORIGINATION/ ACCOUNTING/FINANCIAL ACTIVITIES, 06/16/20, NIR, 20/MO- 0/TRADING 8. GIGAX WEALTH MANAGEMENT, 728 HAWK RUN DRIVE, OFALLON, MO, INDEPENDENT INSURANCE AGENT FOR VARIOUS INDEPENDENT INSURANCE COMPANIES,11/23/20, INV REL, 10 HR/WK- 10 TRADING HR."/>
</OthrBuss>
<DRPs/>
</Indvl>
<Indvl>
<Info lastNm="HINSON" firstNm="BRIAN" midNm="TROY" indvlPK="2783737" actvAGReg="Y" link="https://adviserinfo.sec.gov/individual/summary/2783737"/>
<OthrNms/>
<CrntEmps>
<CrntEmp orgNm="BRIDGEWORTH WEALTH MANAGEMENT" orgPK="164100" str1="101 25TH STREET NORTH" city="BIRMINGHAM" state="AL" cntry="United States" postlCd="35203">
<CrntRgstns>
<CrntRgstn regAuth="AL" regCat="RA" st="APPROVED" stDt="2015-05-12"/>
<CrntRgstn regAuth="TX" regCat="RA" st="APPROVED_RES" stDt="2015-05-01"/>
</CrntRgstns>
<BrnchOfLocs>
<BrnchOfLoc str1="400 MERIDIAN STREET" str2="SUITE 200" city="HUNTSVILLE" state="AL" cntry="United States" postlCd="35801"/>
<BrnchOfLoc str1="101 25TH STREET NORTH" city="BIRMINGHAM" state="AL" cntry="United States" postlCd="35203"/>
</BrnchOfLocs>
</CrntEmp>
</CrntEmps>
<Exms>
<Exm exmCd="S63" exmNm="Uniform Securities Agent State Law Examination" exmDt="1996-10-11"/>
</Exms>
<Dsgntns>
<Dsgntn dsgntnNm="Certified Financial Planner"/>
<Dsgntn dsgntnNm="Chartered Financial Consultant"/>
<Dsgntn dsgntnNm="Personal Financial Specialist"/>
</Dsgntns>
<PrevRgstns>
<PrevRgstn orgNm="LINCOLN FINANCIAL ADVISORS CORPORATION" orgPK="3978" regBeginDt="2000-04-25" regEndDt="2015-05-11">
<BrnchOfLocs>
<BrnchOfLoc city="HUNTSVILLE" state="AL"/>
<BrnchOfLoc city="HUNTSVILLE" state="AL"/>
</BrnchOfLocs>
</PrevRgstn>
</PrevRgstns>
<EmpHss>
<EmpHs fromDt="04/2015" orgNm="BRIDGEWORTH, LLC" city="HUNTSVILLE" state="AL"/>
<EmpHs fromDt="07/1996" toDt="04/2015" orgNm="LINCOLN FINANCIAL ADVISORS CORPORATION" city="HUNTSVILLE" state="AL"/>
<EmpHs fromDt="07/1996" toDt="04/2015" orgNm="FIRST FINANCIAL GROUP" city="BIRMINGHAM" state="AL"/>
<EmpHs fromDt="04/2015" orgNm="LPL FINANCIAL LLC" city="HUNTSVILLE" state="AL"/>
</EmpHss>
<OthrBuss>
<OthrBus desc="1) 04/30/2015: BRIDGEWORTH FINANCIAL, LLC - DBA FOR LPL BUSINESS (ENTITY FOR LPL BUSINESS) - INV REL - AT REPORTED BUSINESS LOCATIONS - START 01/01/2015 - 1% OF TIME SPENT 2) 04/30/2015: BRIDGEWORTH, LLC - INV REL - AT REPORTED BUSINESS LOCATION(S) - REGISTERED INVESTMENT ADVISOR HYBRID - START 01/2015 - 99% OF TIME SPENT. 3) 5/11/2015: NO BUSINESS NAME - INVESTMENT RELATED - AT REPORTED BUSINESS LOCATION(S) - NON-VARIABLE INSURANCE - STARTED 4/1/2015 - TIME SPENT 1% - LINES OF INSURANCE INCLUDE TERM, WHOLE, UNIVERSAL, LTC, DISABILITY. 4) 6/2/2017 - Bridgeworth Financial - Investment Related - At Reported Business Location(s) - DBA for LPL Business (entity for LPL business) - Started 04/30/2015 - 5 Hours Per Month/3 Hours During Securities Trading. 5) 5/8/2018 - Foster Properties Ltd - Not Investment Related - Home Based - Other-Family Business - Started 12/22/1997 - 1 Hours Per Month/0 Hours During Securities Trading - Handle the majority of business matters for this family business."/>
</OthrBuss>
<DRPs>
<DRP hasRegAction="N" hasCriminal="N" hasBankrupt="N" hasCivilJudc="N" hasBond="N" hasJudgment="N" hasInvstgn="N" hasCustComp="Y" hasTermination="N"/>
</DRPs>
</Indvl>
</Indvls>
</IAPDIndividualReport>

Getting table content using BeautifulSoup

I am trying to retrieve table content using the following python codes from this website: https://whalewisdom.com/filer/hillhouse-capital-advisors-ltd#tabholdings_tab_link
stat_table = soup.find_all('table', id_ = 'current_holdings_table', class_ = "table table-bordered table-striped table-hover")
But when I use len(stat_table), it returned me with a value of zero, indicating nothing was able to be retrieved from the website. Does anyone know where I went wrong? Thank you for the help.
The data you see is loaded via JavaScript from another URL. To load the data, you can use this example:
import json
import requests
url = 'https://whalewisdom.com/filer/holdings?id=hillhouse-capital-advisors-ltd&q1=-1&type_filter=1,2,3,4&symbol=&change_filter=&minimum_ranking=&minimum_shares=&is_etf=0&sc=true&sort=current_mv&order=desc&offset=0&limit=25'
data = json.loads(requests.get(url).text)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for row in data['rows']:
print('{:<5} {:<50} {:<15} {:<15}'.format(row['symbol'], row['name'], row['current_shares'], row['current_mv']))
Prints:
BGNE BeiGene Ltd ADR 147035258.0 28823321625.74
ZM Zoom Video Communications Inc 6856980.0 1738519000.0
IQ iQIYI Inc 46694629.0 1082848000.0
BABA Alibaba Group Holding Ltd ADR 3930086.0 847720000.0
PDD Pinduoduo Inc 9863866.0 846714000.0
UBER Uber Technologies Inc 19260700.0 598623000.0
TAL TAL Education Group American Depositary ADR 7906041.0 540615000.0
JD JD.com Inc ADR 7810402.0 470030000.0
BILI Bilibili Inc 9102063.0 421608000.0
CBPO China Biologic Products Holdings Inc 2962076.0 302665000.0
ESGR Enstar Group Ltd 1747840.0 267018000.0
ALGN Align Technology Inc 790365.0 216908000.0
APLS Apellis Pharmaceuticals Inc 5028289.0 164224000.0
FGEN FibroGen Inc 3955787.0 160328000.0
BBIO BridgeBio Pharma Inc 4711604.0 153645000.0
TSLA Tesla Inc 130378.0 140783000.0
CRM Salesforce.com Inc. 709495.0 132910000.0
ZTO ZTO Express Cayman Inc ADR 3433592.0 126047000.0
MDLZ Mondelez International Inc. (Kraft Foods) 2431164.0 124305000.0
VIE Viela Bio, Inc. 2815868.0 121983000.0
VIPS Vipshop Holdings Ltd ADR 5477392.0 109055000.0
BPMC Blueprint Medicines Corp 1364631.0 106441000.0
ARGX Argenx SE ADS ADR 470000.0 105858000.0
GOSS Gossamer Bio Inc 7420974.0 96473000.0
BEAM Beam Therapeutics Inc. 2966403.0 83059000.0

How to read csv with double quotes from WoS?

I'm trying to read CSV files from the citation report of Web of Science. This is the structure of the file:
TI=clinical case of cognitive dysfunction syndrome AND CU=MEXICO
null
Timespan=All years. Indexes=SCI-EXPANDED, SSCI, A&HCI, ESCI.
"Title","Authors","Corporate Authors","Editors","Book Editors","Source Title","Publication Date","Publication Year","Volume","Issue","Part Number","Supplement","Special Issue","Beginning Page","Ending Page","Article Number","DOI","Conference Title","Conference Date","Total Citations","Average per Year","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016"
""Didy," a clinical case of cognitive dysfunction syndrome","Heiblum, Moises; Labastida, Rocio; Chavez Gris, Gilberto; Tejeda, Alberto","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","MAY-JUN 2007","2007","2","3","","","","68","72","","10.1016/j.jveb.2007.05.002","","","2","0.20","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","0","0","0","1","0","0","0"
""Didy," a clinical case of cognitive dysfunction syndrome (vol 2, pg 68, 2007)","Heiblum, A.; Labastida, R.; Gris, Chavez G.; Tejeda, A.; Edwards, Claudia","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","SEP-OCT 2007","2007","2","5","","","","183","183","","","","","0","0.00","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
I manage to import the it using fread, however, I still want to know which is the appropriate quote and why is assigning "Didy," as row names despite that the argument is NULL. This are the arguments that I'm using.
s_file <- read.csv(savedrecs.txt,
skip = 4,
header = TRUE,
row.names = NULL,
quote = '\"',
stringsAsFactors = FALSE)
What you have shown is not a valid csv file format. There are some double double quotes (i.e. "") without a comma. For example there is one at the beginning of the second line.
""Didy," a clinical case of cognitive dysfunction syndrome", etc.
So it thinks there is a null followed by Diddy, followed by " a clinical case of cognitive dysfunction syndrome" Fix up the file and you should be ok. E.g. the second line should start with
"","Didy","a clinical case of cognitive dysfunction syndrome"

change different columns using colClasses in a single attempt in R

I have a dataset in csv as follows-
Sample of the data is as well pasted below
Now I have columns like Transaction.Data which is Date type but read.csv is expecting real , again Transaction is String, which I want to convert to Dr.(0) and Cr.(1) , and I have Amount which has comma in-between, which I want to change to Integer or vector with decimal so that I could plot.
To convert Cr.(1) and Dr.(0) , I found the solution from my previous question as-'
setClass("CrDr")
setAs("character", "CrDr", function(from) c(Cr.=1,Dr.=0)[from])
So now I have 3 things to do while reading the csv-
Transaction.data <- date
Transaction <- Dr.(0) Cr.(1)
Amount/Balance <- numeric
How to achieve these many kind of changes in a single attempt.
Data Sample
Transaction Date Remarks Transaction Amount Balance
26/05/2014 ATM/CASH WDL/26-05-14/18:12:12/0 Dr. 3,000.00 1,11,216.17
26/05/2014 ATD/Auto Debit CC5xx3009 Dr. 3,953.22 1,14,216.17
22/05/2014 TRFR FROM:SRI GANESH INFRATECH &SOFTWARE PVT LTD Cr. 36,000.00 1,18,169.39
21/05/2014 BIL/000593351901/priyanka/VODAESP_MICI335 Dr. 555 82,169.39
17/05/2014 IPS/SPENCERS RE/20140517124555/0 Dr. 514 82,724.39
12/5/2014 BIL/000589207330/Kolkataairfare/INDIGO_MICI3346 Dr. 7,617.00 83,238.39
6/5/2014 BIL/000586940549/Mumma#May/NSP Dr. 1,10,000.00 90,855.39
3/5/2014 BIL/000585385115/airtel#bb/AIRTEL_MICI3338 Dr. 797 2,00,855.39
3/5/2014 IPS/SPENCERS RE/20140503112817/0 Dr. 328 2,01,652.39
1/5/2014 NEFT-AXMB141215740194-ABHISHEK CHOUDHARY-may month Cr. 1,00,000.00 2,01,980.39
29/04/2014 TRFR FROM:SRI GANESH INFRATECH & SOFTWARE PVT LTD Cr. 12,000.00 1,01,980.39
26/04/2014 ATM/CASH WDL/26-04-14/21:20:31/0 Dr. 1,000.00 89,980.39
25/04/2014 ATD/Auto Debit CC5xx3009 Dr. 897 90,980.39
19/04/2014 VIN/Tata_Sky_DT/20140419180921/0 Dr. 351 91,877.39
10/4/2014 BY CASH - BHOPAL Cr. 3,000.00 92,228.39
31/03/2014 BIL/000570396248/Mumma#Mar/NSP Dr. 1,50,000.00 89,228.39
31/03/2014 NEFT-AXMB140902244145-ABHISHEK CHOUDHARY- Cr. 30,000.00 2,39,228.39

Resources