How to match multiline pattern - r

I am trying to match pattern in a text file. It works well as long as the pattern stands within one line. But it occurred that in some case the pattern can be across two lines.
i have the following code :
#indicate the Name pattern to R
name_pattern = '<nameOfIssuer>([^<]*)</nameOfIssuer>'
#Collect information that match the pattern that we are looking #
datalines = grep(name_pattern, thepage[1:length(thepage)], value = TRUE)
#We will use gregexpr and gsub to extract the information without the html tags
#create a function first
getexpr = function(s,g)substring(s,g,g+attr(g,'match.length')-1)
gg = gregexpr(name_pattern, datalines)
matches = mapply(getexpr, datalines, gg)
result = gsub(name_pattern, '\\1', matches)
result <- gsub("&", "&", result)
names(result) = NULL
It works well when the text is as :
<nameOfIssuer>Posco ADR</nameOfIssuer>
BUt in case where the text is as follows, it does not work:
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
Does anyone know how to handle both case dynamically please?
Full text is as follow:
<SEC-DOCUMENT>0001437749-18-018038.txt : 20181009
<SEC-HEADER>0001437749-18-018038.hdr.sgml : 20181009
<ACCEPTANCE-DATETIME>20181005183736
ACCESSION NUMBER: 0001437749-18-018038
CONFORMED SUBMISSION TYPE: 13F-HR
PUBLIC DOCUMENT COUNT: 2
CONFORMED PERIOD OF REPORT: 20180930
FILED AS OF DATE: 20181009
DATE AS OF CHANGE: 20181005
EFFECTIVENESS DATE: 20181009
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: DAILY JOURNAL CORP
CENTRAL INDEX KEY: 0000783412
STANDARD INDUSTRIAL CLASSIFICATION: NEWSPAPERS: PUBLISHING OR PUBLISHING & PRINTING [2711]
IRS NUMBER: 954133299
STATE OF INCORPORATION: SC
FISCAL YEAR END: 0930
FILING VALUES:
FORM TYPE: 13F-HR
SEC ACT: 1934 Act
SEC FILE NUMBER: 028-15782
FILM NUMBER: 181111587
BUSINESS ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
BUSINESS PHONE: 2132295300
MAIL ADDRESS:
STREET 1: 915 EAST FIRST STREET
CITY: LOS ANGELES
STATE: CA
ZIP: 90012
FORMER COMPANY:
FORMER CONFORMED NAME: DAILY JOURNAL CO
DATE OF NAME CHANGE: 19870427
</SEC-HEADER>
<DOCUMENT>
<TYPE>13F-HR
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?>
<edgarSubmission xmlns="http://www.sec.gov/edgar/thirteenffiler" xmlns:com="http://www.sec.gov/edgar/common">
<headerData>
<submissionType>13F-HR</submissionType>
<filerInfo>
<liveTestFlag>LIVE</liveTestFlag>
<flags>
<confirmingCopyFlag>false</confirmingCopyFlag>
<returnCopyFlag>true</returnCopyFlag>
<overrideInternetFlag>false</overrideInternetFlag>
</flags>
<filer>
<credentials>
<cik>0000783412</cik>
<ccc>XXXXXXXX</ccc>
</credentials>
</filer>
<periodOfReport>09-30-2018</periodOfReport>
</filerInfo>
</headerData>
<formData>
<coverPage>
<reportCalendarOrQuarter>09-30-2018</reportCalendarOrQuarter>
<filingManager>
<name>DAILY JOURNAL CORP</name>
<address>
<com:street1>915 EAST FIRST STREET</com:street1>
<com:city>LOS ANGELES</com:city>
<com:stateOrCountry>CA</com:stateOrCountry>
<com:zipCode>90012</com:zipCode>
</address>
</filingManager>
<reportType>13F HOLDINGS REPORT</reportType>
<form13FFileNumber>028-15782</form13FFileNumber>
<provideInfoForInstruction5>N</provideInfoForInstruction5>
</coverPage>
<signatureBlock>
<name>Gerald L. Salzman</name>
<title>Chief Executive Officer, President, CFO, Treasurer</title>
<phone>213-229-5300</phone>
<signature>/s/ Gerald L. Salzman</signature>
<city>Los Angeles</city>
<stateOrCountry>CA</stateOrCountry>
<signatureDate>10-05-2018</signatureDate>
</signatureBlock>
<summaryPage>
<otherIncludedManagersCount>0</otherIncludedManagersCount>
<tableEntryTotal>4</tableEntryTotal>
<tableValueTotal>159459</tableValueTotal>
<isConfidentialOmitted>false</isConfidentialOmitted>
</summaryPage>
</formData>
</edgarSubmission>
</XML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>INFORMATION TABLE
<SEQUENCE>2
<FILENAME>rdgit100518.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="us-ascii"?>
<informationTable xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable">
<infoTable>
<nameOfIssuer>Bank of
America Corp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>060505104</cusip>
<value>67758</value>
<shrsOrPrnAmt>
<sshPrnamt>2300000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>2300000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Posco ADR</nameOfIssuer>
<titleOfClass>Sponsored ADR</titleOfClass>
<cusip>693483109</cusip>
<value>643</value>
<shrsOrPrnAmt>
<sshPrnamt>9745</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>9745</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>US Bancorp</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>902973304</cusip>
<value>7393</value>
<shrsOrPrnAmt>
<sshPrnamt>140000</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>140000</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>Wells Fargo &amp; Co</nameOfIssuer>
<titleOfClass>Common Stock</titleOfClass>
<cusip>949746101</cusip>
<value>83665</value>
<shrsOrPrnAmt>
<sshPrnamt>1591800</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>1591800</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
</informationTable>
</XML>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>

Assuming there could be multiple matching <nameOfIssuer> tags in your document, and you want to match all of them, then we can try using grepexpr with regmatches:
input <- "<nameOfIssuer>Bank of\n America Corp</nameOfIssuer>\n blah blah blah \n"
input <- paste0(input, "<nameOfIssuer>Citigroup</nameOfIssuer>")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", input, perl=TRUE)
regmatches(input, m)[[1]]
[1] "Bank of\n America Corp" "Citigroup"

Using Tim's solution plus the collapse option of paste the program works. The code is as following:
thepage <- paste(thepage, collapse = "")
m <- gregexpr("(?<=<nameOfIssuer>)([^<]*?)(?=</nameOfIssuer>)", thepage, perl=TRUE)
result <- regmatches(thepage, m)[[1]]
names(result) = NULL
#put the result into a dataframe
Positions = as.data.frame(matrix(result, ncol=1, byrow = TRUE))

Related

How to separate a JSON string into multiple columns in R

I have a column with strings in dictionary format, like this:
{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}
I want to create one column for each key, filled with the value for each row.
I tried using separate with regex:
separate(data, address, into = c('district', 'city', 'state', 'country',
'latitude', 'longitude', 'timezone',
'zipCode', 'streetName', 'streetNumber'),
"/:.*?/,")
However this is not working, and perhaps using separate with regex is not even the best course of action for this. Any suggestions on how to do it?
Since it is effectively JSON (though with single-quotes instead of the requisite double-quotes), we can use jsonlite:
txt <- "{'district': 'Ilha do Retiro', 'city': 'RECIFE', 'state': 'PE', 'country': 'BR', 'latitude': -8.062004, 'longitude': -34.908081, 'timezone': 'Etc/GMT+3', 'zipCode': '50830000', 'streetName': 'Avenida Engenheiro Abdias de Carvalho', 'streetNumber': '365'}"
as.data.frame(jsonlite::parse_json(gsub("'", '"', txt)))
# district city state country latitude longitude timezone zipCode streetName streetNumber
# 1 Ilha do Retiro RECIFE PE BR -8.062004 -34.90808 Etc/GMT+3 50830000 Avenida Engenheiro Abdias de Carvalho 365

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

XML parsing with namespaces using XML package in R

I am trying to parse a xml file using XML package of R. Sample XML content is as follows:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Header>
<messageContext xmlns="http://www.deltavista.com/dspone/ordercheck-if/V001">
<credentials>
<user>foobar</user>
<password>barbaz</password>
</credentials>
</messageContext>
</soapenv:Header>
<soapenv:Body>
<ns1:orderCheckResponse xmlns:ns1="http://www.deltavista.com/dspone/ordercheck-if/V001">
<ns1:returnCode>1</ns1:returnCode>
<ns1:product>
<ns1:name>Consumer</ns1:name>
<ns1:country>POL</ns1:country>
<ns1:language>POL</ns1:language>
</ns1:product>
<ns1:archiveID>420</ns1:archiveID>
<ns1:reportCreationTime>201911151220</ns1:reportCreationTime>
<ns1:foundAddress>
<ns1:legalForm>PERSON</ns1:legalForm>
<ns1:address>
<ns1:name>John</ns1:name>
<ns1:firstName>Dow</ns1:firstName>
<ns1:gender>MALE</ns1:gender>
<ns1:dateOfBirth>19960410</ns1:dateOfBirth>
<ns1:location>
<ns1:street>nowhere</ns1:street>
<ns1:house>48</ns1:house>
<ns1:city>farfarland</ns1:city>
<ns1:zip>00-500</ns1:zip>
<ns1:country>POL</ns1:country>
</ns1:location>
</ns1:address>
</ns1:foundAddress>
<ns1:myDecision>
<ns1:decision>YELLOW</ns1:decision>
</ns1:myDecision>
<ns1:personBasicData>
<ns1:knownSince>20181201</ns1:knownSince>
<ns1:contact>
<ns1:item>EMAIL</ns1:item>
<ns1:value>foo#gmail.com</ns1:value>
</ns1:contact>
<ns1:contact>
<ns1:item>PHONE</ns1:item>
<ns1:value>123456789</ns1:value>
</ns1:contact>
</ns1:personBasicData>
<ns1:decisionMatrix>
<ns1:identificationDecision>
<ns1:personStatus xsi:type="ns1:DecisionMatrixItemPersonStatus" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>ADULT</ns1:value>
</ns1:personStatus>
<ns1:identificationType xsi:type="ns1:DecisionMatrixItemIdentificationType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>IDENTITY_IN_CITY</ns1:value>
</ns1:identificationType>
<ns1:similarHit xsi:type="ns1:DecisionMatrixItemInt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>0</ns1:value>
</ns1:similarHit>
<ns1:houseType xsi:type="ns1:DecisionMatrixItemString" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>SHARED_USAGE</ns1:value>
</ns1:houseType>
<ns1:nameHint xsi:type="ns1:DecisionMatrixItemNameHint" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>CONFIRMED</ns1:value>
</ns1:nameHint>
<ns1:locationIdentificationType xsi:type="ns1:DecisionMatrixItemLocationIdentificationType" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>HOUSE_CONFIRMED</ns1:value>
</ns1:locationIdentificationType>
</ns1:identificationDecision>
<ns1:solvencyDecision>
<ns1:paymentExperience xsi:type="ns1:DecisionMatrixItemPHS" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>NOPROBLEM</ns1:value>
</ns1:paymentExperience>
<ns1:externalSourcesProcessingStatus xsi:type="ns1:DecisionMatrixItemExternalProcessingStatus" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>OK</ns1:value>
</ns1:externalSourcesProcessingStatus>
</ns1:solvencyDecision>
<ns1:clientExtensionsDecision>
<ns1:applicationFilter xsi:type="ns1:DecisionMatrixItemStringWithOverride" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>GREEN</ns1:partialDecision>
<ns1:value>0</ns1:value>
</ns1:applicationFilter>
<ns1:myScore xsi:type="ns1:DecisionMatrixItemInt" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:partialDecision>YELLOW</ns1:partialDecision>
<ns1:value>401</ns1:value>
</ns1:myScore>
</ns1:clientExtensionsDecision>
</ns1:decisionMatrix>
<ns1:paymentHistory>
<ns1:currency>PLN</ns1:currency>
<ns1:count>0</ns1:count>
<ns1:dateOfLastEntry>20191111</ns1:dateOfLastEntry>
<ns1:amountTotal>0.0</ns1:amountTotal>
<ns1:amountTotalOpen>0.0</ns1:amountTotalOpen>
<ns1:creditStatusMax>0</ns1:creditStatusMax>
<ns1:masterRiskStatus>Brak danych o negatywnej historii</ns1:masterRiskStatus>
</ns1:paymentHistory>
<ns1:normalization>
<ns1:searchedAddress xsi:type="ns1:SearchedAddressN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:name>John</ns1:name>
<ns1:firstName>Dow</ns1:firstName>
<ns1:gender>MALE</ns1:gender>
<ns1:dateOfBirth>19960410</ns1:dateOfBirth>
<ns1:location>
<ns1:street>nowhere</ns1:street>
<ns1:house>39</ns1:house>
<ns1:houseExtension/>
<ns1:city>farfarland</ns1:city>
<ns1:zip>00-500</ns1:zip>
<ns1:country>POL</ns1:country>
</ns1:location>
<ns1:addressID>123</ns1:addressID>
<ns1:unitID>123</ns1:unitID>
<ns1:liableID>1231</ns1:liableID>
<ns1:houseID>1232</ns1:houseID>
<ns1:streetID>1233</ns1:streetID>
<ns1:cityID>1234</ns1:cityID>
</ns1:searchedAddress>
<ns1:foundAddress>
<ns1:addressID>1235</ns1:addressID>
<ns1:unitID>1236</ns1:unitID>
<ns1:liableID>1237</ns1:liableID>
<ns1:houseID>1238</ns1:houseID>
<ns1:streetID>1239</ns1:streetID>
<ns1:cityID>1230</ns1:cityID>
</ns1:foundAddress>
</ns1:normalization>
<ns1:clientExtensions>
<ns1:additionalData>
<ns1:name>pesel_verification_status</ns1:name>
<ns1:value>1</ns1:value>
</ns1:additionalData>
<ns1:additionalData>
<ns1:name>pesel_verification_execution_code</ns1:name>
<ns1:value>200</ns1:value>
</ns1:additionalData>
<ns1:additionalData>
<ns1:name>pesel_verification_codes</ns1:name>
<ns1:value>12010; 12013</ns1:value>
</ns1:additionalData>
</ns1:clientExtensions>
<ns1:executionStrategy/>
</ns1:orderCheckResponse>
</soapenv:Body>
</soapenv:Envelope>
Here is the code I am using to read and parse this XML content; str henceforth:
library(XML)
foobar <- xmlInternalTreeParse(str, encoding = 'KOI8-R', useInternalNodes = F)
xmlSApply(foobar$doc$children$Envelope, function(x) xmlSApply(x, names))
xmlSApply(foobar$doc$children$Envelope, function(x) xmlSApply(x, function(x1) xmlSApply(x1, names)))
Here I am able to parse the XML content and tried iterating over nodes to at least print the names. However I couldn't extract the values inside, even though I read many SO questions and tried countless combinations using xPathApply() etc. (reference)
Any hint on what I might be doing wrong here.
Consider a straightforward descendant XPath search with //* that acknowledges namespaces to retrieve all element names and values. However, since your XPath does not reference any namespace prefixes, for this specific search it is redundant in the xpathSApply calls:
doc <- xmlParse(str, asText=TRUE)
nmsp <- c(soapenv = "http://schemas.xmlsoap.org/soap/envelope/",
doc = "http://www.deltavista.com/dspone/ordercheck-if/V001",
ns1 = "http://www.deltavista.com/dspone/ordercheck-if/V001")
# NAMED CHARACTER VECTOR OF ALL 117 ELEMENT NAMES AND VALUES
elem_vals <- setNames(xpathSApply(doc, path="//*", namespaces = nmsp, xmlValue) ,
xpathSApply(doc, path="//*", namespaces = nmsp, xmlName))
Output
Names (first 20 items)
head(names(elem_vals), 20)
# [1] "Envelope" "Header" "messageContext" "credentials" "user"
# [6] "password" "Body" "orderCheckResponse" "returnCode" "product"
# [11] "name" "country" "language" "archiveID" "reportCreationTime"
# [16] "foundAddress" "legalForm" "address" "name" "firstName"
Values (last 20 items)
tail(elem_vals, 20)
# streetID cityID
# "1233" "1234"
# foundAddress addressID
# "123512361237123812391230" "1235"
# unitID liableID
# "1236" "1237"
# houseID streetID
# "1238" "1239"
# cityID clientExtensions
# "1230" "pesel_verification_status1pesel_verification_execution_code200pesel_verification_codes12010; 12013"
# additionalData name
# "pesel_verification_status1" "pesel_verification_status"
# value additionalData
# "1" "pesel_verification_execution_code200"
# name value
# "pesel_verification_execution_code" "200"
# additionalData name
# "pesel_verification_codes12010; 12013" "pesel_verification_codes"
# value executionStrategy
# "12010; 12013" ""

How to read csv with double quotes from WoS?

I'm trying to read CSV files from the citation report of Web of Science. This is the structure of the file:
TI=clinical case of cognitive dysfunction syndrome AND CU=MEXICO
null
Timespan=All years. Indexes=SCI-EXPANDED, SSCI, A&HCI, ESCI.
"Title","Authors","Corporate Authors","Editors","Book Editors","Source Title","Publication Date","Publication Year","Volume","Issue","Part Number","Supplement","Special Issue","Beginning Page","Ending Page","Article Number","DOI","Conference Title","Conference Date","Total Citations","Average per Year","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016"
""Didy," a clinical case of cognitive dysfunction syndrome","Heiblum, Moises; Labastida, Rocio; Chavez Gris, Gilberto; Tejeda, Alberto","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","MAY-JUN 2007","2007","2","3","","","","68","72","","10.1016/j.jveb.2007.05.002","","","2","0.20","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","0","0","0","1","0","0","0"
""Didy," a clinical case of cognitive dysfunction syndrome (vol 2, pg 68, 2007)","Heiblum, A.; Labastida, R.; Gris, Chavez G.; Tejeda, A.; Edwards, Claudia","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","SEP-OCT 2007","2007","2","5","","","","183","183","","","","","0","0.00","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
I manage to import the it using fread, however, I still want to know which is the appropriate quote and why is assigning "Didy," as row names despite that the argument is NULL. This are the arguments that I'm using.
s_file <- read.csv(savedrecs.txt,
skip = 4,
header = TRUE,
row.names = NULL,
quote = '\"',
stringsAsFactors = FALSE)
What you have shown is not a valid csv file format. There are some double double quotes (i.e. "") without a comma. For example there is one at the beginning of the second line.
""Didy," a clinical case of cognitive dysfunction syndrome", etc.
So it thinks there is a null followed by Diddy, followed by " a clinical case of cognitive dysfunction syndrome" Fix up the file and you should be ok. E.g. the second line should start with
"","Didy","a clinical case of cognitive dysfunction syndrome"

Resources