How to read csv with double quotes from WoS? - r

I'm trying to read CSV files from the citation report of Web of Science. This is the structure of the file:
TI=clinical case of cognitive dysfunction syndrome AND CU=MEXICO
null
Timespan=All years. Indexes=SCI-EXPANDED, SSCI, A&HCI, ESCI.
"Title","Authors","Corporate Authors","Editors","Book Editors","Source Title","Publication Date","Publication Year","Volume","Issue","Part Number","Supplement","Special Issue","Beginning Page","Ending Page","Article Number","DOI","Conference Title","Conference Date","Total Citations","Average per Year","1988","1989","1990","1991","1992","1993","1994","1995","1996","1997","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014","2015","2016"
""Didy," a clinical case of cognitive dysfunction syndrome","Heiblum, Moises; Labastida, Rocio; Chavez Gris, Gilberto; Tejeda, Alberto","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","MAY-JUN 2007","2007","2","3","","","","68","72","","10.1016/j.jveb.2007.05.002","","","2","0.20","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","1","0","0","0","1","0","0","0"
""Didy," a clinical case of cognitive dysfunction syndrome (vol 2, pg 68, 2007)","Heiblum, A.; Labastida, R.; Gris, Chavez G.; Tejeda, A.; Edwards, Claudia","","","","JOURNAL OF VETERINARY BEHAVIOR-CLINICAL APPLICATIONS AND RESEARCH","SEP-OCT 2007","2007","2","5","","","","183","183","","","","","0","0.00","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
I manage to import the it using fread, however, I still want to know which is the appropriate quote and why is assigning "Didy," as row names despite that the argument is NULL. This are the arguments that I'm using.
s_file <- read.csv(savedrecs.txt,
skip = 4,
header = TRUE,
row.names = NULL,
quote = '\"',
stringsAsFactors = FALSE)

What you have shown is not a valid csv file format. There are some double double quotes (i.e. "") without a comma. For example there is one at the beginning of the second line.
""Didy," a clinical case of cognitive dysfunction syndrome", etc.
So it thinks there is a null followed by Diddy, followed by " a clinical case of cognitive dysfunction syndrome" Fix up the file and you should be ok. E.g. the second line should start with
"","Didy","a clinical case of cognitive dysfunction syndrome"

Related

R extract specific word after keyword

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))
You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

Platform Data Extension: how do I parse ROAD_NAME_FCN

I'm using HERE's Platform Data Extension to retrieve road names. However, I don't understand the strings that I'm getting. I suspect they're encoded somehow but I don't know how to decode them.
For example:
ENGBNFDR Dr NNASN"e|fe "de "e|rre "dri|ve "nol|te;NASY"e|fe "de "e|rre;<snip>
If I split them by a "record separator" character, e.g. link_names.split('\x1e') the values look slightly more intelligible, but only slightly. There are still bizarre abbreviations I don't understand, e.g. ENGBN.
The PDE Layers documents can be found here: http://pde.cit.api.here.com/1/doc/content.html?detail=1&app_id=xxx&app_code=yyy
Layers > ROAD_NAME_FC1 > NAMES.
List of all names for this object, in all languages, latin1/pinyin/phonetic transliterations.
For convenience, non-exonym base names are listed first.
Format:
NAMES = NAME1 \u001D NAME2 \u001D NAME3 ...
NAME = NAME_TEXT \u001E TRANSLIT1 ; TRANSLIT2 ; ... \u001E PHONEME1 ; PHONEME2 ; ... NAME_TEXT = LANGUAGE_CODE NAME_TYPE IS_EXONYM text
TRANSLIT = LANGUAGE_CODE text
PHONEME = LANGUAGE_CODE IS_PREFERRED text
LANGUAGE_CODE is a 3 character string
NAME_TYPE is one letter (A = abbreviation, B = base name, E = exonym, K = shortened name, S = synonym)
IS_EXONYM = Y if the name is a translation into another language
IS_PREFERRED = Y if this is the preferred phoneme.
Please note, the delimiters are:
\u001D between languages (NAMES level)
\u001E between name text, transliterations, and phonemes ';' between different transliterations and phonemes of the same name.

Lower Case Certain Words R

I need to convert certain words to lower case. I am working with a list of movie titles, where prepositions and articles are normally lower case if they are not the first word in the title. If I have the vector:
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
What I need is this:
movies_updated = c('The Kings of Summer', 'The Words', 'Out of the Furnace', 'Me and Earl and the Dying Girl')
Is there an elegant way to do this without using a long series of gsub(), as in:
movies_updated = gsub(' In ', ' in ', movies)
movies_updated = gsub(' In', ' in', movies_updated)
movies_updated = gsub(' Of ', ' of ', movies)
movies_updated = gsub(' Of', ' of', movies_updated)
movies_updated = gsub(' The ', ' the ', movies)
movies_updated = gsub(' the', ' the', movies_updated)
And so on.
In effect, it appears that you are interested in converting your text to title case. This can be easily achieved with use of the stringi package, as shown below:
>> stringi::stri_trans_totitle(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings Of Summer" "The Words" "Out Of The Furnace"
Alternative approach would involve making use of the toTitleCase function available in the the tools package:
>> tools::toTitleCase(c('The Kings of Summer', 'The Words', 'Out of the Furnace'))
[1] "The Kings of Summer" "The Words" "Out of the Furnace"
Though I like #Konrad's answer for its succinctness, I'll offer an alternative that is more literal and manual.
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace',
'Me And Earl And The Dying Girl')
gr <- gregexpr("(?<!^)\\b(of|in|the)\\b", movies, ignore.case = TRUE, perl = TRUE)
mat <- regmatches(movies, gr)
regmatches(movies, gr) <- lapply(mat, tolower)
movies
# [1] "The Kings of Summer" "The Words"
# [3] "Out of the Furnace" "Me And Earl And the Dying Girl"
The tricks of the regular expression:
(?<!^) ensures we don't match a word at the beginning of a string. Without this, the first The of movies 1 and 2 will be down-cased.
\\b sets up word-boundaries, such that in in the middle of Dying will not match. This is slightly more robust than your use of space, since hyphens, commas, etc, will not be spaces but do indicate the beginning/end of a word.
(of|in|the) matches any one of of, in, or the. More patterns can be added with separating pipes |.
Once identified, it's as simple as replacing them with down-cased versions.
Another example of how to turn certain words to lower case with gsub (with a PCRE regex):
movies = c('The Kings Of Summer', 'The Words', 'Out Of The Furnace', 'Me And Earl And The Dying Girl')
gsub("(?!^)\\b(Of|In|The)\\b", "\\L\\1", movies, perl=TRUE)
See the R demo
Details:
(?!^) - not at the start of the string (it does not matter if we use a lookahead or lookbehind here since the pattern inside is a zero-width assertion)
\\b - find leading word boundary
(Of|In|The) - capture Of or In or The into Group 1
\\b - assure there is a trailing word boundary.
The replacement contains the lowercasing operator \L that turns all the chars in the first backreference value (the text captured into Group 1) to lower case.
Note it can turn out a more flexible approach than using tools::toTitleCase. The code part that keeps specific words in lower case is:
## These should be lower case except at the beginning (and after :)
lpat <- "^(a|an|and|are|as|at|be|but|by|en|for|if|in|is|nor|not|of|on|or|per|so|the|to|v[.]?|via|vs[.]?|from|into|than|that|with)$"
If you only need to apply lowercasing and do not care about the other logic in the function, it might be enough to add these alternatives (do not use ^ and $ anchors) to the regex at the top of the post.

Replace rogue double-quotes in vector in R

I have a broken CSV file with long text fields containing both double quotes and commas. I've been able to clean it up to some extent and now have tab-separated fields as a vector of whole lines (each value is a line).
head(temp, 2)
[1] "\"org_order\"\t\"organizations.api_path\"\t\"permalink\"\t\"api_path\"\t\"web_path\"\t\"name\"\t\"also_known_as\"\t\"short_description\"\t\"description\"\t\"profile_image_url\"\t\"primary_role\"\t\"role_company\"\t\"role_investor\"\t\"role_group\"\t\"role_school\"\t\"founded_on\"\t\"founded_on_trust_code\"\t\"is_closed\"\t\"closed_on\"\t\"closed_on_trust_code\"\t\"num_employees_min\"\t\"num_employees_max\"\t\"stock_exchange\"\t\"stock_symbol\"\t\"total_funding_usd\"\t\"number_of_investments\"\t\"homepage_url\"\t\"created_at\"\t\"updated_at\""
[2] "1\t\"organizations/care1st-health-plan-arizona\"\t\"care1st-health-plan-arizona\"\t\"organizations/care1st-health-plan-arizona\"\t\"organization/care1st-health-plan-arizona\"\t\"Care1st Health Plan Arizona\"\t\"\"\t\"Care1st Health Plan Arizona provides high quality health care services.\"\t\"Care1st is a health plan providing support and services to meet the health care needs of eligible members enrolled in KidsCare, AHCCCS, and DDD.\"\t\"http://public.crunchbase.com/t_api_images/v1475743278/m2teurxnhkwacygzdn2m.png\"\t\"company\"\t\"\"\t\"\"\t\"\"\t\"\"\t\"2003-01-01\"\t\"4\"\t\"FALSE\"\t\"\"\t\"0\"\t\"251\"\t\"500\"\t\"\"\t\"\"\t\"0\"\t\"0\"\t\"\"\t\"1475743348\"\t\"1475899305\""
I then write temp as a file and read it back (which I've found much faster than textConnection). However, read.table("temp", sep = "\t", quote = "\"", encoding = "UTF-8", colClasses = "character") chokes on certain lines and gives me messages such as:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 66951 did not have 29 elements
I think this is due to rogue double quotes, as in the following line (rogue quote can be found immediately after "TripAdvisor de la sant?").
temp[66951]
[1] "67654\t\"organizations/docotop\"\t\"docotop\"\t\"organizations/docotop\"\t\"organization/docotop\"\t\"DOCOTOP\"\t\"\"\t\"Le 'TripAdvisor de la sant?\" est arriv?. Docotop permet de trouver le meilleur professionnel de sant?gr?e ?la communaut?de patients\"\t\"\"\t\"http://public.crunchbase.com/t_api_images/v1455271104/ry9lhcfezcmemoifp92h.png\"\t\"company\"\t\"TRUE\"\t\"\"\t\"\"\t\"\"\t\"2015-11-17\"\t\"7\"\t\"\"\t\"\"\t\"0\"\t\"1\"\t\"10\"\t\"EURONEXT\"\t\"\"\t\"0\"\t\"0\"\t\"http://docotop.com/\"\t\"1455271299\"\t\"1473443321\""
I propose to replace rogue double quotes by single quotes, but I have to leave expected quotes in place. Quotes are expected right before or after a separator (tab) and at the beginning (first line only) and the end of a line. I've written the following attempt at regex with lookarounds for tab and line start and end, but it doesn't work:
temp <- gsub("(?<![^\t])\"(?![\t$])", "'", temp, perl = T)
EDIT: I tried #akrun's solution, but get:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec, : line 181 did not have 29 elements
The line in question (which didn't cause an error before):
temp[181]
[1] "198\torganizations/playfusion\tplayfusion\torganizations/playfusion\torganization/playfusion\tPlayFusion\t\tPlayFusion is a developer of computer games.\tPlayFusion is pioneering the next generation of connected interactive entertainment. PlayFusion's proprietary technology platform fuses video games, robotics, toys, and trans-media entertainment. The company is currently working on its own original IP to trail-blaze its vision ahead of opening its platform to others. PlayFusion is an independent, employee-owned company with offices in Cambridge and Derby in the UK, Douglas in the Isle of Man, and New York and San Francisco in the USA.\thttp://public.crunchbase.com/t_api_images/v1475688372/xnhrd4t254pxj6yxegzt.png\tcompany\t\t\t\t\t2015-01-01\t4\tFALSE\t\t0\t11\t50\t\t\t0\t0\thttp://playfusion.com/#intro\t1475688521\t1475899292"
Your (?<![^\t])"(?![\t$]) regex matches a " that is not preceded with a char other than a tab (so, there must be a tab or start of string before the "), and that is not followed with a tab or $ symbol.
So, the ^ and $ inside character classes lose their anchor meaning.
Replace the character classes with alternation groups:
gsub("(?<!\t|^)\"(?!\t|$)", "'", temp, perl=TRUE)
The (?<!\t|^) lookbehind requires that the " is not at the start of the string and is not preceded with a tab.
The (?!\t|$) lookahead requires that the " is not at the end of the string ($) and is not followed with a tab char.

Resources