R extract specific word after keyword - r

How do I extract a specific word after keyword in R.
I have the following input text which contains details about policy. I need to extract specific words value like FirstName , SurName , FatherName and dob.
input.txt
In Case of unit linked plan, Investment risk in Investment Portfolio is borne by the policyholder.
ly
c I ROPOSAL FORM z
Insurance
Proposal Form Number: 342525 PF 42242
Advisor Coe aranch Code 2
Ff roanumber =F SSOS™S™~™S~S rancid ate = |
IBR. Code S535353424
re GFN ——
INSTRUCTION FOR FILLING THES APPLICATION FORM ; 1. Compiets the proocsal form in CAPITAL LETTERS using = Black Ball Point P]n. 2. Sless= mark your selection by marking “X" insides the
Boe. 3. Slnsse bases 2 Blank soece after eect word, letter or initial 4. Slssse write "MA" for questions whic are not apolicatie. 5.00 NOT USE the Sor") to identify your initial or seperate the sddressiiine.
6. Sulmissson of age proof ie mandatory along wall Ge propel fonm.
IMPORTANT INSTRUCTIONS WITH REGARD TO DISCLOSURE OF INFORMATION: Inturance it a contract of UTMOST GOOD FAITH and itis required by disclose all material and nelevant
fach: complebehy, DO) NOT suppress any fac: in response by the questions in the priposal form. FAILURE TO PROVIDE COMPLETE AND ACCURATE INFORMATION OR
MISREPRESENTATION OF THE FACTS COULD DECLARE THES POLICY CONTRACT NULL AND VOID AFTER PAYMENT OF SURRENDER VALUE, IF ANY, SUBJECT TO SECTION 45 OF
INSURANCE ACT, 1998 As AMENDED FROM TIME TO TIME,
Section I - Details of the Life to be Assured
1. Tite E-] Mr. LJ Mrs. LJ Miss [J Or. LJ Others (Specify)
2. FirstName PETER PAUL
3. Surname T
44. Father's Name
46, Mother's Name ERIKA RESWE D
5. Date of Birth 13/02/1990 6, Gender E] Male ] Female
7. Age Proof L] School Certificate [] Driving License [] Passport {Birth Certificate E"] PAN Card
3, Marital Status D) Single EF] Married 0 Widower) 0 Civorcee
9, Spouse Name ERISEWQ FR
10. Maiden Name
iL. Nationality -] Resident Indian National [J Non Resident Indian (MRI) L] Others (Specify)
12, Education J Postgraduate / Doctorate Ee) Graduate [] 12thstd. Pass [J 10thstd. Pass [J Below 10th std.
OO Dliterate / Uneducated CJ Others (Specify)
13. Address For No 7¥%a vaigai street Flower
Communication Nagar selaiyur
Landmark
City Salem
Pin Code BO00 73: State TAMIL NADU
Address proof [] Passport ([] Driving License [] Voter ID [] Bank Statement [] Utility Bill G4 Others (Specify) Aadhaar Card
14, Permanent No 7¥a vaigai street Flower
Address :
Nagar selaiyur
Landmark
City Salem
Pin Code 5353535 state (TAMIL NADU
Address proof CJ] Passport [9 DrivingLicense [J Voter ID [ Bank Statement [ Utility Bill B] Others (Specify) Aadhaar Card
15. Contact Details Mobile 424242424 Phone (Home)
Office / Business
E-mail fdgrgtr13#yahoo.com
Preferred mode: ((] Letter EF) E-Mail
Preferred Language for Letter {other than English): [] Hindi [] Kannada [-] Tamil J Telugu C] Malayalam C) Gujarati
Bengali GOriya =D] Marathi
16. Occupation CL] Salaried-Govt /PSU ( Salaried-other [9 Self Employed Professional [J Aagriculturist {Farmer [Part Time Business
LJ Retired ] Landlord J Student (current Std) -] Others (Specify) Salaried - MNC
17. Full Name of the Capio software
Employers Businnes/
School/College
18, Designation & Exact nature of Work / Business Manager
19. AnnualIncomein 1,200,000.00 20. Annual Income of Husband / Father = 1,500,000.00
Figures (%) (for female and minor lives)
21. Exact nature of work / business of Husband / Father for female and minor lives Government Employee
Page 10fé
The below code works for me but the problem is if line order changes everything get changed. Is there a way to extract keyword value irrespective of line order. ?
Current Code
path <- getwd()
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
fName <- sub('.*FirstName', '', my_txt[7])
SName <- sub('.*Surname', '', my_txt[8])
FatherNm <- sub(".*Father's Name", '', my_txt[9])
dob <- sub("6, Gender.*", '',sub(".*Date of Birth", '', my_txt[11]))

You can combine the text together as one string and extract the values based on pattern in the data. This approach will work irrespective of the line number in the data provided the pattern in the data is always valid for all the files.
my_txt <- readLines(paste(path, "/input.txt", sep = ""))
#Collapse data in one string
text <- paste0(my_txt, collapse = '\n')
#Extract text after FirstName till '\n'
fName <- sub('.*FirstName (.*?)\n.*', '\\1', text)
fName
#[1] "John Woo"
#Extract text after Surname till '\n'
SName <- sub('.*Surname (.*?)\n.*', '\\1', text)
SName
#[1] "T"
#Extract text after Father's Name till '\n'
FatherNm <- sub(".*Father's Name (.*?)\n.*", '\\1', text)
FatherNm
#[1] "Bill Woo"
#Extract numbers which come after Date of Birth.
dob <- sub(".*Date of Birth (\\d+/\\d+/\\d+).*", '\\1', text)
dob
#[1] "13/07/1970"

Related

How can I get the key to increment when it is a string

I need to take someone’s age, and then outputs a key event from every year they have lived through.
dYearlyEvents = {
"1993": "Bill Clinton is inaugurated as the 42nd president.",
"1994": "The People's Republic of China gets its first connection to the Internet.",
"1995": "eBay is founded by Pierre Omidyar.",
"1996": "Murder of Tupac Shakur.",
"1997": "The first episode of Pokémon airs on TV Tokyo.",
"1998": "Death of Frank Sinatra.",
"1999": "The Columbine High School massacre in Colorado, United States, causes 15 deaths.",
"2000": "The Sony PlayStation 2 releases in Japan. ",
}
sBirthYear = (input("What year were you born in: \n"))
while True:
if sBirthYear in dYearlyEvents:
print(dYearlyEvents[sBirthYear])
sBirthYear += 1
This is what I tried but obviously as the input is a string it wont add a year every time it loops to print all events from 1993 to 2000 instead just prints 1993.

PowerApps - One Collection feeds another, with List lookups

There are a few steps I'm trying to hit, here.
STEP 1:
I have created a Collection (ScanDataCollection) with the following command:
ClearCollect(ScanDataCollection,Split(ScanData.Text,Char(10)));
Where ScanData is a multiline text control, containing data strings such as this:
REQ1805965.RITM2055090.01
REQ1805965.RITM2055091.01
REQ1805982.RITM2055144.01
REQ1805982.RITM2055145.01
This produces a Collection of:
RESULT
REQ1805965.RITM2055090.01
REQ1805965.RITM2055091.01
REQ1805982.RITM2055144.01
REQ1805982.RITM2055145.01
The unique lookup value in this list is the RITM string (for example: RITM2055091)
I want to build a Collection that looks like this:
CUSTOMERNAME CUSTOMEREMAIL MANAGERNAME MANAGEREMAIL ITEMLIST
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055090 - Vulcan Banana</strong></li>
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055091 - Vulcan Grape</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055144 - Romulan Catfish</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055145 - Romulan Salmon</strong></li>
The values in the rows come from a List (called "Spiderfood" at the moment) in SharePoint (this is where RITM value is typically unique, and can be used as the lookup):
Title REQUEST RITM TASK OPENED_DATE ITEM_DESCRIPTION VIP CUSTOMER_NAME CUSTOMER_NT MANAGER_NAME MANAGER_NT TASK_DESCRIPTION CUSTOMER_LOCATION
8-5-2021 REQ1805965 RITM2055090 TASK123 7-27-2021 Vulcan Banana false Edward edward#fish.com Tony tony#fish.com a string a string
8-5-2021 REQ1805965 RITM2055091 TASK123 7-27-2021 Vulcan Grape false Edward edward#fish.com Tony tony#fish.com a string a string
8-5-2021 REQ1805982 RITM2055144 TASK123 7-27-2021 Romulan Catfish false Joseph joey#fish.com Kate kate#fish.com a string a string
8-5-2021 REQ1805982 RITM2055145 TASK123 7-27-2021 Romulan Salmon false Joseph joey#fish.com Kate kate#fish.com a string a string
...[among hundreds of other records in this List]
Then...
STEP 2:
Take the Collection I built above, and deduplicate, based on CUSTOMEREMAIL, but in the process of deduplicating, concatenate the items in the ITEMLIST column.
The result would be a Collection with only two rows, for example:
CUSTOMERNAME CUSTOMEREMAIL MANAGERNAME MANAGEREMAIL ITEMLIST
Edward edward#fish.com Tony tony#fish.com <li><strong>REQ1805965 - RITM2055090 - Vulcan Banana</strong></li><li><strong>REQ1805965 - RITM2055091 - Vulcan Grape</strong></li>
Joseph joey#fish.com Kate kate#fish.com <li><strong>REQ1805982 - RITM2055144 - Romulan Catfish</strong></li><li><strong>REQ1805982 - RITM2055145 - Romulan Salmon</strong></li>
I sure would appreciate guidance/suggestions on this, please!
Thank you kindly in advance!
Okay, for STEP 1:
ClearCollect(ScanDataCollection,Split(ScanData.Text,Char(10)));
ClearCollect(MailingListExploded, AddColumns(ScanDataCollection,
"CustomerName", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Customer_Name),
"CustomerEmail", "edward#fish.com", // this is what I use as a test so that I don't email customers.
//"CustomerEmail", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Customer_NT),
"ManagerName", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Manager_Name),
"ManagerEmail", LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Manager_NT),
"ItemListHTML", "<li><strong>" & Left(Result,10) & " - " & Mid(Result, 12, 11) & " - " & LookUp('Spiderfood - RITMs', RITM = Mid(Result, 12, 11), Item_Description) & "</li></strong>"));
It adds an additional column from ScanDataCollection called "Result", but I can live with that. (I take it out later)
I had to add that specific list ('Spiderfood - RITMs') as a resource to the PowerApp project, which took me a minute to remember. Derp.
It offers a delegation warning about the use of Lookup if the dataset is very large (well, it's gonna be around 15,000, give or take), but for now, I'll not worry about it.
Now, on to STEP 2:
What would have helped me quicker on this would be to better understand the GROUPBY function, and how it can have multiple arguments, and concatenating the strings was a bit of a headscratcher.
But it seems to work, so here it is:
// Trim away the Result column
ClearCollect(MailingListExplodedTrimmed, DropColumns(MailingListExploded, "Result"));
// Group and concatenate - TransmissionGrid is what we need to send the emails
ClearCollect(RecordsByCustEmail, GroupBy(MailingListExplodedTrimmed, "CustomerEmail", "CustomerName", "ManagerName", "ManagerEmail", "OrderData"));
ClearCollect(TransmissionGridExtra, AddColumns(RecordsByCustEmail, "ConcatenatedOrderString", Concat(OrderData, ItemListHTML)));
ClearCollect(TransmissionGrid, DropColumns(TransmissionGridExtra, "OrderData"));
Notify("Process complete!");
I might be able to shave away some steps by nesting things, but in this instance I wanted to be super obvious in case I have to look at this in 96 hours.
Anyway, that's what did it for me. Onward!

Python code to scrape ticker symbols from Yahoo finance

I have a list of >1.000 companies which I could use to invest in. I need the ticker symbol id's from all these companies. I find difficulties when I am trying to strip the output of the soup, and when I am trying to loop through all the company names.
Please see an example of the site: https://finance.yahoo.com/lookup?s=asml. The idea is to replace asml and put 'https://finance.yahoo.com/lookup?s='+ Companies., so I can loop through all the companies.
companies=df
Company name
0 Abbott Laboratories
1 ABBVIE
2 Abercrombie
3 Abiomed
4 Accenture Plc
This is the code I have now, where the strip code doesn't work, and where the loop for all the company isn't working as well.
#Create a function to scrape the data
def scrape_stock_symbols():
Companies=df
url= 'https://finance.yahoo.com/lookup?s='+ Companies
page= requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
Company_Symbol=Soup.find_all('td',attrs ={'class':'data-col0 Ta(start) Pstart(6px) Pend(15px)'})
for i in company_symbol:
try:
row = i.find_all('td')
company_symbol.append(row[0].text.strip())
except Exception:
if company not in company_symbol:
next(Company)
return (company_symbol)
#Loop through every company in companies to get all of the tickers from the website
for Company in companies:
try:
(temp_company_symbol) = scrape_stock_symbols(company)
except Exception:
if company not in companies:
next(Company)
Another difficulty is that the symbol look up from yahoo finance will retrieve many companies names.
I will have to clear the data afterwards. I want to set the AMS exchange as the standard, hence if a company is listed on multiple exchanges, I am only interested in the AMS ticker symbol. The final goal is to create a new dataframe:
Comapny name Company_symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
Here's a solution that doesn't require any scraping. It uses a package called yahooquery (disclaimer: I'm the author), which utilizes an API endpoint that returns symbols for a user's query. You can do something like this:
import pandas as pd
import yahooquery as yq
def get_symbol(query, preferred_exchange='AMS'):
try:
data = yq.search(query)
except ValueError: # Will catch JSONDecodeError
print(query)
else:
quotes = data['quotes']
if len(quotes) == 0:
return 'No Symbol Found'
symbol = quotes[0]['symbol']
for quote in quotes:
if quote['exchange'] == preferred_exchange:
symbol = quote['symbol']
break
return symbol
companies = ['Abbott Laboratories', 'ABBVIE', 'Abercrombie', 'Abiomed', 'Accenture Plc']
df = pd.DataFrame({'Company name': companies})
df['Company symbol'] = df.apply(lambda x: get_symbol(x['Company name']), axis=1)
Company name Company symbol
0 Abbott Laboratories ABT
1 ABBVIE ABBV
2 Abercrombie ANF
3 Abiomed ABMD
4 Accenture Plc ACN

Data Scraping with list in excel

I have a list in Excel. One code in Column A and another in Column B.
There is a website in which I need to input both the details in two different boxes and it takes to another page.
That page contains certain details which I need to scrape in Excel.
Any help in this?
Ok. Give this a shot:
import pandas as pd
import requests
df = pd.read_excel('C:/test/data.xlsx')
url = 'http://rla.dgft.gov.in:8100/dgft/IecPrint'
results = pd.DataFrame()
for row in df.itertuples():
payload = {
'iec': '%010d' %row[1],
'name':row[2]}
response = requests.post(url, params=payload)
print ('IEC: %010d\tName: %s' %(row[1],row[2]))
try:
dfs = pd.read_html(response.text)
except:
print ('The name Given By you does not match with the data OR you have entered less than three letters')
temp_df = pd.DataFrame([['%010d' %row[1],row[2], 'ERROR']],
columns = ['IEC','Party Name and Address','ERROR'])
results = results.append(temp_df, sort=False).reset_index(drop=True)
continue
generalData = dfs[0]
generalData = generalData.iloc[:,[0,-1]].set_index(generalData.columns[0]).T.reset_index(drop=True)
directorData = dfs[1]
directorData = directorData.iloc[:,[-1]].T.reset_index(drop=True)
directorData.columns = [ 'director_%02d' %(each+1) for each in directorData.columns ]
try:
branchData = dfs[2]
branchData = branchData.iloc[:,[-1]].T.reset_index(drop=True)
branchData.columns = [ 'branch_%02d' %(each+1) for each in branchData.columns ]
except:
branchData = pd.DataFrame()
print ('No Branch Data.')
temp_df = pd.concat([generalData, directorData, branchData], axis=1)
results = results.append(temp_df, sort=False).reset_index(drop=True)
results.to_excel('path.new_file.xlsx', index=False)
Output:
print (results.to_string())
IEC IEC Allotment Date File Number File Date Party Name and Address Phone No e_mail Exporter Type IEC Status Date of Establishment BIN (PAN+Extension) PAN ISSUE DATE PAN ISSUED BY Nature Of Concern Banker Detail director_01 director_02 director_03 branch_01 branch_02 branch_03 branch_04 branch_05 branch_06 branch_07 branch_08 branch_09
0 0305008111 03.05.2005 04/04/131/51473/AM20/ 20.08.2019 NISSAN MOTOR INDIA PVT. LTD. PLOT-1A,SIPCOT IN... 918939917907 shailesh.kumar#rnaipl.com 5 Merchant/Manufacturer Valid IEC 2005-02-07 AACCN0695D FT001 NaN NaN 3 Private Limited STANDARD CHARTERED BANK A/C Type:1 CA A/C No :... HARDEEP SINGH BRAR GURMEL SINGH BRAR HOUSE NO ... JEROME YVES MARIE SAIGOT THIERRY SAIGOT A9/2, ... KOJI KAWAKITA KIHACHI KAWAKITA 3-21-3, NAGATAK... Branch Code:165TH FLOOR ORCHID BUSINESS PARK,S... Branch Code:14NRPDC , WAREHOUSE NO.B -2A,PATAU... Branch Code:12EQUINOX BUSINESS PARK TOWER 3 4T... Branch Code:8GRAND PALLADIUM,5TH FLR.,B WING,,... Branch Code:6TVS LOGISTICS SERVICES LTD.SING,C... Branch Code:2PLOT 1A SIPCOT INDUL PARK,ORAGADA... Branch Code:5BLDG.NO.3 PART,124A,VALLAM A,SRIP... Branch Code:15SURVEY NO. 678 679 680 681 682 6... Branch Code:10INDOSPACE SKCL INDL.PARK,BULD.NO...

Simplifying e-mail network with igraph

I'm analysing a e-mail network. I loaded the following information in a directed igraph on R:
Vertex types: person, e-mail
V(g)[ type == "person" ]
V(g)[ type == "email" ]
Edge types: sends, receives
E(g)[ type == "send" ]
E(g)[ type == "receive" ]
So for example:
John --send--> email1 --receive--> Mary
John --send--> email2 --receive--> Mary
Mary --send--> email3 --receive--> John
I would like to generate a summary of the e-mail activity, with edges with an attribute representing the number of emails:
John --2--> Mary
Mary --1--> John
How would I go about to do that?
Thanks,
Mulone

Resources