Startup Blink web scrapping - web-scraping

Hello and have a great day!
I was trying to get some information for my research on startups from Startup Blink website(https://www.startupblink.com/startups), and here is my code
import requests
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from bs4 import BeautifulSoup
from time import sleep
from time import time
%time
df=pd.DataFrame()
for p in range(1,770):
url=f'https://www.startupblink.com/startups?page={p}&location=united-states'
r=requests.get(url)
us=r.text
soup=BeautifulSoup(us, 'html.parser')
allbus=soup.find_all('div', class_='sc-2ozyz3-0 jlGOJO entity-card laptop:test')
for bus in allbus:
business_name=bus.find('a', class_='sc-2ozyz3-3 bPSWdR').text
city=bus.find('div', class_='sc-2ozyz3-4 iNXPUy').find('a').text
industry=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[0].text
industryspec=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[1].text
description=bus.find('div', class_='sc-2ozyz3-9 gHVzj').text
description=description.rstrip('\xa0Read more')
df = df.append({"Business_name": business_name, "City": city, "Industry": industry, 'Industry Specific': industryspec, 'Description': description}, ignore_index=True)
sleep(0.01)
print(p)
df=df.dropna()
df=df.drop_duplicates()
df.describe()
Unfortunately, I was not able to figure out how to better approach it so that to get all information I need directly from the page without that inner for loop I made which goes through the page several times and it takes too much time.
Any suggestions???
Also, I cannot yet understand how to get the country name from the output HTML tag (it is the second in div class="sc-2ozyz3-4 iNXPUy":
a class="sc-2ozyz3-3 bPSWdR" href="/startups/qiwi">QIWI</a>
<div class="sc-2ozyz3-4 iNXPUy"><div class="sc-2ozyz3-6 sc-2ozyz3-7 kDRpwA bsjXoB"></div>
Moscow,
Russia</div>
<div class="sc-2ozyz3-4 iNXPUy"><div class="sc-2ozyz3-6 sc-2ozyz3-8 kDRpwA jspnHi"></div>
<a href="/startups/industry/fintech">
Appreciate your help and advice!

That page is being hydrated from an API, visible in browser's Dev tools - Network tab: you need to scrape that API endpoint, to get the information.
Here is one way to do it:
import requests
import pandas as pd
from tqdm import tqdm
s = requests.Session()
big_df = pd.DataFrame()
for x in tqdm(range(27)):
r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={x}&bounds=-48.58314637707078,-177.71484375,80.2661234640419,-6.152343750000001&sortBy=rank&order=desc&leaderType=1&countryId=1')
df = pd.json_normalize(r.json()['page'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
Result in terminal:
id title description lat lng unicorn import_tag update_method lowtech pantheon exit slugNumber cb_logo url_rank local_rank stage featured when industry_slug industry_name industry_id subindustry_slug subindustry_id subindustry_name tags tags_name logo url crunchbase linkedin_url city city_slug country_slug country state city_id country_id state_id status highest_rank location claimed_by region_ids city_bounds country_bounds region_name region_bounds region_id cluster_parent
0 4227 DuckDuckGo DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/ 40.0025 -75.118 0 angellist angellist 0 0 0 0 None 981818.18181818176526576281 2 NaN NaN 1397184129 software-data Software & Data 10 software 80.0 Software 365 Search https://www.startupblink.com/uploads/startups_logo/3c3044925df3260f03ce454bf947349c.jpg https://duckduckgo.com/ None None Philadelphia philadelphia united-states United States PA 154 1 54.0 1 10677525 Philadelphia, United States NaN 4,43,37,15 39.8670041,-75.280303,40.1379919,-74.9557629 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 154.0
1 33033 Medium Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 514218.05752427189145237207 3 NaN NaN 1397182260 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/77cc196151a296effc9295ab70da4302.jpg http://medium.com/ None http://www.linkedin.com/company/medium-com San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
2 176060 Eventbrite Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 454876.68161434977082535625 4 NaN NaN 1397189157 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/1c8ed51b74f154a7eb29fdb881417fb2.jpg http://www.eventbrite.com/ None http://www.linkedin.com/company/eventbrite San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
3 282599 FTX Exchange FTX Exchange is a cryptocurrency derivatives exchange company built by traders, for traders. 37.7749 -122.419 0 massive_CB_import21_2018 any 0 0 0 0 /image/upload/v3wgeajl4zaccve2fqgh 370193.95945386844687163830 5 NaN NaN 1612865162 fintech Fintech 4 cryptocurrency 20.0 Cryptocurrency None None https://res.cloudinary.com/crunchbase-production/image/upload/vqz68owblsgchsqpyjzm https://ftx.com/ https://www.crunchbase.com/organization/ftx-exchange None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
4 341985 JUUL JUUL is a manufacturer and distributor of electronic nicotine vaporizers. 37.7749 -122.419 0 massive_CB_2022 any 0 0 0 0 /image/upload/v1429671971/po5mfc1lakppkxasfvaz.png 343775.01932146179024130106 6 NaN 0.0 1642957296 social-leisure Social & Leisure 9 social-leisure-other 68.0 Social & Leisure-Other None None None https://www.juul.com None None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1291 299976 AiCure AiCure is an advanced data analytics company that uses artificial intelligence to understand how patients respond to treatments. 40.7128 -74.006 0 massive_CB_2022 any 0 0 0 0 None 235.74892181180308625699 2372 NaN 0.0 1642945905 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None None http://www.aicure.com None None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1292 262861 Primary Primary is making better clothes for kids and building a better experience for busy parents to shop for them. 40.7128 -74.006 0 massive_CB_import21_2015 any 0 0 0 0 /image/upload/v1427864328/d3eplpf1udmzamqlxbok.png 235.54787246262657163243 2373 NaN NaN 1612862505 ecommerce-retail Ecommerce & Retail 1 ecommerce 2.0 Ecommerce None None https://res.cloudinary.com/crunchbase-production/image/upload/v1427864328/d3eplpf1udmzamqlxbok.png https://www.primary.com/ https://www.crunchbase.com/organization/primary None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1293 275123 OLIPOP OLIPOP is the clinically backed consumer beverage that meets consumer’s real-world taste preferences in a delicious tonic. 37.8044 -122.271 0 massive_CB_import21_2017 any 0 0 0 0 /image/upload/yx6qdieek1mffmbjrph0 235.45512740329783696325 2374 NaN NaN 1612864255 foodtech Foodtech 5 food-and-beverage 32.0 Food and Beverage None None https://res.cloudinary.com/crunchbase-production/image/upload/yx6qdieek1mffmbjrph0 https://www.drinkolipop.com/ https://www.crunchbase.com/organization/olipop None Oakland oakland united-states United States CA 348 1 25.0 1 10677525 Oakland, United States None 4,43,37,15 37.699192,-122.3426648,37.8847249,-122.1149234 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1294 240892 SecuredTouch Solving real-world authentication problems to support digital transformation into the “mobile era” 37.4419 -122.143 0 crunchbase crunchbase 0 0 0 0 /image/upload/v1492674022/pkuky18gpvm5m6fkef79.png 235.38222864173755510819 2376 NaN NaN 1569702672 fintech Fintech 4 fintech-other 23.0 Fintech-Other None None https://res.cloudinary.com/crunchbase-production/image/upload/v1492674022/pkuky18gpvm5m6fkef79.png http://www.securedtouch.com/ https://www.crunchbase.com/organization/securedtouch https://www.linkedin.com/company-beta/9187630/ Palo Alto palo-alto united-states United States CA 77 1 25.0 1 10677525 Palo Alto, United States None 4,43,37,15 37.2853458,-122.202476,37.4659713,-122.0867789 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1295 48632 Womply Womply brings online tools like Google Analytics, Compete.com & Salesforce to offline merchants. \n\nWomply lets merchants:\n-visualize their revenue, social media, & online reputation performance\n-compare performance to competitors\n-identify their best customers\n-see where else customers spend\n-engage customers automatically via email/mobile to drive revenue\n\nWomply is special because it runs in the cloud: no hardware to install, no software to integrate, no training, & no Δ in payment behavior. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 235.07372273596880063451 2377 NaN NaN 1397199100 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None https://www.startupblink.com/uploads/startups_logo/88c6ef3c716188d445c0f39bc40107c6.jpg https://womply.com/insights None https://www.linkedin.com/company/womply San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States None 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1296 rows × 49 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html

You can just use the API and take the fields you need.
import requests
import pandas as pd
results = []
for page in range(770):
url = f"https://www.startupblink.com/api/entities?entity=startups&page={page}&sortBy=rank&order=desc&leaderType=1"
response = requests.get(url)
for business in response.json()['page']:
results.append({
'title': business['title'],
'city': business['city'],
'industry_name': business['industry_name'],
'subindustry_name': business['subindustry_name'],
'description': business['description']
})
df = pd.DataFrame(results)
print(df.to_string(index=False))
OUTPUT:
title city industry_name subindustry_name description
GrabFood London Ecommerce & Retail Ecommerce GrabFood is a same-day grocery delivery company, offering delivery in as little as one hour.
DuckDuckGo Philadelphia Software & Data Software DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/
Medium San Francisco Software & Data Apps Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested.
Eventbrite San Francisco Software & Data Apps Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together.
...

Related

Couldn't get tq_exchange() or stockSymbols() to work

I am trying to get stock symbols with these functions (both failed)
TTR::stockSymbols("AMEX")
Error in symbols[, sort.by] : incorrect number of dimensions
tidyquant::tq_exchange("AMEX")
Getting data...
Error: Can't rename columns that don't exist.
x Column Symbol doesn't exist.
Do these functions work for you? What fixes do you know to correct them? Thank you!
I get the same error. It seems like there has been some changes in the website from which these packages get the information. There is an open issue about this.
In the same thread it is mentioned that you can get the information from the underlying JSON which returns this information.
tmp <- jsonlite::fromJSON('https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0&exchange=AMEX&download=true')
head(tmp$data$rows)
# symbol name
#1 AAMC Altisource Asset Management Corp Com
#2 AAU Almaden Minerals Ltd. Common Shares
#3 ACU Acme United Corporation. Common Stock
#4 ACY AeroCentury Corp. Common Stock
#5 AE Adams Resources & Energy Inc. Common Stock
#6 AEF Aberdeen Emerging Markets Equity Income Fund Inc. Common Stock
# lastsale netchange pctchange volume marketCap country ipoyear
#1 $24.60 -0.3595 -1.44% 15183 40595215.00 United States
#2 $0.846 0.0359 4.432% 2272603 101984125.00 Canada 2015
#3 $33.82 0.61 1.837% 7869 112922038.00 United States 1988
#4 $11.76 2.01 20.615% 739133 18179596.00 United States
#5 $28.31 0.11 0.39% 6217 120099060.00 United States
#6 $9.10 0.09 0.999% 40775 461841180.00 United States
# industry sector
#1 Real Estate Finance
#2 Precious Metals Basic Industries
#3 Industrial Machinery/Components Capital Goods
#4 Diversified Commercial Services Technology
#5 Oil Refining/Marketing Energy
#6
# url
#1 /market-activity/stocks/aamc
#2 /market-activity/stocks/aau
#3 /market-activity/stocks/acu
#4 /market-activity/stocks/acy
#5 /market-activity/stocks/ae
#6 /market-activity/stocks/aef

Variable selection using regsubset in R

I'm working on a Tweets Project and I extracted 87 variables, now i need to perform variable selection method so i used forward subset selection. But i'm facing an error.
regfit.fwd = regsubsets(screen_name ~.,merge_tweets,method = "forward",
complete.cases(merge_tweets),nvmax = 15)
Error in leaps.setup(x[, ii[reorder], drop = FALSE], y, wt,
force.in[reorder], : NA/NaN/Inf in foreign function call (arg 4)
> head(merge_tweets)
X user_id status_id created_at screen_name
1 1 1339835893 1.090257e+18 1548772454 HillaryClinton
2 2 1339835893 1.090002e+18 1548711688 HillaryClinton
3 3 1339835893 1.089999e+18 1548710912 HillaryClinton
4 4 1339835893 1.089994e+18 1548709837 HillaryClinton
5 5 1339835893 1.089994e+18 1548709756 HillaryClinton
6 6 1339835893 1.089994e+18 1548709738 HillaryClinton
text
1 On
top of human suffering and lasting damage to our national parks, the Trump
shutdown cost the economy $11 billion. End shutdowns as a political hostage-
taking tactic.
2 Hurricane Maria decimated trees and ecosystems in Puerto Rico. Para La
Naturaleza's nurseries have made a CGI commitment to plant 750,000 trees in
seven years. The team here has already grown 120,000 seedlings and planted
30,000 trees. source display_text_width is_quote is_retweet favorite_count
retweet_count lang
1 Twitter Web Client 192 FALSE FALSE 14324
4168 en
2 Twitter Web Client 235 FALSE FALSE 10684
2526 en
3 Twitter Web Client 238 FALSE FALSE 11423
2089 en
4 Twitter Web Client 34 FALSE FALSE 1293
113 en
5 Twitter Web Client 222 FALSE FALSE 6641
951 en
6 Twitter Web Client 214 FALSE FALSE 12192
2108 en
status_url name
location
Hillary
Clinton New York, NY
Hillary
Clinton New York, NY
description
1 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma
x2, lawyer, advocate, fan of walks in the woods & standing up for our
democracy.
2 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma
x2, lawyer, advocate, fan of walks in the woods & standing up for our
democracy.
url protected followers_count friends_count listed_count
statuses_count
1 FALSE 24017203 784
41782 10667
2 FALSE 24017203 784
41782 10667
3 FALSE 24017203 784
41782 10667
favourites_count account_created_at verified profile_url
profile_expanded_url
1 1138 1365530675 TRUE
2 1138 1365530675 TRUE
3 1138 1365530675 TRUE
I have removed some url columns as it doesn't support url to be posted. It would be great if anyone can help me out in solving this problem.
Thanks in advance!!

Rfacebook: get reactions to posts

I want to use Rfacebook to get the reactions (not just likes) to specific posts but couldn't find a way to do that. Basically, I would want the same output for a comment as I get for a post:
> BBC <- getPage(page="bbcnews", token=fb_oauth, n=5, since="2017-10-03", until="2017-10-06", feed=FALSE, reactions=TRUE, verbose=TRUE)
5 posts > BBC
id likes_count from_id from_name
1 228735667216_10155178331342217 1602 228735667216 BBC News
2 228735667216_10155178840252217 7575 228735667216 BBC News
3 228735667216_10155178915482217 5735 228735667216 BBC News
4 228735667216_10155180617187217 6843 228735667216 BBC News
5 228735667216_1964396086910573 1736 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
2 Puerto Rico: President Donald J. Trump compares Hurricane Maria to a "real catastrophe like Katrina" bbc.in/2yG9gyZ
3 Do mass shootings ever change gun laws? http://bbc.in/2fIbjv0
4 "Boris asked me to give you this" - The moment comedian Lee Nelson interrupts Prime Minister Theresa May's speech.. by handing her a P45.
5 In her big conference speech, Theresa May talked about council houses and energy prices - but the announcements were overshadowed by a coughing fit and a protester. (Via BBC Politics)\nhttp://bbc.in/2fMCIw3
created_time type link story comments_count shares_count
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ NA 406 230
2 2017-10-03T21:34:21+0000 video https://www.facebook.com/bbcnews/videos/10155178840252217/ NA 14722 12284
3 2017-10-03T21:56:01+0000 video https://www.facebook.com/bbcnews/videos/10155178915482217/ NA 3059 2418
4 2017-10-04T11:17:28+0000 video https://www.facebook.com/bbcnews/videos/10155180617187217/ NA 1737 2973
5 2017-10-04T17:16:33+0000 video https://www.facebook.com/bbcnews/videos/1964396086910573/ NA 636 238
love_count haha_count wow_count sad_count angry_count
1 125 16 18 1063 20
2 318 1155 5023 1072 23698
3 104 69 61 980 504
4 513 4127 76 10 80
5 83 467 24 11 21
Now, I want for the first 5 comments of the first post to also have an output like above. I get all of it except the reactions (corresponding to the columns love_count, haha_count, wow_count, sad_count, angry_count) by using the following code:
> BBC_post <- getPost(BBC$id[1], token=fb_oauth, comments=TRUE, n.comments=5, likes=FALSE, reactions=FALSE)
> BBC_post
$post
from_id from_name
1 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
created_time type link id
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ 228735667216_10155178331342217
likes_count comments_count shares_count
1 1602 406 230
$comments
from_id from_name
1 880124212162441 David Bourton
2 10159595379610445 Valerie Gregory
3 10159810965680122 Nadeem Hussain
4 1657693134252376 Samir Amghar
5 10215327133878123 Shlomo Resnikov
message
1 It's unfathomable to the rest of the world that there are so many people who believe the killer's right to their guns are greater than their victims right to life.
2 That's backwards. The victims didn't do anything. The NRA, the politicians who are bought and paid for by them, including President Trump, and the shooter did. That is where solving the problem begins.
3 BBC ask Israel the same Question... what did the Palestinians civilians do to deserve an Apartheid regime !!!
4 Praying and thinking of the victims will not prevent the next shooting. One failed attempt at a shoe bomb and we all take off our shoes at the airport. 274 Mass shootings since January and no change in your regulation of guns.
5 As a Jew , we constantly ask those kind of questions regarding to the holocaust ,”where was god in the holocaust “? Or “How did he allow this horror”? And the answer that facilitates the most is mysterious ways of god are beyond our perception ,we cannot grasp divine calculation .
created_time likes_count comments_count id
1 2017-10-03T18:25:58+0000 225 71 10155178331342217_10155178338952217
2 2017-10-03T18:29:04+0000 79 45 10155178331342217_10155178346307217
3 2017-10-03T18:28:34+0000 60 38 10155178331342217_10155178345382217
4 2017-10-03T18:32:11+0000 37 3 10155178331342217_10155178354272217
5 2017-10-03T18:44:19+0000 16 20 10155178331342217_10155178380902217
### how do I also display the REACTIONS a comment got? It is not "reactions=TRUE" since that will display the reactions to the post itself and not the comment of the post
Does anyone know how to get there? Or does Rfacebook simply not allow for that (yet) since the feature of 'reacting to comments' was introduced not too long ago?
Many thanks in advance and all the best,
Ivo

R:Fuzzy Logic Name match

I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets
Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern
Cus File
1 chang chun petrochemical
2 chang chun plastics
3 church dwight
4 citrix systems asia pacific
5 cnh industrial services srl
6 conoco phillips
7 conocophillips
8 dfk laurence varnay
9 dtz worldwide
10 electro motive maintenance operati
11 enterasys networks
12 esso resources
13 expedia
14 expedia
15 exponential interactive aust
16 exxonmobil asia pacific pte
17 exxonmobil chemical asia pac div
18 exxonmobil png
19 formula world championship
20 fortitech asia pacific sdn bhd
Master
1 chang chun group
2 church dwight
3 citrix systems asia pacific
4 cnh industrial nv
5 conoco phillips
6 dfk laurence varnay
7 dtz group zealand
8 caterpillar
9 enterasys networks
10 exxon mobil group
11 expedia group
12 exponential interactive aust
13 formula world championship
14 fortitech asia pacific sdn bhd
15 frhi hotels resorts
16 gardner denver industries
17 glencore xstrata international plc
18 grace
19 incomm nz
20 information resources
21 kbr holdings llc
22 kennametal
23 komatsu
24 leonhard hofstetter pelzdesign
25 communications corporation
26 manhattan associates
27 mattel
28 mmg finance
29 nokia oyj group
30 nortek
i have tried with this simple loop
for (i in 1:100){
result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
#result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}
*result *
1 chang chun petrochemical <NA> NA
2 chang chun plastics <NA> NA
3 church dwight church dwight 2
4 citrix systems asia pacific citrix systems asia pacific 3
5 cnh industrial services srl <NA> NA
6 conoco phillips church dwight 2
7 conocophillips <NA> NA
8 dfk laurence varnay <NA> NA
9 dtz worldwide church dwight 2
10 electro motive maintenance operati <NA> NA
11 enterasys networks <NA> NA
12 esso resources church dwight 2
13 expedia <NA> NA
14 expedia <NA> NA
15 exponential interactive aust church dwight 2
16 exxonmobil asia pacific pte <NA> NA
17 exxonmobil chemical asia pac div <NA> NA
18 exxonmobil png church dwight 2
19 formula world championship <NA> NA
20 fortitech asia pacific sdn bhd
tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match!
mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))]
#using looping stat. for checking each cus name with all the master names
for(i in seq(nrow(result)) )
{
if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0)
sprintf("%s", x)
}
which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO
it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist
code:
#check with each value of master file and if matches more than .90 then return master value.
for(i in seq(1:nrow(gr1))
{
for(j in seq(1:nrow(gr2))
{
gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j])
if(gr1$jar[i,j]>.90)
gr1$res[i] = gr2$Master_Names[j]
}
}
#Please let know if there is any minute error with this code
Please if anybody has worked with such data in R please help !
achieved partial result by
code :
df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])

Issue with sorting one column after rank is assigned

*****This is to deal with the question asked in Coursera and hence I may not be able to reveal the complete code*****
hi,
below is my data frame (outcome_H)
Hospital_Name H_A H_F PN
ABC 4.5 5 6
CDE 4.5 1 3
EFG 5 2 1
1) I need to rank the column provided in the function call (it could be one of H_A ,H_F,PN)
2) there will also a rank be provided in the call. Need to match that rank with the rank calculated above and return the respective Hospital_Name
I had used ties.method="first" to solve the tie problem. But however when I look at the final output the hospital name is not sorted.
Example: if i give rank =2, I expect CDE to be printed, but due to some problems(which I am note aware) ABC gets printed for rank=2 and CDE is printed for rank=1.
Below are some parts of code for better understanding:
H_A <- as.numeric(outcome_H$H_A)
HA <- H_A[order(H_A)] // newly added piece to order the value
df <- data.frame(HA,round(rank(HA,ties.method="first")),outcome_H$Hospital_Name)
rowss <- df[order(df$round.rank.HA..),]
Before ordering Output:
HA round.rank.HA.. outcome_H.Hospital.Name
42 8.1 1 FORT DUNCAN MEDICAL CENTER
192 8.5 2 TOMBALL REGIONAL MEDICAL CENTER
61 8.7 4 DETAR HOSPITAL NAVARRO
210 8.7 4 CYPRESS FAIRBANKS MEDICAL CENTER
69 8.8 6 MISSION REGIONAL MEDICAL CENTER
117 8.8 6 METHODIST HOSPITAL,THE
After Ordering output:
HA round.rank.HA..ties.method....first... outcome_H.Hospital.Name
1 8.1 1 PROVIDENCE MEMORIAL HOSPITAL
2 8.5 2 MEMORIAL HERMANN BAPTIST ORANGE HOSPITAL
3 8.7 3 PETERSON REGIONAL MEDICAL CENTER
4 8.7 4 CHILDREN'S HOSPITAL -SCOTT & WHITE HEALTHCARE
5 8.8 5 UNITED REGIONAL HEALTH CARE SYSTEM
6 8.8 6 ST JOSEPH REGIONAL HEALTH CENTER
As you can see, the data with hospital names are completely incorrect.
Any help is very much appreciated.
Thanks,
Pravellika J
You could try H_A <- as.numeric(as.character(outcome_H$H_A))
Output
HA round.rank.HA..ties.method....first... outcome_H.Hospital_Name
1 4.5 1 ABC
2 4.5 2 CDE
3 5.0 3 EFG
I figured it myself. I had initialy assigned HA only with one of the three cols(H_A,H_F,PN). Now i clubbed it with hospital_Name and ordered it based on both the attributes.
Thanks,
Pravellika J

Resources