Startup Blink web scrapping - web-scraping
Hello and have a great day!
I was trying to get some information for my research on startups from Startup Blink website(https://www.startupblink.com/startups), and here is my code
import requests
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from bs4 import BeautifulSoup
from time import sleep
from time import time
%time
df=pd.DataFrame()
for p in range(1,770):
url=f'https://www.startupblink.com/startups?page={p}&location=united-states'
r=requests.get(url)
us=r.text
soup=BeautifulSoup(us, 'html.parser')
allbus=soup.find_all('div', class_='sc-2ozyz3-0 jlGOJO entity-card laptop:test')
for bus in allbus:
business_name=bus.find('a', class_='sc-2ozyz3-3 bPSWdR').text
city=bus.find('div', class_='sc-2ozyz3-4 iNXPUy').find('a').text
industry=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[0].text
industryspec=bus.find_all('div', class_='sc-2ozyz3-4 iNXPUy')[1].find_all('a')[1].text
description=bus.find('div', class_='sc-2ozyz3-9 gHVzj').text
description=description.rstrip('\xa0Read more')
df = df.append({"Business_name": business_name, "City": city, "Industry": industry, 'Industry Specific': industryspec, 'Description': description}, ignore_index=True)
sleep(0.01)
print(p)
df=df.dropna()
df=df.drop_duplicates()
df.describe()
Unfortunately, I was not able to figure out how to better approach it so that to get all information I need directly from the page without that inner for loop I made which goes through the page several times and it takes too much time.
Any suggestions???
Also, I cannot yet understand how to get the country name from the output HTML tag (it is the second in div class="sc-2ozyz3-4 iNXPUy":
a class="sc-2ozyz3-3 bPSWdR" href="/startups/qiwi">QIWI</a>
<div class="sc-2ozyz3-4 iNXPUy"><div class="sc-2ozyz3-6 sc-2ozyz3-7 kDRpwA bsjXoB"></div>
Moscow,
Russia</div>
<div class="sc-2ozyz3-4 iNXPUy"><div class="sc-2ozyz3-6 sc-2ozyz3-8 kDRpwA jspnHi"></div>
<a href="/startups/industry/fintech">
Appreciate your help and advice!
That page is being hydrated from an API, visible in browser's Dev tools - Network tab: you need to scrape that API endpoint, to get the information.
Here is one way to do it:
import requests
import pandas as pd
from tqdm import tqdm
s = requests.Session()
big_df = pd.DataFrame()
for x in tqdm(range(27)):
r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={x}&bounds=-48.58314637707078,-177.71484375,80.2661234640419,-6.152343750000001&sortBy=rank&order=desc&leaderType=1&countryId=1')
df = pd.json_normalize(r.json()['page'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)
Result in terminal:
id title description lat lng unicorn import_tag update_method lowtech pantheon exit slugNumber cb_logo url_rank local_rank stage featured when industry_slug industry_name industry_id subindustry_slug subindustry_id subindustry_name tags tags_name logo url crunchbase linkedin_url city city_slug country_slug country state city_id country_id state_id status highest_rank location claimed_by region_ids city_bounds country_bounds region_name region_bounds region_id cluster_parent
0 4227 DuckDuckGo DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/ 40.0025 -75.118 0 angellist angellist 0 0 0 0 None 981818.18181818176526576281 2 NaN NaN 1397184129 software-data Software & Data 10 software 80.0 Software 365 Search https://www.startupblink.com/uploads/startups_logo/3c3044925df3260f03ce454bf947349c.jpg https://duckduckgo.com/ None None Philadelphia philadelphia united-states United States PA 154 1 54.0 1 10677525 Philadelphia, United States NaN 4,43,37,15 39.8670041,-75.280303,40.1379919,-74.9557629 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 154.0
1 33033 Medium Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 514218.05752427189145237207 3 NaN NaN 1397182260 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/77cc196151a296effc9295ab70da4302.jpg http://medium.com/ None http://www.linkedin.com/company/medium-com San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
2 176060 Eventbrite Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 454876.68161434977082535625 4 NaN NaN 1397189157 software-data Software & Data 10 apps 72.0 Apps 267 Mobile https://www.startupblink.com/uploads/startups_logo/1c8ed51b74f154a7eb29fdb881417fb2.jpg http://www.eventbrite.com/ None http://www.linkedin.com/company/eventbrite San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
3 282599 FTX Exchange FTX Exchange is a cryptocurrency derivatives exchange company built by traders, for traders. 37.7749 -122.419 0 massive_CB_import21_2018 any 0 0 0 0 /image/upload/v3wgeajl4zaccve2fqgh 370193.95945386844687163830 5 NaN NaN 1612865162 fintech Fintech 4 cryptocurrency 20.0 Cryptocurrency None None https://res.cloudinary.com/crunchbase-production/image/upload/vqz68owblsgchsqpyjzm https://ftx.com/ https://www.crunchbase.com/organization/ftx-exchange None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
4 341985 JUUL JUUL is a manufacturer and distributor of electronic nicotine vaporizers. 37.7749 -122.419 0 massive_CB_2022 any 0 0 0 0 /image/upload/v1429671971/po5mfc1lakppkxasfvaz.png 343775.01932146179024130106 6 NaN 0.0 1642957296 social-leisure Social & Leisure 9 social-leisure-other 68.0 Social & Leisure-Other None None None https://www.juul.com None None San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States NaN 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1291 299976 AiCure AiCure is an advanced data analytics company that uses artificial intelligence to understand how patients respond to treatments. 40.7128 -74.006 0 massive_CB_2022 any 0 0 0 0 None 235.74892181180308625699 2372 NaN 0.0 1642945905 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None None http://www.aicure.com None None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1292 262861 Primary Primary is making better clothes for kids and building a better experience for busy parents to shop for them. 40.7128 -74.006 0 massive_CB_import21_2015 any 0 0 0 0 /image/upload/v1427864328/d3eplpf1udmzamqlxbok.png 235.54787246262657163243 2373 NaN NaN 1612862505 ecommerce-retail Ecommerce & Retail 1 ecommerce 2.0 Ecommerce None None https://res.cloudinary.com/crunchbase-production/image/upload/v1427864328/d3eplpf1udmzamqlxbok.png https://www.primary.com/ https://www.crunchbase.com/organization/primary None New York new-york united-states United States NY 15 1 27.0 1 10677525 New York, United States None 4,43,37,15 40.4959961,-74.2590879,40.9152556,-73.7002721 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 15.0
1293 275123 OLIPOP OLIPOP is the clinically backed consumer beverage that meets consumer’s real-world taste preferences in a delicious tonic. 37.8044 -122.271 0 massive_CB_import21_2017 any 0 0 0 0 /image/upload/yx6qdieek1mffmbjrph0 235.45512740329783696325 2374 NaN NaN 1612864255 foodtech Foodtech 5 food-and-beverage 32.0 Food and Beverage None None https://res.cloudinary.com/crunchbase-production/image/upload/yx6qdieek1mffmbjrph0 https://www.drinkolipop.com/ https://www.crunchbase.com/organization/olipop None Oakland oakland united-states United States CA 348 1 25.0 1 10677525 Oakland, United States None 4,43,37,15 37.699192,-122.3426648,37.8847249,-122.1149234 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1294 240892 SecuredTouch Solving real-world authentication problems to support digital transformation into the “mobile era” 37.4419 -122.143 0 crunchbase crunchbase 0 0 0 0 /image/upload/v1492674022/pkuky18gpvm5m6fkef79.png 235.38222864173755510819 2376 NaN NaN 1569702672 fintech Fintech 4 fintech-other 23.0 Fintech-Other None None https://res.cloudinary.com/crunchbase-production/image/upload/v1492674022/pkuky18gpvm5m6fkef79.png http://www.securedtouch.com/ https://www.crunchbase.com/organization/securedtouch https://www.linkedin.com/company-beta/9187630/ Palo Alto palo-alto united-states United States CA 77 1 25.0 1 10677525 Palo Alto, United States None 4,43,37,15 37.2853458,-122.202476,37.4659713,-122.0867789 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1295 48632 Womply Womply brings online tools like Google Analytics, Compete.com & Salesforce to offline merchants. \n\nWomply lets merchants:\n-visualize their revenue, social media, & online reputation performance\n-compare performance to competitors\n-identify their best customers\n-see where else customers spend\n-engage customers automatically via email/mobile to drive revenue\n\nWomply is special because it runs in the cloud: no hardware to install, no software to integrate, no training, & no Δ in payment behavior. 37.7749 -122.419 0 angellist angellist 0 0 0 0 None 235.07372273596880063451 2377 NaN NaN 1397199100 software-data Software & Data 10 data-analytics 77.0 Data Analytics None None https://www.startupblink.com/uploads/startups_logo/88c6ef3c716188d445c0f39bc40107c6.jpg https://womply.com/insights None https://www.linkedin.com/company/womply San Francisco san-francisco united-states United States CA 5 1 25.0 1 10677525 San Francisco, United States None 4,43,37,15 37.6933354,-123.1077733,37.9297707,-122.3279149 17.5749789,-142.2328836,55.7358113,-50.7387430 North America 7.0717019,174.8410874,72.6976623,-8.1392116 4 5.0
1296 rows × 49 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
You can just use the API and take the fields you need.
import requests
import pandas as pd
results = []
for page in range(770):
url = f"https://www.startupblink.com/api/entities?entity=startups&page={page}&sortBy=rank&order=desc&leaderType=1"
response = requests.get(url)
for business in response.json()['page']:
results.append({
'title': business['title'],
'city': business['city'],
'industry_name': business['industry_name'],
'subindustry_name': business['subindustry_name'],
'description': business['description']
})
df = pd.DataFrame(results)
print(df.to_string(index=False))
OUTPUT:
title city industry_name subindustry_name description
GrabFood London Ecommerce & Retail Ecommerce GrabFood is a same-day grocery delivery company, offering delivery in as little as one hour.
DuckDuckGo Philadelphia Software & Data Software DuckDuckGo is a general search engine with:\n --No tracking.\n --Better instant answers.\n --Way less spam and clutter.\n\nMore at https://duckduckgo.com/press/
Medium San Francisco Software & Data Apps Medium is rethinking how ideas and storied are shared with the world. We believe: \n\n- Great ideas can come from anywhere\n- People create better things together\n- Design matters at a deep level\n\nWe also care deeply about how media shapes the lives of individuals and the decisions of society — and we think it can be better. \n\nWe have a world-class engineering and design team, which we are looking to grow slowly and deliberately. Let us know if you're interested.
Eventbrite San Francisco Software & Data Apps Eventbrite brings people together around the power of live events. Founded in 2006, the innovative ticketing, registration, and event discovery platform has sold more than 140M tickets in 176 countries, and processed over $2B in gross ticket sales (25% of the in the last six months). We’re transforming the ticketing and registration industry from the ground up, and we're looking for amazing people to help us change the way people get together.
...
Related
Couldn't get tq_exchange() or stockSymbols() to work
I am trying to get stock symbols with these functions (both failed) TTR::stockSymbols("AMEX") Error in symbols[, sort.by] : incorrect number of dimensions tidyquant::tq_exchange("AMEX") Getting data... Error: Can't rename columns that don't exist. x Column Symbol doesn't exist. Do these functions work for you? What fixes do you know to correct them? Thank you!
I get the same error. It seems like there has been some changes in the website from which these packages get the information. There is an open issue about this. In the same thread it is mentioned that you can get the information from the underlying JSON which returns this information. tmp <- jsonlite::fromJSON('https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0&exchange=AMEX&download=true') head(tmp$data$rows) # symbol name #1 AAMC Altisource Asset Management Corp Com #2 AAU Almaden Minerals Ltd. Common Shares #3 ACU Acme United Corporation. Common Stock #4 ACY AeroCentury Corp. Common Stock #5 AE Adams Resources & Energy Inc. Common Stock #6 AEF Aberdeen Emerging Markets Equity Income Fund Inc. Common Stock # lastsale netchange pctchange volume marketCap country ipoyear #1 $24.60 -0.3595 -1.44% 15183 40595215.00 United States #2 $0.846 0.0359 4.432% 2272603 101984125.00 Canada 2015 #3 $33.82 0.61 1.837% 7869 112922038.00 United States 1988 #4 $11.76 2.01 20.615% 739133 18179596.00 United States #5 $28.31 0.11 0.39% 6217 120099060.00 United States #6 $9.10 0.09 0.999% 40775 461841180.00 United States # industry sector #1 Real Estate Finance #2 Precious Metals Basic Industries #3 Industrial Machinery/Components Capital Goods #4 Diversified Commercial Services Technology #5 Oil Refining/Marketing Energy #6 # url #1 /market-activity/stocks/aamc #2 /market-activity/stocks/aau #3 /market-activity/stocks/acu #4 /market-activity/stocks/acy #5 /market-activity/stocks/ae #6 /market-activity/stocks/aef
Variable selection using regsubset in R
I'm working on a Tweets Project and I extracted 87 variables, now i need to perform variable selection method so i used forward subset selection. But i'm facing an error. regfit.fwd = regsubsets(screen_name ~.,merge_tweets,method = "forward", complete.cases(merge_tweets),nvmax = 15) Error in leaps.setup(x[, ii[reorder], drop = FALSE], y, wt, force.in[reorder], : NA/NaN/Inf in foreign function call (arg 4) > head(merge_tweets) X user_id status_id created_at screen_name 1 1 1339835893 1.090257e+18 1548772454 HillaryClinton 2 2 1339835893 1.090002e+18 1548711688 HillaryClinton 3 3 1339835893 1.089999e+18 1548710912 HillaryClinton 4 4 1339835893 1.089994e+18 1548709837 HillaryClinton 5 5 1339835893 1.089994e+18 1548709756 HillaryClinton 6 6 1339835893 1.089994e+18 1548709738 HillaryClinton text 1 On top of human suffering and lasting damage to our national parks, the Trump shutdown cost the economy $11 billion. End shutdowns as a political hostage- taking tactic. 2 Hurricane Maria decimated trees and ecosystems in Puerto Rico. Para La Naturaleza's nurseries have made a CGI commitment to plant 750,000 trees in seven years. The team here has already grown 120,000 seedlings and planted 30,000 trees. source display_text_width is_quote is_retweet favorite_count retweet_count lang 1 Twitter Web Client 192 FALSE FALSE 14324 4168 en 2 Twitter Web Client 235 FALSE FALSE 10684 2526 en 3 Twitter Web Client 238 FALSE FALSE 11423 2089 en 4 Twitter Web Client 34 FALSE FALSE 1293 113 en 5 Twitter Web Client 222 FALSE FALSE 6641 951 en 6 Twitter Web Client 214 FALSE FALSE 12192 2108 en status_url name location Hillary Clinton New York, NY Hillary Clinton New York, NY description 1 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma x2, lawyer, advocate, fan of walks in the woods & standing up for our democracy. 2 2016 Democratic Nominee, SecState, Senator, hair icon. Mom, Wife, Grandma x2, lawyer, advocate, fan of walks in the woods & standing up for our democracy. url protected followers_count friends_count listed_count statuses_count 1 FALSE 24017203 784 41782 10667 2 FALSE 24017203 784 41782 10667 3 FALSE 24017203 784 41782 10667 favourites_count account_created_at verified profile_url profile_expanded_url 1 1138 1365530675 TRUE 2 1138 1365530675 TRUE 3 1138 1365530675 TRUE I have removed some url columns as it doesn't support url to be posted. It would be great if anyone can help me out in solving this problem. Thanks in advance!!
Rfacebook: get reactions to posts
I want to use Rfacebook to get the reactions (not just likes) to specific posts but couldn't find a way to do that. Basically, I would want the same output for a comment as I get for a post: > BBC <- getPage(page="bbcnews", token=fb_oauth, n=5, since="2017-10-03", until="2017-10-06", feed=FALSE, reactions=TRUE, verbose=TRUE) 5 posts > BBC id likes_count from_id from_name 1 228735667216_10155178331342217 1602 228735667216 BBC News 2 228735667216_10155178840252217 7575 228735667216 BBC News 3 228735667216_10155178915482217 5735 228735667216 BBC News 4 228735667216_10155180617187217 6843 228735667216 BBC News 5 228735667216_1964396086910573 1736 228735667216 BBC News message 1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began. 2 Puerto Rico: President Donald J. Trump compares Hurricane Maria to a "real catastrophe like Katrina" bbc.in/2yG9gyZ 3 Do mass shootings ever change gun laws? http://bbc.in/2fIbjv0 4 "Boris asked me to give you this" - The moment comedian Lee Nelson interrupts Prime Minister Theresa May's speech.. by handing her a P45. 5 In her big conference speech, Theresa May talked about council houses and energy prices - but the announcements were overshadowed by a coughing fit and a protester. (Via BBC Politics)\nhttp://bbc.in/2fMCIw3 created_time type link story comments_count shares_count 1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ NA 406 230 2 2017-10-03T21:34:21+0000 video https://www.facebook.com/bbcnews/videos/10155178840252217/ NA 14722 12284 3 2017-10-03T21:56:01+0000 video https://www.facebook.com/bbcnews/videos/10155178915482217/ NA 3059 2418 4 2017-10-04T11:17:28+0000 video https://www.facebook.com/bbcnews/videos/10155180617187217/ NA 1737 2973 5 2017-10-04T17:16:33+0000 video https://www.facebook.com/bbcnews/videos/1964396086910573/ NA 636 238 love_count haha_count wow_count sad_count angry_count 1 125 16 18 1063 20 2 318 1155 5023 1072 23698 3 104 69 61 980 504 4 513 4127 76 10 80 5 83 467 24 11 21 Now, I want for the first 5 comments of the first post to also have an output like above. I get all of it except the reactions (corresponding to the columns love_count, haha_count, wow_count, sad_count, angry_count) by using the following code: > BBC_post <- getPost(BBC$id[1], token=fb_oauth, comments=TRUE, n.comments=5, likes=FALSE, reactions=FALSE) > BBC_post $post from_id from_name 1 228735667216 BBC News message 1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began. created_time type link id 1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ 228735667216_10155178331342217 likes_count comments_count shares_count 1 1602 406 230 $comments from_id from_name 1 880124212162441 David Bourton 2 10159595379610445 Valerie Gregory 3 10159810965680122 Nadeem Hussain 4 1657693134252376 Samir Amghar 5 10215327133878123 Shlomo Resnikov message 1 It's unfathomable to the rest of the world that there are so many people who believe the killer's right to their guns are greater than their victims right to life. 2 That's backwards. The victims didn't do anything. The NRA, the politicians who are bought and paid for by them, including President Trump, and the shooter did. That is where solving the problem begins. 3 BBC ask Israel the same Question... what did the Palestinians civilians do to deserve an Apartheid regime !!! 4 Praying and thinking of the victims will not prevent the next shooting. One failed attempt at a shoe bomb and we all take off our shoes at the airport. 274 Mass shootings since January and no change in your regulation of guns. 5 As a Jew , we constantly ask those kind of questions regarding to the holocaust ,”where was god in the holocaust “? Or “How did he allow this horror”? And the answer that facilitates the most is mysterious ways of god are beyond our perception ,we cannot grasp divine calculation . created_time likes_count comments_count id 1 2017-10-03T18:25:58+0000 225 71 10155178331342217_10155178338952217 2 2017-10-03T18:29:04+0000 79 45 10155178331342217_10155178346307217 3 2017-10-03T18:28:34+0000 60 38 10155178331342217_10155178345382217 4 2017-10-03T18:32:11+0000 37 3 10155178331342217_10155178354272217 5 2017-10-03T18:44:19+0000 16 20 10155178331342217_10155178380902217 ### how do I also display the REACTIONS a comment got? It is not "reactions=TRUE" since that will display the reactions to the post itself and not the comment of the post Does anyone know how to get there? Or does Rfacebook simply not allow for that (yet) since the feature of 'reacting to comments' was introduced not too long ago? Many thanks in advance and all the best, Ivo
R:Fuzzy Logic Name match
I have been working on large data set which has names of customers , each of this has to be checked with the master file which has correct names (300 KB) and if matched append the master file name to names of customer file as new column value. My prev Question worked for small data sets Both Customer & Master file has been cleaned using tm and have tried different logic , but only works on small set of data when applied to huge files not effective, pattern matching doesn't help here my opinion cause no names comes with exact pattern Cus File 1 chang chun petrochemical 2 chang chun plastics 3 church dwight 4 citrix systems asia pacific 5 cnh industrial services srl 6 conoco phillips 7 conocophillips 8 dfk laurence varnay 9 dtz worldwide 10 electro motive maintenance operati 11 enterasys networks 12 esso resources 13 expedia 14 expedia 15 exponential interactive aust 16 exxonmobil asia pacific pte 17 exxonmobil chemical asia pac div 18 exxonmobil png 19 formula world championship 20 fortitech asia pacific sdn bhd Master 1 chang chun group 2 church dwight 3 citrix systems asia pacific 4 cnh industrial nv 5 conoco phillips 6 dfk laurence varnay 7 dtz group zealand 8 caterpillar 9 enterasys networks 10 exxon mobil group 11 expedia group 12 exponential interactive aust 13 formula world championship 14 fortitech asia pacific sdn bhd 15 frhi hotels resorts 16 gardner denver industries 17 glencore xstrata international plc 18 grace 19 incomm nz 20 information resources 21 kbr holdings llc 22 kennametal 23 komatsu 24 leonhard hofstetter pelzdesign 25 communications corporation 26 manhattan associates 27 mattel 28 mmg finance 29 nokia oyj group 30 nortek i have tried with this simple loop for (i in 1:100){ result$x[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) #result$Y[i] = agrep(result$ICIS_Cust_Names[i], result1$Master_Names, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4)) } *result * 1 chang chun petrochemical <NA> NA 2 chang chun plastics <NA> NA 3 church dwight church dwight 2 4 citrix systems asia pacific citrix systems asia pacific 3 5 cnh industrial services srl <NA> NA 6 conoco phillips church dwight 2 7 conocophillips <NA> NA 8 dfk laurence varnay <NA> NA 9 dtz worldwide church dwight 2 10 electro motive maintenance operati <NA> NA 11 enterasys networks <NA> NA 12 esso resources church dwight 2 13 expedia <NA> NA 14 expedia <NA> NA 15 exponential interactive aust church dwight 2 16 exxonmobil asia pacific pte <NA> NA 17 exxonmobil chemical asia pac div <NA> NA 18 exxonmobil png church dwight 2 19 formula world championship <NA> NA 20 fortitech asia pacific sdn bhd tried with lapply but no use , as you can notice my master file is large and some times i get error of rows length doesn't match! mm<-dt[lapply(result, function(x) levenshteinDist(x ,lapply(result1, function(x) x)))] #using looping stat. for checking each cus name with all the master names for(i in seq(nrow(result)) ) { if((levenshteindist(result[i],lapply(result1, function(x) String(x))))==0) sprintf("%s", x) } which method would be best for this ? similar to my Q but not much helpfullI referd few Q from STO it might be naive but when applied with huge data sets it mis behaves, can anybody familiar with R could correct me with the above code for levenshteinDist code: #check with each value of master file and if matches more than .90 then return master value. for(i in seq(1:nrow(gr1)) { for(j in seq(1:nrow(gr2)) { gr1$jar[i,j]<-jarowinkler(gr1$ICIS_Cust_Names[i],gr2$Master_Names[j]) if(gr1$jar[i,j]>.90) gr1$res[i] = gr2$Master_Names[j] } } #Please let know if there is any minute error with this code Please if anybody has worked with such data in R please help !
achieved partial result by code : df$result<-data.frame(df$Cust_Names, df$Master_Names[max.col(-adist(df$Cust_Names,df$Master_Names))])
Issue with sorting one column after rank is assigned
*****This is to deal with the question asked in Coursera and hence I may not be able to reveal the complete code***** hi, below is my data frame (outcome_H) Hospital_Name H_A H_F PN ABC 4.5 5 6 CDE 4.5 1 3 EFG 5 2 1 1) I need to rank the column provided in the function call (it could be one of H_A ,H_F,PN) 2) there will also a rank be provided in the call. Need to match that rank with the rank calculated above and return the respective Hospital_Name I had used ties.method="first" to solve the tie problem. But however when I look at the final output the hospital name is not sorted. Example: if i give rank =2, I expect CDE to be printed, but due to some problems(which I am note aware) ABC gets printed for rank=2 and CDE is printed for rank=1. Below are some parts of code for better understanding: H_A <- as.numeric(outcome_H$H_A) HA <- H_A[order(H_A)] // newly added piece to order the value df <- data.frame(HA,round(rank(HA,ties.method="first")),outcome_H$Hospital_Name) rowss <- df[order(df$round.rank.HA..),] Before ordering Output: HA round.rank.HA.. outcome_H.Hospital.Name 42 8.1 1 FORT DUNCAN MEDICAL CENTER 192 8.5 2 TOMBALL REGIONAL MEDICAL CENTER 61 8.7 4 DETAR HOSPITAL NAVARRO 210 8.7 4 CYPRESS FAIRBANKS MEDICAL CENTER 69 8.8 6 MISSION REGIONAL MEDICAL CENTER 117 8.8 6 METHODIST HOSPITAL,THE After Ordering output: HA round.rank.HA..ties.method....first... outcome_H.Hospital.Name 1 8.1 1 PROVIDENCE MEMORIAL HOSPITAL 2 8.5 2 MEMORIAL HERMANN BAPTIST ORANGE HOSPITAL 3 8.7 3 PETERSON REGIONAL MEDICAL CENTER 4 8.7 4 CHILDREN'S HOSPITAL -SCOTT & WHITE HEALTHCARE 5 8.8 5 UNITED REGIONAL HEALTH CARE SYSTEM 6 8.8 6 ST JOSEPH REGIONAL HEALTH CENTER As you can see, the data with hospital names are completely incorrect. Any help is very much appreciated. Thanks, Pravellika J
You could try H_A <- as.numeric(as.character(outcome_H$H_A)) Output HA round.rank.HA..ties.method....first... outcome_H.Hospital_Name 1 4.5 1 ABC 2 4.5 2 CDE 3 5.0 3 EFG
I figured it myself. I had initialy assigned HA only with one of the three cols(H_A,H_F,PN). Now i clubbed it with hospital_Name and ordered it based on both the attributes. Thanks, Pravellika J