How to scrape address from websites using Scrapy? [closed] - web-scraping

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am using Scrapy and I need to scrape the address from contact us page from a given domain. The domains are provided as a result of google search api and hence I do not know what the exact structure of the web page is going to be. Is this kind of scraping possible? Any examples would be nice.

Providing few examples would help to make a better answer, but the general idea could be to:
find the "Contact Us" link
follow the link and extract the address
assuming you don't have any information about the web-sites you'll be given.
Let's focus on the first problem.
The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:
follow the a tag with the text "Contact Us", "Contact", "About Us", "About" etc
check /about, /contact_us and similar endpoints, examples:
http://www.sample.com/contact.php
http://www.sample.com/contact
follow all links that have contact, about etc text inside
From these you can build a set of Rules for your CrawlSpider.
The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.

Related

Creating LinkedIn App. Why is my company not listed? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I just got admin access to the company page and even added it as my current position.
Still, when creating an app here https://www.linkedin.com/developer/apps/new the given company/page is not listed on the dropdown. Any tips?
Had the same issue. My problem was that I was trying to use a Showcase rather than a Company page.
Guess I was just a victim of bad UI (?)
So, turns out that "creating a new company" on that interface does not really mean "create a new company page".
If you are with the same problem, just select "Create a new company" and type the name. Company pages will pop up.
This seems to be LinkedIn company search issue. It can occur when your company name is too common. As a workaround, you can temporarily add a unique part to your company name.
For example, if your company name is "Dog Store", then you may want to change it temporarily to "Dog Store (DOG168)". Now try to search by "DOG168" on the create an application page. This trick helped in my case.
You are adding an app right?
What you may need to do is add a Company Page.

modifying url to display more results per page [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
How would one find out if it's even possible on specific site?
For example https://forums.eveonline.com/default.aspx?g=topics&f=257
There are many more sites where I wanted to display more results per page but option is not available.
Without knowledge of the code base, there is no way to know whether you can change the page behavior via a URL parameter, other than trial and error.
If the site is well designed, all URL parameters ought to be validated against a white list, so it should not be possible to hack the URL. So you should not rely on this.
I know that this is not answering the real question and i know that John Wu is right: You can't obtain this via querystring if you don't know if this is coded server side. What i think is that there is always a way:
For example, in this case you can use rss feed (the button placed at the bottom of the page):
https://forums.eveonline.com/default.aspx?g=rsstopic&pg=Topics&f=257

Preserving old links on new WordPress site [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
It is a simple scenario. The client wanted a remake of an old site and we chose WordPress. Besides visual redesign there were changes in content structure and naturally links. The problem is that the old content is pretty highly ranked on Google, so the questions is actually two fold.
Will switching to new site affect the ranking?
How to preserve links that are already there on Google to point to the same content but to different URL's on the new site?
Switching will affect your Google rankings. The ranking is tied to the address of a page, so when you move it to a new address, you loose the ranking you've built up. However, if you use 301 redirects from the old content to the new, you will preserve your Google rankings. This tells the search engine that your content that was in page A is now in page B. Think "change of address" cards for the Internet. It works for search engines as well as users in browsers.
Here's a good article on the subject: http://www.bruceclay.com/blog/2007/03/how-to-properly-implement-a-301-redirect/

How to get list of pages indexed by Google? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have created a site twoo years ago in php and now converting it into asp.net mvc. Now I would like to get all indexed pages by Google so I can validate all these are working with new site.
I search on google using "site:mysite.com" it shows 21000 results, So how Can i get these 21000 results and validate all they are working with new.
I don't know if there is a tool that gives you a list. However one way I can think of is to keep an eye on your Google WebMasters account for errors. Any page that Google can't reach will be a page you need to look at and fix. This isn't a fast solution but it's a reliable one.
If your previous website has a structure to its urls then it should be easy to replicate that using routes in asp.Net.
I think this topic has to be moved to webmasters section of stackexchange.
Generally what people would do with change of url is to give 301 redirect to all the new pages from old pages (aka url rewriting).
To verify if all your links work, you can use a tool called Xenu. Being a webmaster, you might already known this. It just goes through the entire list of urls in the pages and verifies them. But in your case you want to check if all your old php links work, you can possibly get a sitemap of the existing list and do the checking by putting this sitemap page on the new site.

Construct a Netflix Affiliate URL to a search result page? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a Netflix Affiliate account, but I don't want to direct users to the homepage for them to create an account, I want to direct them to a search result page. The reason for this is that on our site we have lots of titles but they can't be reliably linked to a single Netflix result programmatically, so we would prefer if we could direct users to a search page, and if the user signs up, get the revenue. Is this possible? I find the whole Netflix-Affiliate-but-Google-Affiliate scheme a bit daunting.
Sorry for answering my own question once again. I found the answer here (a bit hidden that link). For anyone wondering about the same problem, here's a quote from that page:
Deep linking (linking to pages other than the main Netflix login page) is a little more complicated - the structure is:
http://clickserve.cc-dt.com/link/tplclick?lid=41000000030242852&pubid=00000000000000000&redirect=[Endcoded Deep linking URL]
That works pretty good.
I would be wary of using Felix's example above before confirming your account details with the Netflix API team. I think you need to first submit your Google ID (GAN) along with your API key so that they will credit leads generated with the above structure.
Read the details on http://developer.netflix.com/docs/Affiliate_Program
Even worse, if you use http://www.netflix.com/Search?v1=Jaws it returns a search result appropriately for "Jaws" if you're not logged into Netflix, but if you happen to already be logged in to your Netflix account it takes you to the Netflix homepage instead. This is with the affiliate linkage element out of the equation entirely, so there are multiple issues at stake.
I'm looking for the solution too, and have had no response from Netflix or the Google Affiliate Network.

Resources