Scraping MSN News with Scrapy - web-scraping

I'm currently attempting to scrape MSN News with scrapy and am having some difficulties with getting the proper response from the browser when inside the scrapy shell.
when I go to https://www.msn.com/en-us/news/world in my browser, I see this:
which is perfect because that's what the page is supposed to look like, but when I run the command scrapy shell https://www.msn.com/en-us/news/world and then view(response) this is what I see instead.
I've tried disabling javascript to see if maybe the content was being loaded with ajax and that's why it wasn't working, but all that did was stop the thumbnails from loading. Anybody know why it's behaving this way?

The website definitely has a lot of javascript running. How you should approach this is with disabling javascript in one instance and having a normal instance on the side.
Then you can dig around and compare, i.e. find thumbnail id and search it no-javascript source - it's probably somewhere in a json or a javascript variable.
This is what scrapy sees with javascript disabled.
You can see that article names and short descriptions are there. If you inspect the title you can even see that there's a link to thumbnail as well!
articles = response.xpath("//li[#data-m]/a[#aria-label]")
for article in articles:
# thumbnail
response.xpath('img/#data-src').extract_first()
# '{"default":"//img-s-msn-com.akamaized.net/tenant/amp/entityid/AAp0iW6.img?h=414&w=624&m=6&q=60&u=t&o=t&l=f&f=jpg&x=1280&y=688"}'
# title
article.xpath("#aria-label").extract_first()
# 'north korea can hit most of united states: u.s. officials provided by reuters'
# description
article.xpath("/img/#alt").extract_first()
# This Friday, July 28, 2017, photo distributed by the Nort...

Related

How does Google Keep fetch link preview?

My websites (which are just recovered from an attack) are fetched like this in Google Keep. However testing with Facebook Open Graph Debugger they appear normal. So is there a way to know what mechanism GK uses to fetch link preview?
Also asked in Google Help Center
This is a sign of a cloaking hack. See Fixing the Japanese keyword hack | Web Fundamentals for more

LinkedIn Not utilizing og:image

I've got a site that has multiple share buttons on entries in a WordPress site.
We designed this so there are no individual entries to view, they're Podcasts and videos. The listing page has a minimum of 10 entries, each with share buttons.
Currently the share links and titles are working correctly. But the page is not recognizing the og:image, and instead is picking up the default logo for the site itself.
I read another post on Stack Overflow that said it might be an issue for LinkedIn if the image is utilizing SSL for the link. But I just find that hard to believe.
The other issue I'm struggling with, the docs say once an image is scraped it stays cached for approximately 7 days.
I had an issue with FaceBook and there's a debugger that allows you to rescrape the page which let's me verify my changes worked.
My two questions are, is there something other than the og:image i should be specifying? since I can't specify it per post, it's in the head of the page itself, i would think it would pick that up. No?
Second, is there a way a developer can re-check after the meta info has been changed to see if the changes worked, without having to wait the TTL on the cache?
try this:
url/link?blah=1
url/link?blah=2
url/link?blah=3
to get around the cache.
This should trick it into thinking its a new page each time.
Can i get a link to test?
Anthony Walz posted the correct answer. Through email he also helped another problem i had which corrected a new issue i didn't realize I had until i looked.
my LinkedIn shares were not picking up the show title, they were picking up the page description instead (i have several podcasts showing on one page, we don't use individual post pages, they all play from the listing.)
he pointed me to the developer docs on formatting sharing links
Which gives a real world example - here:
https://www.linkedin.com/shareArticle?mini=true&url=http://developer.linkedin.com&title=LinkedIn%20Developer%20Network
&summary=My%20favorite%20developer%20program&source=LinkedIn
Thanks a ton for assist Anthony!

How can I get this Google RSS gadget to work?

I am trying to embed an RSS feed on a web page I am designing. The feed is for an Austin, TX Craigslist page:
http://austin.craigslist.org/search/fua?query=%22modern+salvage%22&srchType=A&minAsk=&maxAsk=
Depending on which URL I use for the feed I get one of these results:
**Feed URL**
http://www.gmodules.com/ig/creator?url=http://austin.craigslist.org/search/fua?query=%22modern%20salvage%22&srchType=A&format=rss
Error parsing module spec:
Not a properly formatted file
missing xml header
**<link> in the XML for the above URL**
http://www.gmodules.com/ig/creator?url=http://austin.craigslist.org/search/fua?query=&quot;modern%20salvage&quot;&amp;srchType=A
Information is temporarily unavailable
**I have also tried the URL in the head of the HTML doc:**
http://austin.craigslist.org/search/fua?query="modernsalvage"&srchType=A&format=rss" title="RSS feed for craigslist | furniture - all ""modern salvage"" in austin
Information is temporarily unavailable.
Although Craigslist encourages users to embed RSS feeds I wonder if the Craigslist server is denying the request. I have a background in design, not programming. Any suggestions?
Thank You.
I'm not sure I understand what gadget you're using...
Anyway, I was able to make your page load with the RSS Reader+ gadget on a Google Sites page (won't stay up forever).
Attempts to make it work with http://www.gstatic.com/sites-gadgets/rss-sites/rss_sites.xml were unsuccessful. I think that gadget is broken, according to comments on http://www.google.ca/ig/directory?type=gadgets&url=www.google.com/ig/modules/reader.xml

Display Page feed to webpage using XFBML

I have a page I want the feed to work from the webpage
I can get it to work to my OWN feed but that is not the plan.
(https://www.facebook.com/pages/Inferno-Online-Stockholm/149232401793444) is where I want it to end up but getting to work from the webpage has lost me.
Fox
I have written a small Wall Feed Plugin you install with a simple iframe if you wish to check it out.
https://apps.facebook.com/AnotherFeed/?pageid=149232401793444&since=30+days+ago&limit=8&type=feed&ref=get.code
This will work for any page or application, and can work for personal profiles when the post is public.
HOW? I used php-sdk & graph api for this plugin. This can be done with Javascript SDK.
SEE: https://developers.facebook.com/docs/sdks/
SEE: https://developers.facebook.com/docs/reference/api/
SEE:
SEE: https://developers.facebook.com/docs/reference/api/post/

Flex 3: Project Architecture & SEO

I've got a Flex 3 project. One of the problems I have is that not very much of its content is indexed by Google. Currently, I pull data from a mySQl database, so the Googlebot doesn't see most of the site.
My goal is to increase the amount of content indexed by Google, improve the SEO, and improve SERPs.
I thought that instead of pulling the data from the database that I would change the project's architecture and create separate "pages". So, in my case, I would compile each puzzle separately and upload it to the server in its own directory. This way the info in each puzzle would get indexed.
The negative is that if I add a puzzle, I'd have to add a link to it in all of the puzzles that are already on the server. I would have to add the link, re-compile each puzzle and upload it to the server. Is there a way to get around this problem? Also, if I wanted to communicate some data from one puzzle to another in the future, I wouldn't be able to do so.
Any suggestions?
Thank you.
-Laxmidi
The usual way to achieve this goal is to develop a hidden parallel site in HTML.
On the first page you will have your flash and, hidden by javascript, a list of links to the other pages. These links will be parsed by the robots. Ideally, the href pages are virtual (look for "url rewriting"). On each "fake" page, your server-side language will print on the page a content or links from your database AND the flash. The flash will be provided with a string explaining where it is and what it's supposed to show.
Ex: http://www.mysite.com/category1/content7 The URL rewriting sends this request to http://www.mysite.com/index.php?uri=category1/content7. The page should display the Flash with FlashVar "uri=category1/content7". The Flash knows which content it has to display so when an user comes from google, following this link, he will find the content he was looking for.
Every linking and content for SEO should be in HTML, don't trust robots capability of reading Flash.
have a look at Adobe's reference on deep-linking.
you can generate a website's sitemap.xml with a cron process (daily), such that the URLs encode the state of the application you need. This URL will encode whatever content you need to retrieve from the db, with just one index.html page.
good luck!

Resources