Scrapy: Unable to access class despite of it's there - css

I am trying to scrape this page, I am trying to fetch Color Name, LT. BLUE. From Chrome I see HTML:
<div id="desc-options"><div class="option"><span class="label">Color:</span> LT. BLUE</div><div class="option"><span class="label">Size:</span> 6.5</div></div>
I tried response.css("#desc-options") to access everything inside but returns []. Even BeautifulSoup is failing.

The element you're looking for is dynamically created via JavaScript. You cannot parse it from the plain HTML.
The good news is: the data you're looking for is probably still in the page. Check out the <script> tag defining the spConfig variable. Looks like there's some JSON there you can parse ...

Related

Fill input tag of an html in python

So I have tried web scraping a website and it has a field where you can write ( a navigation bar of some sort )
Whenever I am writing something there it creates a dropdown of things related to what I wrote ( things that contain what I wrote )
What I'm trying to do is essentially use requests.post from requests library in python in order to fill a value inside it, afterwards, I want it to grab whatever the dropdown showed.
I've had a few problems while doing it:
The dropdown disappears whenever you click somewhere else on the website so it does create temporary HTML tags of the list temporarily.
I couldn't find a way to actually post something inside the navigation bar.
A great example I've found on the web is inside FUTWIZ which does exactly what I described above, Whenever I try with F12 I see it creates some HTML description, is there a way to grab the HTML After the value is put inside the actual navigation bar?
EDIT
This is the code I've tried:
import requests
from bs4 import BeautifulSoup
urls = "https://www.futwiz.com/en/"
requst = requests.get(urls)
bs4Out = BeautifulSoup(requst.text, "html.parser")
poster = requests.post(urls, data={"form-control": "Messi"})
print(poster.text)
Now, I know the data in requests.post only puts it as a query but I can't really figure out how to fill the header
This is the link to FUTWIZ, it has the navigation bar which is the thing I'm trying to work with?
https://www.futwiz.com/en/

Extract link text from a with webscraping using Google Sheets

I have the following <html> text:
Text
How should I do for getting "Text" value? I am trying with this, but I get an empty value:
=INDEX(importxml("http://www.remoteurl.com";"//a[#href='link.html']");1)
I tried using your syntax and it worked for me. I shortened it a little for testing purposes.
=importxml("https://www.remoteurl.com","//a[#href='link.html']")
Be sure that the href value you are passing in the xpath query is exactly what is present on the web page, e.g. if the web page uses a relative path then you must also use the same relative path.
I was doing it properly, but the problem is that coding was inside an iframe, so it was impossible to reach it.

When using apoc.load.html, Is it possible to return the full HTML rather than only text?

Lets say I want to scrape the Neo4j RefCard found at: https://neo4j.com/docs/cypher-refcard/current/
And I would like to fetch a 'code' example along with its styling. Here's my target. Notice that it has CSS treatment (font, color...):
...so in Neo4j I call the apoc.load.html procedure as shown here, and you can see it's no problem finding the content:
It returns a map with three keys: tagName, attributes, and text.
The text is the issue for me. It's stripped of all styling. I would like for it to let me know more about the styling of the different parts of this text.
The actual HTML in the webpage looks like following image with all of these span class tags: cm-string, cm-node, cm-atom, etc. Note that this was not generated by Neo4j's apoc.load.html procedure. It came straight from my Chrome browser's inspect console.
I don't need the actual fonts and colors, just the tag names.
I can seen in the documentation that there is an optional config map you can supply, but there's no explanation for what can be configured there. It would be lovely if I could configure it to return, say, HTML rather than text.
The library that Neo4j uses for CSS selection here is jsoup.
So I am hoping to not strip the <span> tags, or otherwise, extract their class names for each segment of text.
Could you not generate the HTML yourself from the properties in your object? It looks they are all span tags with 3 different classes depending on whether your using the property name, property value, or property delimiter?
That is probably how they are generating the HTML themselves.
Okay, two years later I revisited this question I posted, and did find a solution. I'll keep it short.
The APOC procedure CALL apoc.load.html is using the scraping library Jsoup, which is not a full-fledged browser. When it visits a page it reads the html sent by the server but ignores any javascript. As a result, if a page uses javascript for inserting content or even just formatting the content, then Jsoup will miss the html that the javascript would have generated had it run.
So I have just tried out the service at prerender.com. It's simple to use. You send it a URL, it takes your url as an argument and fetches that page itself and executes the page's javascript as it does. It returns the final result as static HTML.
So if I just call prerender.com with apoc.load.html then the Jsoup library will simply ask for the html and this time it will get the fully rendered html. :)
You can try the following two queries and see the difference pre-rendering makes. The span tags in this page are rendered only by javascript. So if we call it asking for its span tags without pre-rendering we get nothing returned.
CALL apoc.load.html("https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags
...but if we call it via the prender.com website, you will get a bunch of span tags and their content.
CALL apoc.load.html("https://service.prerender.cloud/https://neo4j.com/docs/cypher-refcard/current/", {target:".listingblock pre:contains(age: 38) span"}) YIELD value
UNWIND value.target AS spantags
RETURN spantags

Meteor: Get url of uploaded image in CollectionFS photo-blog example

I'd like to be able to pass the URL of an uploaded image in javascript in the tutorial example making a photoblog in meteor.
In that example (in home.js), the helper for templates that render images returns Images.find(), which is used in the image template (image.html) to output html to show the image via:
<img src="{{url}}" />
This works fine, as does the entire tutorial, including S3. However, I'd like to combine it with another project, and that one will require storing and passing around the url under program control.
It would seem that because the template is able to use {{url}}, that in js, one could, in the simplest case, use Images.findOne().url to get at least the first url. E.g., I have modified the given helper to contain this:
Template.home.helpers({
'images': function() {
console.log("url from home helper: = " + Images.findOne().url); //cannot read url property
return Images.find();
}
});
However, this gets the error "cannot read url property..." (and after that, for some reason, the console prints out a huge batch of source code!!) If the template is able to render the field "url" from the collection image object, why can't js see it?
How can I get at the url in javascript?
the url is the function not the property so you have to use Images.findOne().url() not the Images.findOne().url
or
if you are getting the same error that because your findone method return undefined.
There are the possible issues.
Your Images collection are empty.
You did not publish then images and not subscribe the images.
You may be using this call before uploading the images.
I hope this this may solve your issue.

How to reload the by iron-router in meteor?

I get the data from database randomly for my page. And I would like to add a <a> tag to link to the same page for getting another random data. However, since it does nothing if the target page is the same as the current page.
Is any better way to get the new data?
Have a look at this issue: https://github.com/EventedMind/iron-router/pull/324
Basically iron-router adds events to all your a tags. Also the corresponding bit of code: https://github.com/EventedMind/iron-router/blob/79861385df5d2b667630ec82abe4de3efa3166e3/lib/client/location.js#L48
So you need to pass a selector to IronLocation that does include the a tag that does this:
E.g you could do
IronLocation.configure({
'linkSelector' : 'a[href][data-router="true"]'
});
I'm not sure what you're a tags look like. But you could either make all the ones you want to work by changing all the a tags you want to work with iron router, like above. Or change a specific a tag and exlcude it:
'linkSelector' : 'a:not([ironskip])'
Then use <a href=".." ironskip>..</a> for the route you dont want iron-router to handle.
So this way you can specify what a tags you want iron-router to touch and which ones you dont.

Resources