Scrape an image from website in Google sheets (using ImportXml) - web-scraping

I'm trying to scrape a product image from Aliexpress website to display in a cell together with link and other details. I've been trying to formulate the xpath, however I've been getting an error.
this is the command I've used:
=image(IMPORTXML(A2, "(//img[#class='magnifier-image'])[1]/#src"))
A2 is the product link
this is the element from Aliexpress website
link to the product: https://www.aliexpress.com/item/1005001845596088.html
Could anyone help me understand what I'm doing wrong please?
I would be very grateful for any ideas.
Thank you,
Kristyna

Unfortunately, the HTML data retrieved by IMPORTXML has not value of img[#class='magnifier-image']. But, fortunately, that image URL can be retrieved from meta tag. When this is reflected in the formula, it becomes as follows.
Sample formula:
=IMAGE(IMPORTXML(A2,"//meta[#property='og:image']/#content"))
The cell "A2" has the URL of https://www.aliexpress.com/item/1005001845596088.html.
In this case, the image URL of https://ae01.alicdn.com/kf/S962937cf821a4e0196a11cf0c877df11Y/Birthday-Valentine-Day-Keychain-Gifts-for-Boyfriend-Husband-My-Man-I-love-you-Couples-Keyring-for.jpg is retrieved.
Note:
This sample formula is for the URL of https://www.aliexpress.com/item/1005001845596088.html. So when you change the URL, this formula might not be able to be used. And also, when the specification of the site is changed, this formula might not be able to be used. Please be careful about this.

Related

cannot see some data after scraping a link using requests.get or scrapy

I am trying to scrape data from a stock exchange website. Specifically, I need to read numbers in the top left table. If you inspect the html page, you will see these numbers under <div> tags, following <td> tags whose id is "e0", "e3", "e1" and "e4". However, the reponse, once saved into a text file, lacks all these numbers and some others. I have tried using selenium with some 20 second delays (so that the javascript is loaded) but this does not work and the element cannot be found.
Is there any workaround for this issue?
If you use inspect element > network > filter by XHR, you will see the page which sends the data :
In your case this is this link : http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=7745894403636165&c=23%20.
Unfortunately for you, the data is badly arranged so you will have to look at which position in the answer is the data which interests you. Good luck.

Scraping the gender of clothing items

Looking for advice please on methods to scrape the gender of clothing items on a website that doesn't specify the gender on the product page.
The website I'm crawling is www.very.co.uk and an example of a product page would be this - https://www.very.co.uk/berghaus-combust-reflect-long-jacket-red/1600352465.prd
Looking at that page, there looks to be no easy way to create a script that could identify this item as womenswear. Other websites might have breadcrumbs to use, or the gender might be in the title / URL but this has nothing.
As I'm using scrapy, with the crawl template and Rules to build a hierarchy of links to scrape, I was wondering if it's possible to pass a variable in one of the rules or the starting_URL to identify all items scraped following this rule / starting URL would have a variable as womenswear? I can then feed this variable into a method / loader statement to tag the item as womenswear before putting it into a database.
If not, would anyone have any other ideas on how to categorise this item as womenswear. I saw an example where you could use an excel spreadsheet to create the start_urls and in that excel spreadsheet tag each row as womenswear, mens etc. However, I feel this method might cause issues further down the line and would prefer to avoid it if possible. I'll spare the details of why I think this would be problematic unless anyone asks.
Thanks in advance
There does seem to be a breadcrumb in your example, however for an alternative you can usually check the page source by simply searching your term - maybe there's some embedded javascript/json that can be extract?
Here you can see some javascript for subcategory that indicates that it's a "womens_everyday_sports_jacket".
You can parse it quite easily with some regex:
re.findall('subcategory: "(.+?)"', response.body_as_unicode())
# womens_everyday_sports_jacket

Rvest web scrape returns empty character

I am looking to scrape some data from a chemical database using R, mainly name, CAS Number, and molecular weight for now. However, I am having trouble getting rvest to extract the information I'm looking for. This is the code I have so far:
library(rvest)
library(magrittr)
# Read HTML code from website
# I am using this format because I ultimately hope to pull specific items from several different websites
webpage <- read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/", 1))
# Use CSS selectors to scrape the chemical name
chem_name_html <- webpage %>%
html_nodes(".short .breakword") %>%
html_text()
# Convert the data to text
chem_name_data <- html_text(chem_name_html)
However, when I'm trying to create name_html, R only returns character (empty). I am using SelectorGadget to get the HTML node, but I noticed that SelectorGadget gives me a different node than what the Inspector does in Google Chrome. I have tried both ".short .breakword" and ".summary-title short .breakword" in that line of code, but neither gives me what I am looking for.
I have recently run into the same issues using rvest to scrape PubChem. The problem is that the information on the page is rendered using javascript as you are scrolling down the page, so rvest is only getting minimal information from the page.
There are a few workarounds though. The simplest way to get the information that you need into R is using an R package called webchem.
If you are looking up name, CAS number, and molecular weight then you can do something like:
library(webchem)
chem_properties <- pc_prop(1, properties = c('IUPACName', 'MolecularWeight'))
The full list of compound properties that can be extracted using this api can be found here. Unfortunately there isn't a property through this api to get CAS number, but webchem gives us another way to query that using the Chemical Translation Service.
chem_cas <- cts_convert(query = '1', from = 'CID', to = 'CAS')
The second way to get information from the page that is a bit more robust but not quite as easy to work with is by grabbing information from the JSON api.
library(jsonlite)
chem_json <-
read_json(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", "1", "/JSON/?response_type=save$response_basename=CID_", "1"))
With that command you'll get a list of lists, which I had to write a convoluted script to parse the information that I needed from the page. If you are familiar with JSON, you can parse far more information from the page, but not quite everything. For example, in sections like Literature, Patents, and Biomolecular Interactions and Pathways, the information in these sections will not fully show up in the JSON information.
The final and most comprehensive way to get all information from the page is to use something like Scrapy or PhantomJS to render the full html output of the PubChem page, then use rvest to scrape it like you originally intended. This is something that I'm still working on as it is my first time using web scrapers as well.
I'm still a beginner in this realm, but hopefully this helps you a bit.

Why is ImportXML not working for a specific field while trying to scrape kickstarter.com?

I am trying to screen scrape funding status of a specific Kickstarter project.
I am using following formula in my Google spreadsheet, what I am trying here is to get the $ amount of project's funding status:
=ImportXML("http://www.kickstarter.com/projects/1904431672/trsst-a-distributed-secure-blog-platform-for-the-o","//data[#class='Project942741362']")
It returns #N/A in the cell, with comment:
error: The xPath query did not return any data.
When I try using ImportXML on other parts of the same webpage it seems to work perfectly well. Could someone please point out what I am doing wrong here?
It seems that the tag "data" is not correctly parsed.
A choice of workaround may be:
=REGEXEXTRACT(IMPORTXML("http://...", "//div[#id='pledged']"), "^\S*")

How to OR solr term facets via the search URL in Drupal 7 site?

I have a Drupal 7 website that is running apachesolr search and is using faceting through the facetapi module.
When I use the facets to narrow my searches, everything works perfectly and I can see the filters being added to the search URL, so I can copy them as links (ready-made narrowed searches) elsewhere on the site.
Here is an example of how the apachesolr URL looks after I select several facets/filters:
search_url/search_keyword?f[0]=im_field_tag_term1%3A1&f[1]=im_field_tag_term2%3A100
Where the 'search_keyword' portion is the text I'm searching for and the '%3A' is just the url encoded ':' (colon).
Knowing this format, I can create any number of ready-made searches by creating the correct format for the URL. Perfect!
However, these filters are always ANDed, the same way they are when using the facet interface. Does anyone know if there is a syntax I can use, specifically in the search URL, to OR my filters/facets? Meaning, to make it such that the result is all entries that contains EITHER of the two filters?
Thanks in advance for any help or pointers in the right direction!
New edit:
I do know how to OR terms within the same vocabulary through the URL, I'm just wondering how to do it for terms in different vocabularies. ;-)
You can write a filter query that looks like:
fq=field1:value1 OR field2:value2
Alternatively you can use localparams to specify the query operator:
fq={!q.op=OR}field1:value1 field2:value2
As far as I know, there's not any easier way to do this. There is, in fact, an rather old bug asking for a way to OR the fq parameters...
I finally found a way to do this in Drupal
Enable the fq parameter setting.
Go to admin/config/search/apachesolr/[your_search_page]/core_search/edit or just navigate to the settings of the search page you're trying to modify
Check the 'Allow user input using the URL' setting
URL Syntax
Add the following at the end of the URL: ?fq=tid:(16 OR 38), where 16 and 38 are the term ids

Resources