Scraping availability of the product - web-scraping

My goal is to scrape some data if the product available or not.
At present I am using the following:
=importxml (B2,//*[#id="product-controls"]/div/div1/div1)
Unfortunately, i am receiving an error. Here is the link to the file https://docs.google.com/spreadsheets/d/11OJvxRRIXJolpi2UttmNIOArAdwh1qeZhjqczlVI8oc/edit#gid=1531110146
As an example, I want to get the data from the url https://radiodetal.com.ua/mikroshema-5m0365r-dip8
and Xpath should be from here

got the formula
=importxml (B2,"//div[#class='stock']")

Related

Retrieving Company Name from Web Search using Ticker

I have a list of 3000+ tickers for which I would like to get their respective company names. I couldn't find these names using the Bloomberg database.
I manually checked on Google for few of them and found that the first web page in google search that gives the company's name when I enter the US-based tickers is that of Bloomberg. For example, when I searched for "0000284D US Equity", the first page is https://www.bloomberg.com/profile/company/0000284D:US. For non-US based tickers such as "010520 KS Equity", the first page that shows the company's name could be something else.
I checked posts like this one - Finding company name from a ticker in Bloomberg - but couldn't find the relevant solution.
Is there any R package that can help in fetching the company name from web search using the ticker? Please suggest. Thanks.
I couldn't find a package in R but could find a package in python that can help. We can pass the tickers in a list and then print their long names as below.
pip install yfinance
import yfinance as yfin
x = yfin.Ticker("AAPL")
print(x.info['longName'])

What to do if rvest isn't recognizing a node in R?

I am trying to scrape referee game data using rvest. See the code below:
page_ref<-read_html("https://www.pro-football-reference.com/officials/HittMa0r.htm")
ref_tab <- page_ref %>%
html_node("#games") %>%
html_text()
#html_table()
But rvest does not recognize any of the nodes for the "Games" table in the link. It can pull the data from the first table "Season Totals" just fine. So, am I missing something? In general, what does it mean if rvest doesn't recognize a node identified with SelectorGadget and is clearly identified in the developer tools?
It is because the first table is in the html you get from the server and the other tables are filled by JavaScript. Rvest can only get you the things that are there in the html response from server. If you want to get the data filled by JS, you need to use some other tool like Selenium or Puppeteer, for example.
Selenium
Puppeteer / gepetto

Extract Google analytics search keyword and impression

I'm trying to extract search engine keywords using "rga" package in R.
I'm using the report:
acquisition > search console > queries
However I can't find the right metric name in the api for the search keywords.
I'm trying to find a dimension for eg: ga:searchkeyword, ga:searchenginekeyword etc.
I want to run the following code:
ga$getData(ids, dimensions="ga:xxxxxx", metrics="ga:impressions",
start.date="2016-04-01", end.date="2017-04-01")
What dimension should I supply in dimensions parameter?
ga:adMatchedQuery just gives you the keywords a user googled before clicking your AdWords-Ad. You will not see the organic Keywords.
You donĀ“t get the organic Keywords from the Search Console with the GA Reporting API.
You can use the package searchConsoleR (https://cran.r-project.org/web/packages/searchConsoleR/searchConsoleR.pdf) using googleAuthR for the authentication part.
The Query looks like this:
keywords <-search_analytics(website_name,
start_date, end_date,
c("query", "page"),
searchType="web")
ga:adMatchedQuery is what you're looking for
https://developers.google.com/analytics/devguides/reporting/core/dimsmets#view=detail&group=adwords&jump=ga_admatchedquery

Scrapy returning numbers and letters instead of "?" for href value

I am trying to scrape a web forum using Scrapy for the href link info and when I do so, I get the href link with many letters and numbers where the question mark should be.
This is a sample of the html document that I am scraping:
I am scraping the html data for the href link using the following code:
response.xpath('.//*[contains(#id, "thread_title")]/#href').extract()
When I run this, I get the following results:
[u'showthread.php?s=f969fe6ed424b22d8fddf605a9effe90&t=2676278']
What should be returned is:
[u'showthread.php?t=2676278']
I have ran other tests scraping for href data with question marks elsewhere in the document and I also get the "s=f969fe6ed424b22d8fddf605a9effe90&" returned.
Why am I getting this data returned with the "s=f969fe6ed424b22d8fddf605a9effe90&" instead of just the question mark?
Thanks!
It seems that the site I am scraping from uses a unique identifier in order to more accurately update the number of views per the thread. I was not able to return scraped data without a unique id, it changed over time, and scraped a different HTML tag for the thread ID and then joined it to the web address (showthread.php?t=) to create the link I was looking for.

finding article by date code in google search appliance GSA

The Google Search Appliance goes through and finds out the date of each article when it crawls (last modified date is the default).
However, it doesn't turn up articles when you query by date code.
Is there any way to get the GSA to do this?
(We have a daily broadcast which people often search for by date code. Right now we have to manually put in the 4 most common date codes into the meta-keywords in order for them to be pulled up through a query)
Have you tried using inmeta:date as described in the Search Protocol Reference documentation?
Alternatively, if the date code is in the document content or the URL you could use entity recognition to extract it.
One way to make sure GSA is collecting the document date is to check the search results in XML format and see if tag has the date value. You can see the results in XML format by removing any proxystylesheet parameter in the URL.
If the value of tag is empty then GSA is not getting the document dates.
You can configure the document dates under Crawl and Index > Document Dates (at least at GSA version 7). We are using a meta tag approach. We put a date meta tag to each document/page and tell GSA to use this meta tag to sort the documents. The full list of options are:
URL
Meta Tag
Title
Body
Last Modified
Here are some links that helped me to find answers when dealing with a similar problem:
https://support.google.com/gsa/answer/2675414?hl=en
https://developers.google.com/search-appliance/documentation/64/xml_reference#request_sort_by_date
https://groups.google.com/forum/#!searchin/google-search-appliance-help/sort$20by$20date$20not$20working

Resources