Scraping users video in Twitter with rvest - css

I'm using rvest to scrape some web static elements in the web. However, I could not scrape dynamic content . For example, how to scrape audience count (44K) in the following video post?
I tried this:
library(rvest)
video_tweet = html("https://twitter.com/estrellagalicia/status/993432910584659968")
video_tweet %>%
html_nodes("#permalink-overlay #permalink-overlay-dialog div #permalink-overlay-body div div div div div div div div div div div div span div div div div span span") %>% as.character()

You need to use Rselenium and you should pick the right css for it.This should work:
library(RSelenium)
library(rvest)
rmDr <- rsDriver(browser = "chrome")
myclient <- rmDr$client
video_tweet = "https://twitter.com/estrellagalicia/status/993432910584659968"
myclient$navigate(video_tweet)
mypagesource <- myclient$getPageSource()
read_html(mypagesource[[1]]) %>%
html_nodes("#permalink-overlay-dialog > div.PermalinkOverlay-content > div > div > div.permalink.light-inline-actions.stream-uncapped.has-replies.original-permalink-page > div.permalink-inner.permalink-tweet-container > div > div.js-tweet-details-fixer.tweet-details-fixer > div.card2.js-media-container.has-autoplayable-media > div.PlayableMedia.LiveBroadcastCard-playerContainer.LiveBroadcastCard--supportsLandscapePresentation.watched.playable-media-loaded > div > div > div > div:nth-child(2) > div.rn-1oszu61.rn-1efd50x.rn-14skgim.rn-rull8r.rn-mm0ijv.rn-13yce4e.rn-fnigne.rn-ndvcnb.rn-gxnn5r.rn-1nlw0im.rn-deolkf.rn-6koalj.rn-1pxmb3b.rn-7vfszb.rn-eqz5dr.rn-1r74h94.rn-1mnahxq.rn-61z16t.rn-p1pxzi.rn-11wrixw.rn-ifefl9.rn-bcqeeo.rn-wk8lta.rn-9aemit.rn-1mdbw0j.rn-gy4na3.rn-u8s1d.rn-1lgpqti > span > div > div.rn-1oszu61.rn-1efd50x.rn-14skgim.rn-rull8r.rn-mm0ijv.rn-13yce4e.rn-fnigne.rn-ndvcnb.rn-gxnn5r.rn-deolkf.rn-6koalj.rn-1pxmb3b.rn-7vfszb.rn-eqz5dr.rn-1mnahxq.rn-61z16t.rn-p1pxzi.rn-11wrixw.rn-ifefl9.rn-bcqeeo.rn-wk8lta.rn-9aemit.rn-1mdbw0j.rn-gy4na3.rn-bnwqim.rn-1lgpqti > div > div > span > span") %>% as.character()

Related

Can't Access Static Drop-down Inside an Iframe

I want to click on a dropdown inside an iframe and select some options, but i keep getting this error in cypress
cypress-iframe commands can only be applied to exactly one iframe at a time. Instead found 2
Please see my code below
`
it('Publish and Lock Results', function(){
setClasses.clickTools()
setClasses.clickConfiguration()
setClasses.publishAndLockResult()
cy.frameLoaded();
cy.wait(5000)
cy.iframe().find("body > div:nth-child(6) > div:nth-child(1) > div:nth-child(1) > form:nth-child(2) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2) > div:nth-child(5) > p:nth-child(2)").click()
})
```**Please see screenshot of the element**
[![enter image description here](https://i.stack.imgur.com/WpUfn.png)](https://i.stack.imgur.com/WpUfn.png)
There should be an iframe identifier that you can use (anything else for the iframe, a unique id, class or attribute, like data-test)
let iframe = cy.get(iframeIdentifier).its('0.contentDocument.body').should('not.be.empty').then(cy.wrap)
iframe.find("body > div:nth-child(6) > div:nth-child(1) > div:nth-child(1) > form:nth-child(2) > table:nth-child(5) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2) > div:nth-child(5) > p:nth-child(2)").click()
From what the output is suggesting, you've got two iframes on that page that it's trying to send commands to, and you have to pick one
Answer adapted from source: cypress.io blog post on working with iframes

Click SalesForce Report Refresh Button

I'm trying to find the simplest way to automatically have a SalesForce report refresh button clicked every 2 minutes. Below are the various "copy" details for the item inspection results (sorry, I'm not sure what information would be most helpful to complete this request).
Element:
Refresh
Outer HTML:
Refresh
Sector:
#brandBand_1 > div > div > div > div > div.slds-page-header--object-home.slds-page-header_joined.slds-page-header_bleed.slds-page-header.slds-shrink-none.test-headerRegion.forceListViewManagerHeader > div:nth-child(2) > div:nth-child(3) > force-list-view-manager-button-bar > div > div:nth-child(1) > lightning-button-icon
JS Path:
document.querySelector("#brandBand_1 > div > div > div > div > div.slds-page-header--object-home.slds-page-header_joined.slds-page-header_bleed.slds-page-header.slds-shrink-none.test-headerRegion.forceListViewManagerHeader > div:nth-child(2) > div:nth-child(3) > force-list-view-manager-button-bar > div > div:nth-child(1) > lightning-button-icon")
Xpath:
//*[#id="brandBand_1"]/div/div/div/div/div[1]/div[2]/div[3]/force-list-view-manager-button-bar/div/div[1]/lightning-button-icon
Full Xpath:
/html/body/div[4]/div[1]/section/div[1]/div/div[2]/div[1]/div/div/div/div/div/div/div/div[1]/div[2]/div[3]/force-list-view-manager-button-bar/div/div[1]/lightning-button-icon
I've tried some browser extensions, but so far have only found browser page refresh extensions.

Scrapy splash can't find element

Problem:
I am using scrapy splash to scrape a web page. However it seems the css path for imageURL does not return any element but the ones for name and category worked fine. (xpath and selector are all copied directly from Chrome.)
Things I've Tried:
At first I thought it's because the page has not been fully loaded when parse gets called so I changed the wait argument for SplashRequest to 5 but it did not help. I also downloaded a copy of the html response from splash GUI (http://localhost:8050) and verified that the xpath/selectors all work well on the downloaded copy. Here I assumed that this html would be exactly what scrapy sees in parse so I couldn't make sense of why it wouldn't work inside scrapy script.
Code:
Here is my code:
class NikeSpider(scrapy.Spider):
name = 'nike'
allowed_domains = ['nike.com', 'store.nike.com']
start_urls = ['https://www.nike.com/t/air-vapormax-flyknit-utility-running-shoe-XPTbVZzp/AH6834-400']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest (
url=url,
callback=self.parse,
args= {
'wait': 5
}
)
def parse(self, response):
name = response.xpath('//*[#id="RightRail"]/div/div[1]/div[1]/h1/text()').extract_first()
imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > img::attr(src)').extract_first()
category = response.css('#RightRail > div > div.d-lg-ib.mb0-sm.mb8-lg.u-full-width > div.ncss-base.pr12-sm > h2::text').extract_first()
url = response.url
if name != None and imageURL != None and category != None:
item = ProductItem()
item['name'] = name
item['imageURL'] = imageURL
item['category'] = category
item['URL'] = url
yield item
May they use different formatting but for me it's (source::attr(srcset) at the end):
imageURL = response.css('#PDP > div > div:nth-child(2) > div.css-1jldkv2 > div:nth-child(1) > div > div > div.d-lg-h.bg-white.react-carousel > div > div.slider-container.horizontal.react-carousel-slides > ul > li.slide.selected > div > picture:nth-child(3) > source::attr(srcset)').extract_first()

R- html_nodes doesnt find selector

I wanted to scrap some data with "rvest" package from url http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015
I wanted to get the table with the following selector (copied via inspect option from chrome):
#historic-price-list > div > div.content > table
But html_nodes doesn't work:
> url="http://www.finanzen.ch/kurse/historisch/ABB/SWL/1.1.2001_27.10.2015"
> css_selector="#historic-price-list > div > div.content > table"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (0)}
What I can find is:
> css_selector="#historic-price-list"
> html(url) %>% html_nodes(css_selector)
{xml_nodeset (1)}
[1] <div id="historic-price-list"/>
But it doesn't goes any further.
Maybe someone got an idea why?

converting xpath to css

How do you convert the following xpath to css?
By.xpath("//div[#id='j_id0:form:j_id687:j_id693:j_id694:j_id700']/div[2]/table/tbody/tr/td");
Here's what I tried but it didn't work:
By.cssSelector("div[#id='j_id0:form:j_id687:j_id693:j_id694:j_id700'] > div:nth-of-type(2) > table > tbody > tr > td");
By.cssSelector("div > #j_id0:form:j_id687:j_id693:j_id694:j_id700 > div:nth-of-type(2) > table > tbody > tr > td");
Thank you.
Try this
#j_id0:form:j_id687:j_id693:j_id694:j_id700 tbody td
Notice the whitespaces instead of > and that allows you to skip child tags
I also do not like that long id. If the ending part of the id is unique you can use partial search
[id$='_id700'] tbody>tr>td
One should learn how to write css selectors, but a for a quick fix, try: cssify
For example, I put in your xpath and it spit out: div#j_id0:form:j_id687:j_id693:j_id694:j_id700 > div:nth-of-type(2) > table > tbody > tr > td

Resources