R Selenium - Difficulty Extracting Data from Complex Table - r

I'm trying to webscrape some soccer data. I'm able to loop through all of the necessary web pages, but I'm having trouble getting the data that I need from each page. I think the tables that hold the table are some form of Java, which makes it difficult.
I'm trying to get the goal times for each team from the following website:
http://www.scoreboard.com/uk/match/arsenal-west-brom-2014-2015/AyTNt38e/#match-summary|match-statistics;0|lineups;1
but I can't seem to distinguish between goals/cards/other events that are present. Can anyone help me, or is this simply a lost cause on this website?
My code to get the time of the first event (goal/cards/other) is :
library("RSelenium")
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://www.scoreboard.com/uk/match/arsenal-west-brom-2014-2015/AyTNt38e/#match-summary|match-statistics;0|lineups;1")
x<-mybrowser$findElements(using = 'css selector', ".time-box")
x[[1]]$getElementText()

You need to pick a specific parent element that holds only and all the elements that you want. In this case, "#summary-content div.time-box" works as the CSS selector.
If you want the event type, e.g. goal vs card vs ..., then you want to use the CSS selector "#summary-content div.icon-box" and then look at the other class on the DIV element. soccer-ball for a goal, y-card for a yellow card, and so on. For example,
<div class="icon-box soccer-ball">
That should be enough to get you started. You should be able to get the rest of them yourself.

Related

cannot see some data after scraping a link using requests.get or scrapy

I am trying to scrape data from a stock exchange website. Specifically, I need to read numbers in the top left table. If you inspect the html page, you will see these numbers under <div> tags, following <td> tags whose id is "e0", "e3", "e1" and "e4". However, the reponse, once saved into a text file, lacks all these numbers and some others. I have tried using selenium with some 20 second delays (so that the javascript is loaded) but this does not work and the element cannot be found.
Is there any workaround for this issue?
If you use inspect element > network > filter by XHR, you will see the page which sends the data :
In your case this is this link : http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=7745894403636165&c=23%20.
Unfortunately for you, the data is badly arranged so you will have to look at which position in the answer is the data which interests you. Good luck.

Empty nodes when scraping links with rvest in R

My goal is to get links to all challenges of Kaggle with their title. I am using the library rvest for it but I do not seem to come far. The nodes are empty when I am a few divs in.
I am trying to do it for the first challenge at first and should be able to transfer that to every entry afterwards.
The xpath of the first entry is:
/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a
My idea was to get the link via html_attr( , "href") once I am in the right tag.
My idea is:
library(rvest)
url = "https://www.kaggle.com/competitions"
kaggle_html = read_html(url)
kaggle_text = html_text(kaggle_html)
kaggle_node <- html_nodes(kaggle_html, xpath = "/html/body/div[1]/div[2]/div/div/div[2]/div/div/div[2]/div[2]/div/div/div[2]/div/div/div[1]/a")
html_attr(kaggle_node, "href")
I cant go past a certain div. The following snippet shows the last node I can access
node <- html_nodes(kaggle_html, xpath="/html/body/div[1]/div[2]/div")
html_attrs(node)
Once I go one step further with html_nodes(kaggle_html,xpath="/html/body/div[1]/div[2]/div/div"), the node will be empty.
I think the issue is that kaggle uses a smart list that expands the further I scroll down.
(I am aware that I can use %>%. I am saving every step so that I am able to access and view them more easily to be able to learn how it properly works.)
I solved the issue. I think that I can not access the full html code of the site from R because the table is loaded by a script which expands the table (thus the HTML) with a user scrolling through.
I resolved it, by expanding the table manually, downloading the whole HTML webpage and loading the local file.

Scraping an HTML page that is still loading using R

I am trying to scrape some information from a web page using R. The only problem (so far) is that when I inspect the HTML object that was returned, I see that the key DIV element (from which I want to return data) has the message that it is loading.
The code I am using is below.
How can ensure that all elements on the web page have been rendered before harvesting the HTML.
library(xml2)
html <- xml2::read_html("https://www.holidayhouses.co.nz/")
lst_node <- xml_find_all(html, "//body/div[#class = 'MapView js-MapView']/h1")
lst_node
# returns <h1 class="LoadingMessage">Loading...</h1>
Thanks for any suggestions...
I am editing my answer as the two ideas might apply if you use the package RSelenium. It is not the fastest way for scraping, but it should do the job. What you can do with this package, is to use R to interact with your internet browser.
So once you use RSelenium to go tu the url, you can:
If you are confident that the div will load in a certain amount of time, then you can set some delay using Sys.sleep() before you save the div in lst_node
Iterating until lst_node!="Loading..." with a while loop

Web Scraping with R using xpathSApply and trying to get only the links with "/overview"

I'm doing a project for college that involves web scraping. I'm trying to get all the links of the players profiles in this website(http://www.atpworldtour.com/en/rankings/singles?rankDate=2015-11-02&rankRange=1-5001). I've tried to grab the links with the following code:
library(XML)
doc_parsed<-htmlTreeParse("ranking.html",useInternal =T)
root<-xmlRoot(doc_parsed)
hrefs1 = xpathSApply(root,fun=xmlGetAttr,"href",path='//a')
"ranking.html" is the saved link. When I run the code, it gives me a list with 6887 instead of the 5000 links of the players profiles.What should I do?
To narrow down to the links you want, you must include in your expression attibutes that are unique to the element you are after. The best and fastest way to go is using ids (which should be unique). Next best is using paths under elements with specific classes. For example:
hrefs1 <- xpathSApply(root,fun=xmlGetAttr, "href", path='//td[#class="player-cell"]/a')
By the way, the page you link to has at the moment exactly 2252 links, not 5000.

using the chrome console to select out data

I'm looking to pull out all of the companies from this page (https://angel.co/finder#AL_claimed=true&AL_LocationTag=1849&render_tags=1) in plain text. I saw someone use the Chrome Developer Tools console to do this and was wondering if anyone could point me in the right direction?
TLDR; How do I use Chrome console to select and pull out some data from a URL?
Note: since jQuery is available on this page, I'll just go ahead and use it.
First of all, we need to select elements that we want, e.g. names of the companies. These are being kept on the list with ID startups_content, inside elements with class items in a field with class name. Therefore, selector for these can look like this:
$('#startups_content .items .name a')
As a result, we will get bunch of HTMLElements. Since we want a plain text we need to extract it from these HTMLElements by doing:
.map(function(idx, item){ return $(item).text(); }).toArray()
Which gives us an array of company names. However, lets make a single plain text list out of it:
.join('\n')
Connecting all the steps above we get:
$('#startups_content .items .name a').map(function(idx, item){ return $(item).text(); }).toArray().join('\n');
which should be executed in the DevTools console.
If you need some other data, e.g. company URLs, just follow the same steps as described above doing appropriate changes.

Resources