Get comments from XVIDEOS using rvest [R] - r

I am a datajournalist and I am trying to scrape all the comments of Xvideos, so it gets easier to find victims of leaked personal videos. I have the following code in R, but I can't go on, because I don't know how to click the button "comment" or how to change the url to show the comments by default. Could you give a hand? Thank you.
library(tidyverse)
library(rvest)
url <- "https://www.xvideos.com/new/1"
links <- url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href") %>%
as.data.frame() %>%
`colnames<-`("link") %>%
filter(str_detect(link, "/video"))

I'm not sure why necessarily to use R for this, I would much rather suggest selenium framework to work with for such a workload. This is javascript that does an XHR so it will not be parsable with read html as it will not execute the site code.
But nonetheless you can also reverse engineer the requests - if you want to work with R here is a solution concept that will work:
You get a list of the videos with your code so you should have URLs like this:
https://de.xvideos.com/video52314867/...
You can use a regular Expression like \/video(\d+)\/to get the ID from there and then request the comment URL:
POST https://de.xvideos.com/threads/video-comments/get-posts/top/52314867/0/0
I guess you can see where the ID belongs... this way you will get the video comments as responses directly without executing Javascript.

Related

How can I get dynamic html tags using rvest?

I want to read the next pet button link from here so that I can go on scraping the next page.
I can find the link about next pet in the source code of this web page
but whatever I tried just can't read the <a> tag with the link I need.
I already tried using rvest because from the source code I found that next pet button is class .pdpNav-inner-btn
library(xml2)
library(rvest)
nextpage <- read_html('https://www.petfinder.com/dog/lilly-45365714/gu/yigo/guam-animals-in-need-gu01/') %>%
html_nodes('.pdpNav-inner-btn') %>%
html_attrs()
but all I got were empty list or empty char array.
I also tried reading the whole html and just looking for Next Pet string:
url<-GET('https://www.petfinder.com/dog/lilly-45365714/gu/yigo/guam-animals-in-need-gu01/')
url <- content(url, as="text")
gregexpr('Next Pet',url)
but I still got nothing.
How do I solve the problem?

R web scraping packages failing to read in all tables of url

I'm trying to scrape a number of tables from the following link:
'https://www.pro-football-reference.com/boxscores/201209050nyg.htm'
From what I can tell from trying a number of methods/packages I think R is failing to read in the entire url. Here's a few attempts I've made:
a <- getURL(url)
tabs <- readHTMLTable(a, stringsAsFactors = T)
and
x <- read_html(url)
y <- html_nodes(x,xpath= '//*[#id="div_home_snap_counts"]')
I've had success reading in the first two tables with both methods but after that I can't read in any others regardless of whether I use xpath or css. Does anyone have any idea why I'm failing to read in these later tables?
If you use a browser like Chrome you can go into settings and disable javascript. You will then see that only a few tables are present. The rest require javascript to run in order to load. Those are not being loaded, as displayed in browser, when you use your current method. Possible solutions are:
Use a method like RSelenium which will allow javascript to run
Inspect HTML of page to see if info is stored elsewhere and can be obtained from there. Sometimes info is retrieved from script tags, for example, where it is stored as json/javascript object
Monitor network traffic when refreshing page (F12 to open dev tools and then Network tab) and see if you can find the source where the additional content is being loaded from. You may find other endpoints you can use).
Looking at the page it seems that at least two of those missing tables (likely all) are actually stored in comments in the returned html, associated with divs having class placeholder; and that you need to remove either the comments marks, or use a method that allows for parsing comments. Presumably, when javascript runs these comments are converted to displayed content.
Here is an example from the html:
Looking at this answer by #alistaire, one method is as follows (shown for single example table as per above image)
library(rvest)
h <- read_html('https://www.pro-football-reference.com/boxscores/201209050nyg.htm')
df <- h %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#game_info') %>%
html_table()

Rvest web scrape returns empty character

I am looking to scrape some data from a chemical database using R, mainly name, CAS Number, and molecular weight for now. However, I am having trouble getting rvest to extract the information I'm looking for. This is the code I have so far:
library(rvest)
library(magrittr)
# Read HTML code from website
# I am using this format because I ultimately hope to pull specific items from several different websites
webpage <- read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/", 1))
# Use CSS selectors to scrape the chemical name
chem_name_html <- webpage %>%
html_nodes(".short .breakword") %>%
html_text()
# Convert the data to text
chem_name_data <- html_text(chem_name_html)
However, when I'm trying to create name_html, R only returns character (empty). I am using SelectorGadget to get the HTML node, but I noticed that SelectorGadget gives me a different node than what the Inspector does in Google Chrome. I have tried both ".short .breakword" and ".summary-title short .breakword" in that line of code, but neither gives me what I am looking for.
I have recently run into the same issues using rvest to scrape PubChem. The problem is that the information on the page is rendered using javascript as you are scrolling down the page, so rvest is only getting minimal information from the page.
There are a few workarounds though. The simplest way to get the information that you need into R is using an R package called webchem.
If you are looking up name, CAS number, and molecular weight then you can do something like:
library(webchem)
chem_properties <- pc_prop(1, properties = c('IUPACName', 'MolecularWeight'))
The full list of compound properties that can be extracted using this api can be found here. Unfortunately there isn't a property through this api to get CAS number, but webchem gives us another way to query that using the Chemical Translation Service.
chem_cas <- cts_convert(query = '1', from = 'CID', to = 'CAS')
The second way to get information from the page that is a bit more robust but not quite as easy to work with is by grabbing information from the JSON api.
library(jsonlite)
chem_json <-
read_json(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", "1", "/JSON/?response_type=save$response_basename=CID_", "1"))
With that command you'll get a list of lists, which I had to write a convoluted script to parse the information that I needed from the page. If you are familiar with JSON, you can parse far more information from the page, but not quite everything. For example, in sections like Literature, Patents, and Biomolecular Interactions and Pathways, the information in these sections will not fully show up in the JSON information.
The final and most comprehensive way to get all information from the page is to use something like Scrapy or PhantomJS to render the full html output of the PubChem page, then use rvest to scrape it like you originally intended. This is something that I'm still working on as it is my first time using web scrapers as well.
I'm still a beginner in this realm, but hopefully this helps you a bit.

How can I scrape data from a website within a frame using R?

The following link contains the results of the marathon of Paris: http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon.
I want to scrape these results, but the information lies within a frame. I know the basics of scraping with Rvest and Rselenium, but I am clueless on how to retrieve the data within such a frame. To get an idea, one of the things I tried was:
url = "http://www.schneiderelectricparismarathon.com/us/the-race/results/results-marathon"
site = read_html(url)
ParisResults = site %>% html_node("iframe") %>% html_table()
ParisResults = as.data.frame(ParisResults)
Any help in solving this problem would be very welcome!
The results are loaded by ajax from the following url :
url="http://www.aso.fr/massevents/resultats/ajax.php?v=1460995792&course=mar16&langue=us&version=3&action=search"
table <- url %>%
read_html(encoding="UTF-8") %>%
html_nodes(xpath='//table[#class="footable"]') %>%
html_table()
PS: I don't know what ajax is exactly, and I just know basics of rvest
EDIT: in order to answer the question in the comment: I don't have a lot of experience in web scraping. If you only use very basic technics with rvest or xml, you have to understand a little more the web site, and every site has its own structure. For this one, here is how I did:
As you see, in the source code you don't see any results because they are in an iframe, and when inspecting the code, you can see after "RESULTS OF 2016 EDITION":
class="iframe-xdm iframe-resultats" data-href="http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=3"
Now you can use directly this url : http://www.aso.fr/massevents/resultats/index.php?langue=us&course=mar16&version=2
But you still can get the results. You can then use Chrome developer tools > Network > XHR. When refreshing the page, you can see that the data is loaded from this url (when you choose the Men category) : http://www.aso.fr/massevents/resultats/ajax.php?course=mar16&langue=us&version=2&action=search&fields%5Bsex%5D=F&limiter=&order=
Now you can get the results !
And if you want the second page, etc. you can click on the number of the page, then use developer tool to see what happens !

Download ASPX page with R

There are a number of fairly detailed answers on SO which cover authenticated login to an aspx site and a download from it. As a complete n00b I haven't been able to find a simple explanation of how to get data from a web form
The following MWE is intended as an example only. And this question is more intended to teach me how to do it for a wider collection of webpages.
website :
http://data.un.org/Data.aspx?d=SNA&f=group_code%3a101
what I tried and (obviously) failed.
test=read.csv('http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc')
giving me goobledegook with a View(test)
Anything that steps me through this or points me in the right direction would be very gratefully received.
The URL you are accessing using read.csv is returning a zipped file. You could download it
using httr say and write the contents to a temp file:
library(httr)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
response <- GET(urlUN)
writeBin(content(response, as = "raw"), "temp/temp.zip")
fName <- unzip("temp/temp.zip", list = TRUE)$Name
unzip("temp/temp.zip", exdir = "temp")
read.csv(paste0("temp/", fName))
Alternatively Hmisc has a useful getZip function:
library(Hmisc)
urlUN <- "http://data.un.org/Handlers/DownloadHandler.ashx?DataFilter=group_code:101;country_code:826&DataMartId=SNA&Format=csv&c=2,3,4,6,7,8,9,10,11,12,13&s=_cr_engNameOrderBy:asc,fiscal_year:desc,_grIt_code:asc"
unData <- read.csv(getZip(urlUN))
The links are being generated dynamically. The other problem is the content isn't actually at that link. You're making a request to a (very odd and poorly documented) API which will eventually return with the zip file. If you look in the Chrome dev tools as you click on that link you'll see the message and response headers.
There's a few ways you can solve this. If you know some javascript you can script a headless webkit instance like Phantom to load up these pages, simulate lick events and wait for a content response, then pipe that to something.
Alternately you may be able to finagle httr into treating this like a proper restful API. I have no idea if that's even remotely possible. :)

Resources