Beautiful Soup renders only a portion of the page - web-scraping

So I am trying to scrape fantasy player data from this website: http://www.footywire.com/. As an example page, I shall refer to this one: http://www.footywire.com/afl/footy/pu-gold-coast-suns--gary-jnr-ablett.
My efforts using BeautifulSoup have unfortunately not yielded positive results. So far, it seems to only render the menu on the left side of the page, and nothing in the middle (being the 2016 Supercoach stats and Breakeven Analysis) for me to actually scrape.
import urllib
from bs4 import BeautifulSoup
r = urllib.urlopen('http://www.footywire.com/afl/footy/pu-gold-coast-suns--gary-jnr-ablett').read()
soup = BeautifulSoup(r)
print soup.prettify()
Screengrab:
Where am I going wrong, or what do I need to adjust to get to the stats data that I'm after?
EDIT: Following the quick discussion with #Padraic Cunningham, I have determined that the issue specifically seems to be that when executing the BeautifulSoup(r) command, the resultant output is not as full as what is contained in r by itself.

Related

Web Scraping BoardGameGeek with RVest

I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.
I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.
*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *
I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.
Link: https://boardgamegeek.com/boardgame/63888/innovation
HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.
<span ng-if="min > 0" class="ng-binding ng-scope">45</span>
OR
<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>
Javscript sections at top of page like this: about 8 of them:
<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>
When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:
\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"
I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?
Are you aware, that BGG has an API? Documentation can be found here: URL
The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888
You can read the info with this code:
library(dplyr)
library(rvest)
game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>%
html_nodes("name[type=primary]") %>%
html_attr("value") %>%
as.character()
#> [1] "Innovation"
By inspecting the xml file you can choose what node you want to export.
Created on 2020-04-06 by the reprex package (v0.3.0)

how to print jupyter background job output in cell it was launched from?

When launching a background job from an IPython Jupyter Notebook, how can I make the printed output appear in the cell it was launched from, rather than in the cell I am currently working in?
The print() command seems to print into the current working cell, not into the cell that the background job was launched from. Is there a way to make it print nicely in the cell it was launched from? This is particularly relevant when running multiple background job sets, as to determine which jobset was responsible for that line of output.
Edit: it occurs with any code but here is a small snippet to reproduce it:
from IPython.lib import backgroundjobs as bg
import time
def launchJob():
for i in range(100):
print('x')
time.sleep(1)
jobs = bg.BackgroundJobManager()
for r in range(3):
jobs.new("launchJob()")
that does exactly what you'd expect it to do, print 3 x's every second in the output under the cell. Now go to the next cell and type 1+1 and execute it. The output 2 appears, but also any remaining x's get printed in this new cell rather in the original cell.
I am looking for a way to specifically tell my job to always print to the original cell it was executed from, as to obtain sort of a log in the background, or to generate a bunch of plots in the background, or any kind of data that I want in one place rather than appearing all over my notebook.
The IPython.lib.backgoundjobs.BackgroundJobManager.new documentation states:
All threads running share the same standard output. Thus, if your background jobs generate output, it will come out on top of whatever you are currently writing. For this reason, background jobs are best used with silent functions which simply return their output.
In GitHub pull request #856 of IPython (October 2011), the developers discussed this issue and concluded that keeping the output in the original cell would be difficult to implement. Hence, they decided to table the idea and in the latest version (7.9.0 at the time of writing) it still has not been solved.
In the mean time, a workaround could be to store the output in a variable, or to write/pickle the output to a file and print/display it once the background jobs are finished.
While it is not a direct answer to my question as it does not work with the print command, I did manage to solve my problem partially in the sense that I can have graphs updating in the back (and can hence log any kind of data on them as-we-go without any need to re-run).
Some proof of concept code below, based on What is the currently correct way to dynamically update plots in Jupyter/iPython?
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import time
from IPython.lib import backgroundjobs as bg
def pltsin(ax, colors=['b']):
x = np.linspace(0,1,100)
if ax.lines:
for line in ax.lines:
line.set_xdata(x)
y = np.random.random(size=(100,1))
line.set_ydata(y)
else:
for color in colors:
y = np.random.random(size=(100,1))
ax.plot(x, y, color)
fig.canvas.draw()
fig,ax = plt.subplots(1,1)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(0,1)
ax.set_ylim(0,1)
def launchJob():
for i in range(100):
pltsin(ax, ['b', 'r'])
time.sleep(1)
jobs = bg.BackgroundJobManager()
jobs.new("launchJob()")
While this is running, typing 1+1 in another cell does not disrupt the updating of the figure. For me this was a game changer, so I'll post this in case anyone is helped by it.
#Peter As other answers address the method of doing this using pickle or Output file, I would like to suggest alternate method.
I will advise that you should initiate print operation in parent window. When the certain criteria for execution is reached in child process window then return those values to print function in parent.
Use %matplotlib inline to print the values in cell of parent. This prints the outputs and visuals in the same cell.

How do I get HoloViews to display in Google Colabs notebooks?

I can't get any HoloViews graphics to display in any Google Colabs notebook.
For example even the simple Bokeh example right out fo the HoloViews introduction
points = hv.Points(np.random.randn(500,2))
points.hist(num_bins=51, dimension=['x','y'])
fails to show anything, without any error being reported, while the same code (and all example code from HoloViews) works fine in local Jupyter notebooks.
If I download the Colabs notebook locally and open it, I see the following where I say nothing for output in Colabs:
No (safe) renderer could be found for output. It has the following MIME types: application/javascript, application/vnd.bokehjs_load.v0+json
How do I get Bokeh HoloViews to display in Google Colabs notebooks?
See https://github.com/pyviz/holoviews/issues/3551 . Colaboratory has some serious limitations on how it handles notebooks, and for now you have to do this once:
import os, holoviews as hv
os.environ['HV_DOC_HTML'] = 'true'
Then for every single cell with a plot in it you have to re-load the JS:
hv.extension('bokeh')
hv.Curve([1, 2, 3])
It would be great if Google could fix that, as it's unworkable in my opinion.

R: rvest package read_html() function gives different outputs on same URL

Specifically I am trying to parse Amazon reviews of a product with the rvest library in R.
reviews_url <- "https://www.amazon.com/Magic-Bullet-Blender-Small-Silver/product-reviews/B012T634SM/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"
amazon_review <- read_html(reviews_url)
reviewRaw <- amazon_review %>%
html_nodes(".review-text") %>%
html_text()
The problem I am facing is, that if I rerun the function I sometimes get different Outputs, like it somehow parsed a different site. Sometimes it is the right output.
How can I fix this?
I already tried using the RSelenium package and use the WebDriver to load the Page and give it time to load but it does not help.
Interestingly the output alternates between 2 alternatives. So either the reviews are parsed correctly or they are not. The wrong alternative always looks the same however.
There definitely is some pattern there, but I just can't get my head around what the problem could be here. It might have to do something with the way the reviews are being loaded in at Amazon?
Anyways, I am thankful for any idea to solve this.
Best regards.

LDAvis HTML output from serVis is blank

I'm trying to use LDAvis for the first time, but have run into the following issue:
After running serVis on my JSON object,
serVis(json, out.dir = 'LDAvis', open.browser = FALSE)
the 5 expected files are created (i.e., d3.v3.js, index.html, lda.css, lda.json, and ldavis.js). As I understand LDAvis, opening the html file should open the interactive viewer. However, in doing this, only a blank webpage is opened.
I've compared the html source code with that from LDAvis projects found online, and they are the same. This was built using Christopher Gandrud's script found here where the LDA results come from the topicmodels package and used the Gibbs method. The underlying data uses ~45K documents with ~15K unique terms. For what it's worth, the lda.json file seems a bit small at ~6MB.
Unfortunately, this issue seems too large to provide sample data or reproducible code. (If I could isolate the issue more, then perhaps I could add sample code.) Instead, I was hoping if readers had any ideas for the cause of this issue or if it has come about before.
Thanks ahead for any feedback!
I've resolved the issue after realizing that most web browsers restrict access to local files. For Chrome, the .exe needs to be called with the option "--allow-file-access-from-files". Otherwise, no error is displayed opening the LDAvis output unless you inspect HTML elements manually.

Resources