How to figure out where is the raw data in a table?

How to figure out where is the raw data in a table? - web-scraping

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.

I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.

You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

Related

Loop asLongAs() in Gatling using Scala

`I am trying to use a loop e.g. asLongAs() in Galing but not getting enough data on google about how to use.
My scenrio is to open a HTML page and that page takes some time to load and for that I have a css selector to check that once the report get loaded we have one css selector to check in the source code.
my code is like:
`exec (http("ABC -${ID} - Id -${ID2}")
.get("web/a/b/c/")
.check(css(.abc).saveAs("URL"))
.exec(session =\> {
val response = session("URL").as\[String\]
println(s"url is: \\n$response")
session
})
exec(http("Open the redirected report - ${ID1} Id-${ID2}")
.get(session =\> session("URL").as\[String\])
Some checks
.check(css(".Image").exists)`
I want to create a loop till this css(.Image) is loading. Because once the URL is hitting at that time this CSS doesn't appear and it takes time to load and i want to calculate that time only.

but not getting enough data on google
Have you tried the official documentation ? It has samples for Java, Scala and Kotlin.
https://gatling.io/docs/gatling/reference/current/core/scenario/#aslongas

Scrape BSCScan Token Holdings Page

I'm trying to get data from this page
https://bscscan.com/tokenholdings?a=0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d
But the Website owner doesn't provide endpoints APIs for this purpose. So I tried to achieve it in different ways:
-USING DRYSCRAPE but the library seems to be abandoned;
-USING REQUESTS but the data are provided dinamically by javascript;
-USING REQUESTS HTML but even in this case the data doesn't seems to be loaded.
I would like to ignore selenium cause it's slow but I don't know how to solve this issue. Anyone has a solution that could work? The data I need is the table containing the tokens of the wallet. Thank U in advice and hv a nice day.

You can do it with requests-html, for example let's grab the symbol of the first row:
from requests_html import HTMLSession
session = HTMLSession()
url='https://bscscan.com/tokenholdings'
token={'a': '0xFAe2dac0686f0e543704345aEBBe0AEcab4EDA3d'}
r = session.get(url, params=token)
r.html.render(sleep=2)
binance_row = r.html.find('tbody tr', first=True)
symbol = binance_row.find('td')[2].text
print(symbol)
Output:
BNB

How to create a stock availability checker with python requests if JavaScript is used?

I wrote some code which should check whether a product is back in stock and when it is, send me an email to notify me. This works when the things I'm looking for are in the html.
However, sometimes certain objects are loaded through JavaScript. How could I edit my code so that the web scraping also works with JavaScript?
This is my code thus far:
import time
import requests
while True:
# Get the url of the IKEA page
url = 'https://www.ikea.com/nl/nl/p/flintan-bureaustoel-vissle-zwart-20336841/'
# Get the text from that page and put everything in lower cases
productpage = requests.get(url).text.lower()
# Set the strings that should be on the page if the product is not available
outofstockstrings = ['niet beschikbaar voor levering', 'alleen beschikbaar in de winkel']
# Check whether the strings are in the text of the webpage
if any(x in productpage for x in outofstockstrings):
time.sleep(1800)
continue
else:
# send me an email and break the loop

Instead of scraping and analyzing the HTML you could use the inofficial stock API that the IKEA website is using too. That API return JSON data which is way easier to analyze and you’ll also get estimates when the product gets back to stock.
There even is a project written in javascript / node which provides you this kind of information straight from the command line: https://github.com/Ephigenia/ikea-availability-checker
You can easily check the stock amount of the chair in all stores in the Netherlands:
npx ikea-availability-checker stock --country nl 20336841

Google Analytics Realtime Sandbox Environment

I am looking for a way to setup a google analytics sandbox environment that will allow me
to test out my custom js code near real time.
My app will be using custom variables for advanced segmentation, and I would like to test out multiple scenarios quickly, as opposed to setting up a dummy GA account and wait for a whole day to confirm the test.
Thanks

Great question.
For GA, server updates occur every four hours, and after every sixth such update, the entire set is recalculated, which means a 24-hour lag from code change to reliable feedback. This delay also applies to most customizations to the GA Browser (e.g., "custom filters").
So if you are going to use GA as your web metrics system, and you expect to actually rely on those data then a test rig is essential.
For me, it's useful to group test systems for client-side analytics using two rubrics: (i) complete, self-contained (closed-loop) systems; or (ii) simpler automated data pulls from the production system (by "production system" here i mean GA's system, not the Site whose pages the GA code is tracking).
For the latter, just add this line to each page of your Site that contains the GA tracking code, just below '__trackPageview()':
pageTracker._setLocalRemoteServerMode();
That line will cause a copy of each transaction line to be logged to your server's activity log--so in essence, you get the data captured by GA in real-time That's all you need to do to capture the data; to parse it, you can use, for instance, any of the excellent open source web log analyzers like AWStats, or roll your own.
This is simple and reliable--but all it can do is tell you (in real-time) "does the analytics code i just implemented on pages served by my production server actually work?"
Usually, that's not good enough--you would rather know if your code will work before it's on your production server. To do that, you need to simulate the production environment and find a way to access in real-time the data GA collects.
This kind of test rig is a little more involved, but still not difficult.
In sum, it requires these steps:
host/serve the ga.js and the
tracking pixel locally;
log the __utm.gif requests (in the
GA data flow, each request
corresponds to one logged
transaction); and
parse the headers into some
convenient human-readable form.
If you want more detail than that (ie, a step-by-step implementation), here it is:
I. Hosting/Serving the GA Script (& automating updates
To do that, you can create a small shell script like this one to wget the latest ga.js version into your local directory (replacing the extant version it finds there).
#!/bin/sh
rm /My_Sites/sitename.com/analytics/ga.js
cd /My_Sites/sitename.com/analytics/
wget http://www.google-analytics.com/ga.js
chmod 644 /My_Sites/sitename.com/analytics/ga.js
cd ${OLDPWD}
exit 0;
(Thanks to AskApache.com, which provided the original motivation and config details to do this in a production context.)
II. Create __utm.gif file
This is just a transparent 1x1 pixel gif image, which you will place in Site directory (doesn't matter where, it just needs to match the location recited in your pages)
III. Log the __utm.gif Requests
For a testing protocol in which you are the source of the client-side activity (e.g., you want to verify the cross-browser fidelity of some event-tracking code you've added to a page on your Site, so you automate 5000 clicks on the button you just wired up,serving the page from your dev server set up for this purpose) it's probably simplest to just log the Request Headers, because it's in those headers that the GA script directs the client to gather various data from the DOM, from the location bar (url), and from prior http headers, and append them to a request for a resource on the GA server (__utm.gif, which is just a 1x1 transparent pixel).
For this type of protocol, i use the Firefox addon, LiveHTTPHeaders. You install it like any other Firefox addon, a few mouse clicks is all. Next, open it, and click the "Generator" tab. From this window, you can see the actual requests in real time. At the bottom of the window is a 'save' button to store the log. I find it easier to configure LiveHTTPHeaders to log only the __utm.gif requests; to do that, just click the 'Edit' tab and create a siimple filter to exclude everything except these particular gif images (using the check boxes on the right, and the large text box to the right).
Other kinds of test protocols require you to work from your Server Activity Logs; in that case just add this line to each page of your Site, just below __trackPageview():
pageTracker._setLocalRemoteServerMode();
IV. Parse those logged requests so you can actually read them
So now your log will contain individual transction lines, each one of which is a string appended to an HTTP Request for the GA tracking pixel. This string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urchin tracker"). Each of these parameters corresponds to a variable that you see in the GA Dashboard (here's a complete list and description of them). This is all you need to know to build a parser. In more detail:
First, here's a sanitized __utm.gif request (the entries in your LiveHTTPHeaders log):
http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
This is my parser (in Python):
# regular expression module imported
import re
pattern = r'\&{1,2}'
pat_obj = re.compile(pattern)
# splitting the gif request on the '&' character
# (which GA originally used to concatenate each piece to build the request)
# (here, i've bound the __utm.gif to the variable by 'gfx')
gfx1 = pat_obj.split(gfx)
# create a look-up table to map a descriptive name to each gif request parameter
# (note, this isn't the entire list, which i've linked to above)
keys = "utmje utmsc utmsr utmac utmcc utmcn utmcr utmcs utmdt utme utmfl utmhn utmn utmp utmr utmul utmwv"
values = "java_enabled screen_color_depth screen_resolution account_string cookies campaign_session_new repeat_campaign_visit language_encoding page_title event_tracking_data flash_version host_name GIF_req_unique_id page_request referral_url browser_language gatc_version"
keys = keys.strip().split()
#create the look-up table
GIF_REQUEST_PARAMS = dict(zip(keys, values))
# parse each request parameter and map the parameter name to a descriptive name:
pattern = r'(utm\w{1,2})=(.*?)$'
pat_obj = re.compile(pattern)
for itm in gfx1 :
m = pat_obj.search(itm)
if m :
fmt = '{0:25} {1:10}'
print( fmt.format( GIF_REQUEST_PARAMS[m.group(1)], m.group(2) ) )
The result looks like this:
gatc_version              1         
GIF_req_unique_id         1669045322
language_encoding         UTF-8     
screen_resolution         1280x800  
screen_color_depth        24-bit    
browser_language          en-us     
java_enabled              1         
flash_version             10.0%20r45
campaign_session_new      1         
page_title                Position%20Listings%20%7C%20Linden%20Lab
host_name                 lindenlab.hrmdirect.com
referral_url              http://lindenlab.com/employment
page_request              /employment/openings.php?sort=da
account_string            UA-XXXXXX-X
cookies
To avoid making this longer still, i left out the cookies' value. They obviously require a separate parsing step, though it's virtually identical to the step i just showed. Again, each request represents a single transaction, so you can store them as you need to.

Python-getting data from an asp.net AJAX application

Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002?
The page appears to be doing an HTTP Post once the year combo box has changed. The name of the control is ct100$ContentPlaceHolder1$ddlSeries. I try setting a value for this control using urllib.urlencode(postdata), but I must be doing something wrong-the data on the page is not changing. Can this be done in Python?
I'd prefer not to use Selenium, if at all possible.
I've been using code like this(from stackoverflow user dbr)
import urllib
postdata = {'ctl00$ContentPlaceHolder1$ddlSeries': 9}
src = urllib.urlopen(
"http://utahcritseries.com/RawResults.aspx",
data = urllib.urlencode(postdata)
).read()
print src
But seems to be pulling up the same 2002 data. I've tried using firebug to inspect the headers and I see a lot of extraneous and random-looking data being sent back and forth-do I need to post these values back to the server also?

Use the excellent mechanize library:
from mechanize import Browser
b = Browser()
b.open("http://utahcritseries.com/RawResults.aspx")
b.select_form(nr=0)
year = b.form.find_control(type='select')
year.get(label='2005').selected = True
src = b.submit().read()
print src
Mechanize is available on PyPI: easy_install mechanize

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to figure out where is the raw data in a table? - web-scraping

Related

Loop asLongAs() in Gatling using Scala

Scrape BSCScan Token Holdings Page

How to create a stock availability checker with python requests if JavaScript is used?

Google Analytics Realtime Sandbox Environment

Python-getting data from an asp.net AJAX application

Categories

Resources