I want to scrape this website - https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx.
I need to simulate clicking the 'Parcel #' link first then entering a value (i.e. 1234) into the Parcel # textbox and clicking search.
I need to scrape the data in the table which is shown at the bottom.
I'd like to write this in ASP.NET so I can push the Parcel # etc parameters through as part of the request. Once I get that request back, I'm confident I can parse it myself, I'm just not sure how I should exactly send the original request as it's not as simple as sending across parameters?
In your question you've specified both Javascript and asp.net so I really have no idea what technologies you're planning on using. I'd recommend HtmlAgility pack. It has a download from url option. It'll help with the parsing too.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx");
Related
I'm trying to get a single .ics file containing all calendar entries for a particular user of OpenSRS, which I believe has something to do with Sabre. I've tried using code that I have used successfully on other CalDAV servers, but it doesn't seem to work the same. Alternatively, I could make multiple http calls to get individual .ics files, if there is way to do that.
Using what I believe is the correct server (example: mail.servername.ca/caldav/username#domain.ca)
I make a http post using the custom request of "PROPFIND". I get back http 207, and the return data is a bunch of xml. Some of this is an href to a web page that if I retrieve, is an html file that displays links to the .ics files of each event (under the heading "Nodes"). So, I suppose I could scrape this html to get a list of links, then download them one by one. But I'm not entirely sure what would happen if I have hundreds or thousands of events - would I get them all on a single html file? And that would be very slow of course.
I've also tried the "REPORT" command which is how I get .ics data from other CalDAV servers, but that does not return useful data. I was hoping someone could point me at a better method of doing this.
I'm trying to scrape the daily temperature data from this page - specifically the min and max daily temp: https://www.wunderground.com/calendar/gb/birmingham/EGBB/date/2020-8
I found the line in the html where the data is located:
calendar days temperature li tag
and the rest of the daily temperature can also be found in the other li tags:
other li tags where temp data is inside
I'm trying to use beautiful soup to scrape the said data but when I try to use the following code, I am not getting all the li tags from the html, even if they are there when I inspect the html at the website
when I print the resulting temp_cont, there are the other li tags but not the ones that contain the daily data: result of soup find all
I've already tried using other html parser but it didn't work - all other parser output the same data.
I'm looking at other solution like trying to load it using javascript since others suggest that some pages may load dynamically but I don't really understand this.
Hope you can help me with this one.
response = get(url,headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
temp_cont = soup.find_all('li')
EDIT (ADDITIONAL QUESTION):
I tested the solution recommended by #AaronS below and the scraping worked perfectly fine. However, when I try to re-run the script again after few hours, an error of 'Nonetype' was prompted since one of the list element is 'None'.
When I inspected the website again in the network preview of the API, the first element of temperatureMax is now written "null". I don't understand why/how it changed or if there's a workaround so that the scraping works again. See screenshot here: network preview with null
So if you disable javascript in your browser you will find that none of the information you require is there. This is what Roman is explaining about. Javascript can make HTTP requests and grab data from APIs, the response is then fed back to the browser.
If you inspect the page and go to network tools. You will be able to see all the requests to load the page up. In those requests, there's one that if you click and go to preview, you'll see there's some temperature data.
I'm lazy so I copy the cURL of this request and input into a website like curl.trillworks.com which converts this to a python request.
Here you can see I'm copying the cURL request here.
Code Example
import requests
params = (
('apiKey', '6532d6454b8aa370768e63d6ba5a832e'),
('geocode', '52.45,-1.75'),
('language', 'en-US'),
('units', 'e'),
('format', 'json'),
)
response = requests.get('https://api.weather.com/v3/wx/forecast/daily/15day', params=params)
max = response.json()['temperatureMax'][0]
min = response.json()['temperatureMin'][0]
print('Min Temperature: ',min)
print('Max Temperature: ',max)
Output
Min Temperature: 65
Max Temperature: 87
Explanation
So the url is an API that weather.com has for daily forecasts. It has parameters specifying where you are, the format the response should return.
We make an HTTP get request with those parameters and the response we get back is a JSON object. We can convert this to a python dictionary using response.json() method.
Now if you output response.json() you'll get a lot of data, if you look at the preview in your browser of that HTTP request you can have a navigate down to the data you want. We want the data in the keys 'temperatureMax' and 'temperatureMin'. The result actually a list, and todays temperature max and min are the first item of those lists. Hence response.json()['temperatureMax'][0] and min = response.json()['temperatureMin'][0]
Additional Information
This is a case of the website has dynamic content which is loaded by javascript. There are two broad ways to deal with this type of content.
Mimic the HTTP request that the javascript invokes (This is what we have done here)
Use a package like selenium to invoke browser activity. You can use selenium to render the entire HTML including the javascript invoked parts of that HTML. Then select the data you want.
The benefit of the first option is efficency, it's much faster, the data is more structured. The reason to consider the last type of work last option is selenium was never meant for web scraping, it's brittle, if anything changes on the website then you'll find yourself in a position where you'll need to maintain the code often.
So my advice is try to do the first option, inspect the page, go to network tools and look at the previews of all the requests made. Play about with the previews data to see if it has what you want. Then re-create that request. Now sometimes you just need a simple HTTP request without parameters, cookies or headers.
In this particular example, you justed parameters, but sometimes you'll need all through and possibly more for different websites. So be mindful of that if you're not able to get the data. It's not fool proof, there are definitely instances that re-creating the HTTP request is difficult and there are things you as the user of the website are not privvy to which are required. Infact a well developed API which have this feature in to stop people scraping it.
this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]
Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.
I am trying to write a program that extracts shipping container information from a specific site. I've had success with several shipping companies wbsites that use POST methods to submit searches. For these sites I have been using cURL, a PHP libary. However, this one site http://www.cma-cgm.com/eBusiness/Tracking/ has been very difficult to interact with. I have tried using cURL but all I retrieve is the surrounding html without the actual search results.
A sample container I am trying to track is CMAU1173561.
The actual tracking URL seems to be http://www.cma-cgm.com/eBusiness/Tracking/Default.aspx?ContNum=CMAU1173561&T=292012319448 where ContNum is the shipping container and T is a value constructed from current time.
I also noted the .aspx. What is the best approach for retrieving these search results programatically?
I originally asked this question on Super User but was told that it might be better placed here...
I have a running blog and to help me track and write about my runs I've recently bought a Garmin GPS watch. The setup works a treat and I'm able to share links to my runs in my blog such as:
http://connect.garmin.com/activity/23842182
Is there an easy way for me to capture the map itself out of the Garmin Connect site (see the link) and display it in my blog posting? I can take a screenshot but an interactive map would be heaps better. It's obviously a Google Map with the run info overlayed so there must be a way... right?
To created an embedded interactive Google Map to render your run polylines, you will need to extract the data that the Garmin site is using to render the line.
From the Garmin site, there are two Javascript files that do the work:
http://connect.garmin.com/resource/garmin-js-lib/map/MapsUtil.js - Bunch of utility functions for rendering Google maps based on data in the Garmin system
http://connect.garmin.com/api/activity/component/mapLoader.js - Uses Garmin.service.ActivityClient to grab the JSON data describing the polyline. It feeds this data into Garmin.map.MapsUtil.addEncodedPolylineToMap to render the map.
So do do this on your blog, you will need to either request the JSON data from the Garmin site (and trust that the URI format doesn't change) or grab the data and store it on your own site. The URI format is currently:
http://connect.garmin.com/proxy/activity-service-1.0/gpolyline/activity/<activity id>?full=true
Where activity ID is the last number in your original URL. So:
http://connect.garmin.com/activity/23842182
http://connect.garmin.com/proxy/activity-service-1.0/gpolyline/activity/23842182?full=true
This data request will return some JSON that you can then use to render a Google Map.
Once you have decided how you want to store the JSON data, you will need to write some Javascript to request the JSON and, in the callback, feed it into the GPolyline.fromEncoded method. Once you have a GPolyline object (that is populated from the encoded JSON data), you can add it to a Google Maps GMap2 with the addOverlay method.
I realize that this answer is fairly technically involved and might be overwhelming if you haven't played with Google Maps before. If this is the case, I suggest heading over to the Google Maps API intro page for some hints on getting started.
Since this question was first posted, Garmin Connect has since added a quick code snippet to embed in your WordPress site to display your maps and course data. If you're having issues getting the code snippet to stay in the post after saving - check out these instructions for embedding Garmin Connect activities in WordPress.