Parsing Web page with R

Parsing Web page with R - r

this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:
https://www.jobsbank.gov.sg/
What I wan to do is to parse the content of all available job listing in the web.
my approach:
click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do
provide the search result web address to R and identify all the job listing links
supply the job listing links to R and ask R to go to each listing and extract the content.
look for next page and repeat step 2 and 3.
However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.
Is there anyway to overcome this problem?
Suppose I managed to get the web address for the search result, I intent to use the following code:
base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]

Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).
Learn about HTTP GET and HTTP POST requests.
Notice the search box sends a POST request.
See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES
{actionForm.keyWord}:my search string )
Construct a POST request using one of the R http packages with that form data in.
Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.
Hence you end up using postForm from RCurl package:
p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")
And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.
Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.
There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.
Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.

Related

why is soup find all not showing all tags

I'm trying to scrape the daily temperature data from this page - specifically the min and max daily temp: https://www.wunderground.com/calendar/gb/birmingham/EGBB/date/2020-8
I found the line in the html where the data is located:
calendar days temperature li tag
and the rest of the daily temperature can also be found in the other li tags:
other li tags where temp data is inside
I'm trying to use beautiful soup to scrape the said data but when I try to use the following code, I am not getting all the li tags from the html, even if they are there when I inspect the html at the website
when I print the resulting temp_cont, there are the other li tags but not the ones that contain the daily data: result of soup find all
I've already tried using other html parser but it didn't work - all other parser output the same data.
I'm looking at other solution like trying to load it using javascript since others suggest that some pages may load dynamically but I don't really understand this.
Hope you can help me with this one.
response = get(url,headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
temp_cont = soup.find_all('li')
EDIT (ADDITIONAL QUESTION):
I tested the solution recommended by #AaronS below and the scraping worked perfectly fine. However, when I try to re-run the script again after few hours, an error of 'Nonetype' was prompted since one of the list element is 'None'.
When I inspected the website again in the network preview of the API, the first element of temperatureMax is now written "null". I don't understand why/how it changed or if there's a workaround so that the scraping works again. See screenshot here: network preview with null

So if you disable javascript in your browser you will find that none of the information you require is there. This is what Roman is explaining about. Javascript can make HTTP requests and grab data from APIs, the response is then fed back to the browser.
If you inspect the page and go to network tools. You will be able to see all the requests to load the page up. In those requests, there's one that if you click and go to preview, you'll see there's some temperature data.
I'm lazy so I copy the cURL of this request and input into a website like curl.trillworks.com which converts this to a python request.
Here you can see I'm copying the cURL request here.
Code Example
import requests
params = (
('apiKey', '6532d6454b8aa370768e63d6ba5a832e'),
('geocode', '52.45,-1.75'),
('language', 'en-US'),
('units', 'e'),
('format', 'json'),
)
response = requests.get('https://api.weather.com/v3/wx/forecast/daily/15day', params=params)
max = response.json()['temperatureMax'][0]
min = response.json()['temperatureMin'][0]
print('Min Temperature: ',min)
print('Max Temperature: ',max)
Output
Min Temperature: 65
Max Temperature: 87
Explanation
So the url is an API that weather.com has for daily forecasts. It has parameters specifying where you are, the format the response should return.
We make an HTTP get request with those parameters and the response we get back is a JSON object. We can convert this to a python dictionary using response.json() method.
Now if you output response.json() you'll get a lot of data, if you look at the preview in your browser of that HTTP request you can have a navigate down to the data you want. We want the data in the keys 'temperatureMax' and 'temperatureMin'. The result actually a list, and todays temperature max and min are the first item of those lists. Hence response.json()['temperatureMax'][0] and min = response.json()['temperatureMin'][0]
Additional Information
This is a case of the website has dynamic content which is loaded by javascript. There are two broad ways to deal with this type of content.
Mimic the HTTP request that the javascript invokes (This is what we have done here)
Use a package like selenium to invoke browser activity. You can use selenium to render the entire HTML including the javascript invoked parts of that HTML. Then select the data you want.
The benefit of the first option is efficency, it's much faster, the data is more structured. The reason to consider the last type of work last option is selenium was never meant for web scraping, it's brittle, if anything changes on the website then you'll find yourself in a position where you'll need to maintain the code often.
So my advice is try to do the first option, inspect the page, go to network tools and look at the previews of all the requests made. Play about with the previews data to see if it has what you want. Then re-create that request. Now sometimes you just need a simple HTTP request without parameters, cookies or headers.
In this particular example, you justed parameters, but sometimes you'll need all through and possibly more for different websites. So be mindful of that if you're not able to get the data. It's not fool proof, there are definitely instances that re-creating the HTTP request is difficult and there are things you as the user of the website are not privvy to which are required. Infact a well developed API which have this feature in to stop people scraping it.

How to know if a user clicked a link using its network traffic

I have large traffic files that I'm trying to analyze in order to get statistical features of users.
One of the features that I would like to extract is links clicking in specific sites (for examples - clicking on popups and more)
My first idea was to look in the packets' content and search for hrefs and links, save them all in some kind of data structure with their time stamps, and then iterate again over the packets to search for requests at any time close to the time the links appeared.
Something like in the following pseudo code (in the following code, the packets are sorted by flows (flow: IP1 <=> IP2)):
for each packet in each flow:
search for "href" or "http://" or "https://"
save the links with their timestamp
for each packet in each flow:
if it's an HTTP request and its URL matches any URL in the list and the
time is close enough, record it
The problem with this code is that some links are dynamically generated while the page is loading (using javascript or so), and cannot be found using the above method.
I have also tried to check the referrer field in the HTTP header and look for packets that were referred by the relevant sites. This method generates a lot of false positives because of iframes and embedded objects.
It is important to mention that this is not my server, and my intention is to make a tool for statistical analysis of users behavior (thus, I can't add some kind of click tracker to my site).
Does anyone have an idea what can I do in order to check if the users clicked on links according to their network traffic?
Any help will be appreciated!
Thank you

Scrape ASP.NET Website with heavy javascript calls

I want to scrape this website - https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx.
I need to simulate clicking the 'Parcel #' link first then entering a value (i.e. 1234) into the Parcel # textbox and clicking search.
I need to scrape the data in the table which is shown at the bottom.
I'd like to write this in ASP.NET so I can push the Parcel # etc parameters through as part of the request. Once I get that request back, I'm confident I can parse it myself, I'm just not sure how I should exactly send the original request as it's not as simple as sending across parameters?

In your question you've specified both Javascript and asp.net so I really have no idea what technologies you're planning on using. I'd recommend HtmlAgility pack. It has a download from url option. It'll help with the parsing too.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("https://recorder.co.clark.nv.us/RecorderEcommerce/default.aspx");

Request GA statistic data for a specific large set of pages

I've spent last few days trying to find a solution to solve problem below.
I have set of URLs for which I would like to request data - mainly pageviews and visits by months in specific time interval. These URL specify one web section and we would like to get statistics for this section. I'm using PHP GAPI.
I am able to construct correct filter for the URL set:
ga:pagePath==[url1]||ga:pagePath==[url2]||ga:pagePath==[url3]...
But this works for a fews URLs because request is sent via GET and there is request length limitation for GET.
At first I tried to make severeal requests for a few URLs from the whole set and after all requests (when I had data for all pages) I made sum of pageviews and visits. Than I realized that this could work for pageviews but not for visits (one particular visit could be counted in more than one response and thanks to sum it was counted muliple times).
And than i have these limitations:
I can't use regular expresion to shorten the filter. URLs of pages are badly designed (not thanks to us :) ) and the pages in a web section therefore don't have nice URL prefix like /my-section/*
I need historical data (2 years back), so it won't help to start tracking some custom variable or event for pages in particular web section from now.
So I tried to make POST request to API. I was able to get auth token, but POSTing request to get statistic data returns:
403 Forbidden
Target feed is read-only
I tried to find if there is actualy the possibility to use POST method, but had no luck finding exact info (some clues suggest that it is not possible).
Another idea could be redesigning URL to have some nice prefix to filter by regexp and somehow changing the stored URLs in GA, but I have a feeling that it's not possible either.
Does anyone have an idea how to solve this?
Thanks for any suggests :)

Track where 404s come from in Google Analytics?

I want to figure out how visitors are getting to my /error404 page. I want to see what URL they attempted to visit (e.g., http://mydomain.com/iloveyou) before they received the 404, so I can see what content my users think I have.

You can get by without touching your source files--as long as your error pages/templates are tagged, that's all you need, configuration-wise.
As an aside, our error page template all look about the same so to clearly allow a person viewing the GA data to distinguish the various error pages, we annotate them by passing in a descriptive string to _trackPageview(), e.g.,
pageTracker._trackPageview("404_removed_directory");
Needless to say, this annotation is just for GA, it isn't shown to the user.
So w/r/t viewing the information you are after--i.e., page paths in which one of your error pages is the terminus--you can use either the GA data browser or either of the two GA APIs.
Using the GA Data Export API
I would code my Request this way:
dimensions=ga:previousPagePath
metrics=ga:pageviews
filters=ga:nextPagePath%3D~SomeErrorPage.html
# or if your API client does not require URL encoding, then:
filters=ga:nextPagePath=~SomeErrorPage.html
If you haven't used the GA Data Export API, the Data Feed page shows a complete API Request that you can use as a template.
In addition, Google's Data Feed Query Explorer is a decent sandbox to interactively test the queries that comprise your Requests.
Using the GA data browser
From the main Dashboard, on the left-hand side panel, click Content, then click Overview underneath it. To the right, in the main window, you will see a heading under the Pageviews chart, called Navigation Analysis, this has two linked options under it, Navigation Summary and Entrance Paths. clicking the latter will reveal the view shown below. In the textbox, just enter the name of your error page to get the entrance path for that error page.
Finally, relying on your server access log for this information is less reliable for all of the usual reasons (caching, etc.), in addition, given your question is specific to GA, i assume you already use GA, so modifying your config file, and parsing the activity log, if you are not already doing so, is a lot of trouble, compared to getting a more accurate count of the same data through a channel (GA) you have already set up.

You don't need google analytics for this - just expand the logging of your webserver to log referer data for this specific page. With this information you can see where the majority of people are coming from.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex