I am pretty new to web-scraping and recently I am trying to automatically scrap phone number for pages like this. I am not supposed to use Selenium/headless url browser libraries and I am trying to find the a way to actually request the phone number using let say a web service or any other possible solution that could give me the phone number hopefully directly without having to go through the actual button press by selenium.
I totally understand that it may not even be possible to automatically reveal the phone number in one shut as it is meant not be accessible by nosy newbie web-scraper like me; but I still like to raise the question for my information to get detailed answer from an expert point of view.
If I search the "Reveal" button DOM element, it shows some tags which I have never seen before. I have two main questions which I believe could be helpful for newbies like me.
1) Given a set of unknown tags/attribues (ie. data-q and data-reveal in the blow button), how is one able to find out which scripts in the page are actually using them?
2) I googled the button element's tag like: data-q and data-reveal the only relevant I could find was this which for some reason I don't have access two even-if I use proxy.
Any clue particularly on the first question is much appreciate it.
Regards,
Below is the href-button code
Reveal
Ok, according to your demand there are several steps before you finally get a solution.
1st step : open your own browser and enter your target page(https://www.gumtree.com/p/vans/2015-ford-transit-custom-2.2tdci-290-l1-h1/1190345514)
2nd step : (Assume you are using Chrome as your favorite browser) Press Ctrl+Shift+I to open the console, and then select 'Network' tag in the console.
3rd step : Press the 'Reveal' button on that page, watch the console carefully, catch the http request which is sent immediately when you press the 'Reveal' button. You can see the request contains a long string of number in Query String Parameters, actually it is a timestamp.
4th step : Also you can see there is a part named 'Request Headers' in that http request, and you should copy the values of referer , user-agent , x-gumtree-token.
5th step : Try to construct your request (I am a fan of Python, So I am going to show you my example code in Python)
import time
import requests
import json
headers = {
'referer': 'please enter the value you just copied from that specific request',
'user-agent': 'please enter the value you just copied from that specific request',
'x-gumtree-token': 'please enter the value you just copied from that specific request'
}
url = 'https://www.gumtree.com/ajax/account/seller/reveal/number/1190345514?_='
current_time = time.time()
current_time = str(current_time)
current_time = current_time.split('.')[0] + current_time.split('.')[1] + '0'
url += current_time
response = requests.get(url=url,headers=headers)
response_result = json.loads(response.content)
phone_number = response_result['data']
Related
https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !
I currently have an application that works with Firebase.
I repeatedly load profile pictures. However the link is quite long, it consumes a certain amount of data. To reduce this load, I would like to put the link in raw and only load the token that is added to the link.
To explain, a link looks like this: “https://firebasestorage.googleapis.com/v0/b/fir-development.appspot.com/o/9pGveKDGphYVNTzRE5U3KTpSdpl2?alt=media&token=f408c3be-07d2-4ec2-bad7-acafedf59708”
So I would like to put in gross: https://firebasestorage.googleapis.com/v0/b/fir-developpement.appspot.com/o/
In continuation: “9pGveKDGphYVNTzRE5U3KTpSdpl2” which is the UID of the user that I recover already and the or my problem this poses: “alt = media & token = f408c3be-07d2-4ec2-bad7-acafedf59708” which adds randomly for each photo .
I would like to get back only this last random piece …
Is it possible ?
Thank you
UP : 01/11 Still no solution
It's not supported to break apart and reassemble download URLs. You should be treating these strings as if their implementation details might change without warning.
UPDATE: Google has recently updated their error message with an additional error code possibility: "timeout-or-duplicate".
This new error code seems to cover 99% of our previously mentioned mysterious
cases.
We are still left wondering why we get that many validation requests that are either timeouts or duplicates. Determinining this with certainty is likely to be impossible, but now I am just hoping that someone else has experienced something like it.
Disclaimer: I cross posted this to Google Groups, so apologies for spamming the ether for the ones of you who frequent both sites.
I am currently working on a page as part of a ASP.Net MVC application with a form that uses reCAPTCHA validation. The page currently has many daily users.
In my server side validation** of a reCAPTCHA response, for a while now, I have seen the case of the reCAPTCHA response having its success property set to false, but with an accompanying empty error code array.
Most of the requests pass validation, but some keep exhibiting this pattern.
So after doing some research online, I explored the two possible scenarios I could think of:
The validation has timed out and is no longer valid.
The user has already been validated using the response value, so they are rejected the second time.
After collecting data for a while, I have found that all cases of "Success: false, error codes: []" have either had the validation be rather old (ranging from 5 minutes to 10 days(!)), or it has been a case of a re-used response value, or sometimes a combination of the two.
Even after implementing client side prevention of double-clicking my submit-form button, a lot of double submits still seem to get through to the server side Google reCAPTCHA validation logic.
My data tells me that 1.6% (28) of all requests (1760) have failed with at least one of the above scenarios being true ("timeout" or "double submission").
Meanwhile, not a single request of the 1760 has failed where the error code array was not empty.
I just have a hard time imagining a practical use case where a ChallengeTimeStamp gets issued, and then after 10 days validation is attempted, server side.
My question is:
What could be the reason for a non-negligible percentage of all Google reCAPTCHA server side validation attempts to be either very old or a case of double submission?
**By "server side validation" I mean logic that looks like this:
public bool IsVerifiedUser(string captchaResponse, string endUserIp)
{
string apiUrl = ConfigurationManager.AppSettings["Google_Captcha_API"];
string secret = ConfigurationManager.AppSettings["Google_Captcha_SecretKey"];
using (var client = new HttpClient())
{
var parameters = new Dictionary<string, string>
{
{ "secret", secret },
{ "response", captchaResponse },
{ "remoteip", endUserIp },
};
var content = new FormUrlEncodedContent(parameters);
var response = client.PostAsync(apiUrl, content).Result;
var responseContent = response.Content.ReadAsStringAsync().Result;
GoogleCaptchaResponse googleCaptchaResponse = JsonConvert.DeserializeObject<GoogleCaptchaResponse>(responseContent);
if (googleCaptchaResponse.Success)
{
_dal.LogGoogleRecaptchaResponse(endUserIp, captchaResponse);
return true;
}
else
{
//Actual code ommitted
//Try to determine the cause of failure
//Look at googleCaptchaResponse.ErrorCodes array (this has been empty in all of the 28 cases of "success: false")
//Measure time between googleCaptchaResponse.ChallengeTimeStamp (which is UTC) and DateTime.UtcNow
//Check reCAPTCHAresponse against local database of previously used reCAPTCHAresponses to detect cases of double submission
return false;
}
}
}
Thank you in advance to anyone who has a clue and can perhaps shed some light on the subject.
You will get timeout-or-duplicate problem if your captcha is validated twice.
Save logs in a file in append mode and check if you are validating a Captcha twice.
Here is an example
$verifyResponse = file_get_contents('https://www.google.com/recaptcha/api/siteverify?secret='.$secret.'&response='.$_POST['g-recaptcha-response'])
file_put_contents( "logfile", $verifyResponse, FILE_APPEND );
Now read the content of logfile created above and check if captcha is verified twice
This is an interesting question, but it's going to be impossible to answer with any sort of certainly. I can give an educated guess about what's occurring.
As far as the old submissions go, that could simply be users leaving the page open in the browser and coming back later to finally submit. You can handle this scenario in a few different ways:
Set a meta refresh for the page, such that it will update itself after a defined period of time, and hopefully either get a new ReCAPTCHA validation code or at least prompt the user to verify the CAPTCHA again. However, this is less than ideal as it increases requests to your server and will blow out any work the user has done on the form. It's also very brute-force: it will simply refresh after a certain amount of time, regardless of whether the user is currently actively using the page or not.
Use a JavaScript timer to notify the user about the page timing out and then refresh. This is like #1, but with much more finesse. You can pop a warning dialog telling the user that they've left the page sitting too long and it will soon need to be refreshed, giving them time to finish up if they're actively using it. You can also check for user activity via events like onmousemove. If the user's not moving the mouse, it's very likely they aren't on the page.
Handle it server-side, by catching this scenario. I actually prefer this method the most as it's the most fluid, and honestly the easiest to achieve. When you get back success: false with no error codes, simply send the user back to the page, as if they had made a validation error in the form. Provide a message telling them that their CAPTCHA validation expired and they need to verify again. Then, all they have to do is verify and resubmit.
The double-submit issue is a perennial one that plagues all web developers. User behavior studies have shown that the vast majority occur because users have been trained to double-click icons, and as a result, think they need to double-click submit buttons as well. Some of it is impatience if something doesn't happen immediately on click. Regardless, the best thing you can do is implement JavaScript that disables the button on click, preventing a second click.
Take a standard web page with lots of text fields, drop downs etc.
What is the most efficient way in webdriver to fill out the values and then verify if the values have been entered correctly.
You only have to test that the values are entered correctly if you have some javascript validation or other magic happening at your input fields. You don't want to test that webdriver/selenium works correctly.
There are various ways, depending if you want to use webdriver or selenium. Here is a potpourri of the stuff I'm using.
Assert.assertEquals("input field must be empty", "", selenium.getValue("name=model.query"));
driver.findElement(By.name("model.query")).sendKeys("Testinput");
//here you have to wait for javascript to finish. E.g wait for a css Class or id to appear
Assert.assertEquals("Testinput", selenium.getValue("name=model.query"));
With webdriver only:
WebElement inputElement = driver.findElement(By.id("input_field_1"));
inputElement.clear();
inputElement.sendKeys("12");
//here you have to wait for javascript to finish. E.g wait for a css Class or id to appear
Assert.assertEquals("12", inputElement.getAttribute("value"));
Hopefully, the results of filling out your form are visible to the user in some manner. So you could think along these BDD-esque lines:
When I create a new movie
Then I should see my movie page
That is, your "new movie" steps would do the field entry & submit. And your "Then" would assert that the movie shows up with your entered data.
element = driver.find_element(:id, "movie_title")
element.send_keys 'The Good, the Bad, the Ugly'
# etc.
driver.find_element(:id, "submit").click
I'm just dabbling in this now, but this is what I came up with so far. It certainly seems more verbose than something like Capybara:
fill_in 'movie_title', :with => 'The Good, the Bad, the Ugly'
Hope this helps.
Does Google have an API with a function which will verify if a specific phrase can be found at a given url?
Say I have a webpage url: www.mysite/2011/01/check-if-phrase-exists
I want to know if the phrase foobar exists somewhere on that document (it can be anywhere on the html document - not just "readable text").
The function/api would return True or False.
Question Update The "method" should avoid me from having to retrieve the entire page to my server and search myself. It is the fetching of the webpage to my server that I am trying to avoid (to cut down on bandwidth).
I don't think they do, but you could do this yourself without much code (this is adapted from the App Engine docs):
import urllib2
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
my_search_function(result)
# or perhaps my_search_function(result.content)
except urllib2.URLError, e:
handleError(e)
Then you can just define my_search_function(text) to do what you need