Jsoup doesnt show certain element which is visible at webpage - web-scraping

we can see at this page https://www.futbin.com/21/player/541/lionel-messi,
there is ps-lowest-1 span element and when I took this element via
doc.getElementById("ps-lowest-1") it doesn't give me data-price attr also text is coming dash, what may cause this issue.

What you needed to do was look through the network http request/responses in Chromes Dev tools.
If you find the value you're looking for (688000 for Ps4), you can look through the request/responses and you'll eventually find the value in a request to:
https://www.futbin.com/21/playerPrices?player=158023&rids=50489671&_=1603238284786
This is the data I think you want.
In order to parse it you can look at using:
String url = "https://www.futbin.com/21/playerPrices?player=158023&rids=50489671&_=1603238284786";
ResponseEntity<String> document = restTemplate.getForEntity(url, String.class);
String json = document.getBody();
List<String> listOfItems = JsonPath.read(json, "$.path[*].to.items.you.want");
This should give you a rough idea how to get the data you want.

Related

How does a Body get applied to an HTTP request?

In a previous post (link below) I was able to figure out how to apply a body to an HTTP request using VB.NET and the HttpWebRequest class object. However, I'm trying to understand how this is working technically. In other words:
How is the body actually being added to the http request because I don't see it being assigned to a property of the request object?
Here's the code snippet.
Dim vHTTPREQUESTBODY As String = "the body of the request"
Dim vBYTEVERSIONOFBODY As Byte() = Encoding.ASCII.GetBytes(vHTTPREQUESTBODY)
vHTTPREQUEST.ContentLength = vBYTEVERSIONOFBODY.Length
Dim vDATASTREAM As Stream = vHTTPREQUEST.GetRequestStream()
vDATASTREAM.Write(vBYTEVERSIONOFBODY, 0, vBYTEVERSIONOFBODY.Length)
vDATASTREAM.Close()
Any help appreciated and more information below for those interested.
More Information
Why the question:
What is confusing to me is that you cannot clearly see where/how the body is added to the request before it is sent. As in older VB objects doing the same, there would be simple “Body” property for that object to set a value to. For example, in line 3 of the above, you can see the “vHTTPREQUEST.ContentLength” property set. However, you don’t see something like that for the body as in “vHTTPREQUEST.BodyOrBlahBlahBlah”. Yet, somehow the above does add the body string you set in the “vHTTPREQUESTBODY” variable to the request.
Can anyone explain how? I've search online and nothing specifically addresses this.
This is my guess:
Take a look at line 5. If you look up the “Write” method of the “Stream” object it basically states what is happening is that you are adding the bytes you placed in the first parameter of that method to the end of the current stream. So... I’m assuming what is happening is that the stream is in the computer’s memory at that point of the code, and when you use the “Write” method you are adding your body in the form of bytes to that stream that is already in memory. Therefore, you don’t need to explicitly assign it to a request of the property. Again that is a guess. I was hoping for confirmation or the true explanation or a resource/reference where I can read about this.
Previous Post:
Here's a link to previous post

ASP.NET Core URL Parameter Decoding

I have an ASP.NET Core web API and an issue with encoded URL's in query parameters.
I have an URL parameter like 'path/to/'. The IDENTIFIER part is something like 'HÄÄ/20/19'. This is urlEncoded in frontend to a link URL. The result is a link like
domain.com/new/stuff/path/to/H%C3%84%C3%84%2F20%2F19
Now, at some point, user gets redirected to a controller where this URL is used in a query parameter like:
param=%2Fpath%2Fto%2FH%C3%84%C3%84%2F20%2F19
I'm using request query to get the param
var param = HttpContext.Request.Query["param"].ToString();
After this the value of param is
%2Fpath%2Fto%2FHÄÄ%2F20%2F19
So the LATIN CAPITAL LETTER A WITH DIAERESIS are automatically decoded as the other encoded characters are not.
The actual problem comes when I'm redirecting the user to this URL. It ends up with a referer header where it causes havoc with an error message
System.InvalidOperationException: Invalid non-ASCII or control character in header: 0x00C4
I tried to just replace all the 'Ä' characters with 'A' and the problem is fixed. This is not a real fix though. I cannot encode the whole variable (see above) as it would result in double encoding for other encoded characters.
This problem only occurs with IE11 and Edge (AFAIK) and works fine with at least Chrome.
I'm not 100% sure where the actual problem is and why this is happening so does anyone have any ideas where to start looking and how to fix this without hacking with the string.replace?
EDIT
I could fix it with something like this, but I'm not seriously doing this. Seems way too hacky.
var problemPart = param.Substring(param.LastIndexOf('/') + 1, param.Length - param.LastIndexOf('/') - 1);
var fixedPart = WebUtility.UrlDecode(problemPart);
fixedPart = WebUtility.UrlEncode(fixedPart);
param = param.Replace(problemPart, fixedPart);
EDIT 2
I think the problem is that IE11 and Edge change the encoding by adding control characters to it when the URL ends up to the referer header. The fix I added to the original post doesn't actually fix the problem but just work around it. The control character that gets added to the URL is %C2%84 (so Ä becomes %C3%84%C2%84 instead of just %C3%84)
TEMPORARY WORKAROUND
I basically used the code above to workaround the issue. I iterated the parameter value and re-encoded all the invalid characters in it. This doesn't fix the root cause but works around the issue and user doesn't get any errors to the screen.

ashx - get all the possible items of QueryString

Looking at this
http://www.dotnetperls.com/ashx
I might have bits of code like this:
string file = context.Request.QueryString["file"];
if (file == "logo")
{
r.WriteFile("Logo1.png");
}
else
{
r.WriteFile("Flower1.png");
}
That should allow me to see different things depending on URL that I enter in a browser, for example:
http://www.dotnetperls.com/?file=logo
http://www.dotnetperls.com/?file=sth_else_eg_flower
The problem I am facing now is how, knowing just http://www.dotnetperls.com/?file can I read what the all the assumed options of the file variable are? In this case it would be "logo" and anything else.
What I have in reality is http://www.somewebstie.com/somefile.ashx?somevariable=. I can Google up the string to get few results (i.e. http://www.somewebstie.com/somefile.ashx?somevariable=abcde or http://www.somewebstie.com/somefile.ashx?somevariable=xyz) thus I know it exists and is somehow searchable. I just would like to know all the other "abcde" and "xyz". If I try just http://www.somewebstie.com/somefile.ashx I get a singe line error saying that I am giving a wrong variable and I cannot see anything important in the source of the site.
What might be important here - I have zero knowledge about web technologies.
You can't get this information. Its all hidden in the code implementation. There is no published format (by default) that will show you all of the available options the code is looking for.

How do I identify the referrer page in ASP.NET?

In VS2003, I am trying to find out the particular page where the request is coming from. I want to identify the exact aspx page name.
Is there a way to only get the page name or some how strip the page name?
Currently I am using the following instruction...
string referencepage = HttpContext.Current.Request.UrlReferrer.ToString();
and I get the following result...
"http://localhost/MyPage123.aspx?myval1=3333&myval2=4444;
I want to get the result back with out any query string parameters and be able to identify the page MyPage123.aspx accurately...
How do I do that??
Instead of calling .ToString on the Uri, use the AbsolutePath property instead:
string referencepage = HttpContext.Current.Request.UrlReferrer.AbsolutePath;
This should get you "/MyPage123.aspx" in your case.
Edit: Had LocalPath instead of AbsolutePath by mistake
Look at the Segments property of the URI class (which is what HttpContext.Current.Request.UrlReferrer returns).
Something like HttpContext.Current.Request.UrlReferrer.Segments[1] (changing the 1 indexer to get the correct segment you require).

Checking The Date A Webpage Has Been Updated?

I want to be able to run a little script that I can populate with a list of URLs and it pulls in and checks when the page was last updated? Has anyone done this?
I can only find a manual way of doing this using JavaScript by pasting this into the browser URL field
javascript:alert(document.lastModified)
Any ideas greatly received :)
The following will step through an array of URLs and display the last modified date or, if it's not present, the date of the server request.
string[] urls = { "http://boflynn.net", "http://slashdot.org" };
foreach ( string url in urls )
{
System.Net.HttpWebRequest req =
(System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
System.Net.HttpWebResponse resp =
(System.Net.HttpWebResponse) req.GetResponse();
Console.WriteLine("{0} - {1}", url, resp.LastModified);
}
If you use urllib2 (or perhaps httplib might be better still) in a python script you can inspect the headers that are returned for the last-modified field.
It depends on what you mean by "last updated". Sure, there is the Last-Modified HTTP header, but it can be very misleading. For example, if the page is being served up dynamically, there is a good change that this field will be the current time, even if the content of the page itself (the part useful to humans) has not been updated in a rather long time. This page itself is a good example of this phenomenon.
If you are truly interested in the last time the content was updated, then I don't have an immediate answer.

Resources