I'm new to Scrapy and I'm trying to extract data from sport bets on sportsbooks.
I am currenty trying to extract data from the upcoming matches in the Premier League: https://sport.mrgreen.com/da-DK/filter/football/england/premier_league
(The site is in Danish)
First I have used the command "fetch" on the website, and I am able to return something back using the "response" command with both CSS and xpath from the body of the HTML code. However, when I want to extract data beyond a certain point in the HTML code ("div data-ui-view"), response just returns an empty list. (See picture)
Example
I have encircled the xpath in red. I return something when I run the following:
response.xpath('/html/body/div[1]/div')
I have tried to use both CSS on the innermost class that I could find on the data I want to extract and the direct xpath as well. Still only an empty list.
response.xpath('/html/body/div[1]/div/div')
(The above code returns "[]")
response.xpath('response.xpath('/html/body/div[1]/div/div/div[2]/div/div/div[1]/div/div[3]/div[2]/div/div/div/div/div/div[4]/div/div[2]/div/div/ul/li[1]/a/div/div[2]/div/div/div/div/button[1]/div/div[1]/div'))
(The above xpath is to a football club name)
Does anybody know what the problem might be? Thanks
You can't do response.xpath(response.xpath()), one response is enough; also, I always use "" instead of '', and avoid using full xpath, that rarely works, instead try with .//div and see what returns, and for better results, use the search options that xpath has, like response.xpath(".//div[contains(text(), 'Chelsea Wolves')]//text(). Make sure your response.url matches with the url you want to scrapy.
Remember, a short and specific xpath is better than a large and ambiguos xpath.
Related
I'm writing a Gatling simulation, and I want to verify both that a certain element exists, and that the content of one of its attributes starts with a certain substring. E.g.:
val scn: ScenarioBuilder = scenario("BasicSimulation")
.exec(http("request_1")
.get("/path/to/resource")
.check(
status.is(200),
css("form#name", "action").ofType[String].startsWith(BASE_URL).saveAs("next_url")))
Now, when I add the startsWith above, the compiler reports an error that says startsWith is not a member of io.gatling.http.check.body.HttpBodyCssCheckBuilder[String]. If I leave the startsWith out, then everything works just fine. I know that the expected form element is there, but I cant confirm that its #action attribute starts with the correct base.
How can I confirm that the attribute start with a certain substring?
Refer this https://gatling.io/docs/2.3/general/scenario/
I have copied the below from there but it is a session function and will work like below :-
doIf(session => session("myKey").as[String].startsWith("admin")) { // executed if the session value stored in "myKey" starts with "admin" exec(http("if true").get("..."))}
I just had the same problem. I guess one option is to use a validator, but I'm not sure how if you can declare one on the fly to validate against your BASE_URL (the documentation doesn't really give any examples). You can use transform and is.
Could look like this:
css("form#name", "action").transform(_.startsWith(BASE_URL)).is(true)
If you also want to include the saveAs call in one go you could probably also do something like this:
css("form#name", "action").transform(_.substring(0, BASE_URL.length)).is(BASE_URL).saveAs
But that's harder to read. Also I'm not sure what happens when substring throws an exception (like IndexOutOfBounds).
I wanted to verify a text in a webpage exist for 2 times or ‘n’ times. I have used “Page Should Contain” keyword but it says “Pass” when it identifies single occurrence. I don’t want to verify using locator.
Ex: I want to verify the text "Success" is available in a current webpage for 3 times using robot framework
Any inputs/suggesstions would be helpful.
Too bad you don't want to use a locator, as robotframework has a keyword just for that:
Xpath Should Match X Times //*[contains(., "Success")] 2
The caveat is the locator should not be prepended with xpath= - just straight expression.
The library keyword Page Should Contain does pretty much exactly that, by the way.
And if you want to find how many times the string is present in the page - easy:
${count}= Get Matching Xpath Count //*[contains(., "Success")]
And then do any kind of checks on the result, e.g.
Should Be Equal ${count} 2
I thought the problem of not using locator sounds fun (the rationale behind the requirement still unclear, yet), so another solution - look in the source yourself:
${source}= Page Source # you have the whole html of the page here
${matches}= Get Regexp Matches ${source} >.*\b(Success)\b.*<
${count}= Get Length ${matches}
The first one gets the source, the second gets all non-overlapping (separate) occurrences of the target string, when it is (hopefully) inside a tag. The third line returns the count.
Disclaimer - please don't actually do that, unless you're 100% sure of the source and the structure. Use a locator.
I am integrating with a system that creates part of a URL and I supply part of the URL.
I supply this:
http://myServer/gis/default.aspx?MAP_NAME=myMap
The system supplies this:
?type=mrolls&rolls='123','456'
(the "rolls" change depending on what the user chooses in the system)
so, my URL ends up looking like this:
http://myServer/gis/default.aspx?MAP_NAME=myMap?type=mrolls&rolls='123','456'
I need to get the rolls but when I try this in VB.Net:
Dim URL_ROLL As String = Request.QueryString("rolls")
I get an incorrect syntax error.
I think it's a combination of the 2nd question mark and the single quotes.
When the system is only passing one roll, it works, I can get the rolls from the URL
which looks like this:
http://myServer/gis/default.aspx?MAP_NAME=myMap?type=roll&roll=123
I asked them to change the format of the system's URL but they can't change it without affecting the rest of their users.
Can anyone give me some ideas on how to get the rolls from the URL with single quotes?
OK, I believe I've fixed my problem.
I used a regular expression to remove anything in the querystring that wasn't a number or a comma.
Thanks again for taking time to make your comments, it made me look at the problem from a different angle.
I'm looking to extract some data from a definition list in some html code in R. So far I've done the following;
url <- "myurl"
doc <- htmlParse(url)
and then I (think I) want to use xpathSApply to extract the list data; however I keep returning an error... I'm new to the concept of webscraping and HTML, so I'm not entirely sure how the function goes about locating the data to scrape.
How do I find the xpath to pass to xpathSApply?
an example url would be http://opencorporates.com/companies/gb/06309283
and I would want to scrape the data regarding company name, number, address, directors etc. into one observation per query.
Firefox has an amazing plugin called FireBug, and an extension to that called FirePath. Using that, you can right click on any element on a web page and click "Inspect" . That will show you the XPath to be passed to xpathSApply.
If you can't use Firebug there's a nifty bookmarklet called SelectorGadget that does much the same thing and should work in IE9
Turns out the syntax that I was in need of was the '//node[#class="myclass"]' for use in the xpathSAppply function. Cheers all
I have one issues I'm struggling with with regards to my HTTPModule filter:
1) I notice that the module gets it's data in chunks. This is problematic for me because I'm using a regex to find and replace. If I get a partial match in one chunk and the rest of the match in the second, it will not work. Is there any way to get the entire response before I do my thing to it? I have seen code where it appends data to a string builder until it uses a matches on an "" end tag but my code must work for more that just (xml, custom tags, etc). I don't know how to detect the End Of Stream or if that is even possible.
I am attaching the filter in the BeginRequest.
Have a look at this example. It looks for "" in the stream of the page.
Here's a sample project which performs buffered search and replace in an HttpModule using a Request.Filter and Response.Filter. You should be able to adapt this technique to perform a Regex easily.
https://github.com/snives/HttpModuleRewrite