BeautifulSoup not returning source - web-scraping

I am trying to download the table data from http://www.footywire.com/afl/footy/ft_match_statistics?mid=5634
but run into problems when I try and obtain the soup from BeautifulSoup
I am trying
url='http://www.footywire.com/afl/footy/ft_match_statistics?mid=5634'
soup=BeautifulSoup(url)
but just get back the header, or nothing at all.
I've also tried using different a different parser (html5lib), and also reading the page through urllib2, but still not getting any of the body of the page. I'm pretty useless at web interaction so maybe there is something fundamental I am missing, but it seems to work on other websites.
Any help would be much appreciated in pulling this data. Why am I not getting the expected source?

Hello fellow Aussie :)
I'd use requests and lxml if I were you. I think that website is checking for cookies and a few headers. Requests' session class stores cookies and will let you pass headers too. lxml will let you use xpath here which I think will be less painful than BeautifulSoup's interface.
See below:
>>> import lxml.html
>>> import requests
>>> session = requests.session()
>>> response = session.get("http://www.footywire.com/afl/footy/ft_match_statistics?mid=5634", headers={"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Referer":"http://www.footywire.com/afl/footy/ft_match_statistics?mid=5634","Cache-Control":"max-age=0"})
>>> tree = lxml.html.fromstring(response.text)
>>> rows = tree.xpath("//table//table//table//table//table//table//tr")
>>> for row in rows:
... row.xpath(".//td//text()")
...
[u'\xa0\xa0', 'Sydney Match Statistics (Sorted by Disposals)', 'Coach: ', 'John Longmire', u'\xa0\xa0']
['Player', 'K', 'HB', 'D', 'M', 'G', 'B', 'T', 'HO', 'I50', 'FF', 'FA', 'DT', 'SC']
['Josh Kennedy', '20', '17', '37', '2', '1', '1', '1', '0', '3', '1', '0', '112', '126']
['Jarrad McVeigh', '23', '11', '34', '1', '0', '0', '2', '0', '5', '1', '1', '100', '116']
... cont...
The xpath query could be a bit brittle, but you get the idea :)

Related

airflow db init error in WSL2 Ubuntu: "Additional properties are not allowed ('logo' was unexpected)"

I have the next visualization of the error:
File "/home/tinmar/.local/lib/python3.10/site-packages/jsonschema/validators.py", line 353, in validate
raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('**logo**' was unexpected)
Failed validating 'additionalProperties' in schema['properties']['integrations']['items']:
{'additionalProperties': False,
'properties': {'external-doc-url': {'description': 'URL to external '
'documentation for '
'the integration.',
'type': 'string'},
'how-to-guide': {'description': 'List of paths to '
'how-to-guide for the '
'integration. The path '
'must start with '
"'/docs/'",
'items': {'type': 'string'},
'type': 'array'},
'integration-name': {'description': 'Name of the '
'integration.',
'type': 'string'},
'tags': {'description': 'List of tags describing the '
"integration. While we're "
'using RST, only one tag is '
'supported per integration.',
'items': {'enum': ['apache',
'aws',
'azure',
'gcp',
'gmp',
'google',
'protocol',
'service',
'software',
'yandex'],
'type': 'string'},
'maxItems': 1,
'minItems': 1,
'type': 'array'}},
'required': ['integration-name', 'external-doc-url', 'tags'],
'type': 'object'}
On instance['integrations'][0]:
{'external-doc-url': 'https://www.sqlite.org/index.html',
'how-to-guide': ['/docs/apache-airflow-providers-sqlite/operators.rst'],
'integration-name': 'SQLite',
**'logo': '/integration-logos/sql**,
'tags': ['software']}
I really don't know what the hell is happen here, I tried almost every solution I can find in the internet but nothing was usefull...
By the way, before this error, I have other one:
'''
ModuleNotFoundError: No module named 'wtforms.compat'
'''
Error that was solved by this code
pip install wtforms==2.3.3
pip install marsmallow==3.0.0
I think that maybe the first error was the cause of all, but I'm not sure about it and I don't know how can be possible
This errors came when I try to uninstall airflow 2.5.0 in order to downgrade to airflow 2.0.0. I mean, this was the cause of the other two errors, I think so...
I also try to uninstall and reinstall WSL2, this threw me some errors but I solve it, I don't think this was the problem because before I trying this, the other errors already pop out.
I expect a local airflow env in Ubuntu with WSL2
I will appreciate if someone have any idea on how to solve it, thank you before it :)

How to replace or remove special characters from scrapy?

I just started learning scrapy and trying to make spider to grab some info from website and trying to replace or remove special characters in 'short_descr'
import scrapy
class TravelspudSpider(scrapy.Spider):
name = 'travelSpud'
allowed_domains = ['www.tripadvisor.ca']
start_urls = [
'https://www.tripadvisor.ca/Attractions-g294265-Activities-c57-Singapore.html/'
]
base_url = 'https://www.tripadvisor.ca'
def parse(self, response, **kwargs):
for items in response.xpath('//div[#class= "_19L437XW _1qhi5DVB CO7bjfl5"]'):
yield {
'name': items.xpath('.//span/div[#class= "_1gpq3zsA _1zP41Z7X"]/text()').extract()[1],
'reviews': items.xpath('.//span[#class= "DrjyGw-P _26S7gyB4 _14_buatE _1dimhEoy"]/text()').extract(),
'rating': items.xpath('.//a/div[#class= "zTTYS8QR"]/svg/#title').extract(),
'short_descr': items.xpath('.//div[#class= "_3W_31Rvp _1nUIPWja _17LAEUXp _2b3s5IMB"]'
'/div[#class="DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'place': items.xpath('.//div[#class= "ZtPwio2G"]'
'/div'
'/div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]/text()').extract(),
'cost': items.xpath('.//div[#class= "DrjyGw-P _26S7gyB4 _3SccQt-T"]'
'/div[#class= "DrjyGw-P _1SRa-qNz _2AAjjcx8"]'
'/text()').extract(),
}
next_page_partial_url = response.css("div._1I73Kb0a").css("div._3djM0GaD").xpath('.//a/#href').extract_first()
if next_page_partial_url is not None:
next_page_url = self.base_url + next_page_partial_url
yield scrapy.Request(next_page_url, callback=self.parse)
Character I'm trying to replace is Hiking Trails • Scenic Walking Areas. The dot in the middle decodes in csv file as •
Everyting else works like a charm.
I've tried to use .replace(), but I'm getting an error:
AttributeError: 'list' object has no attribute 'replace'
Any help would be appreciated
If you're removing these special characters just because they appear weirdly in a CSV file, then I suggest not removing them. Just simply add the following line in the settings.py file.
FEED_EXPORT_ENCODING = 'utf-8-sig'
This will print the special character in your CSV file.

running a bash script in salt-minions

I am using CentOS -7 and it has latest Salt installed(salt 2017.7.2 (Nitrogen). I want to execute a certain script in all the salt-minions connected from salt-master and provide me the exit status from the salt-minions so that I can determine the salt states would be declared pass or fail.
Below are the contents of init.sls contents. Can anyone help me in this regard please? If you provide me some example code that would really help.
Regards
Pradeep
Note: There i tried to post code but stackoverflow is giving errors for improper indentation.
First attempt
https://docs.saltstack.com/en/latest/ref/states/all/salt.states.cmd.html#salt.states.cmd.script
Then I tried jinja way after googling around a bit:
Second Attempt
https://groups.google.com/forum/#!topic/salt-users/IJo6Z8Hro2w
Rendering SLS 'base:abcd' failed: Problem running salt function in Jinja template: Unable to run command '['/root/scripts/test.sh']' with the context '{'timeout': None, 'with_communicate': True, 'shell': False, 'bg': False, 'stderr': -2, 'env': {'LANG': 'en_US.UTF-8', 'LC_NUMERIC': 'C', 'NOTIFY_SOCKET': '/run/systemd/notify', 'LC_MESSAGES': 'C', 'LC_IDENTIFICATION': 'C', 'LC_MONETARY': 'C', 'LC_COLLATE': 'C', 'LC_CTYPE': 'C', 'LC_ADDRESS': 'C', 'LC_MEASUREMENT': 'C', 'LC_TELEPHONE': 'C', 'LC_PAPER': 'C', 'LC_NAME': 'C', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin:/sbin', 'LC_TIME': 'C'}, 'stdout': -1, 'close_fds': True, 'stdin': None, 'cwd': '/root'}', reason: command not found; line 1
Here is how I run scripts on my SaltStack install:
run_my_script.sh:
cmd.script:
- name: my_script.sh
- source: salt://scripts/my_script.sh
You can then just check the exit status of the state.

Unix: Grabbing dates from file and sorting them

I have multiple files that look like this:
//file start
$thing1 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2013-10-01'};
$thing2 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2012-11-01'};
$thing3 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2014-12-01'};
//file end
Using Unix, what is the best way to grab all of the items in a file that are dates. I know that the items I'm looking for in the file look like
{somethingDate = '1111-11-11'}
From this I want to grab '1111-11'11'. File one will have multiple 'fileOneDate' entries and file two will have multiple 'fileTwoDate' entries, etc. My goal is to take all of these dates that are '*Date', remove duplicates, and sort them into an output file, which is easy enough using the sort command and pipes. However, I'm stuck on this first part. What I have so far looks like this:
<command I'm working on now that grabs dates> | sort -n > outputfile.txt
I believe the way to go would be an AWK script. What would be the right way to parse these files?
Do you need like this?
sed -n "s/.*'\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)'.*/\1/p"
If you have -r option in sed,
sed -nr "s/.*'([0-9]{4}-[0-9]{2}-[0-9]{2})'.*/\1/p"
Test:
sat:~# echo "{somethingDate = '1111-11-11'}" | sed -n "s/.*'\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)'.*/\1/p"
1111-11-11
sat:~#
sat:~# echo "$thing1 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2013-10-01'};" | sed -n "s/.*'\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)'.*/\1/p"
2013-10-01
grep -o is the simplest way to extract text.
sort -u to sort (duh) and remove duplicates.
grep -oE '\<[0-9]{4}-[0-9]{2}-[0-9]{2}\>' <<'END' | sort -u
$thing1 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2013-10-01'};
$thing2 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2012-11-01'};
$thing3 = {'item1' => '0', 'item2 => '3', 'itemDate' => '2014-12-01'};
$thing2b= {'item1' => '0', 'item2 => '3', 'itemDate' => '2012-11-01'};
$thing2c= {'item1' => '0', 'item2 => '3', 'itemDate' => 'foo2012-01-01bar'};
END
2012-11-01
2013-10-01
2014-12-01
If your sample file is called datefile, then:
$ sed -nr "s/.*Date' => '([^']+)'.*/\1/p" datefile | sort -n
2012-11-01
2013-10-01
2014-12-01
The above regex looks for lines containing Date' => 'datestring' and prints the datestring.
In more detail, the sed command consists of a substitution which, in sed-style, are written as s/old/new/options. The old part is a bit complicated so I will go through it piece by piece: the old regex looks for (a) .* means anything (any number of any characters), followed by (b) Date' => ', followed by (c) ([^']+) which means one or more characters that are not single quotes, followed by (d) a single quote, followed by (e) .*, again meaning anything. If a match is made, then that line is replaced with the date string (saved as \1 because the date string regex was in parens) and then, because of the p at the end of the expression, that date is printed. Because the -n option is given to sed, lines with no matching datestring are not printed.
If your sed does not support -r (OSX), then use a similar expression but with a few added backslashes:
sed -n "s/.*Date' => '\([^']\+\)'.*/\1/p" datefile | sort -n

Is there anyway to add a class to a specific node when using google orgchart API?

I am using google org chart API. I would like to style a particular node but i don't see anyway to add a className or an id on a specific node to then use css to style.
I see you can change the style on all nodes but i don't see anyway to do it on a single node
Is this possible?
You can set "style" and "selectedStyle" properties on the DataTable row for the node you want to style (see the OrgChart custom properties).
If you specifically need to use a class, then your only option is to set the formatted value of the cell to wrap the contents in a <div> with the desired class.
If you want to specify styling in JSON literal you can use the p:{style: 'some styling here'} property for the row object. However you cannot specify class definition in p attribute :(
JSON Example:
var dataAsJSON = {
cols:[{type:'string'},{type:'string'},{type:'string'}],
rows:[
{c:[{v: '0', f: 'Final Fantasy'}, null, {v: 'First Root'}], p:{style: 'background-color:violet;'}},
{c:[{v: '1', f: 'DmC'}, null, {v: 'Second Root'}], p:{style: 'background-color:lime;'}},
{c:[{v: '2', f: 'Cloud Strife'}, {v: '0'}, null]},
{c:[{v: '3', f: getFormattedCell('Vincent Valentine')}, {v: '0'}, null]},
{c:[{v: '4', f: 'Sephiroth'}, {v: '2'}, null]},
{c:[{v: '5', f: 'Dante'}, {v: '1'}, null]},
{c:[{v: '6', f: 'Nero'}, {v: '1'}, null]}
]
};

Resources