How to display a nested dictionary with bokeh datatable - bokeh

I'm trying to display a dictionary with nested dictionaries inside. And on cells with the nested dictionary, the bokeh datatable display that as [object Object].
So far, I've been unpacking the dictionaries manually to display the actual data, however I was wondering if there is a better way of doing it? perhaps some sort of formatter on the datatable?
I'm open to suggestions.

You could try to do some data manipulation with Pandas before using Bokeh
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import TableColumn, DataTable
from bokeh.io import output_file, show
import pandas as pd
nested_dictionary = {"user1": {'name': 'John', 'age': '27', 'sex': 'Male'},
"user2": {'name': 'Marie', 'age': '22', 'sex': 'Female'},
"user3": {'name': 'Luna', 'age': '24', 'sex': 'Female', 'married': 'No'}}
df = pd.DataFrame.from_dict(nested_dictionary, orient='index')
print (df.head())
source = ColumnDataSource(df)
# Dataframe -> DataTable
table_columns = [TableColumn(field=name, title=name)for name in df.columns]
data_table = DataTable(source=source, columns=table_columns, width=580, height=280)
# Specify the name of the output file and show the result
output_file('example_so.html')
show(data_table)
If your data is coming from a JSON file check json_normalize() from pandas.io.json

Related

How to convert div tags to a table?

I want to extract the table from this website https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214
Checking the source of that website, I noticed that somehow the table tag is missing. I assume that this table is a summary of multiple div classes. Is there any easy approach to convert this table to excel/csv? I badly have coding skills/experience...
Appreciate any help
There are a few way to do that. One of which (in python) is (pretty self-explanatory, I believe):
import lxml.html as lh
import csv
import requests
url = 'https://www.rankingthebrands.com/The-Brand-Rankings.aspx?rankingID=37&year=214'
req = requests.get(url)
doc = lh.fromstring(req.text)
headers = ['Position', 'Name', 'Brand Value', 'Last']
with open('brands.csv', 'a', newline='') as fp:
#note the 'a' in there - for 'append`
file = csv.writer(fp)
file.writerow(headers)
#with the headers out of the way, the heavier xpath lifting begins:
for row in doc.xpath('//div[#class="top100row"]'):
pos = row.xpath('./div[#class="pos"]//text()')[0]
name = row.xpath('.//div[#class="name"]//text()')[0]
brand_val = row.xpath('.//div[#class="weighted"]//text()')[0]
last = row.xpath('.//div[#class="lastyear"]//text()')[0]
file.writerow([pos,name,brand_val,last])
The resulting file should be at least close to what you're looking for.

Convert panel.widgets.tables.Tabulator to layoutDOM

I'm new to Bokeh so apologies if I get the terminology wrong.
I have a simple dashboard and I'm trying to add a chart using tabulator to the page docs
The basic setup is as follows
from bokeh.models import Select, Panel
from bokeh.models.widgets import Tabs
import my_func from irrelevant_code
chart = my_func() # this is a tabulator object
tab1 = Panel(child = summary_layout, title="Summary")
tab2 = Panel(child = chart, title="Chart")
tabs = Tabs(tabs=[tab1, tab2])
document = curdoc()
document.add_root(tabs)
This runs into a problem since Panel expects a LayoutDOM object and chart is a panel.widgets.tables.Tabulator object.
How can I convert chart to a layoutDOM object?
The specific error I get is
*** ValueError: failed to validate Panel(id='1212', ...).child: expected an instance of type LayoutDOM, got Tabulator(formatters={'testDate': DateForm...}, groups={'testGroup': ['col1',...}, selectable='checkbox', selection=[0, 1, 2, 3, 4, ...], titles={'col1': 'Column 1', ...}, value= val1 val2 v...) of type Tabulator
So while in theory you could use the .get_root() or .get_model() methods on the Tabulator to turn the Panel object into a Bokeh object I would generally recommend just sticking with Panel, e.g. your example can be written as:
import panel as pn
import my_func from irrelevant_code
chart = my_func() # this is a tabulator object
tabs = pn.Tabs(('Summary', summary_layout), ('Chart', chart))
tabs.servable()

Scraping specific checkbox values using Python

I am trying to analyze the data on this website: website
I want to scrape a couple of countries such as BZN|PT - BZN|ES and BZN|RO - BZN|BG
I tried for forecastedTransferCapacitiesMonthAhead the following:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show')
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month:
print(''.join(price.get_text("|", strip=True).split()))
But I only get the preselected country. How can I pass my arguments so that I can select the countries that I want? Much obliged.
The code is missing a crucial part - i.e., the parameters which inform the requests, like import/export and from/to countries and types.
In order to solve the issue, below you might find a code built on yours, which uses the GET + parameters function of requests. To run the complete code, you should find out the complete list of parameters per country.
from bs4 import BeautifulSoup
import requests
payload = { # this is the dictionary whose values can be changed for the request
'name' : '',
'defaultValue' : 'false',
'viewType' : 'TABLE',
'areaType' : 'BORDER_BZN',
'atch' : 'false',
'dateTime.dateTime' : '01.05.2020 00:00|UTC|MONTH',
'border.values' : 'CTY|10YPL-AREA-----S!BZN_BZN|10YPL-AREA-----S_BZN_BZN|10YDOM-CZ-DE-SKK',
'direction.values' : ['Export', 'Import']
}
page = requests.get('https://transparency.entsoe.eu/transmission-domain/r2/forecastedTransferCapacitiesMonthAhead/show',
params = payload) # GET request + parameters
soup = BeautifulSoup(page.text, 'html.parser')
tran_month = soup.find('table', id='dv-datatable').findAll('tr')
for price in tran_month: # print all values, row by row (date, export and import)
print(price.text.strip())

Adding values to dictionary in FOR loop. Updating instead of "Appending"

import requests
from bs4 import BeautifulSoup
urls = ['url1']
dictionary = {}
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
for sub_heading in soup.find_all('h3'):
dictionary[url] = sub_heading.text
print(dictionary)
I'm getting a result that looks like this {url : sub_heading.text} instead of getting a dictionary containing all the values I'm expecting.
It seems that the loop is updating instead of "appending"...
Python Dictionaries have key:value pairs, and it can not have duplicate keys.
So in this code, 'url' is key and 'sub_heading.text' is value.
And everytime this loop runs, only the value for 'url' in dict is
getting updated.
for sub_heading in soup.find_all('h3'):
dictionary[url] = sub_heading.text
You should use some other data structure instead of Dict (for e.g. list of tuples or dataframe).

How To Remove White Space in Scrapy Spider Data

I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader. Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others in the spider. How can I clean the data of these line returns (\r\n)? My items.py file only contains the item names and field(). The spider code is below:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.utils.markup import replace_escape_chars
from ccpstore.items import Greenhouse
class GreenhouseSpider(BaseSpider):
name = "greenhouse"
allowed_domains = ["domain.com"]
start_urls = [
"http://www.domain.com",
]
def parse(self, response):
items = []
l = XPathItemLoader(item=Greenhouse(), response=response)
l.add_xpath('name', '//div[#class="product_name"]')
l.add_xpath('title', '//h1')
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
items.append(l.load_item())
return items
You can use the default_output_processor on the loader and also other processors on individual fields, see title:
from scrapy.spider import BaseSpider
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Compose, MapCompose
from w3lib.html import replace_escape_chars, remove_tags
from ccpstore.items import Greenhouse
class GreenhouseSpider(BaseSpider):
name = "greenhouse"
allowed_domains = ["domain.com"]
start_urls = ["http://www.domain.com"]
def parse(self, response):
l = XPathItemLoader(Greenhouse(), response=response)
l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
l.add_xpath('name', '//div[#class="product_name"]')
l.add_xpath('title', '//h1', Compose(remove_tags))
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
return l.load_item()
It turns out that there were also many blank spaces in the data, so combining the answer of Steven with some more research allowed the data to have all tags, line returns and duplicate spaces removed. The working code is below. Note the addition of text() on the loader lines which removes the tags and the split and join processors to remove spaces and line returns.
def parse(self, response):
items = []
l = XPathItemLoader(item=Greenhouse(), response=response)
l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars)
l.default_output_processor = Join()
l.add_xpath('title', '//h1/text()')
l.add_xpath('usage', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]/text()')
l.add_xpath('repeat', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]/text()')
l.add_xpath('direction', '//li[#id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]/text()')
items.append(l.load_item())
return items

Resources