Webscraping on html function parameter and export to csv - web-scraping

<div class="readmore">
<a href="" onclick="updateDetailModal({name":"Company Name 1","website":"https:\/\/hello.com.sg\/","phone":"65 8123 4567","email":"hello#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
</div>
Hi I'm looking to web scrape the following so that I can get it in a .csv file in this format<br>
Company Name | Website Url | Phone | Email -> 1st Row
Company Name 1 | https://hello.com.sg/ | 81234567 | hello#gmail.com -> 2nd Row
Company Name 2 | https://hello2.com.sg/ | 87654321 | hello2#gmail.com -> Subsequent rows for all links
Is there a way to use regex to get the individual fields and export them to a CSV file? I've been trying python and beautiful soup but I only know how to export using class or id. Not sure how to do it for function parameters.
Appreciate your help!

To extract the information you are looking for you need not just beautifulsoup (or lxml), but also json and a bit of string manipulation.
Assuming your html looks like this:
modal = """<div class="readmore">
<a href="" onclick="updateDetailModal({"name":"Company Name 1","website":"https:\/\/hello.com.sg\/","phone":"65 8123 4567","email":"hello#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
<a href="" onclick="updateDetailModal({"name":"Company Name 2","website":"https:\/\/hello2.com.sg\/","phone":"87654321","email":"hello2#gmail.com.sg"})" class="btn btn-primary" data-toggle="modal" data-target="#exampleModal">More
</a>
</div>"""
Then:
from bs4 import BeautifulSoup as bs
import json
soup = bs(modal,"lxml")
infos = soup.select('a')
companies = []
for info in infos:
target = info.attrs['onclick'].split('(')[1].split(')')[0]
data = json.loads(target)
companies.extend([[v for v in data.values()]])
Your data is now in the companies list:
for co in companies:
print(co)
Output:
['Company Name 1', 'https://hello.com.sg/', '65 8123 4567', 'hello#gmail.com.sg']
['Company Name 2', 'https://hello2.com.sg/', '87654321', 'hello2#gmail.com.sg']
From here you write it to csv using standard methods.

Related

Django #hashtags in url

Sooo what am trying to do is a link on CONTACT to redirect to HOME and scroll down to some content, but dont know how to pass # in urls in django. Any help appreciated. The scroll is fine on home but cant get it to work from contact.
URL
path('/#products', HomeView.as_view(), name='products'),
CONTACT.html
<a class="nav-link" href="{% url 'core:products' %}">Products</a>
HOME.html
this is in navbar
<a class="nav-link" style="cursor: pointer" href='#products'>Products</a>
this is where i want it scrolled
<a class="anchor" id="products"></a>
RedirectView for this:
views.py
from django.views.generic import RedirectView
from django.urls import reverse
class ViewpostRedirectView(RedirectView):
def get_redirect_url(*args, **kwargs):
hash_part = "add_data_Modal" # the data you want to add to the hash part
return reverse("createpost") + "#{0}".format(hash_part)
urls.py
path('viewpost/', views.createpost, name='createpost'),
path('viewpost/modal/', views.ViewpostRedirectView.as_view(), name='createpost_modal')
More info : https://www.kite.com/python/docs/django.views.generic.RedirectView

Show News Item creator or owner full name in Plone

I'm trying to show the full name for each news item in a list. For the moment I have only the user id (nickname).
Is there a simple way (in existing .pt file) to show the full name of creator or owner instead of a nickname?
The page must work for anonymous users, too. I mean - the page must be public.
Some details:
<div class="container-fluid news-list-container"
tal:define="news_items python:context.getFolderContents(contentFilter={'portal_type':['News Item'], 'sort_on': 'Date', 'sort_order': 'descending',});
Batch python:modules['Products.CMFPlone'].Batch;
b_size python:4;
b_start python:0;
b_start request/b_start | b_start;
batch python:Batch(news_items, b_size, int(b_start), orphan=0);"
tal:condition="news_items">
<div class="news-list-items">
<tal:items tal:repeat="news_item batch">
<!-- News item -->
<div class="row news-item"
tal:define="news_object python:news_item.getObject();
news_date python:news_object.getField('modification_date').getAccessor(news_object)();
news_title python:news_object.getField('title').getAccessor(news_object)();
news_description python:news_object.getField('description').getAccessor(news_object)();
news_image python:news_object.getField('image').getAccessor(news_object)();
news_url python:news_object.absolute_url();
news_creators python:news_object.getField('creators').getAccessor(news_object)(); .... ...
<tal:fullname define="membership context/portal_membership;
info python:membership.getMemberInfo(user.getId());
fullname info/fullname">
You are are <span class="name" tal:content="fullname" />
</tal:fullname>
This example is taken from the plone documentation
You can get inspired a lot by this code:
https://github.com/collective/Products.Scrawl/blob/1021047c4ef6c2655d104e8b345a24140da9e4aa/Products/Scrawl/browser/blogentry_view.pt#L32
<tal:name tal:condition="item_creator"
tal:define="author python:context.portal_membership.getMemberInfo(item_creator)">
<span i18n:translate="label_by_author">Posted by
<a href="#"
title="Read more posts by this author"
tal:attributes="href string:${context/portal_url}/author/${item_creator}"
tal:content="python:author and author['fullname'] or item_creator"
tal:omit-tag="not:author"
i18n:domain="scrawl"
i18n:name="author"
i18n:attributes="title author_title">
Bob Dobalina
</a>
</span>
</tal:name>
Mind the possible performance issues.
A cached view method may work a lot better, e.g.:
#memoize
def userid2fullname(self, userid):
pm = api.portal.get_tool('portal_membership')
memberinfo = pm.getMemberInfo(userid)
return memberinfo and memberinfo['fullname'] or userid

Retain Query String In URL

I've built a web app that lives on SharePoint.
The user is asked a question at the beginning. They can answer either "Yes" or "No."
A query string is added to the URL depending on the user's answer. So there will only be two query strings: ?choice=yes or ?choice=no.
The index.html page has this:
<input type="button" value="Yes" onclick="redirect(this);" id="yes" class="btn btn-primary btn-lg">
<input type="button" value="No" onclick="redirect(this);" id="no" class="btn btn-primary btn-lg">
And this is how the query string is initially added:
function redirect(e){
location.href = "map.html?choice=" + e.id;
}
After the user picks an answer it takes them to map.html, which is the first section the user comes into from the question (so the URL at this point can be either map.html?choice=yes or map.html?choice=no). After that, they can navigate to where ever they please on the app.
Here's the issue. The navigation is being pulled in from an external nav.js file (i.e. js include) like so:
<script type="text/javascript" src="/js/nav.js"></script>
Here's the nav.js code:
document.write(
' <div class="row text-center navigation">' +
' <ul class="nav nav-pills">' +
' <li id="nav-welcome"><a class="align" href="index.html">WELCOME</a></li>' +
' <li id="nav-map"><span>MAP</span></li>' +
' <li id="nav-about"><span>ABOUT</span></li>' +
' <li id="nav-contact"><span>CONTACT</span></li>' +
' <li id="nav-pricing"><span>PRICING</span></li>' +
' </ul>' +
' </div>'
);
How do I continue to pass either query string as the user continues to navigate the app?
There is a single page who's content will change depending on the query string.
So if the user answers "yes" at the beginning and eventually navigates to the pricing page, it will pull content specific to the "yes" answer (i.e. call a specific function).
If the user answers "no" at the beginning and eventually navigates to the pricing page, it will pull content specific to the "no" answer (i.e. call a specific function).
Thank you all in advance.

Twig date formatting

I'm trying to format a date and place each item (month, day and year) inside a div or span. I'm using a SaaS platform which provide a date like so:
2015-03-07 22:54:00
What I try to do is format this date into 3 separate items and place them in a div like so:
<span class="day">07</span>
<span class="month">March</span>
<span class="year">2015</span>
Therefore I need to strip everything behind a - and call it inside a div like so:
<span class="day">{{ article.date | split('-')[2] }}</span>
<span class="month">{{ article.date | split('-')[1] }}</span>
<span class="year">{{ article.date | split('-')[0] }}</span>
This gives me an output like:
<span class="day">07 22:54:00</span>
<span class="month">03</span>
<span class="year">2015</span>
Where I get stuck is to remove the time after the day and change the month into a textual representation.
I know it can be done with PHP date functions but I can't get that to work. The problem I'm facing is that the date needs to be stripped after each -.
Is there anybody who can help me with that?
You can use the date() function, and then the date filter
NB: they are 2 really different things!!!
{% set dateTime = date(article.date) %}
<span class="day">{{ dateTime | date('d') }}</span>
<span class="month">{{ dateTime | date('F') }}</span>
<span class="year">{{ dateTime | date('Y') }}</span>
First, the date() function converts your string dates to a DateTime object.
Then, the date filter makes you do the equivalent of the DateTime::format PHP function, so you can use a separate notation for the desidered output.
If you need to translate the name of the month to a locale, you can pipe the trans filter, but it must be enabled first.

identifying image object using css for protractor automation

I am using protractor for angular js automation. I am trying to get the 'fa fa-something' text from the below element structure using css identifier-
<div class="Itemlistcontainer">
<ul class="itemlist sortlist ui-sortable">
<!-- ngRepeat: Item in Items | orderBy:CustomSort:false --><li ng-repeat=" Item in Items | orderBy:CustomSort:false" ng-show="!searchinput || ([Item.Name]|filter:searchinput).length" ng-class="{ 'itemdisabled' : !CanUseTask(Item) || deactivate }" class="ng-scope ui-draggable">
<div bo-attr="" bo-attr-id="Item.Id" bo-attr-title="Item.Details | html2string" class="label itemlabel" id="3d991564-a1a9-49ab-8659-a26e00fbfae6" title="Blah blah blah.">
<span>
<i ng-class="itemIconClass(Item)" style="margin-right: 5px;" class="fa fa-something"></i>
</span><span bo-text="item.Name | ellipse : 32">Item Name</span>
</div>
<!--ngRepeat: Item in Items....and the list goes on
I need to know under what Item in Items was this 'fa fa-something' found. I am using element(By.css('ul.itemlist i.itemIconClass(Item)').getAttribute('class').getText()
which is not working.
element(By.css('ul.itemlist i.itemIconClass(Item)').getAttribute('class').getText()
can't work as you're trying to interpolate an angular template expression in a protractor element selector
I think you need :
element(By.css('ul.itemlist i.fa.fa-something').getAttribute('class').getText()
And to determine what Item in Items was this 'fa fa-something' found maybe you need an ID (which will be easier to read, no need to parse class attribute by extracting fa fasomething etc...

Resources