I need help to access and save multiple html pages - web-scraping

I need to access 272.549 html pages and save them.
First is https://cearatransparente.ce.gov.br/portal-da-transparencia/despesas/notas-de-empenho/2839284
Last is https://cearatransparente.ce.gov.br/portal-da-transparencia/despesas/notas-de-empenho/3111833
How do I do this using "R"?
this is the code I tried, but failed.
for (page_result in seq(from = 2839284, to = 2839294, by = 1)) {
link = paste0("https://cearatransparente.ce.gov.br/portal-da-transparencia/despesas/notas-de-empenho/",page_result)
page = read_html(link)
save_html(page, file = "pagina.html")
}
PS. IF helps, it has a print code, like this : https://cearatransparente.ce.gov.br/portal-da-transparencia/despesas/notas-de-empenho/3111833?locale=pt-BR&print=true

Related

How to edit this code so that I can view only texts?

This is a snippet of config file of snownews (a terminal based news aggregator).
Problem: When I try to view a rss feed on terminal, I can only view text till the image! after that everything is blank. Also since i'm using a terminal image is also not supported.
image: https://imgur.com/a/O3Gq2Dl
Here is the code were the web scraping takes place. I only need to view text without any images.
# Importing
if (($PROGRAM_NAME =~ "snow2opml") || ($ARGV[0] eq "--export")) {
OPMLexport();
} else {
my $parser = XML::LibXML->new();
$parser->validation(0); # Turn off validation from libxml
$parser->recover(1); # And ignore any errors while parsi>
my(#lines) = <>;
my($input) = join ("\n", #lines);
my($doc) = $parser->parse_string($input);
my($root) = $doc->documentElement();
# Parsing the document tree using xpath
my(#items) = $root->findnodes("//outline");
foreach (#items) {
my(#attrs) = $_->attributes();
foreach (#attrs) {
# Only print attribute xmlUrl=""
if ($_->nodeName =~ /xmlUrl/i) {
print $_->value."\n";
}
}
}
}
If the full code is needed I can post it

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

I'm trying to parse a directory with a collection of xml files from RSS feeds.
I have a similar code for another directory working fine, so I can't figure out the problem. I want to return the items so I can write them to a CSV file. The error I'm getting is:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
Here is the site I've collected RSS feeds from: https://www.ba.no/service/rss
It worked fine for: https://www.nrk.no/toppsaker.rss and https://www.vg.no/rss/feed/?limit=10&format=rss&categories=&keywords=
Here is the function for this RSS:
import os
import xml.etree.ElementTree as ET
import csv
def baitem():
basepath = "../data_copy/bergens_avisen"
table = []
for fname in os.listdir(basepath):
if fname != "last_feed.xml":
files = ET.parse(os.path.join(basepath, fname))
root = files.getroot()
items = root.find("channel").findall("item")
#print(items)
for item in items:
date = item.find("pubDate").text
title = item.find("title").text
description = item.find("description").text
link = item.find("link").text
table.append((date, title, description, link))
return table
I tested with print(items) and it returns all the objects.
Can it be how the XML files are written?
Asked a friend and said to test with a try except statement. Found a .DS_Store file, which only applies to Mac computers. I'm providing the solution for those who might experience the same problem in the future.
def baitem():
basepath = "../data_copy/bergens_avisen"
table = []
for fname in os.listdir(basepath):
try:
if fname != "last_feed.xml" and fname != ".DS_Store":
files = ET.parse(os.path.join(basepath, fname))
root = files.getroot()
items = root.find("channel").findall("item")
for item in items:
date = item.find("pubDate").text
title = item.find("title").text
description = item.find("description").text
link = item.find("link").text
table.append((date, title, description, link))
except Exception as e:
print(fname, e)
return table

Extracting data from a web page to Excel sheet

How can I extract information from a web page into an Excel sheet?
The website is https://www.proudlysa.co.za/members.php and I would like to extract all the companies listed there and all their respective information.
The process you're referring to is called web scraping, and there are several VBA tutorials out there for you to try.
Alternatively, you can always try
(source: netdna-ssl.com)
I tried creating something to grab for all pages. But ran of time and had bugs. This should help you a little. You will have to do this on all 112 pages.
Using chrome go to the page
type javascript: in the url then paste the code below. it should extra what you need. then you will have to just copy and paste it in to excel.
var list = $(document).find(".pricing-list");
var csv ="";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "<br>";
}
you will get something like this
EDITTED
use this instead will automatically download each page as csv then you can just combine them after somehow.
Make sure to type javascript: in url before pasting and pressing enter
Also works with chrome, not sure about other browsers. i dont use them much
var list = $(document).find(".pricing-list");
var csv ="data:text/csv;charset=utf-8,";
for (i = 0; list.length > i;i++) {
var dataTags = list[i].getElementsByTagName('li');
var dataArr = [];
for (j = 0; dataTags.length > j;j++) {
dataArr.push(dataTags[j].innerText.trim());
}
csv += dataArr.join(', ') + "\n";
}
var a = document.createElement("a");
a.href = ""+ encodeURI(csv);
a.download = "data.csv";
a.click();

xpages view panel display computed icon

I'm trying to display an icon in a view panel based on a column value. This page will display if I only display the column value and/or use a static database image. If however I try to compute the image based on the column value I get an Http Code 500 error. Looking at the error-log I see two errors, the first is CLFAD0211E: Exception thrown and the second CLFAD0246E: Exception occurred servicing request for .
I have reviewed this simple explanation on how to add a dynamic icon (https://www.youtube.com/watch?v=27MvLDx9X34) and other similar articles and still not working. Below is the code for the computed icon.
var urlFull:XSPUrl = new XSPURL(database.getHttpURL());
var url = urlFull.getHost();
var path = "/icons/vwicn";
// var idx = rowData.getColumnValues().get(1); Removed for testing
var idx = "82.0"; //Hard coded the value for testing
if (idx < 10){
path += ("00" + idx).left(3);
}else if (idx < 100){
path += ("0" + idx).left(3);
}else {
path += idx.left(3);
}
path += ".gif";
//path = "/icons/vwicn082.gif"; I have also tried hard coding the path value - still a no go
url = setPath(path);
url.removeAllParameters();
return url.toString();
The view panel is configured as xp:viewPanel rows="40" id="viewPanel1" var="rowData".
Any suggestions on what to look for or a better option to compute a view panel icon would be appreciated.
Cheers!!!
You have a typo: url = setPath(path);should be url.setPath(path);

Python webdriver switch from main window to popup screen (not Java alert) and login

Here is the window I need to enter new password and repeat it again and click 'create'.
My code so far:
createLogin = wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="Item.MessageUniqueBody"]/div/div/div/div/div[2]/div[2]/a')))
createLogin.click()
time.sleep(10)
try:
newPassword = self.driver.find_elements_by_xpath('//*[#id="editNewUser_newPassword"]')
newPassword1 = self.driver.find_elements_by_xpath('//*[#id="editNewUser_newPasswordRepeat"]')
newPasswordForm = self.driver.find_elements_by_xpath('//*[#id="editNewUserPasswordForm"]/table/tbody/tr[1]/td[1]')
self.driver.switch_to.active_element(newPasswordForm)
time.sleep(3)
newPassword.send_keys('123')
newPassword1.send_keys('123')
time.sleep(2)
# createLog = wait.until(
# EC.presence_of_element_located((By.XPATH, '//*[#id="editNewUserPassword_save"]')))
# createLog.click()
# time.sleep(5)
except NoAlertPresentException as e:
time.sleep(2)
myAccount = wait.until(
EC.presence_of_element_located((By.XPATH, '//*[#id="easMyAccount1"]')))
myAccount.click()
time.sleep(5)
This is the problem.
You are using find_elements_by_xpath rather than find_element_by_xpath
plural vs singular.
find_elements_by_xpath: it gives you a list of web elements with matching identifier.
find_element_by_xpath: it gives you a first web element with matching identifier.
newPassword = self.driver.find_element_by_xpath('//*[#id="editNewUser_newPassword"]')
newPassword1 = self.driver.find_element_by_xpath('//*[#id="editNewUser_newPasswordRepeat"]')
newPasswordForm = self.driver.find_element_by_xpath('//*[#id="editNewUserPasswordForm"]/table/tbody/tr[1]/td[1]')
#gauurang Answer is right, But You have to used find_element_by_xpath, also as your xpath suggested you have id to locate the webelements so it is always better to use id over xpath
Also your xpath are correct
newPassword = self.driver.find_element_by_id('editNewUser_newPassword')
newPassword1 = self.driver.find_element_by_id('editNewUser_newPasswordRepeat')
newPasswordForm = self.driver.find_element_by_xpath('//*[#id="editNewUserPasswordForm"]/table/tbody/tr[1]/td[1]')

Resources