previous web crawler doesn't recognize element id - css

I am new to the web crawling task. Previously I tried the following simple crawler, and it worked well.
Recently I come back to the code and tried to do more on crawler, however the browser.find_element_by_id("lst-ib") does not work and I receive the error that says
' no such element: Unable to locate element: {"method":"css selector","selector":"[id="lst-ib"]"}
(Session info: chrome=84.0.4147.89) '
To solve my problem, I tried to find xpath of input text box for google page from inspect. Is it always like that? does the id or css selector that we define for crawler change regularly and we should update the code?
from selenium import webdriver
url = "https://www.google.com"
browser = webdriver.Chrome(executable_path = "chromedriver")
browser.get(url)
#inputElement = browser.find_element_by_id("lst-ib")
# I replace the xpath with previous id
inputElement =
browser.find_element_by_xpath("/html/body/div/div[2]/form/div[2]/div[1]/div[1]/div/div[2]/input")
inputElement.send_keys("my input search text")
inputElement.submit()
browser.quit()

try below xpath :
inputElement =
browser.find_element_by_xpath("//body[#id='gsr']/div[#id='viewport']/div[#id='searchform']/form[#id='tsf']/div/div/div/div/div/input[1]")
inputElement.send_keys("my input search text")
Your solution:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome(executable_path=r"path of chrome driver")
wait = WebDriverWait(driver, 10)
driver.get("https://www.google.com")
inputElement = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, "/html/body/div/div[2]/form/div[2]/div[1]/div[1]/div/div[2]/input")))
inputElement.send_keys("my input search text")
Output :

Related

How to use wtforms.po with Flask-WTF

from wtforms.fields.simple import TextField, PasswordField
from wtforms import validators
from wtforms.ext.i18n.form import Form
class BaseForm(Form):
LANGUAGES = ['zh']
class LoginForm(BaseForm):
username = TextField("Username", [validators.Required()])
psw = PasswordField("Password", [validators.Required()])
The above code works fine, the form prompt message can be translated to Chinese.
What problem I have is how to use Flask-wtf instead of wtforms?
I tried:
from wtforms import validators
from flask.ext.wtf import Form
from wtforms.fields.simple import TextField, PasswordField
class BaseForm(Form):
LANGUAGES = ['zh']
class LoginForm(BaseForm):
username = TextField("Username", [validators.Required()])
psw = PasswordField("Password", [validators.Required()])
The prompt message is still English. Could someone give me advise? Thanks.
Fixed!!
http://pythonhosted.org/Flask-Babel/
add the following code to your app script
from flask.ext.babel import Babel
babel = Babel(app)
app.config['BABEL_DEFAULT_LOCALE'] = 'zh_Hans_CN'
create babel.cfg next to your application:
[python: **.py]
[jinja2: **/templates/**.html]
extensions=jinja2.ext.autoescape,jinja2.ext.with_
copy wtforms.po / wtforms.mo to folder "translations/zh_Hans_CN/LC_MESSAGES" (created by Flask-Babel) which next to your application
Then no need touch anything with Flask-WTF, it will works fine with Flask-Babel auto.

Replacing firefox driver with htmlunit driver

Hi can any please tell how to run this sample program with HTMLUNIT DRIVER INSTEAD OF FIREFOX DRIVER.
The below code had run successfully with firefox driver but did not run successfully with htmlunit driver giving
org.openqa.selenium.NoSuchElementException: Unable to locate a node using .//*[contains(concat(' ',normalize-space(#class),' '),' gssb_e ')]-EXCEPTION.
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
public class GoogleSuggest
{
public static void main(String[] args) throws Exception
{
WebDriver driver = new FirefoxDriver();
driver.get("http://www.google.com/webhp?complete=1&hl=en");
WebElement query = driver.findElement(By.name("q"));
query.sendKeys("Cheese");
long end = System.currentTimeMillis() + 50000;
while (System.currentTimeMillis() < end)
{
WebElement resultsDiv = driver.findElement(By.className("gssb_e"));
if (resultsDiv.isDisplayed())
{
break;
}
}
List<WebElement> allSuggestions =
driver.findElements(By.xpath("//td[#class='gssb_a gbqfsf']"));
for (WebElement suggestion : allSuggestions)
{
System.out.println(suggestion.getText());
}
}
}
Please any one tell me how to do it with HTMLUNIT driver n I M A VERY JUST BEGINNER and explain me the reason even and i would be happy if any one post the same code manipulated with HTMLUNIT driver and also please tell me how to overcome the DEFAULTCSSERROR when using HTMLUNIT driver which was again not a problem with firefox driver.
My main intention is dat running the above process backside with out invoking the browser making all things invisible.
Any one please do help me in this aspect.
In HtmlUnit Driver, It'll look for only lowercase tag and attribute.
Example :
Html
input type="text" name="example" >
INPUT type="text" name="other" >
// webdriver code
driver.findElements(By.xpath("//input"));
for HtlmUNit case:It'll find only one element(name="example")
for firefoxDriver case = it'll find 2 element
hope it'll you in debugging code
HtmlUnit driver <> FirefoxDriver
"If you test javascript using HtmlUnit the results
may differ significantly from those browsers"
Take a look here

Need a cq5 example

I am new to Adobe cq5. Went through many online blogs and tutorials but could not get much. Can any one provide a Adobe cq5 application example with detailed explanation that can store and retrieve data in JCR.
Thanks in advance.
Here's a snippet for CQ 5.4 to get you started. It inserts a content page and text (as a parsys) at an arbitrary position in the content hierarchy. The position is supplied by a workflow payload, but you could write something that runs from the command line and use any valid CRX path instead. The advantage of making it a process step is that you get a session established for you, and the navigation to the insert point has been taken care of.
import java.text.SimpleDateFormat;
import java.util.Date;
import javax.jcr.Node;
import javax.jcr.RepositoryException;
import org.apache.sling.jcr.resource.JcrResourceConstants;
import org.apache.felix.scr.annotations.Component;
import org.apache.felix.scr.annotations.Properties;
import org.apache.felix.scr.annotations.Property;
import org.apache.felix.scr.annotations.Service;
import org.osgi.framework.Constants;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.day.cq.workflow.WorkflowException;
import com.day.cq.workflow.WorkflowSession;
import com.day.cq.workflow.exec.WorkItem;
import com.day.cq.workflow.exec.WorkflowData;
import com.day.cq.workflow.exec.WorkflowProcess;
import com.day.cq.workflow.metadata.MetaDataMap;
import com.day.cq.wcm.api.NameConstants;
#Component
#Service
#Properties({
#Property(name = Constants.SERVICE_DESCRIPTION,
value = "Makes a new tree of nodes, subordinate to the payload node, from the content of a file."),
#Property(name = Constants.SERVICE_VENDOR, value = "Acme Coders, LLC"),
#Property(name = "process.label", value = "Make new nodes from file")})
public class PageNodesFromFile implements WorkflowProcess {
private static final Logger log = LoggerFactory.getLogger(PageNodesFromFile.class);
private static final String TYPE_JCR_PATH = "JCR_PATH";
* * *
public void execute(WorkItem workItem, WorkflowSession workflowSession, MetaDataMap args)
throws WorkflowException {
//get the payload
WorkflowData workflowData = workItem.getWorkflowData();
if (!workflowData.getPayloadType().equals(TYPE_JCR_PATH)) {
log.warn("unusable workflow payload type: " + workflowData.getPayloadType());
workflowSession.terminateWorkflow(workItem.getWorkflow());
return;
}
String payloadString = workflowData.getPayload().toString();
//the text to be inserted
String lipsum = "Lorem ipsum...";
//set up some node info
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("d-MMM-yyyy-HH-mm-ss");
String newRootNodeName = "demo-page-" + simpleDateFormat.format(new Date());
SimpleDateFormat simpleDateFormatSpaces = new SimpleDateFormat("d MMM yyyy HH:mm:ss");
String newRootNodeTitle = "Demo page: " + simpleDateFormatSpaces.format(new Date());
//insert the nodes
try {
Node parentNode = (Node) workflowSession.getSession().getItem(payloadString);
Node pageNode = parentNode.addNode(newRootNodeName);
pageNode.setPrimaryType(NameConstants.NT_PAGE); //cq:Page
Node contentNode = pageNode.addNode(Node.JCR_CONTENT); //jcr:content
contentNode.setPrimaryType("cq:PageContent"); //or use MigrationConstants.TYPE_CQ_PAGE_CONTENT
//from com.day.cq.compat.migration
contentNode.setProperty(javax.jcr.Property.JCR_TITLE, newRootNodeTitle); //jcr:title
contentNode.setProperty(NameConstants.PN_TEMPLATE,
"/apps/geometrixx/templates/contentpage"); //cq:template
contentNode.setProperty(JcrResourceConstants.SLING_RESOURCE_TYPE_PROPERTY,
"geometrixx/components/contentpage"); //sling:resourceType
Node parsysNode = contentNode.addNode("par");
parsysNode.setProperty(JcrResourceConstants.SLING_RESOURCE_TYPE_PROPERTY,
"foundation/components/parsys");
Node textNode = parsysNode.addNode("text");
textNode.setProperty(JcrResourceConstants.SLING_RESOURCE_TYPE_PROPERTY,
"foundation/components/text");
textNode.setProperty("text", lipsum);
textNode.setProperty("textIsRich", true);
workflowSession.getSession().save();
}
catch (RepositoryException e) {
log.error(e.toString(), e);
workflowSession.terminateWorkflow(workItem.getWorkflow());
return;
}
}
}
I have posted further details and discussion.
A few other points:
I incorporated a timestamp into the name and title of the content
page to be inserted. That way, you can run many code and test cycles
without cleaning up your repository, and you know which test was the
most recently run. Added bonus: no duplicate file names, no
ambiguity.
Adobe and Day have been inconsistent about providing constants for
property values, node types, and suchlike. I used the constants that
I could find, and used literal strings elsewhere.
I did not fill in properties like the last-modified date. In code for
production I would do so.
I found myself confused by Node.setPrimaryType() and
Node.getPrimaryNodeType(). The two methods are only rough
complements; the setter takes a string but the getter returns a
NodeType with various info inside it.
In my original version of this code, I read the text to be inserted from a file, rather than just using the static string "Lorem ipsum..."
Once you've worked through this example, you should be able to use the Abobe docs to write code that reads data back from the CRX.
If you want to learn how to write a CQ application that can store and query data from the CQ JRC, see this article:
http://scottsdigitalcommunity.blogspot.ca/2013/02/querying-adobe-experience-manager-data.html
This provides a step by step guide and walks you right through the entire processes - including building the OSGi bundle using Maven.
FRom the comments above - I see reference to BND file. You should stay away from CRXDE to create OSGi and use Maven.

Scrapy - parsing all sub-pages of a given domain

I would like to parse kickstarter.com projects using scrapy, but can't figure out how to make the spider search projects that I don't explicitly specify under start_urls. I have the first part of the scrapy code figured out (I can extract the necessary information from one website), I just can't get it to do this for all projects under the domain kickstarter.com/projects.
From what I've read, I believe that parsing is possible (1) using links on the starting page (kickstarter.com/projects), (2) using links from one project page to jump to another project, and (3) using a site map (which I don't think kickstarter.com has) to locate webpages to parse.
I've spent hours trying each of these methods but and I am getting nowhere.
I've used the scrapy tutorial code and built on it.
Here is the part so far that works:
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ["http://www.kickstarter.com/projects/brucegoldwell/dragon-keepers-book-iv-fantasy-mystery-magic"]
def parse(self, response):
x = HtmlXPathSelector(response)
item = kickstarteritem()
item['url'] = response.url
item['name'] = x.select("//div[#class='NS-project_-running_board']/h2[#id='title']/a/text()").extract()
item['launched'] = x.select("//li[#class='posted']/text()").extract()
item['ended'] = x.select("//li[#class='ends']/text()").extract()
item['backers'] = x.select("//span[#class='count']/data[#data-format='number']/#data-value").extract()
item['pledge'] = x.select("//div[#class='num']/#data-pledged").extract()
item['goal'] = x.select("//div[#class='num']/#data-goal").extract()
return item
Since you're subclassing CrawlSpider, do not override parse. CrawlSpider's link crawling logic is contained within parse, which you really need.
As for the crawling itself, that's what the rules class attribute is for. I haven't tested it, but it should work:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from tutorial.items import kickstarteritem
class kickstarter(CrawlSpider):
name = 'kickstarter'
allowed_domains = ['kickstarter.com']
start_urls = ['http://www.kickstarter.com/discover/recently-launched']
rules = (
Rule(
SgmlLinkExtractor(allow=r'\?page=\d+'),
follow=True
),
Rule(
SgmlLinkExtractor(allow=r'/projects/'),
callback='parse_item'
)
)
def parse_item(self, response):
xpath = HtmlXPathSelector(response)
loader = XPathItemLoader(item=kickstarteritem(), response=response)
loader.add_value('url', response.url)
loader.add_xpath('name', '//div[#class="NS-project_-running_board"]/h2[#id="title"]/a/text()')
loader.add_xpath('launched', '//li[#class="posted"]/text()')
loader.add_xpath('ended', '//li[#class="ends"]/text()')
loader.add_xpath('backers', '//span[#class="count"]/data[#data-format="number"]/#data-value')
loader.add_xpath('pledge', '//div[#class="num"]/#data-pledged')
loader.add_xpath('goal', '//div[#class="num"]/#data-goal')
yield loader.load_item()
The spider crawls the pages of the recently launched projects.
Also, use yield instead of return. It's better to keep your spider's output a generator and it lets you yield multiple items/requests without making a list to hold them.

Scala and html: download an image (*.jpg, etc) to Hard drive

I've got a Scala program that downloads and parses html. I got the links to the image files form the html, Now I need to transfer those images to my hard drive. I'm wondering what the best Scala method I should use.
my connection code:
import java.net._
import java.io._
import _root_.java.io.Reader
import org.xml.sax.InputSource
import scala.xml._
def parse(sUrl:String) = {
var url = new URL(sUrl)
var connect = url.openConnection
var sorce:InputSource = new InputSource
var neo = new TagSoupFactoryAdapter //load sUrl
var input = connect.getInputStream
sorce.setByteStream(input)
xml = neo.loadXML(sorce)
input.close
}
My blog
Then you may want to take a look at java2s. Although the solution is in plain Java but you can still modify to Scala syntax to "just use it"
An alternative option is to use the system commands which is much cleaner
import sys.process._
import java.net.URL
import java.io.File
object Downloader {
def start(location: String) : Unit = {
val url = new URL(location)
var path = url match {
case UrlyBurd(protocol, host, port, path) => (if (path == "") "/" else path)
}
path = path.substring(path.lastIndexOf("/") + 1)
url #> new File(path) !!
}
}
object UrlyBurd {
def unapply(in: java.net.URL) = Some((
in.getProtocol,
in.getHost,
in.getPort,
in.getPath
))
}
One way to achieve that is: collect the URLs of the images and ask for them to the server (open a new connection with the image url and store the bytestream in the hard drive)

Resources