How to take 2 values from links Scrapy LinkExtractor

How to take 2 values from links Scrapy LinkExtractor - web-scraping

I need take all links from Amazon starting with this one -
https://www.amazon.com/s?k=guess+case&crid=2Q25FH0FOTCA4&sprefix=guess+case%2Caps%2C215&ref=nb_sb_noss
But i need only cases Guess. These links must containt 2 values - "Guess" and "Phone". For example:
https://www.amazon.com/Guess-Scarlett-Collection-Hard-iPhone/dp/B00QTEP0B0/ref=sr_1_2?crid=2Q25FH0FOTCA4&keywords=guess+case&qid=1650550474&sprefix=guess+case%2Caps%2C215&sr=8-2
https://www.amazon.com/Guess-GUHCP13SPCUMABK-Marble-Collection-iPhone/dp/B09J94ZMZ3/ref=sr_1_3?crid=2Q25FH0FOTCA4&keywords=guess+case&qid=1650550474&sprefix=guess+case%2Caps%2C215&sr=8-3
How can i take these links with help library re?
start_urls = ['https://www.amazon.com/s?k=guess+case&crid=2Q25FH0FOTCA4&sprefix=guess+case%2Caps%2C215&ref=nb_sb_noss/']
rules = [Rule(LinkExtractor(allow=r'???' , ))...

Just use an if statement...
if 'guess' and 'phone' not in url:

Related

Scraping: No attribute find_all for <p>

Good morning :)
I am trying to scrape the content of this website: https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=438653&CID=102313
The text I am trying to get seems to be located inside some <p> and separated by <br>.
For some reason, whenever I try to access a <p>, I get the following mistake: "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?", and this even if I do find instead of find_all().
My code is below (it is a very simple thing with no loop yet, I just would like to identify where the mistake comes from):
from selenium import webdriver
import time
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='MYPATH/chromedriver',options=options)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("div", class_="col-sm-12")
people_in_column = column.find_all("p").find_all("br")
Is there anything obvious I am not understanding here?
Thanks a lot in advance for your help!

You are trying to select a lsit of items aka ResultSet multiples times which is incorrect meaning using find_all method two times but not iterating.The correct way is as follows. Hope, it should work.
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Full working code as an example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://public.era.nih.gov/pubroster/preRosIndex.era?AGENDA=446096&CID=102313"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source#.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
columns = soup.find_all("div", class_="col-sm-12")
for column in columns:
people_in_column = column.find("p").get_text(strip=True)
print(people_in_column)
Output:
Notice of NIH Policy to All Applicants:Meeting rosters are provided for information purposes only. Applicant investigators and institutional officials must not communicate directly with study section members about an application before or after the review. Failure to observe this policy will create a serious breach of integrity in the peer review process, and may lead to actions outlined inNOT-OD-22-044, including removal of the application from immediate
review.

Turn off search continuation results in python-ldap?

Using python-ldap.search_s() function (https://www.python-ldap.org/en/python-ldap-3.3.0/reference/ldap.html#ldap.LDAPObject.search_s) with params...
base = DC=myorg,DC=local
filterstr = (&(sAMAccountName={login})(|(memberOf=CN=zone1,OU=zones,OU=datagroups,DC=myorg,DC=local)(memberOf=CN=zone2,OU=zones,OU=datagroups,DC=myorg,DC=local)))
...to try to match against a specific AD user.
Yet when I look at the result returned (with login = myuser), I see something like:
[
(u'CN=zone1,OU=zones,OU=datagroups,DC=myorg,DC=local', {u'sAMAccountName': ['myuser']}),
(None, [u'ldap://DomainDnsZones.myorg.local/DC=DomainDnsZones,DC=myorg,DC=local']),
(None, [u'ldap://ForestDnsZones.myorg.local/DC=ForestDnsZones,DC=myorg,DC=local']),
(None, [u'ldap://myorg.local/CN=Configuration,DC=myorg,DC=local'])
]
where there are multiple other hits in the list (besides the myuser sAMAccountName match) that have nothing to do with the search filter.
Looking at the docs (https://www.python-ldap.org/en/python-ldap-3.3.0/faq.html) these appear to be "search continuations" / referrals that are included when the search base is at the domain level and it says that they can be turned off by including the code like...
l = ldap.initialize('ldap://foobar')
l.set_option(ldap.OPT_REFERRALS,0)
as well as trying
ldap.set_option(ldap.OPT_REFERRALS,0)
l = ldap.initialize('ldap://foobar')
...yet adding this code does not change the behavior at all and I get the same results (see https://www.python-ldap.org/en/python-ldap-3.3.0/reference/ldap.html?highlight=set_option#ldap.set_option).
Am I misunderstanding something here? Anyone know how to get these to stop popping up? Anyone know the structure of the tuples that this function returns (the docs do not describe)?

Just talked to someone else more familiar with python-ldap and was told that OPT_REFERRALS is controlling if you automatically follow the referral, but it doesn't stop AD from sending them.
For now, the only approach they recommended was to filter these values with something like:
results = ldap.search_s(...)
results = [ x for x in results if x[0] is not None ]
Noting that the structure of the results returned from search_s() is
[
( dn, {
attrname: [ value, value, ... ],
attrname: [ value, value, ... ],
}),
]
When it's a referral it's a DN of None and the entry dict is replaced with an array of URI's.
* (Note that in the search_s call you can request specific attributes to be returned in your search too)
* (Note that since my base DN is a domain level path, using the ldap.set_option(ldap.OPT_REFERRALS,0) snippet was still useful just to stop the search_s() from actually going down the referral paths (which was adding a few seconds to the search time))
Again, I believe that this problem is due to the base DN being a domain level path (unless there is some other base_dn or search.filter I could use for that fact that the group users are scattered across various AD paths in the domain that I'm missing).

Randomizing URL numbers in iMacros

I'm using iMacros because I want to scrape a certain site for ID's which are used in the URL, after which I want to press a button.
I know you can't use Regular Expressions or globbing in the syntax for URL GOTO.
But I figured there might be a way to enter variables into the URL GOTO=?
Preferable I wouldn't want to randomize the variable, but have it try every page from [1 - 99999]
This is what I currently have:
VERSION BUILD=8940826 RECORDER=FX
TAB T=1
SET !ERRORIGNORE YES
SET !VAR3 ("Math.floor(Math.random()*99999 + 1);")
URL GOTO=http://example.com/id/ "randomized_variable_here"
TAG POS=1 TYPE=SPAN ATTR=TXT:press<SP>button
I have tried a few things, but I don't seem to be able to do this.
I have very little experience actually creating stuff for myself, I just modify scripts to fit my purposes, but should I look towards an HTML document or something like that to randomize that variable for me?
Thanks in advance!

It's pretty simple to get the string with a randomized variable:
' ...
SET !VAR3 EVAL("Math.floor(Math.random()*99999 + 1);")
URL GOTO=http://example.com/id/{{!VAR3}}
' ...
And the following code is for looping through [1 - 'Max:' value on the 'iMacros' sidebar]:
' ...
SET !LOOP 1
URL GOTO=http://example.com/id/{{!LOOP}}
' ...
Just play this macro in loop mode.

Rails 4: how to identify and format links, hashtags and mentions in model attributes?

In my Rails 4 app, I have a Post model, with :copy and :short_copy as custom attributes (strings).
These attributes contain copies for social medias (Facebook, Twitter, Instagram, Pinterest, etc.).
I display the content of these attributes in my Posts#Show view.
Currently, URLs, #hashtags and #mentions are formatted like the rest of the text.
What I would like to do is to format them in a different fashion, for instance in another color or in bold.
I found the twitter-text gem, which seems to offer such features, but my problem is that I do NOT need — and do NOT want — to have these URLs, #hashtags and #mentions turn into real links.
Indeed, it looks like the twitter-text gem converts URLs, #hashtags and #mentions by default with Twitter::Autolink, as explained in this Stack Overflow question.
That's is not what I am looking for: I just want to update the style of my URLs, #hashtags and #mentions.
How can I do this in Ruby / Rails?
—————
UPDATE:
Following Wes Foster's answer, I implemented the following method in post.rb:
def highlight(string)
string.gsub!(/\S*#(\[[^\]]+\]|\S+)/, '<span class="highlight">\1</span>')
end
Then, I defined the following CSS class:
.highlight {
color: #337ab7;
}
Last, I implemented <%= highlight(post.copy) %> in the desired view.
I now get the following error:
ArgumentError
wrong number of arguments (1 for 2..3)
<td><%= highlight(post.copy) %></td>
What am I doing wrong?
—————

I'm sure each of the following regex patterns could be improved to match even more options, however, the following code works for me:
def highlight_url(str)
str.gsub!(/(https?:\/\/[\S]+)/, '[\1]')
end
def highlight_hashtag(str)
str.gsub!(/\S*#(\[[^\]]+\]|\S+)/, '[#\1]')
end
def highlight_mention(str)
str.gsub!(/\B(\#[a-z0-9_-]+)/i, '[\1]')
end
# Initial string
str = "Myself and #doggirl bought a new car: http://carpictures.com #nomoremoney"
# Pass through each
highlight_mention(str)
highlight_hashtag(str)
highlight_url(str)
puts str # > Myself and [#doggirl] bought a new car: [http://carpictures.com] [#nomoremoney]
In this example, I've wrapped the matches with brackets []. You should use a span tag and style it. Also, you can wrap all three gsub! into a single method for simplicity.
Updated for the asker's add-on error question
It looks like the error is references another method named highlight. Try changing the name of the method from highlight to new_highlight to see if that fixes the new problem.

Meteor - Passing a jade helper into a helper function

I'm trying to populate a list with a dataset and set the selected option with a helper function that compares the current data with another object's data (the 2 objects are linked)
I made the same type of list population with static variables:
Jade-
select(name='status')
option(value='Newly Acquired' selected='{{isCurrentState "Newly Acquired"}}') Newly Acquired
option(value='Currently In Use' selected='{{isCurrentState "Currently In Use"}}') Currently In Use
option(value='Not In Use' selected='{{isCurrentState "Not In Use"}}') Not In Use
option(value='In Storage' selected='{{isCurrentState "In Storage"}}') In Storage
Coffeescript-
"isCurrentState" : (state) ->
return #status == state
This uses a helper isCurrentState to match a given parameter to the same object that my other code is linked to so I know that part works
The code I'm trying to get to work is :
Jade-
select.loca(name='location')
each locations
option(value='#{siteName}' selected='{{isCurrentLocation {{siteName}} }}') #{siteName}
Coffeescript-
"isCurrentLocation": (location) ->
return #locate == location
All the other parts are functioning 100%, but the selected part is not
I've also tried changing the way I entered the selected='' part in a manner of ways such as:
selected='{{isCurrentLocation "#{siteName}" }}'
selected='{{isCurrentLocation "#{siteName} }}'
selected='{{isCurrentLocation {{siteName}} }}'
selected='#{isCurrentLocation "{{siteName}}" }'
selected='#{isCurrentLocation {{siteName}} }'
selected='#{isCurrentLocation #{siteName} }'
Is what I'm trying to do even possible?
Is there a better way of achieving this?
Any help would be greatly appreciated
UPDATE:
Thanks #david-weldon for the quick reply, i've tried this out a bit and realised that I wasn't exactly clear in what I was trying to accomplish in my question.
I have a template "update_building" created with a parameter( a buidling object) with a number of attributes, one of which is "locate".
Locations is another object with a number of attributes as well, one of which is "siteName". One of the siteName == locate and thus i need to pass in the siteName from locations to match it to the current building's locate attribute
Though it doesn't work in the context I want to use it definitely pointed me in a direction I didn't think of. I am looking into moving the parent template(The building) date context as a parameter into the locations template and using it from within the locations template. This is easily fixable in normal HTML spacebars with:
{{>locations parentDataContext/variable}}
Something like that in jade would easily solve this

Short answer
selected='{{isCurrentLocation siteName}}'
Long answer
You don't really need to pass the current location because the helper should know it's own context. Here's a simple (tested) example:
jade
template(name='myTemplate')
select.location(name='location')
each locations
option(value=this selected=isCurrentLocation) #{this}
coffee
LOCATIONS = [
'Newly Acquired'
'Currently In Use'
'Not In Use'
'In Storage'
]
Template.myTemplate.helpers
locations: LOCATIONS
isCurrentLocation: ->
#toString() is Template.instance().location.get()
Template.myTemplate.onCreated ->
#location = new ReactiveVar LOCATIONS[1]

I looked into the datacontexts some more and ended up making the options that populate the select into a different template and giving that template a helper, accessing the template's parent's data context and using that to determine which location the building had saved in it so that I could set that option to selected
Jade-
template(name="location_building_option")
option(value='#{siteName}' selected='{{isSelected}}') #{siteName}
Coffeescript -
Template.location_building_option.helpers
'isSelected': ->
parent = Template.parentData(1)
buildSite = parent.locate
return #siteName == buildSite
Thanks #david-weldon, Your answer helped me immensely to head in the right direction

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to take 2 values from links Scrapy LinkExtractor - web-scraping

Just use an if statement... if 'guess' and 'phone' not in url:

Related

Scraping: No attribute find_all for <p>

Turn off search continuation results in python-ldap?

Randomizing URL numbers in iMacros

Rails 4: how to identify and format links, hashtags and mentions in model attributes?

Meteor - Passing a jade helper into a helper function

Categories

Resources