Is there a way to get a Linkedin post's url in the below format from the website:
https://www.linkedin.com/feed/update/urn:li:activity:xxxxxxxxxxxxxxxxxxx/
There is a hacky way I have found. It turns out, you can find it on the page source. Here is an example of how I got it:
>>> import requests
>>> post_link = 'https://www.linkedin.com/posts/activity-6453360792111718400-0j6d'
>>> response = requests.get(post_link).text
>>> urn_index = response.index('urn:li:activity:')
>>> finish_index = response.index('"', urn_index)
>>> activity_urn = response[urn_index:finish_index]
>>> print(f'https://www.linkedin.com/feed/update/{activity_urn}')
https://www.linkedin.com/feed/update/urn:li:activity:6453360792111718400
I don't think that's a stable method but I also couldn't find any other way with the Linkedin API.
Related
Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)
I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic
I am trying to use python to find the final redirected URL for a url. I tried various solutions from stackoverflow answers but nothing worked for me. I am only getting the original url.
To be specific, I tried requests, urllib2 and urlparse libraries and none of them worked as they should. Here are some of the codes I tried:
Solution 1:
s = requests.session()
r = s.post('https://www.boots.com/search/10055096', allow_redirects=True)
print(r.history)
print(r.history[1].url)
Result:
[<Response [301]>, <Response [302]>]
https://www.boots.com/search/10055096
Solution 2:
import urlparse
url = 'https://www.boots.com/search/10055096'
try:
out = urlparse.parse_qs(urlparse.urlparse(url).query)['out'][0]
print(out)
except Exception as e:
print('not found')
Result:
not found
Solution 3:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print(get_redirected_url('https://www.boots.com/search/10055096'))
Result:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Expected URL below is the final redirected page and that is what I want to return.
Original URL: https://www.boots.com/search/10055096
Expected URL: https://www.boots.com/gillette-fusion5-razor-blades-4pk-10055096
Solution #1 was the closest one. At least it returned 2 responses but second respond wasn't the final page, it seems like it was the loading page looking at the content of it.
The first request returns with a html file which contains a JS to update the site and Java scripts are not processed by requests . You can find the updated link by using
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.boots.com/search/10055096')
soup = BeautifulSoup(r.content,'html.parser')
reg = soup.find('input',id='searchBoxText').findNext('script').contents[0]
print(re.search(r'ht[\w\://\.-]+', reg).group())
I am using Python Client for Google Cloud Vision API, basically same code as in documentation http://google-cloud-python.readthedocs.io/en/latest/vision/
>>> from google.cloud import vision
>>> client = vision.ImageAnnotatorClient()
>>> response = client.annotate_image({
... 'image': {'source': {'image_uri': 'gs://my-test-bucket/image.jpg'}},
... 'features': [{'type': vision.enums.Feature.Type.FACE_DETECTOIN}],
... })
problem is that response doesn't have field "annotations" (as it is documentation) but based on documentation has field for each "type". so when I try to get response.face_annotations I get
and basically I don't know how to extract result from Vision API from response (AnnotateImageResponse) to get something like json/dictionary like data.
version of google-cloud-vision is 0.25.1 and it was installed as full google-cloud library (pip install google-cloud).
I think today is not my day
I appreciate any clarification / help
Hm. It is a bit tricky, but the API is pretty great overall. You can actually directly call the face detection interface, and it'll spit back exactly what you want - a dictionary with all the info.
from google.cloud import vision
from google.cloud.vision import types
img = 'YOUR_IMAGE_URL'
client = vision.ImageAnnotatorClient()
image = vision.types.Image()
image.source.image_uri = img
faces = client.face_detection(image=image).face_annotations
print faces
Above answers wont help because delta in improvisation is happening which you can say reality vs theoretical.
The vision response is not json type, it is just the customized class type which is perfect for vision calls.
So after much research, I conjured this solution and it works
Here is the solution
Convert this output to ProtoBuff and then to json, it will be simple extraction.
def json_to_hash_dump(vision_response):
"""
a function defined to take a convert the
response from vision api to json object transformation via protoBuff
Args:
vision_response
Returns:
json_object
"""
from google.protobuf.json_format import MessageToJson
json_obj = MessageToJson((vision_response._pb))
# to dict items
r = json.loads(json_obj)
return r
well alternative is to use Python API Google client, example is here https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/vision/api/label/label.py
I try to get the product rating information from target.com. The URL for the product is
http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty
After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to
find out the correct XHR and where to send request
use FormRequest to get the right json
parse json
(if I am wrong about the steps please tell me)
I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.
Can someone can walk me through this?
Thank you!
The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:
<meta itemprop="productID" content="15258543">
Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:
import json
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
allowed_domains = ["target.com"]
start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]
def parse(self, response):
product_id = response.xpath("//meta[#itemprop='productID']/#content").extract_first()
return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
callback=self.parse_ratings,
meta={"product_id": product_id})
def parse_ratings(self, response):
data = json.loads(response.body)
print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])
Prints 4.5585.