scrape under "show more"

scrape under "show more" - web-scraping

I am trying to scrape all the objects with the same tag from a specific site (Google Scholar) with BeautifulSoup, but it doesn't scrap the object under the "show more" at the end of the page. How can I fix it?
Here's an example of my code:
# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
webpage=urlopen('http://scholar.google.gr/citations?user=FwuKA4UAAAAJ&hl=el')
soup=BeautifulSoup(webpage)
for t in soup.findAll('a',{"class":"gsc_a_at"}):
print t.text

You have to pass pagination parameters to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example python code (available in other libraries also) to retrieve the second page of results:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_author",
"hl": "en",
"author_id": "FwuKA4UAAAAJ",
"start": "20",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"articles": [
{
"title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
"citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
"authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
"publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
"cited_by": {
"value": 69,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
"serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
"cites_id": "6286720977869955347"
},
"year": "2012"
},
{
"title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
"citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
"authors": "N Tselios, N Avouris, V Komis",
"publication": "Education and Information Technologies 13 (1), 55-76, 2008",
"cited_by": {
"value": 68,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
"serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
"cites_id": "1046912849634390721"
},
"year": "2008"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

In Chrome, try F12 --> Network, select 'Preserve log' and disable cache.
Now hit the show more button.
Check the GET/POST request being sent. You will know what to do next.

Related

Redux - updating store based on async api calls

I have a use case to render page using redux with graphql api calls.
On first component will call default action to fetch data from graphql, and stores in redux state as below
state = { films: {
totalCount: 6,
films: [
{
created: '2014-12-10T14:23:31.880000Z',
id: 'ZmlsbXM6MQ==',
director: 'George Lucas',
title: 'A New Hope'
},
{
created: '2014-12-12T11:26:24.656000Z',
id: 'ZmlsbXM6Mg==',
director: 'Irvin Kershner',
title: 'The Empire Strikes Back'
},
{
created: '2014-12-18T10:39:33.255000Z',
id: 'ZmlsbXM6Mw==',
director: 'Richard Marquand',
title: 'Return of the Jedi'
}
]
}
}
and will show UI like below
Films demo app UI screenshot
Once placed in individual component (Film), i have to make service call to get film details by id ( this calls will be async to fetch data and has to store in state.
Will get data as below
{ "data": {
"film": {
"id": "ZmlsbXM6NQ==",
"title": "Attack of the Clones",
"created": "2014-12-20T10:57:57.886000Z",
"director": "George Lucas",
"releaseDate": "2002-05-16",
"episodeID": 2,
"openingCrawl": "There is unrest in the Galactic\r\nSenate. Several thousand solar\r\nsystems have declared their\r\nintentions to leave the Republic.\r\n\r\nSenator Amidala, the former\r\nQueen of Naboo, is returning\r\nto the Galactic Senate to vote\r\non the critical issue of creating\r\nan ARMY OF THE REPUBLIC\r\nto assist the overwhelmed\r\nJedi....",
"producers": [
"Rick McCallum"
]
}
}
}
Now i have to update my state like below, so that i can show all film data in individual (Film) component
state = { films: {
totalCount: 6,
films: [
{
"id": "ZmlsbXM6NQ==",
"title": "Attack of the Clones",
"created": "2014-12-20T10:57:57.886000Z",
"director": "George Lucas",
"releaseDate": "2002-05-16",
"episodeID": 2,
"openingCrawl": "There is unrest in the Galactic\r\nSenate. Several thousand solar\r\nsystems have declared their\r\nintentions to leave the Republic.\r\n\r\nSenator Amidala, the former\r\nQueen of Naboo, is returning\r\nto the Galactic Senate to vote\r\non the critical issue of creating\r\nan ARMY OF THE REPUBLIC\r\nto assist the overwhelmed\r\nJedi....",
"producers": [
"Rick McCallum"
]
},
{
"id": "ZmlsbXM6Mg==",
"title": "The Empire Strikes Back",
"created": "2014-12-12T11:26:24.656000Z",
"director": "Irvin Kershner",
"releaseDate": "2002-05-16",
"episodeID": 2,
"openingCrawl": "There is unrest in the Galactic\r\nSenate. Several thousand solar\r\nsystems have declared their\r\nintentions to leave the Republic.\r\n\r\nSenator Amidala, the former\r\nQueen of Naboo, is returning\r\nto the Galactic Senate to vote\r\non the critical issue of creating\r\nan ARMY OF THE REPUBLIC\r\nto assist the overwhelmed\r\nJedi....",
"producers": [
"Rick McCallum"
]
},
{
"id": "ZmlsbXM6Mw==",
"title": "Return of the Jedi",
"created": "2014-12-18T10:39:33.255000Z",
"director": "Richard Marquand",
"releaseDate": "2002-05-16",
"episodeID": 2,
"openingCrawl": "There is unrest in the Galactic\r\nSenate. Several thousand solar\r\nsystems have declared their\r\nintentions to leave the Republic.\r\n\r\nSenator Amidala, the former\r\nQueen of Naboo, is returning\r\nto the Galactic Senate to vote\r\non the critical issue of creating\r\nan ARMY OF THE REPUBLIC\r\nto assist the overwhelmed\r\nJedi....",
"producers": [
"Rick McCallum"
]
}
]
}
}
When i am trying action to fetch Film by id (async calls using api middleware) in Film component, its calling and trying to update but all actions are looping and not working properly.
Please help me to understand and use redux actions properly.
App Codesandbox link https://codesandbox.io/s/adoring-jang-bf2f8m
Verify the console logs, can see actions looping....
Thanks.
Update::
Updated the above app to #redux/toolkit, below is the ref url
https://codesandbox.io/s/react-rtk-with-graphql-9bl8q7

You are using a highly outdated style of Redux here that will make you write 4 times the code at no benefit - modern Redux does not have ACTION_TYPE constants, switch..case reducers, hand-written middleware (at least in a case like yours), immutable reducer logic or hand-written action creators - all that since 2019. The tutorial you are following is highly outdated and the problems you are facing right now will not be problems for you if you go with the modern style.
Please do yourself a favor and go follow the official Redux tutorial. It also covers getting data from apis in chapters 5, 7 and 8.

Post Request Returning {"error": "Expecting a string", "result":87}

I am trying to interface with the Chroma SDK released by Razer and have been running into some issues. Following the documentation that razer provides, I have been trying to change the color of my RGB mouse for a while now, and hope that someone has the answer for me. I can successfully check to see if the ChromaSDK is working through:
url = 'http://localhost:54235/razer/chromasdk'
x = requests.get(url)
print(x.text)
Then, I can initialize the connection by sending a post to the URL, following the template given on their website:
data = {
"title": "Razer Chroma SDK RESTful Test Application",
"description": "This is a REST interface test application",
"author": {
"name": "Chroma Developer",
"contact": "www.razerzone.com"
},
"device_supported": [
"keyboard",
"mouse",
"headset",
"mousepad",
"keypad",
"chromalink"],
"category": "application"
}
x = requests.post(url, json=data)
print(x.text)
This post request returns:
{"sessionid":55105,"uri":"http://localhost:55105/chromasdk"}
Then, since the connection is initialized, I SHOULD be able to change the colors of the connected Razer devices using endpoints such as /mouse or /headset. This is where it gets funky, if I were to use the url(s): http://localhost:54235/chromasdk/mouse, http://localhost:54235/razer/mouse, or http://localhost:54235/mouse then I get the error "Not Implemented", whereas if I use the URI provided by the previous post request and tag /mouse onto the end of it, it get this error:
{"error":"Expecting a string","result":87}
Or if I use http://localhost:54235/razer/chromasdk/mouse, I get:
{"author":null,"category":null,"description":null,"device_supported":null,"error":"The parameter is incorrect.","result":87,"title":null}
The endpoints SHOULD follow the URL http://localhost:54235/razer/chromasdk, and I am following the documentation to a T, so what am I doing wrong??

Adobe Analytics 2.0 API endpoint to get report suite events, props, and evars

I'm having a hard time finding a way in the 2.0 API that I can get a list of Evars, Props and Events for a given report suite. The 1.4 version has the reportSuite.getEvents() endpoint and similar for Evars and Props.
Please let me know if there is a way to get the same data using the 2.0 API endpoints.

The API v2.0 github docs aren't terribly useful, but the Swagger UI is a bit more helpful, showing endpoints and parameters you can push to them, and you can interact with it (logging in with your oauth creds) and see requests/responses.
The two API endpoints in particular you want are metrics and dimensions. There are a number of options you can specify, but to just get a dump of them all, the full endpoint URL for those would be:
https://analytics.adobe.io/api/[client id]/[endpoint]?rsid=[report suite id]
Where:
[client id] - The client id for your company. This should be the same value as the legacy username:companyid (the companyid part) from v1.3/v1.4 API shared secret credentials, with the exception that it is suffixed with "0", e.g. if your old username:companyid was "crayonviolent:foocompany", the [client id] would be "foocompany0", because..reasons? I'm not sure what that's about, but it is what it is.
[endpoint] - Value should be "metrics" to get the events, and dimensions to get the props and eVars. So you will need to make 2 API endpoint requests.
[rsid] - The report suite id you want to get the list of events/props/eVars from.
Example:
https://analytics.adobe.io/api/foocompany0/metrics?rsid=fooglobal
One thing to note about the responses: they aren't like the v1.3 or v1.4 methods where you query for a list of only those specific things. It will return a json array of objects for every single event and dimension respectively, even the native ones, calculated metrics, classifications for a given dimension, etc. AFAIK there is no baked in way to filter the API query (that's in any documentation I can find, anyways..), so you will have to loop through the array and select the relevant ones yourself.
I don't know what language you are using, but here is a javascript example for what I basically do:
var i, l, v, data = { prop:[], evar: [], events:[] };
// dimensionsList - the JSON object returned from dimensions API call
// for each dimension in the list..
for (i=0,l=dimensionsList.length;i<l;i++) {
// The .id property shows the dimension id to eval
if ( dimensionsList[i].id ) {
// the ones we care about are e.g. "variables/prop1" or "variables/evar1"
// note that if you have classifications on a prop or eVar, there are entries
// that look like e.g. "variables/prop1.1" so regex is written to ignore those
v = (''+dimensionsList[i].id).match(/^variables\/(prop|evar)[0-9]+$/);
// if id matches what we're looking for, push it to our data.prop or data.evar array
v && v[1] && data[v[1]].push(dimensionsList[i]);
}
}
// metricsList - the JSON object returned from metrics API call
// basically same song and dance as above, but for events.
for (var i=0,l=metricsList.length;i<l;i++) {
if ( metricsList[i].id ) {
// events ids look like e.g. "metrics/event1"
var v = (''+metricsList[i].id).match(/^metrics\/event[0-9]+$/);
v && data.events.push(metricsList[i]);
}
}
And then the result data object will have data.prop,data.evar, and data.events, each an array of the respective props/evars/events.
Example object entry for an data.events[n]:
{
"id": "metrics/event1",
"title": "(e1) Some event",
"name": "(e1) Some event",
"type": "int",
"extraTitleInfo": "event1",
"category": "Conversion",
"support": ["oberon", "dataWarehouse"],
"allocation": true,
"precision": 0,
"calculated": false,
"segmentable": true,
"supportsDataGovernance": true,
"polarity": "positive"
}
Example object entry for an data.evar[n]:
{
"id": "variables/evar1",
"title": "(v1) Some eVar",
"name": "(v1) Some eVar",
"type": "string",
"category": "Conversion",
"support": ["oberon", "dataWarehouse"],
"pathable": false,
"extraTitleInfo": "evar1",
"segmentable": true,
"reportable": ["oberon"],
"supportsDataGovernance": true
}
Example object entry for a data.prop[n]:
{
"id": "variables/prop1",
"title": "(c1) Some prop",
"name": "(c1) Some prop",
"type": "string",
"category": "Content",
"support": ["oberon", "dataWarehouse"],
"pathable": true,
"extraTitleInfo": "prop1",
"segmentable": true,
"reportable": ["oberon"],
"supportsDataGovernance": true
}

Adobe Analytics API - Real Time Classification

I need to get from Omniture real time API a classify eVar, exclude some value, and then breackdown its with sitesection.
I try with this query:
{
"reportDescription": {
"source": "realtime",
"reportSuiteID": "**RSID**", //MY REPORT SUITE
"metrics": [{
"id": "instances"
}],
"elements": [{
"id": "evar", //MY EVAR
"top": 100,
"classification": "Real Time", //CLASSIFICATION NAME
"search": {
"type": "NOT",
"keywords": ["somevalue"] //THE VALUE TO EXCLUDE
}
},{
"id" : "sitesection",
"top" : 1
}],
"dateGranularity": "minute:1",
"dateFrom": "-1 minute"
}
}
But in the JSON response I see "somevalue" how if it not excluded.
The strange thing is that if I remove the "breakdown" (with sitesection) the classification filter seems to works fine.
I can't use classification filter if a breackdown is used in real time report? I can't find any documentation about that.
An other thing is that if I request a report with the classification, without any search, I receve the response but there is a lot of "::Unspecified::". The problem is that the "::Unspecified::" seems to be the last datas that Omniture receves form my webpages. I think this means that classifications are not in real time, also if you can to use it in real time report.

pull the citations for a paper from google scholar using R

Using google-scholar and R, I'd like to find out who is citing a particular paper.
The existing packages (like scholar) are oriented towards H-index analyses: statistics on a researcher.
I want to give a target-paper as input. An example url would be:
https://scholar.google.co.uk/scholar?oi=bibs&hl=en&cites=12939847369066114508
Then R should scrape these citations pages (google scholar paginates these) for the paper, returning an array of papers which cite the target (up to 500 or more citations). Then we'd search for keywords in the titles, tabulate journals and citing authors etc.
Any clues as to how to do that? Or is it down to literally scraping each page? (which I can do with copy and paste for one-off operations).
Seems like this should be a generally useful function for things like seeding systematic reviews as well, so someone adding this to a package might well increase their H :-)

Although there's is a bunch of available Google's API, a google scholar-based API is not available. So, albeit a web crawler on google scholar pages might not be difficult to develop, I do not know to what extent it might be illegal. Check this.

Alternatively, you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all rich structured data for you.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_scholar",
"hl": "en",
"cites": "12939847369066114508"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
{
"position": 1,
"title": "Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA)",
"result_id": "HYlMgouq9VcJ",
"type": "Pdf",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf",
"snippet": "Abstract In this document, we illustrate the use of lavaan by providing several examples. If you are new to lavaan, this is the first document to read … 3.1 Entering the model syntax as a string literal … 3.2 Reading the model syntax from an external file …",
"publication_info": {
"summary": "Y Rosseel - Journal of statistical software, 2012 - users.ugent.be",
"authors": [
{
"name": "Y Rosseel",
"link": "https://scholar.google.com/citations?user=0R_YqcMAAAAJ&hl=en&oi=sra",
"serpapi_scholar_link": "https://serpapi.com/search.json?author_id=0R_YqcMAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "0R_YqcMAAAAJ"
}
]
},
"resources": [
{
"title": "ugent.be",
"file_format": "PDF",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=HYlMgouq9VcJ",
"cited_by": {
"total": 10913,
"link": "https://scholar.google.com/scholar?cites=6338159566757071133&as_sdt=2005&sciodt=0,5&hl=en",
"cites_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cites=6338159566757071133&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:HYlMgouq9VcJ:scholar.google.com/&scioq=&hl=en&as_sdt=2005&sciodt=0,5",
"versions": {
"total": 27,
"link": "https://scholar.google.com/scholar?cluster=6338159566757071133&hl=en&as_sdt=2005&sciodt=0,5",
"cluster_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cluster=6338159566757071133&engine=google_scholar&hl=en"
},
"cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:HYlMgouq9VcJ:scholar.google.com/&hl=en&as_sdt=2005&sciodt=0,5"
}
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scrape under "show more" - web-scraping

In Chrome, try F12 --> Network, select 'Preserve log' and disable cache. Now hit the show more button. Check the GET/POST request being sent. You will know what to do next.

Related

Redux - updating store based on async api calls

Post Request Returning {"error": "Expecting a string", "result":87}

Adobe Analytics 2.0 API endpoint to get report suite events, props, and evars

Adobe Analytics API - Real Time Classification

pull the citations for a paper from google scholar using R

Categories

Resources