pull the citations for a paper from google scholar using R - r

Using google-scholar and R, I'd like to find out who is citing a particular paper.
The existing packages (like scholar) are oriented towards H-index analyses: statistics on a researcher.
I want to give a target-paper as input. An example url would be:
https://scholar.google.co.uk/scholar?oi=bibs&hl=en&cites=12939847369066114508
Then R should scrape these citations pages (google scholar paginates these) for the paper, returning an array of papers which cite the target (up to 500 or more citations). Then we'd search for keywords in the titles, tabulate journals and citing authors etc.
Any clues as to how to do that? Or is it down to literally scraping each page? (which I can do with copy and paste for one-off operations).
Seems like this should be a generally useful function for things like seeding systematic reviews as well, so someone adding this to a package might well increase their H :-)

Although there's is a bunch of available Google's API, a google scholar-based API is not available. So, albeit a web crawler on google scholar pages might not be difficult to develop, I do not know to what extent it might be illegal. Check this.

Alternatively, you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all rich structured data for you.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_scholar",
"hl": "en",
"cites": "12939847369066114508"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
{
"position": 1,
"title": "Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA)",
"result_id": "HYlMgouq9VcJ",
"type": "Pdf",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf",
"snippet": "Abstract In this document, we illustrate the use of lavaan by providing several examples. If you are new to lavaan, this is the first document to read … 3.1 Entering the model syntax as a string literal … 3.2 Reading the model syntax from an external file …",
"publication_info": {
"summary": "Y Rosseel - Journal of statistical software, 2012 - users.ugent.be",
"authors": [
{
"name": "Y Rosseel",
"link": "https://scholar.google.com/citations?user=0R_YqcMAAAAJ&hl=en&oi=sra",
"serpapi_scholar_link": "https://serpapi.com/search.json?author_id=0R_YqcMAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "0R_YqcMAAAAJ"
}
]
},
"resources": [
{
"title": "ugent.be",
"file_format": "PDF",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=HYlMgouq9VcJ",
"cited_by": {
"total": 10913,
"link": "https://scholar.google.com/scholar?cites=6338159566757071133&as_sdt=2005&sciodt=0,5&hl=en",
"cites_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cites=6338159566757071133&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:HYlMgouq9VcJ:scholar.google.com/&scioq=&hl=en&as_sdt=2005&sciodt=0,5",
"versions": {
"total": 27,
"link": "https://scholar.google.com/scholar?cluster=6338159566757071133&hl=en&as_sdt=2005&sciodt=0,5",
"cluster_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cluster=6338159566757071133&engine=google_scholar&hl=en"
},
"cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:HYlMgouq9VcJ:scholar.google.com/&hl=en&as_sdt=2005&sciodt=0,5"
}
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

Related

Google Schema.org Math solvers structured data for multiple fields

Trying to setup schema markup for a simple math solver action with two fields. Let's say addition.
1+1=2
Here is Google's doc and example:
{
"#context": "https://schema.org",
"#type": ["MathSolver", "LearningResource"],
"name": "An awesome math solver",
"url": "https://www.mathdomain.com/",
"usageInfo": "https://www.mathdomain.com/privacy",
"inLanguage": "en",
"potentialAction": [{
"#type": "SolveMathAction",
"target": "https://mathdomain.com/solve?q={math_expression_string}",
"mathExpression-input": "required name=math_expression_string",
"eduQuestionType": ["Polynomial Equation","Derivative"]
}],
"learningResourceType": "Math solver"
}
How do we add multiple variables for two numbers?
return {
'#context': 'https://schema.org',
'#type': ['MathSolver', 'LearningResource'],
...
potentialAction: [
{
'#type': 'SolveMathAction',
target: `domain.com/?num1={num1}&num2={num2}`,
'mathExpression-input': 'required name=num1 name=num2',
eduQuestionType: ['addition', 'sum']
},
],
learningResourceType: 'Math solver'
};
Schema.org says about mathExpression (note: mathExpression-input doesnt seem to exist) but does fall under Thing > Intangible EntryPoint
A mathematical expression (e.g. 'x^2-3x=0') that may be solved for > a specific variable, simplified, or transformed. This can take many > formats, e.g. LaTeX, Ascii-Math, or math as you would write with a > keyboard.
But can this be setup for URL params to accept multiple fields within the mathExpression-input instead of a single math expression?

What should the "author" field for a LinkedIn UGC post be for Showcase/Brand pages?

I am trying to specify an author for a UGC post to a showcase page. I am expecting that the author of the showcase post is the showcase page itself, which is what happens when I manually create a post, but this doesn't seem to work with the API.
Let's say I have a showcase urn:li:organizationBrand:123456. If I specify the showcase as the author ("author": "urn:li:organizationBrand:123456) I get an error about an invalid "author" field. But if I wrap the brand URN ID with "organization" instead of "organizationBrand" ("author": "urn:li:organization:123456") it works but I have not found this interchangeability documented anywhere.
This same workaround works for retrieving post stats (/organizationalEntityShareStatistics).
Can anyone explain what the right approach is supposed to be?
Are organization brand URNs meant to effectively be an alias of organization URNs?
You can use the organizationalEntityAcls API to find your URN. organization URNs are not necessarily interchangeable with organizationBrand URNs.
For example:
GET https://api.linkedin.com/v2/organizationalEntityAcls?q=roleAssignee
"paging": {
"count": 10,
"start": 0
},
"elements": [
{
"state": "APPROVED",
"role": "ADMINISTRATOR",
"roleAssignee": "urn:li:person:R8302pZx",
"organizationalTarget": "urn:li:organization:1000"
}
]
}
source: https://learn.microsoft.com/en-us/linkedin/marketing/integrations/community-management/organizations/organization-access-control#find-access-control-information

Writing specs for grammar in atom editor

I have written my own grammar in atom. I would like to write some specs for the same, but I am unable to understand how exactly to write one. I read the Jasmine documentation, but still not very clear. Can someone please explain how to write specs for testing out grammar in atom. Thanks
Grammars are availabe available under atom.grammars.grammarForScopeName("source.yourlanguage")
The grammar object it returns has methods you can feed code snippets (e.g. tokenizeLine, tokenizeLines).
These methods return arrays of tokens.
Testing is just verifying if these methods return what you expect.
E.g. (CoffeeScript alert):
grammar = atom.grammars.grammarForScopeName("source.yourlanguage")
{tokens} = grammar.tokenizeLine("# this is a comment line of some sort")
expect(tokens[0].value).toEqual "#"
expect(tokens[0].scopes).toEqual [
"source.yourlanguage",
"comment.line.number-sign.yourlanguage",
"punctuation.definition.comment.yourlanguage"
]
Happy testing!
Example specs
spec for MscGen (a simple language)
spec for Haskell (more complex)
The array returned by the grammar.tokenizeLine call above looks like this:
[
{
"value": "#",
"scopes": [
"source.yourlanguage",
"comment.line.number-sign.yourlanguage",
"punctuation.definition.comment.yourlanguage"
]
},
{
"value": " this is a comment line of some sort",
"scopes": [
"source.yourlanguage",
"comment.line.number-sign.yourlanguage"
]
},
{
"value": "",
"scopes": [
"source.yourlanguage",
"comment.line.number-sign.yourlanguage"
]
}
]
(Kept seeing this question pop up in the search results when I was looking for an answer to the same question - so just as well document it here.)

scrape under "show more"

I am trying to scrape all the objects with the same tag from a specific site (Google Scholar) with BeautifulSoup, but it doesn't scrap the object under the "show more" at the end of the page. How can I fix it?
Here's an example of my code:
# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
webpage=urlopen('http://scholar.google.gr/citations?user=FwuKA4UAAAAJ&hl=el')
soup=BeautifulSoup(webpage)
for t in soup.findAll('a',{"class":"gsc_a_at"}):
print t.text
You have to pass pagination parameters to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example python code (available in other libraries also) to retrieve the second page of results:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_author",
"hl": "en",
"author_id": "FwuKA4UAAAAJ",
"start": "20",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"articles": [
{
"title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
"citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
"authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
"publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
"cited_by": {
"value": 69,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
"serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
"cites_id": "6286720977869955347"
},
"year": "2012"
},
{
"title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
"citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
"authors": "N Tselios, N Avouris, V Komis",
"publication": "Education and Information Technologies 13 (1), 55-76, 2008",
"cited_by": {
"value": 68,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
"serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
"cites_id": "1046912849634390721"
},
"year": "2008"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.
In Chrome, try F12 --> Network, select 'Preserve log' and disable cache.
Now hit the show more button.
Check the GET/POST request being sent. You will know what to do next.

output directory structure in assemble

I am creating a static site using grunt.js and assemble. I have a data.json file used for building pages using assemble:
{
"articles": [
{
"author": "Brian",
"headline": "A Generation on the Hook 1",
"body": "cars, and start businesses by means of debt",
"slug" : "n-hook1",
"publish_on": "2014-10-10T04:00:00+00:00",
"url": "http://example.com/2014/oct/08/n-hook1/",
},
{
"author": "Brian",
"headline": "A Generation on the Hook 2",
"body": "As millions go to college, buy homes,",
"slug" : "n-hook2",
"publish_on": "2014-10-12T04:00:00+00:00",
"url": "http://example.com/2014/oct/08/n-hook2/",
},
],
}
I would like the output to be created in the following directories like this: 2014/oct/08/n-hook1/index.html. How can I create the directories in assemble?
Is this even possible with assemble.io? If there is something better, let me know. I am new to the js world and would like some direction. I did see this question but this seems to involve placing the files in different directories. Maybe I have to write a helper? If so, I am not sure where to start.
I like assemble because pages that are generated a completely upt o the client side rendering, and I just present the json data. Not sure if there is something better.
The grunt-assemble-permalinks plugin was the solution, it does what I need.

Resources