Google Sheets: Importing numbers from website using ImportXML - web-scraping

I have no experience coding!
I am having trouble scraping data from a website into my Google spreadsheet. I want to get the Observation number into my spreadsheet form this page
I have tried this but honestly have no idea what I'm doing:
=IMPORTXML(A3,"//*[#id="obsstatcol"]/div/div[1]")
With A3 being the above page URL, and the rest is a mash of some tutorial I found with the XPath copies from the observation value I'm trying to scrape off the page.
Can anyone help me make sense of what the hell I'm trying to do and offer some advice?
Thanks in advance

Good attempt! However, unfortunately the obvervation number is not determined until after the page loads. That means that your formula:
=IMPORTXML(A3,"//*[#id=""obsstatcol""]/div/div[1]")
yields
{{ shared.numberWithCommas( totalObservations ) }}
so you cannot just use ImportXML() in this case.
However, all is not lost. I opened the network monitor with F12, and saw that the page was making a web request to this url:
https://api.inaturalist.org/v1/observations/observers?verifiable=any&quality_grade=needs_id&user_id=ericthuranira&locale=en-US
to get the observation data, which appears to be in JSON format. E.g. (formatted for readability)
{
"total_results": 1,
"page": 1,
"per_page": 500,
"results": [
{
"user_id": 1265521,
"observation_count": 121,
"species_count": 42,
"user": {
"id": 1265521,
"login": "ericthuranira",
"spam": false,
"suspended": false,
"created_at": "2018-10-09T11:43:22+00:00",
"login_autocomplete": "ericthuranira",
"login_exact": "ericthuranira",
"name": "Eric Thuranira",
"name_autocomplete": "Eric Thuranira",
"orcid": null,
"icon": "https://static.inaturalist.org/attachments/users/icons/1265521/thumb.jpeg?1580369132",
"observations_count": 237,
"identifications_count": 203,
"journal_posts_count": 0,
"activity_count": 440,
"species_count": 150,
"universal_search_rank": 237,
"roles": [],
"site_id": 1,
"icon_url": "https://static.inaturalist.org/attachments/users/icons/1265521/medium.jpeg?1580369132"
}
}
]
}
This is not in XML format, so you'll have to use a JSON parser to do that. Fortunately, somebody has made one for Google Sheets! You can easily get this for yourself by doing the following:
Paste the code from here into your script editor (Tools > Script Editor), and save it as ImportJSON. This gives you your JSON parser.
Taking the "api" URL I mentioned above for the observers, use this formula (assuming URL is in A3)
=ImportJSON(A3,"/results/observation_count","noHeaders")
And this will get you the number you want.

Related

Fetch company posts from linkedin API

I am trying to fetch the posts of the company from the api, I have already applied to the marketing development platform and it was approved. I already got the token with the scope: r_organization_social and I'm calling the /shares api:
https://api.linkedin.com/v2/shares?q=owners&owners=urn:li:organization:{company_ID}&sharesPerOwner=100&count=25&sharesPerOwner=10
But I'm getting the following response:
{
"paging": {
"start": 0,
"count": 25,
"links": [
{
"type": "application/json",
"rel": "next",
"href": "/v2/shares?count=25&owners=urn%3Ali%3Aorganization%3A{company_ID}&q=owners&sharesPerOwner=10&sharesPerOwner=100&start=0"
}
],
"total": 242
},
"elements": []
}
I tried to change the query params and it's still the same
This end-point worked for me:
https://api.linkedin.com/v2/ugcPosts?q=authors&authors=List(urn%3Ali%3Aorganization%3A<ID_ORGANIZATION>)
See documentation: https://learn.microsoft.com/en-us/linkedin/marketing/integrations/community-management/shares/ugc-post-api?tabs=http#sample-request-6
Disclaimer: I've no access to the linkedin API and couldn't test. But these are some things I noticed:
Your url contains two times the paramater sharesPerOwner, try removing one.
In the docs it's recommended to set the sharesPerOwner to 1000 and the count to 50. I'd also include the start paramater, just to make sure:
Maybe try something like this:
GET https://api.linkedin.com/v2/shares?q=owners&owners=urn:li:organization:{id}&sharesPerOwner=1000&count=50&start=0
From the api-docs(https://learn.microsoft.com/en-us/linkedin/marketing/integrations/community-management/shares/share-api?tabs=http#find-shares-by-owner): "Note that the pagination excludes UGC and Direct Sponsored Content (DSC) posts". Make sure that the owner you are testing contains posts.
If this doesn't work. Could you provide some information on how you are sending the request? Have you tried accessing other parts of the api?

vega not plotting the data

I am brand new to Vega and I was trying to plot some charts on Vega (plugin ElasticSearch and Kibana). Below is the simple visulization I am trying to plot. I am following through the documentation to connect the existing data, however I am unable to get the visuals. It just shows Y and X axis labeled from the code below with blank plots. What am I doing wrong?
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json"
"data": {
url: {
%context%: true
index: test-data
}
format: {property: "hits.hits"}
},
"mark": {"type":"bar"}
"encoding": {
"x": {"field": "DEPT", "type": "ordinal"},
"y": {"field": "SALES", "type": "quantitative"}
}
}
The specification needs to be valid JSON. There are numerous things in your specification that make it invalid; for example:
all strings need to be enclosed in quotes (e.g. url and format)
all items need to be separated by commas (applies to nearly every line of your specification)
Finally, even if you change those syntax errors, the content of your specification doesn't follow the schema: for example, the "url" and the "format" properties of "data" both should be strings.
I would suggest beginning with the vega-lite tutorials, and go from there, modifying what you learn to work with your own data.

Vega / Kibana Custom Visualization with multiple X axis parameters

I'm trying to archieve something like this :
example, using kibana and/or Vega/Vega-lite.
The csv file I used to add the index to kibana was:
student1,90,80,85,95
student2,50,60,55,100
student3,40,70,50,60
At the moment I have this:
{
"$schema": "https://vega.github.io/schema/vega-lite/v2.json",
"data": {
"url": {
%context%: true,
"index":"grades",
"body":{
"size":5
"_source":["StudentName","test1","test2","test3","test4"]
}
},
"format":{"property":"hits.hits"}
},
"mark": "line",
"encoding": {
"x": {"field": "_source.test1", "type": "quantitative"},
"y": {"field": "_source.StudentName", "type": "nominal"}
}
}
So my problem is trying to archieve what is on the picture. I know the "encoding" section of my Vega code isn't correct but I'm having problem finding a way of having multiple parameters in X-axis.
I think this : vega example
would do the trick if i managed to replace the hardcoded values in data with the data from the kibana index. Is there any way I can use the "_source.fields" inside the "values" or is any option in encoding that I can use in order to archieve my result?
Thanks in advance.
Note: My end result most likely only have 1 student. But I want that the visualization to be updated in real-time, therefore the need to use the field.
You posed your question here and answers have been posted - https://github.com/vega/vega/issues/1229#issuecomment-379593878

pull the citations for a paper from google scholar using R

Using google-scholar and R, I'd like to find out who is citing a particular paper.
The existing packages (like scholar) are oriented towards H-index analyses: statistics on a researcher.
I want to give a target-paper as input. An example url would be:
https://scholar.google.co.uk/scholar?oi=bibs&hl=en&cites=12939847369066114508
Then R should scrape these citations pages (google scholar paginates these) for the paper, returning an array of papers which cite the target (up to 500 or more citations). Then we'd search for keywords in the titles, tabulate journals and citing authors etc.
Any clues as to how to do that? Or is it down to literally scraping each page? (which I can do with copy and paste for one-off operations).
Seems like this should be a generally useful function for things like seeding systematic reviews as well, so someone adding this to a package might well increase their H :-)
Although there's is a bunch of available Google's API, a google scholar-based API is not available. So, albeit a web crawler on google scholar pages might not be difficult to develop, I do not know to what extent it might be illegal. Check this.
Alternatively, you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all rich structured data for you.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"api_key": "secret_api_key",
"engine": "google_scholar",
"hl": "en",
"cites": "12939847369066114508"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
{
"position": 1,
"title": "Lavaan: An R package for structural equation modeling and more. Version 0.5–12 (BETA)",
"result_id": "HYlMgouq9VcJ",
"type": "Pdf",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf",
"snippet": "Abstract In this document, we illustrate the use of lavaan by providing several examples. If you are new to lavaan, this is the first document to read … 3.1 Entering the model syntax as a string literal … 3.2 Reading the model syntax from an external file …",
"publication_info": {
"summary": "Y Rosseel - Journal of statistical software, 2012 - users.ugent.be",
"authors": [
{
"name": "Y Rosseel",
"link": "https://scholar.google.com/citations?user=0R_YqcMAAAAJ&hl=en&oi=sra",
"serpapi_scholar_link": "https://serpapi.com/search.json?author_id=0R_YqcMAAAAJ&engine=google_scholar_author&hl=en",
"author_id": "0R_YqcMAAAAJ"
}
]
},
"resources": [
{
"title": "ugent.be",
"file_format": "PDF",
"link": "https://users.ugent.be/~yrosseel/lavaan/lavaanIntroduction.pdf"
}
],
"inline_links": {
"serpapi_cite_link": "https://serpapi.com/search.json?engine=google_scholar_cite&q=HYlMgouq9VcJ",
"cited_by": {
"total": 10913,
"link": "https://scholar.google.com/scholar?cites=6338159566757071133&as_sdt=2005&sciodt=0,5&hl=en",
"cites_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cites=6338159566757071133&engine=google_scholar&hl=en"
},
"related_pages_link": "https://scholar.google.com/scholar?q=related:HYlMgouq9VcJ:scholar.google.com/&scioq=&hl=en&as_sdt=2005&sciodt=0,5",
"versions": {
"total": 27,
"link": "https://scholar.google.com/scholar?cluster=6338159566757071133&hl=en&as_sdt=2005&sciodt=0,5",
"cluster_id": "6338159566757071133",
"serpapi_scholar_link": "https://serpapi.com/search.json?as_sdt=2005&cluster=6338159566757071133&engine=google_scholar&hl=en"
},
"cached_page_link": "https://scholar.googleusercontent.com/scholar?q=cache:HYlMgouq9VcJ:scholar.google.com/&hl=en&as_sdt=2005&sciodt=0,5"
}
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

"Reverse formatting" Riak search results

Let's say I have an object in the test bucket in my Riak installation with the following structure:
{
"animals": {
"dog": "woof",
"cat: "miaow",
"cow": "moo"
}
}
When performing a search request for this object, the structure of the search results is as follows:
{
"responseHeader": {
"status": 0,
"QTime": 3,
"params": {
"q": "animals_cow:moo",
"q.op": "or",
"filter":"",
"wt": "json"
}
},
"response": {
"numFound": 1,
"start": 0,
"maxScore": "0.353553",
"docs": [
{
"id": "test",
"index": "test",
"fields": {
"animals_cat": "miaow",
"animals_cow": "moo",
"animals_dog": "woof"
},
"props": {}
}
]
}
}
As you can see, the way the object is stored, the cat, cow and dog keys are nested within animals. However, when the search results come back, none of the keys are nested, and are simply separated by _.
My question is this: Is there any way provided by Riak to "reverse format" the search, and return the fields of the object in the correct (nested) format? This becomes a problem when storing and returning user data that might possibly contain _.
I do see that the latest version of Riak (beta release) provides a search schema, but I can't seem to see whether my question would be answered by this.
What you receive back in the search result is what the object looked like after passing through the json analyzer. If you need the data formatted differently, you can use a custom analyzer. However, this will only affect newly put data.
For existing data, you can use the id field and issue a get request for the original object, or use the solr query as input to a MapReduce job.

Resources