I have a bibliography dataframe, with article titles, authors, journals and DOI (example below)
noms_prenoms_des_auteurs
titre_de_larticle
reference_de_larticle_doi
SOEWARTO J, CARRICONDE F, HUGOT N, BOCS S, HAMELIN C, ET MAGGIA L
Impact of Austropuccinia psidii in New Caledonia, a biodiversity hotspot
https://doi.org/10.1111/efp.12402
THIBAULT M, VIDAL E, POTTER M, DYER E, ET BRESCIA F
The red-vented bulbul (Pycnonotus cafer): serious pest or understudied invader?
https://doi.org/10.1007/s10530-017-1521-2
I want to retrieve the corresponding author for each article.
My first plan was to scraping on web (by extract text or mail icon), but the html class are not the same for each site, and some sites seems to forbid scraping.
Do you have any idea to retrieve this information ?
Maybe with bibliography management packages ? (RefManage, rcrossref...).
Thanks for your answers !
Related
I want to extract the names of universities and their websites from this site into lists.
In Python I did it with BeautifulSoup v4:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://thebestschools.org/features/best-computer-science-programs-in-the-world/')
content = BeautifulSoup(page.text, 'html.parser')
college_name = []
college_link = []
college_name_list = content.find_all('h3',class_='college')
for college in college_name_list:
if college.find('a'):
college_name.append(college.find('a').text)
college_link.append(college.find('a')['href'])
I really like programming in Julia and since it's very similar to Python, I wanted to know if I can do web scraping in Julia too. Any help would be appreciated.
Your python code doesn't quite work.
I guess the website has been updated recently.
Since they have removed the links as far as i can tell,.
Here is a similar example using Gumbo.jl and Cascadia.jl.
I am using the built in download command to download the webpage.
which writes it to disk in a temp-file, which i then read into String.
It might be cleaner to use HTTP.jl, which could read it straight into a String. But for this simple example it's fine
using Gumbo
using Cascadia
url = "https://thebestschools.org/features/best-computer-science-programs-in-the-world/"
page = parsehtml(read(download(url), String))
college_name = String[]
college_location = String[]
sections = eachmatch(sel"section", page.root)
for section in sections
maybe_col_heading = eachmatch(sel"h3.college", section)
if length(maybe_col_heading) == 0
continue
end
col_heading = first(maybe_col_heading)
name = strip(text(last(col_heading.children)))
push!(college_name, name)
loc = first(eachmatch(sel".school-location", section))
push!(college_location, text(loc[1]))
end
[college_name college_location]
Outputs
julia> [college_name college_location]
51×2 Array{String,2}:
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Massachusetts Institute of Technology (MIT)" "Cambridge, Massachusetts"
"Stanford University" "Stanford, California"
"Carnegie Mellon University" "Pittsburgh, Pennsylvania"
⋮
"Shanghai Jiao Tong University" "Shanghai, China"
"Lomonosov Moscow State University" "Moscow, Russia"
"City University of Hong Kong" "Hong Kong"
Seems like it listed MIT twice.
probably the filtering code in my demo isn't quiet right.
But :shrug: MIT is a great university I hear.
Julia was invented there :joy:
Yes.
For the purpose of web-scraping, Julia has three libraries:
HTTP.jl to download the frontend source code of the website (this is comparable to python's requests library) ,
Gumbo.jl to parse the downloaded source code into a hierarchical structured object,
and Cascadia.jl to finally scrape using a CSS selector API.
I saw that you're young (16) from your profile and your python implementation is also correct.
Therefore, I'd suggest you to try to do a web-scraping task with these three libraries to better understand how they work.
The task that you wish to do, unfortunately, cannot be yet accomplished with Cascadia since the h3 is in a <span> which is currently not an implemented SelectorType in Cascadia.jl
Source
I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.
Similarly to what attempted by other users here, I am trying to adapt a CSL style file to specify the report number for citations of (scientific,technical) reports.
The style that I am attempting to modify (and perhaps suggesting updating in compliance with the journal's manuscript preparation guides) is that used in the scientific journal Physics in Medicine and Biology (Institute of Physics) and that is based on the Harvard style:
http://www.zotero.org/styles/physics-in-medicine-and-biology
Using the online code editor http://editor.citationstyles.org/codeEditor/ I have peeked into a CSL file that does format -in a bibliography- report numbers, such as the vancouver-author-date.csl style.
It appears that the macro involved is report-details so I have accordingly attempted this minor addition to the physics-in-medicine-and-biology style file, which I have temporarily renamed (and changed its id) to physics-in-medicine-and-biology-with-report.
[...]
</info>
<bibliography>
<layout>
<macro name="report-details">
<choose>
<if type="report techreport" match="any">
<text variable="number" prefix="Report " font-style="roman"/>
</if>
</choose>
</macro>
</layout>
</bibliography>
</style>
Upon importing into Papers for Mac (v3.4.1) and selecting this custom citation style, the formatting result is still (as for the unchanged physics-in-medicine-and-biology style):
Andreo P, Burns D T, Hohlfeld K, Huq M, Kanai T, Laitano R F, Smyth V and Vynckier S 2000 Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water (Vienna: International Atomic Energy Agency)
as opposed to the desired
Andreo P, Burns D T, Hohlfeld K, Huq M, Kanai T, Laitano R F, Smyth V and Vynckier S 2000 Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water Report 398 (Vienna: International Atomic Energy Agency)
What am I missing? Just as a counter check, here is the record exported as BibTeX record, showing the non-null number field:
#techreport{Andreo:2000vw,
author = {Andreo, Pedro and Burns, David T and Hohlfeld, K and Huq, MS and Kanai, T and Laitano, Raffaele Fedele and Smyth, Vere and Vynckier, S},
title = {{\emph{Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water}}},
institution = {International Atomic Energy Agency},
year = {2000},
number = {398},
address = {Vienna}
}
Thank you for any hints
Yours, Massimo P.
Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.
This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".
Here is one example of the 10-Q filings that I need to parse:
https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm
If a firm have no share repurchase, this table can be missing from the quarterly report.
I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.
For example, the only way I can think of to parse these forms is
from bs4 import BeautifulSoup
import requests
import unicodedata
import re
url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'
def parse_html(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
tables = soup.find_all('table')
identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)
n = len(tables) -1
rep_tables = []
while n >= 0:
table = tables[n]
remove_invalid_tags(table)
table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
if re.search(identifier, table_text):
rep_tables += [table]
n -= 1
else:
n -= 1
return rep_tables
def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
for tag in invalid_tags:
tags = soup.find_all(tag)
if tags:
[x.replaceWith(' ') for x in tags]
The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!
I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr).
'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.
I'm working with some large government datasets from the Department of Transportation that are available as tab-delimited text files accompanied by data dictionaries. For example, the auto complaints file is a 670Mb file of unlabeled data (when unzipped), and comes with a dictionary. Here are some excerpts:
Last updated: April 24, 2014
FIELDS:
=======
Field# Name Type/Size Description
------ --------- --------- --------------------------------------
1 CMPLID CHAR(9) NHTSA'S INTERNAL UNIQUE SEQUENCE NUMBER.
IS AN UPDATEABLE FIELD,THUS DATA FOR A
GIVEN RECORD POTENTIALLY COULD CHANGE FROM
ONE DATA OUTPUT FILE TO THE NEXT.
2 ODINO CHAR(9) NHTSA'S INTERNAL REFERENCE NUMBER.
THIS NUMBER MAY BE REPEATED FOR
MULTIPLE COMPONENTS.
ALSO, IF LDATE IS PRIOR TO DEC 15, 2002,
THIS NUMBER MAY BE REPEATED FOR MULTIPLE
PRODUCTS OWNED BY THE SAME COMPLAINANT.
Some of the fields have foreign keys listed like so:
21 CMPL_TYPE CHAR(4) SOURCE OF COMPLAINT CODE:
CAG =CONSUMER ACTION GROUP
CON =FORWARDED FROM A CONGRESSIONAL OFFICE
DP =DEFECT PETITION,RESULT OF A DEFECT PETITION
EVOQ =HOTLINE VOQ
EWR =EARLY WARNING REPORTING
INS =INSURANCE COMPANY
IVOQ =NHTSA WEB SITE
LETR =CONSUMER LETTER
MAVQ =NHTSA MOBILE APP
MIVQ =NHTSA MOBILE APP
MVOQ =OPTICAL MARKED VOQ
RC =RECALL COMPLAINT,RESULT OF A RECALL INVESTIGATION
RP =RECALL PETITION,RESULT OF A RECALL PETITION
SVOQ =PORTABLE SAFETY COMPLAINT FORM (PDF)
VOQ =NHTSA VEHICLE OWNERS QUESTIONNAIRE
There are import instructions for Microsoft Access, which I don't have and would not use if I did. But I THINK this data dictionary was meant to be machine-readable.
My question: Is this data dictionary a standard format of some kind? I've tried to Google around, but it's hard to do so without the right terminology. I would like to import into R, though I'm flexible so long as it can be done programmatically.