Problems with web-scraping EconPapers repository - web-scraping

I am looking for some advice on how to web-scrape a repository of economics papers. The site is EconPapers, I found some tutorials on how to instruct R (unfortunately I am not able to use Python since it does not work on my laptop) but all the examples I have seen deal with HTML tags that are easily recoverable from the developer tool. In my specific case, when I open the developer tool I see the very different structures with respect to those that I read. I am not very into these things, since am I an economist and I am looking for someone that can explain to me where I have to look for the information I need. Here attached you will find the HTML code, I hope that someone can help me.
The website I am trying to web-scraping is available at the following link: https://econpapers.repec.org/scripts/search.pf?ft=;wp=on;pl=;sort=date;lgc=AND;aus=;ar=on;kw=;jel=;nep=;ni=1%20year;nit=epdate

Scraping the title is pretty simple since it's in an <h1> tag with class colored under the <div> tag with class bodytext (the bodytext classed div is the whole main content). (I scrape a single page https://econpapers.repec.org/paper/haljournl/hal-02490178.htm for example):
library(rvest)
url <- "https://econpapers.repec.org/paper/haljournl/hal-02490178.htm"
pg <- read_html(url)
title <- pg %>% html_node(xpath = "//div[#class='bodytext']/h1[#class='colored']") %>% html_text()
Output:
> title
[1] "I CRACKED UP... BUT I'M NOT PROUD … \": WHEN THE RESPONSIBLE CONSUMER DEVIATES FROM HIS PERSONAL NORMS"
Other information can be tricky to scrape since they aren't assigned any ID, name, class,... You have to be creative to find a way to identify them. I'll give you some examples for you to get the idea:
Authors: Since all authors' names are italic and nothing else in the content is italic, you can use that to identify authors:
authors <- paste(pg %>% html_nodes(xpath = "//div[#class='bodytext']/p/i") %>% html_text(), collapse = ", ")
Output:
> authors
[1] "Sophie Martins, Stéphanie Montmasson, Fabien Rogeon"
Abstract (similar to Date, Note,...): The abstract info followed after the <b> tag contains Abstract text:
abstract <- pg %>% html_node(xpath = "//div[#class='bodytext']/p/b[contains(., 'Abstract')]/following-sibling::text()[1]") %>% html_text(trim = TRUE)
Output:
> abstract
[1] "Souvent considéré comme exemplaire de par son comportement à l'égard de l'environnement, le consommateur responsable est, lui aussi, amené à agir consciemment en désaccord avec ses normes personnelles. Cette présente recherche apporte une compréhension des raisons et circonstances qui incitent les consommateurs responsables à transgresser leurs normes personnelles et propose un éclairage sur la façon dont ces consommateurs gèrent leurs comportements déviants de leurs convictions environnementales grâce aux stratégies de coping. Au terme d'une exploration empirique s'appuyant sur la technique des incidents critiques, cette recherche met à jour l'influence de facteurs émotionnels, sociaux et situationnels dans l'apparition de comportements allant à l'encontre de leurs normes personnelles. Puis sont abordées les différentes stratégies de coping mises en place par le consommateur responsable. Celles-ci sont en ligne avec les stratégies identifiées par la recherche en management et en marketing à la différence que la culpabilité demeure peu évoquée par les répondants. Abstract : Mostly considered as an example because of their pro-environmental behaviour, responsible consumers are also called upon to act consciously in disagreement with their perso nal norms. This research provides an understanding of the reasons and circumstances that lead responsible consumers to transgress their personal norms and sheds light on how these consumers manage their behaviour that deviates from their environmental beliefs through coping strategies. After an empirical exploration based on the critical incident technique, this research reveals the influence of emotional, social and situational factors in the development of behaviours that run counter to their personal norms. The various coping strategies implemented by the responsible consumer are then discussed. These are in line with the strategies management and marketing research has pointed out, with the difference that respondents rarely mention guilt."
Date:
date <- pg %>% html_node(xpath = "//div[#class='bodytext']/p/b[contains(., 'Date')]/following-sibling::text()[1]") %>% html_text(trim = TRUE)
Output:
> date
[1] "2020-05-07"
I think you get the point. To scrape all sites, create a loop with this code (I haven't tried, let me know if you can't do it yourself).

Related

Bibliography management : retrieve the corresponding authors

I have a bibliography dataframe, with article titles, authors, journals and DOI (example below)
noms_prenoms_des_auteurs
titre_de_larticle
reference_de_larticle_doi
SOEWARTO J, CARRICONDE F, HUGOT N, BOCS S, HAMELIN C, ET MAGGIA L
Impact of Austropuccinia psidii in New Caledonia, a biodiversity hotspot
https://doi.org/10.1111/efp.12402
THIBAULT M, VIDAL E, POTTER M, DYER E, ET BRESCIA F
The red-vented bulbul (Pycnonotus cafer): serious pest or understudied invader?
https://doi.org/10.1007/s10530-017-1521-2
I want to retrieve the corresponding author for each article.
My first plan was to scraping on web (by extract text or mail icon), but the html class are not the same for each site, and some sites seems to forbid scraping.
Do you have any idea to retrieve this information ?
Maybe with bibliography management packages ? (RefManage, rcrossref...).
Thanks for your answers !

Is it possible that two exact same inputs give two different outputs using the Google Cloud API (advanced)

We are working for a customer implementing a solution which uses the Google Translate API (advanced edition). We have an issue now, because we found that translating identical input results in different outputs.
For example, the dutch input string "Goudse 48+ kaas belegen 1/16 Noord Hollandse weidemelk" is translated to French.
The first output gives: "Gouda 48+ fromage affiné 1/16 Lait de prairie de Hollande du Nord"
The second output gives: "Gouda 48+ fromage affiné 1/16 Lait des prés de Hollande du Nord"
This while being translated shortly after each other. In total, within one file of ± 250 products and ± 25 colums, 267 differences appear.
Does anyone know how this is possible? Or what we can do about it?

Cloze question combining mchoice and num import in Moodle

I created a cloze question combining mchoice and num. However I cannot import the question in Moodle as it says
Error importing question Invalid embedded answers (Cloze) question (One of the answers should have a score of 100% so it is possible to get full marks for this question.).
If I turn it into a single mchoice question (deleting the num question) or I turn it into a single num chestion (deleting the mchoice part) it works. I could not find such an example on r-exams.org, that is why I turned here.
This is my Code:
```{r data generation, echo = FALSE, results = "hide"}
library(exams)
Fragen=data.frame(
Fragen=c(
"Vergleich Schlachtgewicht (g) männlicher und weiblicher Hühner (Hähne/Hennen) der gleichen Linie.",
"Untersuchung der Anzahl Insektenarten, welche auf unterschiedlichen Feldern vorkommen (Magerwiese, Klee, je 10 Felder).",
"Untersuchung Sulfatgehalt (mg) bei Wasserproben aus der Limmat. Die Proben wurden an zwei unterschiedlichen Stellen entnommen (Limmatquai, Werdinsel, während 14 Tagen)",
"Untersuchung Kürbisgewicht (kg) bei Düngung mit Gülle oder Kompost"),
Stichprobe1=c("Hahn","Magerwiese","Limmatquai","Guelle"),
Stichprobe2=c("Henne","Klee","Werdinsel","Kompost"),
mean1=c(2500,50,250,10),
mean2=c(2000,20,200,12),
sd1=c(300,20,50,5),
sd2=c(300,10,40,5),
n=c(20,10,14,16)
)
n=sample(4,1)
## DATA
x1=abs(round(rnorm(Fragen$n[n],Fragen$mean1[n],Fragen$sd1[n])))
x2=abs(round(rnorm(Fragen$n[n],Fragen$mean2[n],Fragen$sd2[n])))
datadf=data.frame(x1,x2)
names(datadf)=c(as.character(Fragen$Stichprobe1[n]),as.character(Fragen$Stichprobe2[n]))
write.csv(datadf, "stichproben.csv", row.names = FALSE, quote = FALSE)
alpha=0.05
ps1=shapiro.test(x1)$p.value
ps2=shapiro.test(x2)$p.value
pf=var.test(x1,x2)$p.value
if (ps1 > alpha & ps2 > alpha) {
if (pf > alpha) {
p=t.test(x1,x2,var.equal = TRUE)$p.value
}else{
p=t.test(x1,x2,var.equal = FALSE)$p.value
}
}else{
p=wilcox.test(x1,x2)$p.value
}
p
msol=c(ps1>alpha & ps2>alpha, pf>alpha,TRUE)
msol
```
Question
========
`r Fragen$Fragen[n]`
Die Daten sind im File [stichproben.csv](stichproben.csv).
Answerlist
----------
* Die Stichproben sind normalverteilt
* Die Varianzen sind homogen
* Die Stichproben sind unabhängig
* Führe den am besten geeigneten Test durch und kopiere den p-Wert ins Feld:
Solution
========
```{r solutionlist, echo = FALSE, results = "asis"}
```
Meta-information
================
exname: t-Test unabhaengig
extype: cloze
exsolution: `r mchoice2string(msol)`|`r format(p)`
exclozetype: mchoice|num
extol: `r format(0.01*p)`
New answer (Edit: 2020-06-07)
Version 2.4-0 of R/exams has been improved for better support of mchoice elements in cloze questions. Running your exams2moodle("stichproben.Rmd") yields an exercise like this in Moodle:
Caveat: By default this uses Moodle's evaluation rule for multiple-choice questions where each incorrect checkbox eliminates one correct checkbox. In principle, it is possible to change the eval rule in exams2moodle() but this does not work in all settings. Apparently, if the Moodle percentages only add up approximately but not exactly to 100%, they are not read correctly. My reading is that this is a bug in Moodle. See also below.
Old answer (2020-05-17)
Multiple-choice questions where multiple answers are correct are a bit tricky within Moodle cloze exercises. My understanding is that these were not actually allowed up to a certain point (see the discussion at https://moodle.org/mod/forum/discuss.php?d=213016). Hence, we have only examples with cloze exercises containing single-choice elements but not multiple-choice-elements.
[Note: Jargon is not unified across systems. "Single choice" in R/exams is called "multiple choice, single answer" in Moodle. And "multiple choice" in R/exams is called "multiple choice, multiple answer" in Moodle. Here, I use the shorter jargon as employed by R/exams.]
Actually, I thought that Moodle still didn't support multiple-choice questions as elements in cloze exercises. This would also be consistent with the error message you got, requesting exactly one correct answer yielding 100%.
However, it turns out that under certain conditions it actually works. First, you need to choose a MULTIRESPONSE rather than MULTICHOICE type in exams2moodle() (i.e., this could be fixed on the R/exams side). Second, the percentages of the correct answers needs to sum to exactly 100%. Unfortunately, this conflicts with Moodle requiring 33.33333% as the input for 1/3 of the points. I didn't find a solution to this - other than avoiding the situation where exactly three answers are correct.
As an example I copied your code above into a file stichproben.Rmd and then ran:
set.seed(77)
exams2moodle("stichproben.Rmd", name = "stichproben", cloze = list(
cloze_mchoice_display = "MULTIRESPONSE",
eval = list(partial = TRUE, rule = "false2")
))
Note that the seed is important as it leads to only two out of three items in the multiple-choice question being correct. The eval rule is chosen such that 50% of the points are subtracted if the incorrect item is chosen. This all works as intended in Moodle.
However, running the code above using set.seed(1) before, leads to all three items in the multiple-choice question being correct. Then I still get the error message quoted in your question and - as pointed out above - I don't know if/how this can be avoided. I didn't find a solution. Hence, personally, I would rather avoid mchoice elements in cloze questions and use several schoice elements instead.

Gravity Forms - Nested conditional merge tags

I am using gravity forms and an addon called gravity PDF from Gravity Wiz. It allows to create PDF based on form entries. It allows 2 level of nested conditional tags, such as:
img1
I use a trick here. The tags are not the same (gravityform vs gravityforms). This trick allows 2 level of nested tags. I need 3 levels or more. Such as:
img2
But this doesn't work:
img3
The closing tags got issues.
FYI :
https://docs.gravityforms.com/conditional-shortcode/
https://gravitywiz.com/gravity-forms-conditional-shortcode/
The support said me they are not going to add this feature so I need to find a workaround.
I read the post from Dave here: Gravity Forms - Conditional merge tags - MULTIPLE Values in a single tag
Looking into this filter may be the solution:
add_filter('gform_shortcode_conditional', function ($result, $atts, $content)
What would be the easiest solution? adding a custom tag (gravityform2) or comparing 2 values into one single tag?
[gravityforms action="conditional" relation="all"
value="{Occupation du logement:114}" operator="is" compare="était occupé à titre payant et que le précédent locataire l’a quitté depuis moins de 18 mois"
value2="{Dernier loyer du logement:115}" operator2="is" compare2="ce nouveau loyer est identique au précédent ou seulement révisé"]
You're the perfect age.
[/gravityforms]
Someone can help?
Thanks!

editing a CSL style to add the report number to report types

Similarly to what attempted by other users here, I am trying to adapt a CSL style file to specify the report number for citations of (scientific,technical) reports.
The style that I am attempting to modify (and perhaps suggesting updating in compliance with the journal's manuscript preparation guides) is that used in the scientific journal Physics in Medicine and Biology (Institute of Physics) and that is based on the Harvard style:
http://www.zotero.org/styles/physics-in-medicine-and-biology
Using the online code editor http://editor.citationstyles.org/codeEditor/ I have peeked into a CSL file that does format -in a bibliography- report numbers, such as the vancouver-author-date.csl style.
It appears that the macro involved is report-details so I have accordingly attempted this minor addition to the physics-in-medicine-and-biology style file, which I have temporarily renamed (and changed its id) to physics-in-medicine-and-biology-with-report.
[...]
</info>
<bibliography>
<layout>
<macro name="report-details">
<choose>
<if type="report techreport" match="any">
<text variable="number" prefix="Report " font-style="roman"/>
</if>
</choose>
</macro>
</layout>
</bibliography>
</style>
Upon importing into Papers for Mac (v3.4.1) and selecting this custom citation style, the formatting result is still (as for the unchanged physics-in-medicine-and-biology style):
Andreo P, Burns D T, Hohlfeld K, Huq M, Kanai T, Laitano R F, Smyth V and Vynckier S 2000 Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water (Vienna: International Atomic Energy Agency)
as opposed to the desired
Andreo P, Burns D T, Hohlfeld K, Huq M, Kanai T, Laitano R F, Smyth V and Vynckier S 2000 Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water Report 398 (Vienna: International Atomic Energy Agency)
What am I missing? Just as a counter check, here is the record exported as BibTeX record, showing the non-null number field:
#techreport{Andreo:2000vw,
author = {Andreo, Pedro and Burns, David T and Hohlfeld, K and Huq, MS and Kanai, T and Laitano, Raffaele Fedele and Smyth, Vere and Vynckier, S},
title = {{\emph{Absorbed Dose Determination in External Beam Radiotherapy, An International Code of Practice for Dosimetry Based on Standards of Absorbed Dose to Water}}},
institution = {International Atomic Energy Agency},
year = {2000},
number = {398},
address = {Vienna}
}
Thank you for any hints
Yours, Massimo P.

Resources