There is a moodle course with files that I wish to download automatically. I see that there is a python solution for doing so, but I couldn't find an R solution. I am not sure which package and what type of workflow would fit here (I assume some combination of httr and rvest). Any suggestion would be appreciated.
The first problem is how to deal with the authentication of moodle. In my case, I need 3 items (personal id, user name, and password).
From a google search I was able to find the following moodle related packages (non seem to help): SARP.moodle, moodler. Using download.file on the homepage of the course only redirects to the page viewed by people who do not have access to the course. No pop-up window shows for entering passwords (and again, in my case there is a need for 3 items of the password).
Related
I want to create a product/project documentation in R that is going to be reviewed and discussed by a group of reviews. There are plenty of examples of how to create book-like documents using Rmarkdown (e.g. https://bookdown.org/) or interactive data visualizations using R-shiny. However, I could not find any solution for user comments similar to LibreOffice Writer, MS Word, or Google Docs. I could also imagine having a split-pane where one side is dedicated to the content presentation (e.g. text, graphs, code), while the other side is left for comments.
I am aware that such a solution requires a server-side solution for storing comments.
Any hints on existing solutions, workarounds, and implementations are welcome.
If I correctly understood, your question isn't very R specific. R is just code, R files are just text and they don't allow comments (beside the raw hashtag comments) and reviews. Your question is more about version control environments, that allow reviews on code stuff. The most used version control system is git, and git has an integrated panel in RStudio.
Git allows you to split your developpements in branches, which are the different ideas you and your coworkers can work on independentely. Once an idea is finalised, after some consecutive modifications known as commits, it is to be asked for merge in the "main branch". It is a "pull request".
That is where the different platform using git, like GitHub or GitLab, allow some review systems. Basically, when a branch is done, you ask "is that ok ?". Your reviewer can see the changes, can try you things, and tell if that is actually ok.
The other advantage of these is the continuous integration, that is : elaborating tests (in R with testthat) that will be tested on some events merge, like "on each merge to master". It is meant to ensure that the software is going forward : if a merge breaks some earlier test, you'll know it.
For beginners, in order to avoid being lost in bash commands, GitHub Desktop is a fine GUI above Git.
Note : You can also rely on the package usethis which has a few helper functions like use_git, use_gitlab_ci, use_github_action... It's not specific to reviews and comments but to the gitlab and github integration
I am setting up a database of certain events that have occurred in the past, and need to search the internet for a number of terms to retrieve as many pages as possible that contain terms related to the happenings i want to document.
First I looked into achieving this using Googles "Custom Search API", after reading this question:
Need to access Google Custom search api through R
I did manage to get a JSON of search results through the browser, but not through R, so I moved on.
When I saw that the Custom Search API was using OpenSearch, and found the rOpenSearch package for R, I wanted to try going down this path:
http://terradue.github.io/rOpenSearch/
After reading through the documentation, there was only provided examples of searching sites that provide opensearch descriptions. As I need to search as many websites as possible, it seems like I would need an opensearch description for a search engine like Google. But I can't seem to find that anywhere.
Is there any way to search the internet via. R using OpenSearch, and collecting the results in a data table?
If you know of a better solution to my problem, I'd appreciate if you could point me in another direction.
If I read well, you are looking for something called Web Scraping via R.
<See me!>
I am new to web scraping, and I use the following tool and method to scrap:
I use R (with packages Curl, XML, etc) to read the web pages (with a url link), and htmlTreeParse function to parse the html page.
Then in order to know get the data I want, I first use the developer tool i Chrome to insepct the code.
When I know in which node the data are, I use xpathApply to get them.
Usually, it works well. But I had an issue with this site: http://www.sephora.fr/Parfum/Parfum-Femme/C309/2
When you click on the link, you will load the page, and in fact it is the page 1 (of the products).
You have to load the url again (by entering a second time the url), in order to get the page 2.
When I use the usual process to read the data. The htmlTreeParse function always gives me the page1.
I tried to understand more this web site:
It seems that it is built with Oracle commerce (ATG commerce).
The "real" url is hidden, and when you click on the filter (for instance, you select a brand), you will get url with requestid: http://www.sephora.fr/Parfum/Parfum-Femme/C309?_requestid=285099
This doesn't help to know which selection I made.
Could you please help:
How can I access to more products ?
Thank you
I found the solution: selenium ! I think that it is the ultimate tool for web scraping. I posted several questions concerning web scraping, now with rselenium, almost everything is possible.
I use Kimonolabs right now for scraping data from websites that have the same goal. To make it easy, lets say these websites are online shops selling stuff online (actually they are job websites with online application possibilities, but technically it looks a lot like a webshop).
This works great. For each website an scraper-API is created that goes trough the available advanced search page to crawl all product-url's. Let's call this API the 'URL list'. Then a 'product-API' is created for the product-detail-page that scrapes all necessary elements. E.g. the title, product text and specs like the brand, category, etc. The product API is set to crawl daily using all the URL's gathered in the 'URL list'.
Then the gathered information for all product's is fetched using Kimonolabs JSON endpoint using our own service.
However, Kimonolabs will quit its service end of february 2016 :-(. So, I'm looking for an easy alternative. I've been looking at import.io, but I'm wondering:
Does it support automatic updates (letting the API scrape hourly/daily/etc)?
Does it support fetching all product-URL's from a paginated advanced search page?
I'm tinkering around with the service. Basically, it seems to extract data via the same easy proces as Kimonolabs. Only, its unclear to me if paginating the URL's necesarry for the product-API and automatically keeping it up to date are supported.
Any import.io users here that can give advice if import.io is a usefull alternative for this? Maybe even give some pointers in the right direction?
Look into Portia. It's an open source visual scraping tool that works like Kimono.
Portia is also available as a service and it fulfills the requirements you have for import.io:
automatic updates, by scheduling periodic jobs to crawl the pages you want, keeping your data up-to-date.
navigation through pagination links, based on URL patterns that you can define.
Full disclosure: I work at Scrapinghub, the lead maintainer of Portia.
Maybe you want to give Extracty a try. Its a free web scraping tool that allows you to create endpoints that extract any information and return it in JSON. It can easily handle paginated searches.
If you know a bit of JS you can write CasperJS Endpoints and integrate any logic that you need to extract your data. It has a similar goal as Kimonolabs and can solve the same problems (if not more since its programmable).
If Extracty does not solve your needs you can checkout these other market players that aim for similar goals:
Import.io (as you already mentioned)
Mozenda
Cloudscrape
TrooclickAPI
FiveFilters
Disclaimer: I am a co-founder of the company behind Extracty.
I'm not that much fond of Import.io, but seems to me it allows pagination through bulk input urls. Read here.
So far not much progress in getting the whole website thru API:
Chain more than one API/Dataset It is currently not possible to fully automate the extraction of a whole website with Chain API.
For example if I want data that is found within category pages or paginated lists. I first have to create a list of URLs, run Bulk Extract, save the result as an import data set, and then chain it to another Extractor.Once set up once, I would like to be able to do this in one click more automatically.
P.S. If you are somehow familiar with JS you might find this useful.
Regarding automatic updates:
This is a beta feature right now. I'm testing this for myself after migrating from kimonolabs...You can enable this for your own APIs by appending &bulkSchedule=1 to your API URL. Then you will see a "Schedule" tab. In the "Configure" tab select "Bulk Extract" and add your URLs after this the scheduler will run daily or weekly.
The task is download the table with names of bookmakers and odds (here).
I can not find in source code part which corresponds to these data. I tried to use chrome extension named SelectorGadget, unsuccessfuly.
Similarly, when I want to open matches (matches) I meet same problem. Thank you for any advice.
The data is not in the HTML, it is dynamically loaded via JavaScript.
From the Terms of Service:
Without prior authorisation in writing from the Provider, Visitors are not authorised to copy, modify, tamper with, distribute, transmit, display, reproduce, transfer, upload, download or otherwise use or alter any of the content of the Website.
Therefore, do not expect us to assist you with breaching their terms of use.