scraping data with pjscrape

scraping data with pjscrape - web-scraping

I was scraping data with cheerio and request. However, I could not get a data table that was created with the data tables library. I am now using phantom js and pjscrape. I went through the phantom tutorials and did the hello world. I have started the pjscrape tutorials, but do not understand the path:
/path/to/pjscrape.js
I am using
phantomjs pjscrapejs config.js
Not sure what the path is supposed to be. I tried reading an example on the docs but I am not seeing an answer that clears this up.

It's the relative path to the pjscrape.js file compared to where you're running the phantomjs executable from.
If they're in the same directory, then just 'phantomjs pjscrape.js config.js' will do the trick.

Related

Is there a way to download file in R from a "blob" url?

first-time poster here so please let me know if my formatting is off.
I'm trying to automate some data collection I do weekly via R. To do this I download data from the appropriate URL in Rscript and for 10/11 datasets my code is fine. My holdup is downloading data from the below blob URL. Currently, I am downloading all other files with the download.file command but have found no alternatives in base R or in packages for blob URLs. Is there a method for downloading "blob" files from R onto my computer?
doc <- "blob:https://covid.cdc.gov/3227d54b-d64d-4e03-be5a-a872a1c2025c"
I've tried running the above URL through download.file and shortening the above URL to the section following "blob:" with no luck.
A basic example of the code I am currently using:
download.file(doc,doc_path)
I really have no particular understanding of the core structure of "blob" URLs so if I am fundamentally misunderstanding something please let me know.
As a note I think the AzureStor package might be useful but haven't determined how to use it appropriately.
Any info is much appreciated.

My shinydashboard app works on my machine but not on shinyapps.io

and thanks for any help!
I keep getting the same error message when trying to publish the app on shinyapps.io:
The application failed to start (exited with code 1).
I have already commented the setwd() and library(shiny) as I have learned from other posts, but so far no luck. This is the screenshot of the error.
I am new to this, so any support is greatly appreciated.

It looks to me like you are using an absolute file path in your script. shinyapps.io won't understand a file path specific to your machine.
Instead, try putting the files you need to read in a folder (e.g. 'InputFiles') and put that folder in the same place as your scripts. Change your scripts to refer to files using relative file paths like: 'InputFiles/file1.csv'.
When you run the code locally make sure to set the working directory to the same directory your scripts are in. When you publish to shinyapps.io make sure to include your scripts and the 'InputFiles' directory.
Here's a great explanation of how these work: https://docs.rstudio.com/shinyapps.io/Storage.html#Storage

The solution came to me after reading Thomas's post. I had an R script (which did all the statistics and plots for my dashboard) stored in the same folder where the shiny UI and server were stored. After moving this script file to a different folder, the problem was solved. I do not quite understand why this fixed the issue, but I hope this article helps people facing similar issues.

Relative paths in R: how to avoid my computer being set on fire?

A while back I was reading an article about improving project workflow. The advice was not to use setwd or my computer would burn:
If the first line of your R script is
setwd("C:\Users\jenny\path\that\only\I\have")
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
I started using the here package and it worked great until I started to schedule scripts using cronR. After asking this question my laptop was again threatened with arson:
If the first line of your #rstats script is wd <- here(), I will come
into your lab and SET YOUR COMPUTER ON FIRE.
Fearing for my laptop's safety I started using the method suggested in the answer to get relative file paths:
wd <- Sys.getenv("HOME")
wd <- file.path(wd, "projects", "my_proj")
Which worked for me but not people I was working with who didn't have the same projects directory. So now I'm confused. What is the safest / best way get relative file paths so that a project can be portable?
There are quite a few options: 1, 2. My requirements are to source functions/scripts and read/write csv files. Perhaps the rprojroot package is the best bet?

Create an RStudio project and then reference all files with relative paths from the project's root folder. That way, all users will open the project and automatically have the correct working directory.

There are many ways to organize code and data for use with R. Given that the "arsonist" described in the OP has rejected at least two approaches for locating the project files in an R script, the best next step is to ask the arsonist how s/he performs this function, and adjust your code and file structures accordingly.
UPDATE: Since the "arsonists" appear to be someone who writes on Tidyverse.org (see Tidyverse article in OP) and an answer on SO (see additional links in OP), your computer appears to be relatively safe.
If you are sharing code or executing it with batch processes where the "user" is someone other than you, a useful approach is to place the code, data, and configuration under version control, and develop a runbook to explain how others can retrieve the components and execute them on another computer.
As noted in the comments to the OP, there's nothing wrong with here::here() if its use can be made reliable through documentation in a runbook.
I structure all of my R code into Projects within RStudio, which are organized into a gitrepositories directory. All of the projects can be accessed as subdirectories from the gitrepositories directory. If I need to share a project, I make the project accessible to other users on GitHub.
In my R code I reference external files as subdirectories from the project root directory, such as ./data/gen01.csv.

There are two parts to this question:
how to load data from a relative path, and
how to load code from a relative path
For most use cases (including when invoking tools from a CRON job or similar) the location of the data should either be specified by the user (via command line arguments, standard input or environment variables) or should be relative to the current working directory (getwd() in R).
… Unless the data is a fixed part of the project itself — more on this below.
Loading code from a path that’s relative to other code is simply not supported by base R. For example, source('xyz.r') won’t source an xyz.r file from the project. It will always try to load it from the current working directory, whatever that happens to be. Which is pretty much never what you want. And as you’ve noticed, the ‘here’ package also doesn’t always work.
R basically only works when code is only loaded from packages. But packages aren’t suitable for all types of projects. R has no built-in solution for those other cases. I recommend using ‘box’ modules to solve this. ‘box’ provides a modern module system for R, which means that you can have R projects consisting of multiple code files (and nested sub-projects), without having to wrap them in packages. Loading code inside the same relative path in a module is as simple as
box::use(./xyz)
This always works, as you’d expect from a modern module system, and doesn’t require ‘here’ or similar hacks.
OK, back to the point about data that’s bundled with a project itself. If your project is an R package, you’d use system.file() to load that data. However, this once again doesn’t work for non-package projects. But if you use ‘box’ modules to structure your project, you can use box::file() to load data that’s associated with a module.
Packages such as ‘here’ or ‘rprojroot’, while well-intended, are essentially hacks to work around limitations in R’s handling of non-package code. The proper solution is to make non-package code into a first-class citizen of the R world, and ‘box’ does that.

You can check docs of RSuite package (https://RSuite.io). It is working with script_path that points to currently run R script. I use it to make relative paths using 'file.path' command

Download code uploaded to projectName.meteor.com

Is there a way to download the code for the projectName app that I uploaded to projectName.meteor.com?
Is there a meteor command line interface that will accomplish this?

At the moment this is not possible through any meteor tools.
You can get the client side code by reading out the javascript files over from projectName.meteor.com. The files will be concatenated and minified so they will be far from the original albeit a bit helpful if you can rebeautify them.
For the server side code you'll have to contact the guys who run meteor.com and hopefully they can help you out with that. Keep in mind most of your code will be minified and may not be like the original.

Accessing Excel file from Sharepoint with R

am trying to write an R script that will access an Excel file that is stored on my company's Sharepoint page so that I can make a few calculations and plot the results. I've tried various ways to do this (download.file, RCurl getURL(), gdata), but I can't seem to figure out how to do this. The url is HTTPS and there should be a username and password required. I've gotten the closest with this code:
require(RCurl)
URL<-"https://companyname.sharepoint.com/sites/folder/_layouts/15/WopiFrame.aspx?sourcedoc={2DCC2ED7-1C13-4910-AFAD-4A9ACFF1C797}&file=myfile.xlsx&action=default'
f<-getURL(URL,verbose=T,ssl.verifyhost=F,ssl.verifypeer=F,userpwd="mylogin:mypw")
This seems to connect (although the username and password don't seem to matter) and returns
> f
[1] "<html><head><title>Object moved</title></head><body>\r\n<h2>Object moved to here.</h2>\r\n</body></html>\r\n"`
However, I'm not sure what to do at this point, or even if I'm on the right track. Any help will be greatly appreciated.

I use
library(readxl)
read_excel('//companySharepointSite/project/.../ExcelFilename.xlsx', 'Sheet1', skip=1)
Note, no https:, and sometimes I have to open the file first (i.e., cut and paste //companySharepointSite/project/.../ExcelFilename.xlsx into my browser's address bar)

I found that other answers did not work for me, perhaps because I am on a Mac, which obviously does not play as well with Microsoft products such as Sharepoint.
Ended up having to split it into two pieces: first download the Excel file to disk and then separately read that Excel file.
library(httr)
library(readxl)
# the URL of your sharepoint file
file_url <- "https://yoursharepointsite/Documents/yourfile.xlsx"
# save the excel file to disk
GET(file_url,
authenticate(active_directory_username, active_directory_password, "ntlm"),
write_disk("tempfile.xlsx", overwrite = TRUE))
# save to dataframe
df <- read_excel("tempfile.xlsx")
df
# remove excel file from disk
file.remove("tempfile.xlsx")
This gets the job done, though would be interested if anyone knows how to avoid the interim step of writing to disk.
N.B. Depending on your specific machine/network/Sharepoint configuration, you may also be able to just use authenticate(":",":","ntlm") per this answer.

I was unable to accomplish this using hints from answers above in R (I tried many approaches found on this site). However, just to highlight the response by #RyanBradley above and especially the response by #ZS27:
I instead had to use the OneDrive Desktop client (Windows) to allow me to sync the folder to my computer. Newer versions of SharePoint (like that found in MS Teams) have a sync button or feature in the document libraries / folders that interfaces with OneDrive.
This is the functional equivalent of mounting the folder as a network drive, so R interfaces with the file as if it was a part of your file system. Works for me.

You may need to map a network drive to the SharePoint library so that you can connect to it directly. Or if you don't want to map a network drive you could also place a shortcut to the folder in your startup folder.
Example file path:
\company_sharepoint_site\ssp\site_name\sub_site_name\library_name
Example start up folder location (Windows 10):
C:\Users\USER_NAME\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup
Note direction of the slashes ("\" rather than "/") is important so that your file path is interpreted as a file location, not an internet browser location. By placing such a path in a network drive or as a shortcut in your startup folder your PC should connect to it when it boots.
# Load or install readxl
if(require(readxl) == FALSE){
install.packages("readxl")
if(require(readxl)== FALSE){stop("Unable to install and load readxl")}
}
# Define path to data
data_path <- "\\\\company_sharepoint_site\\ssp\\site_name\\sub_site_name\\library_name\\Example.xlsx"
# Pull data
df_employees <- read_xlsx(data_path)

I had a situation exactly like you. I want to access an excel file, available on an sharepoint site using R programming language.
I have also surfed many stuff in Internet and I didn't find anything relevant to my requirement.
Then, I have attempted the following thing:
I have made the sharepoint folder as a network drive folder, in my local system.
Then, I have accessed that excel file(in sharepoint site) from my machine without accessing web browser.
Hence, I have copied the network path, present in my system (it will be same as your sharepoint site, however it will not have https/http.
The site will start with "\" like the following: "\sharepoint.test.com\folder\path").
Launch RStudio and select Import Dataset option, under Environment section.
Choose 'From Excel'. 'Import Excel Data' form will be opened.
Under File/URL field: Paste the network path of sharepoint (copied from your machine).
Click Import, the excel file in Sharepoint will be imported in R successfully.
Ensure that the file should not have html language as input (lie %20 and all) and Backslash should be used as separator in the URL.
While importing the file, provide the input of the folder name exactly, as you see.
For example:
Sharepoint.microsoft.com - Sharepoint's Domain
Department name - name of the Folder
Project name - name of the folder
Sample.xlsx - name of the file
So, your URL to import dataset should be:
"\Sharepoint.microsoft.com\Department name\Project name\Sample.xlsx".
Thank you!

Try using the link in this format:
http://site/_layouts/download.aspx?SourceUrl=url-of-document-in-library

If above doesn't work try this syntax [note slash directions]:
"\\gov.sharepoint.com#SSL/DavWWWRoot/sites/SomePath/SomePath/SomePath/SomeFile"
See this for more info about syntax and what's going on:
Connect to a site via SSL/DavWWWRoot not usual URL? Why does this make a difference?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scraping data with pjscrape - web-scraping

It's the relative path to the pjscrape.js file compared to where you're running the phantomjs executable from. If they're in the same directory, then just 'phantomjs pjscrape.js config.js' will do the trick.

Related

Is there a way to download file in R from a "blob" url?

My shinydashboard app works on my machine but not on shinyapps.io

Relative paths in R: how to avoid my computer being set on fire?

Download code uploaded to projectName.meteor.com

Accessing Excel file from Sharepoint with R

Categories

Resources