This question already has answers here:
Importing data into R (rdata) from Github
(3 answers)
Closed 7 years ago.
The package for O'Reily's new Learning R book (called "learningr") does not work in R v3. Fortunately, the dataset I want from the package is on the package's Github page here called english_monarchs.rda.
However, for the life of me I cannot figure out how to download the rda file. This is my best attempt:
> library(RCurl)
>
> x <- getURL("https://github.com/richierocks/learningr/blob/master/data/english_monarchs.rda"); x
[1] "\n\n\n<!DOCTYPE html>\n<html>\n <head prefix=\"og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# githubog: http://ogp.me/ns/fb/githubog#\">\n <meta charset='utf-8'>\n <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n <title>learningr/data/english_monarchs.rda at master · richierocks/learningr · GitHub</title>\n <link rel=\"search\" type=\"application/opensearchdescription+xml\" href=\"/opensearch.xml\" title=\"GitHub\" />
It goes on like this through all the html of the page, I cut it short since you get the point. I get the html but not the file itself.
Any help would be much appreciated.
Did you try clicking on "View Raw"?
There may be a better way to do this, but if you want to do this entirely automatically/within R:
library(RCurl)
## paste URL to make it easier to read code (cosmetic!)
dat_url <- paste0("https://raw.github.com/richierocks/",
"learningr/master/data/english_monarchs.rda")
f <- getBinaryURL()
L <- load(rawConnection(f))
(To deal with the redirection, I downloaded the file in Firefox and then asked Firefox to copy the actual download link.)
By the way, are you sure learningr doesn't work with R 3.+ ? I followed the installation instructions at https://github.com/richierocks/learningr/blob/master/README.md with R-devel and they seemed to work fine ...
Related
I need to extract the body of texts from my corpus for text mining as my code now includes references, which bias my results. All coding is performed in R using RStudio. I have tried many techniques.
I have text mining code (of which only the first bit is included below), but recently found out that simply text mining a corpus of research articles is insufficient as the reference section will bias results; reference sections alone may provide another analysis, which would be a bonus.
EDIT: perhaps there is an R package that I am not aware of
My initial response was to clean the text formats after converting from pdf to text using Regex commands within quanteda. As a reference I was intending to follow: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005962&rev=1 . Their method confuses me not just in coding a parallel regex code, but in how to implement recognizing the last reference section to avoid cutting off portions of the text when "reference" appears prior to that section; I have been in contact with their team, but am waiting to learn more about their code since it appears they use a streamlined program now.
PubChunks and LAPDF-text were my next two options the latter of which is referenced in the paper above. In order to utilize the PubChunks package I need to convert all of my pdf (now converted to text) files into XML. This should be straightforward only the packages I found (fileToPDF, pdf2xml, trickypdf) did not appear to work; this seems to be a within-R concern. (Coding relating to trickypdf is included below).
For LAPDF-text, ...[see edit]... the code did not seem to run properly. There are also very limited resources out there for this package in terms of guides etc and they have shifted their focus to a larger package using different language that does happen to include LAPDF-text.
EDIT: I installed java 1.6 (SE 6) and Maven 2.0 then ran the LAPDF-text installer, which seemed to work. That being said, I am still having issues with this process and mvn commands recognizing folders though am continuing to work through it.
I am guessing there is someone else out there, as there are related research papers with similarly vague processes, who has done this before and has also got their hands dirty. Any recommendations is greatly appreciated.
Cheers
library(quanteda)
library(pdftools)
library(tm)
library(methods)
library(stringi) # regex pattern
library(stringr) # simpler than stringi ; uses stringi on backend
setwd('C:\\Users\\Hunter S. Baggen\\Desktop\\ZS_TestSet_04_05')
files <- list.files(pattern = 'pdf$')
summary(files)
files
# Length 63
corpus_tm <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
corpus_tm
# documents 63
inspect(corpus_tm)
meta(corpus_tm[[1]])
# convert tm::Corpus to quanteda::corpus
corpus_q <- corpus(corpus_tm)
summary(corpus_q, n = 2)
# Add Doc-level Variables here *by folder and meta-variable year
corpus_q
head(docvars(corpus_q))
metacorpus(corpus_q)
#_________
# extract segments ~ later to remove segments
# corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
corpus_q_refA <- corpus_reshape(corpus_q, to = "paragraphs", showmeta = TRUE)
corpus_q_refA
# Based upon Westergaard et al (15 Million texts; removing references)
corpus_q_refB <- corpus_trim(corpus_q, what = c('sentences'), exclude_pattern = '^\[\d+\]\s[A-Za-z]')
corpus_q_refB # ERROR with regex above
corpus_tm[1]
sum(str_detect(corpus_q, '^Referen'))
corpus_qB <- corpus_q
RemoveRef_B <- corpus_segment(corpus_q, pattern = 'Reference', valuetype = 'regex')
cbind(texts(RemoveRef_B), docvars(corpus_qB))
# -------------------------
# Idea taken from guide (must reference guide)
setGeneric('removeCitations', function(object, ...) standardGeneric('removeCitations'))
'removCitations'
setMethod('removeCitations', signature(object = 'PlainTextDocument'),
function(object, ...) {
c <- Content(object)
# remove citations tarting with '>'
# EG for > : citations <- grep('^[[:blank:]]*>.*', c) if (length(citations) > 0) c <- c[-citations]
# EG for -- : signatureStart <- grep('^-- $', c) if (length(signatureStart) > 0) c <- c[-(signatureStart:length(c))]
# using 15 mil removal guideline
citations <- grep('^\[\d+\]\s[A-Za-z]')
}
# TRICKY PDF download from github
library(pubchunks)
library(polmineR)
library(githubinstall)
library(devtools)
library(tm)
githubinstall('trickypdf') # input Y then 1 if want all related packages
# library(trickypdf)
# This time suggested I install via 'PolMine/trickypdf'
# Second attempt issue with RPoppler
install_github('PolMine/trickypdf')
library(trickypdf) # Not working
# Failed to install package 'Rpoppler' is not available for R 3.6.0
Short of the RPoppler issue above the initial description should be sufficient.
UPDATE: Having reached out to several research groups the TALN-UPF researchers got back to me and provided me with a pdfx java program that has allowed me to convert my pdfs easily into xml. Of course, now I learn that PubChunks is created with its sister program that extracts xmls from search engines and therefore is of little use to me. That being said, the TALN-UPF group will hopefully advise whether I can extract the body of each text via their other programs (Dr Inventor and Grobid). If this is possible then everything will be accomplished. Of course if not I will be back at RegEx.
So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
I am trying to link up my excel data set to R for statistical analysis. I am running on OSX Sierra (10.12.6) with R studio (1.0.153) and Java 8 (update 144).
The function "read_excel" was able to open my excel document a week ago. When I moved the excel and the R document together to another folder, it no longer worked. Reloading the libraries has had no effect. After multiple attempts (and restarting R studio and computer), something finally worked but function "lmer" was no longer found. After reloading library "lme4", "read_excel" no longer worked!
I have also tried using "read.xlsx" and "readWorksheet(loadWorkbook(...))", which didn't work. "read.csv" also did not work properly since the commas were creating disorganized columns and I am dealing with a larger excel workbook with ongoing changes.
Reading on Stack, question Importing .xlsx file into R has not resolved my issue! Please help!
Libraries loaded:
library(multcomp)
library(nlme)
library(XLConnect)
library(XLConnectJars)
library(lme4)
library(car)
library(rJava)
library(xlsx)
library(readxl)
R data file:
Dataset <- read_excel("Example.xlsx",sheet="testing")
#alternative line: Dataset <- read.xlsx("~/Desktop/My Stuff/Sample/Example.xlsx", sheet=7)
Dataset$AAA <- as.factor(Dataset$AAA)
Dataset$BBB <- as.factor(Dataset$BBB)
Dataset$CCC <- as.numeric(Dataset$CCC)
Dataset$DDD <- as.numeric(Dataset$DDD)
Dataset_lme = lmer(CCC ~ AAA + BBB + (1|DDD), data=Dataset)
While you called the library, try and see if adding readxl::read_excel(path = "yourPath",sheet=1), or even remove the sheet reference. It will automatically take the first sheet.
Perhaps, when you moved the excel and R file to another folder, the pathway should be change either.
Try change the pathway, or replace the pathay by file.choose() and search the excel file manually.
You called the package "xlsx", which can do the thing what you need. Maybe you're typing it wrong.
Dataset <- read.xlsx("Example.xlsx",sheetName="testing")
or
Dataset <- read.xlsx("Example.xlsx",sheetIndex="number of the excel sheet")
I hope it helps.
Try activating library(tidyverse) and library(readr) then use the read_excel().This should work.
Summary: my ultimate goal is to use rCharts, and specifically Highcharts, as part of a ReporteRs PowerPoint report automation workflow. One of the charts I would like to use is rendered as html in the Viewer pane in Rstudio, and addPlot(function() print(myChart)) does not add it to the PowerPoint. As a workaround, I decided to try to save myChart to disk, from where I could just add it to the PowerPoint that way.
So my question is really, How do I get my html image into my ReporteRs workflow? Either getting it saved to a disk, or getting it to be readable by ReporteRs would solve my problem.
This question is really the same as this one, but I'm using rCharts, specifically the example found here:
#if the packages are not already installed
install.packages('devtools')
require(devtools)
install_github('rCharts', 'ramnathv')
#code creates a radar chart using Highcharts
library(rCharts)
#create dummy dataframe with number ranging from 0 to 1
df<-data.frame(id=c("a","b","c","d","e"),val1=runif(5,0,1),val2=runif(5,0,1))
#muliply number by 100 to get percentage
df[,-1]<-df[,-1]*100
myChart <- Highcharts$new()
myChart$chart(polar = TRUE, type = "line",height=500)
myChart$xAxis(categories=df$id, tickmarkPlacement= 'on', lineWidth= 0)
myChart$yAxis(gridLineInterpolation= 'circle', lineWidth= 0, min= 0,max=100,endOnTick=T,tickInterval=10)
myChart$series(data = df[,"val1"],name = "Series 1", pointPlacement="on")
myChart$series(data = df[,"val2"],name = "Series 2", pointPlacement="on")
myChart
So if I try
> png(filename="~/Documents/name.png")
> plot(myChart)
Error in as.double(y) :
cannot coerce type 'S4' to vector of type 'double'
> dev.off()
I get that error.
I've looked into Highcharts documentation, as well as many other potential solutions that seem to rely on Javascript and phantomjs. If your answer relies on phantomjs, please assume I have no idea how to use it. webshot is another package I found which is even so kind as to include an install_phantomjs() function, but from what I could find, it requires you to turn your output into a Shiny object first.
My question is really a duplicate of this one, which is not a duplicate of this one because that is how to embed the html output in Rmarkdown, not save it as a file on the hard drive.
I also found this unanswered question which is also basically the same.
edit: as noted by #hrbrmstr and scores of others, radar charts are not always the best visualization tools. I find myself required to make one for this report.
The answer turned out to be in the webshot package. #hrbrmstr provided the following code, which would be run at the end of the code I posted in the question:
# If necessary
install.packages("webshot")
library(webshot)
install_phantomjs()
# Main code
myChart$save("/tmp/rcharts.html")
webshot::webshot("/tmp/rcharts.html", file="/tmp/out.png", delay=2)
This saves the plot to the folder as an html, and then takes a picture of it, which is saved as a png.
I can then run the ReporteRs workflow by using addImage(mydoc, "/tmp/out.png").
I am using R to extract HTML Tables from a website.
However, the language for the HTML Table is in Hindi and the text is displayed as unicodes.
Any way where I can set/install the font family and get the actual text instead of the unicode.
The code I follow is :
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
table[[which.max(n.rows)]]
The sample site is : http://mpbhuabhilekh.nic.in/bhunakshaweb/reports/mpror.jsp?base=wz/CP8M/wj/DP8I/wz/CoA==&vsrno=26-03-02-00049-082&year=2013&plotno=71
The output comes as :
"< U+092A>"
etc.
Note:For some reason, the readHTMLTable works only when I remove the first two unwanted tables in the HTML file. So if you have to test with the file, please edit out the first two tables or simply delete the first two table headers from the file.
Any help will be appreciated. Thanks
update:
The issue seems to be related to locale set in R on windows OS Machines. Unable to figure out how to get it working though!
The solution I have found for this locale related bug would be to callthe corresponding encoding..
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
output <- table[[which.max(n.rows)]]
for (n in names(output)) Encoding(levels(output[[n]])) <-"UTF-16"
The output in R console might still look gibberish, but the advantage is that once you export the dataset (say a csv), it would all appear in Hindi on other editors.