Saving text from webpage for word cloud in R - r

I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!

Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:

Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)

Related

loading raw file from github in R

I am trying to load a codebook from github in R studio. The url is [here][1]. It is a md file based on the link, but I want to load its raw file. (As pic1 shows on the top right this is a tab called raw, and when I click that, it shows pic2).I try to use the link provided, but it does not work. Could anyone help to tell how to do that? Thanks a lot!
cddf<-url("https://github.com/HimesGroup/BMIN503/blob/master/DataFiles/NHANES_2007to2008_DataDictionary.md")
cd<-read.table(cddf )
Update:
[![enter image description here][2]][2]
When I changed the code :
codebook<-read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",skip = 4, sep = "|", head = TRUE)
The r successfully read most of them, but the sep "|" did not work for two variables: INDHHIN2 and MCQ010. See pic. Can anyone help to figure out why? Thanks~~!
There are two issues here.
First, the raw file is available at the link https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md. However, read.table is not going to be able to read that file without some help: read.table is used for tab or comma delimited files, and that's a table marked up for Markdown. This comes close:
read.table("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/NHANES_2007to2008_DataDictionary.md",
skip = 4, sep = "|", head = TRUE)
but it will still need some cleanup, to remove the first and last columns of junk it added, and to delete the first line.

Extract text from multiple PDF-files to a structured data table

I am new to this platform and I hope someone can help me.
I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right.
This is an example of one file added that I imported. I want to make the yellow shaded lines in a data table.
This is the outcome I would ultimately like to have.
Now I have entered the code below, but I can't get it into a data table.
library(pdftools)
library(stringr)
library(dplyr)
# load the PDF-files into Rstudio
files <- list.files(pattern = "pdf$", full.names = TRUE)
# make a list of the PDF-files
filestext <- lapply(files, pdf_text)
# remove "\n"
filestext <- str_split(filestext, pattern = "\n")
This is the result I get:
Does anyone know the easiest way to solve this?
I would also give https://sensible.so a shot. We have some great documentation and a free plan just for projects like this. Plus, when you sign up there are some tutorials to help you understand how to extract different types of data. I bet you can have this extracted into a clean JSON object in no time.

reading US census products into R

I am trying to import the SIPP 2014 panel data into r but am having some trouble.
It can be found here:
https://www.census.gov/programs-surveys/sipp/data/2014-panel/wave-1.html
Normally, this would be a pretty simple process and I could just use
data = read.csv("pu2014w1.dat")
The issue stems from the size of the dataset and the fact that I do not know what it is separated by nor how the column headers are done. Sadly, I cannot find documentation for importing this file into R.
Any help would be greatly appreciated.
It seems that the file https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.dat.gz
after unzipping, is a fixed-width format text file. So, to read it, we can use
library(readr)
read_delim("https://thedataweb.rm.census.gov/pub/sipp/2014/pu2014w1.sas",
delim = " ", col_names = FALSE, skip = 6) -> foo
fwf_positions(start = foo$X4, end = foo$X6) -> bar
bar[ - c(5223:5231), ] -> bar2
bar3 <- bar2 %>% mutate(width = end - begin)
foobar <- fwf_widths(bar3$width)
read_fwf("pu2014w1.dat.gz", col_positions = foobar)
Note that when reading a fixed-width text file, we need to specify the positions of the fields. I do this by manipulating the contents of the sas input file, which tells us the positions of the fields (for use with SAS). Also, I had to download the gzipped file before I could successfully read it. Typically, I think, one can read directly from the url. I'm not sure why reading from the url didn't work here.

read an Excel file embedded in a website

I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.

In R , Reading a PDF document [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

Resources