Extract text from multiple PDF-files to a structured data table

Extract text from multiple PDF-files to a structured data table - r

I am new to this platform and I hope someone can help me.
I have imported some pdf files into Rstudio using the pdftools library. Now I want to make structured columns of this text. I just can't seem to get the structure right.
This is an example of one file added that I imported. I want to make the yellow shaded lines in a data table.
This is the outcome I would ultimately like to have.
Now I have entered the code below, but I can't get it into a data table.
library(pdftools)
library(stringr)
library(dplyr)
# load the PDF-files into Rstudio
files <- list.files(pattern = "pdf$", full.names = TRUE)
# make a list of the PDF-files
filestext <- lapply(files, pdf_text)
# remove "\n"
filestext <- str_split(filestext, pattern = "\n")
This is the result I get:
Does anyone know the easiest way to solve this?

I would also give https://sensible.so a shot. We have some great documentation and a free plan just for projects like this. Plus, when you sign up there are some tutorials to help you understand how to extract different types of data. I bet you can have this extracted into a clean JSON object in no time.

Related

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?

jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Create data tables using SPSS in R

Using expss package I am creating cross tabs by reading SPSS files in R. This actually works perfectly but the process takes lots of time to load. I have a folder which contains various SPSS files(usually 3 files only) and through R script I am fetching the last modified file among the three.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
#get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
#use file.info to get the index of the file most recently modified
pass<-all_sav[with(file.info(all_sav), which.max(mtime))]
mydata = read_spss(pass,reencode = TRUE) # read SPSS file mydata
w <- data.frame(mydata)
args <- commandArgs(TRUE)
Everything is perfect and works absolutely fine but it generally takes too much time to load large files(112MB,48MB for e.g) which isn't good.
Is there a way I can make it more time-efficient and takes less time to create the table. The dropdowns are created using PHP.
I have searched for this and found another library called 'haven' but I am not sure whether that can give me significance as well. Can anyone help me with this? I would really appreciate that. Thanks in advance.

As written in the expss vignette (https://cran.r-project.org/web/packages/expss/vignettes/labels-support.html) you can use in the following way:
# we need to load packages strictly in this order to avoid conflicts
library(haven)
library(expss)
spss_data = haven::read_spss("spss_file.sav")
# add missing 'labelled' class
spss_data = add_labelled_class(spss_data)

Create GIF or video from external jpg in R

I wanted to do something I thought simple, but... well, I failed repetitively.
I want to create a gif or a video (.avi) from pictures in a folder with R.
I can list the path and the names of the pictures (e.g. "./folder/1.jpg" "./folder/2.jpg" "./folder/3.jpg" "./folder/4.jpg" )
I just wanted to put them the one after the other and create a video or gif file (I will treat them frame by frame later, so the speed is not important)
I found a solution with SaveGIF, it works with plots in R but I didn't find the way to use it with external jpg.
Otherwise, there was this solution with image_animate "Animated graphics", but again, I didn't manage.
Do somebody already have a solution to do that?
Thank you very much for your help!

You can do it with the magick package, which gives you access to ImageMagick functions. For example, if the frames of your movie are in files named
frames <- paste0("folder/", 1:100, ".jpg")
then you would create a movie using
library(magick)
m <- image_read(frames)
m <- image_animate(m)
image_write(m, "movie.gif")
You could choose to write to other formats as well, just by changing the filename, or using other arguments to image_write().

A better solution in my oppinion is using the av package.
example:
list_of_frames <- list.files("your/path/to/directory", full.names=T)
av::av_encode_video(list_of_frames, framerate = 30,
output = "output_animation.mp4")

Saving text from webpage for word cloud in R

I'm trying to practice making word clouds in R and I've seen the process nicely explained in sites like this (http://www.r-bloggers.com/building-wordclouds-in-r/) and in some videos on YouTube. So I thought I'd pick some random long document to practice myself.
I chose the script for Good Will Hunting. It is available here (https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html). What I did is copy that into Notepad++ and start removing blank lines, names, etc. to try to clean up the data before saving. Saving as a .csv file doesn't seem to be an option so I saved it as a .txt file and R doesn't seem to want to read it in.
Both of the following lines return errors in R.
goodwillhunting <- read.csv("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
goodwillhunting <- read.table("C:/Users/MyName/Desktop/goodwillhunting.txt", sep="", stringsAsFactors=FALSE)
My question is based on an html document what is the best way to save it to be read in to be used for something like this? I know with the rvest package you can read in webpages. The tutorials for word clouds have used .csv files so I'm not sure if that's what my end goal needs to be.
This might be a way to read in the data going that route?
test = read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/award_winning/good_will_hunting.html")
text = html_text(test)
Any help is appreciated!

Here's one way:
library(rvest)
library(wordcloud)
test <- read_html("https://finearts.uvic.ca/writing/websites/writ218/screenplays/
award_winning/good_will_hunting.html")
text <- html_text(test)
content <- stringi::stri_extract_all_words(text, simplify = TRUE)
wordcloud(content, min.freq = 10, colors = RColorBrewer::brewer.pal(5,"Spectral"))
Which gives:

Here is a simple example:
library(wordcloud)
text = scan("fulltext.txt", character(0), strip.white = TRUE)
frequency_table = as.data.frame(table(text))
wordcloud(frequency_table$text, frequency_table$Freq)

Quotation issues reading data into R

I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).
The data is coded in an unusual manner. A typical line is:
12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11
There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.
test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")
Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:
12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28
Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.
Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.
Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...
I have looked all over the place for a solution but can't find one.

Building on my comments, here is a solution that would read all the CSV files into a single list.
# Deal with French properly
options(encoding="latin1")
# Set your working directory to where you have
# unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")
# Get the file names
temp <- list.files()
# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)
# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
T0 <- readLines(temp[x])
T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
final <- read.csv(text = T0, header = TRUE)
final
})
names(pollResults) <- Codes
You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).
Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".
invisible(lapply(seq_along(pollResults), function(x) {
NewFilename <- paste("Corrected", temp[x], sep = "_")
write.csv(pollResults[[x]], file = NewFilename,
quote = TRUE, row.names = FALSE)
}))
Hope this helps!

This answer is mainly to #AnandaMahto (see comments to the original question).
First, it helps to set some options globally because of the french accents in the data:
options(encoding="latin1")
Next, read in the data verbatim using readLines():
temp <- readLines("pollresults_resultatsbureau13001.csv")
Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.
temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])
Penultimately, write over the original file.
fileConn<-file("pollresults_resultatsbureau13001.csv")
writeLines(temp,fileConn)
close(fileConn)
Finally, simply read the data back into R:
data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")
There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex