I'm trying to pull a numerical value from a chart that's been embedded in a pdf.
I tried the two methods below, but I was able to convert every other information into xlsx except the line chart information
Link to the pdf:
http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf
The value that I need to pull into a variable
1st Method
library(pdftools)
library(stringr)
library(xlsx)
set.seed(100)
tx <- pdf_text("flureport.pdf")
tx2 <- unlist(str_split(tx, "[\\r\\n]+"))
tx3 <- str_split_fixed(str_trim(tx2), "\\s{2,}", 5)
write.xlsx(tx3, file="ds.xlsx")
2nd Method
library('tm')
file <- 'flureport.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
c<-data.frame(corpus.array)
write.xlsx(c, file="x.xlsx")
Both the xlsx that I wrote didnt contain any chart information, so that I can fetch the value
This is the solution that worked for me, not sure if it would work for all the cases but it did work work in this particular case.
Thanks #user2554330 for mentioning OCR
library(pdftools)
library(stringr)
library(tesseract)
library(magick)
library(magrittr)
list <- c('http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf')
sapply(list, function(x)
pdf_convert(x, format = "png", pages = NULL, filenames = NULL, dpi = 300, opw = "", upw = "", verbose = TRUE))
text <- image_read("Weekly-Flu-Report-01-19-2018_1.png") %>%
image_resize("2000") %>%
image_convert(colorspace = 'gray') %>%
image_trim() %>%
image_ocr()
a<-print(text)
massili<-regmatches(a, gregexpr("\\d+(\\.\\d+){0,1} %", a))[[1]]
Related
By using R programming I want to read files in folder. perform some operations on it, plot and save as csv1.
Read next file, perform same operations, plot and save the new modified dataframe in csv1 with rbind function. Remember I want 1 plot from all files I read in for loop and save plot as pdf.
Right now i am using following code but my system crash due to shortage of RAM
all_paths <-
list.files(path = "/work/newplots",
pattern = "*.*",
full.names = TRUE)
all_filenames <- all_paths %>%
basename() %>%
as.list()
all_content <-
all_paths %>%
lapply(read.table,
header = TRUE,
skip=60,
sep=',',
encoding = "UTF-8")
file <- data.frame()
for (i in 1:length(all_filenames)) {
all_lists <- mapply(c, all_content, i, SIMPLIFY = FALSE)
data <- rbindlist(all_lists, fill = T)
names(data)[1] <- "File.Path"
x1 <- data %>% select(V1) %>% unique()
data <- data %>% data.frame(str_split_fixed(data$File.Path, " ", 23))%>% select(-c(File.Path))%>% filter(X1=='Interactions')
data<- cbind(x1,data)
data <- data %>% select(-c(2)) %>%select(V1,X2)
data$X2 <-as.numeric(data$X2)
file <- write.table(data,"/work/con1_10.csv",row.names = FALSE)
file <- append(file,data)
p<-plot(data$X2, xlab="Cycle number",ylab="Interactions",type = "p")
print(p)
Z<- (2*data$X2)/20006
px<-plot(Z, xlab="Cycle number", ylab="Z")
print(px)
}
I am new to R as of today so this may be simple, but I can not find a solution anywhere.
I am trying to loop through .xlsx files, format them and then bind them into one dataset I think it is called. It works, however, there are a few files that have different row names and amounts of rows. Which then breaks the formatting and thus ends the loop. I would like these files to be ignored when the formatting and binding happens and possibly printing to the console the name of the file.
This is what I have so for and feel free to ask questions if I don't make sense.
library(readxl)
library(tidyr)
library(tidyverse)
library(janitor)
setwd("~:/Users/sam/Desktop/Information_engineering/traffic")
my_files <- list.files(pattern = ".xlsx", recursive = TRUE)
traffic_congestion = lapply(my_files, function(i){
my_data = read_excel(i, sheet = 1, range = "A6:H1446")
my_data_location = read_excel(i, sheet = 1, range = "A1:A2")
my_data <- na.omit(my_data)
my_data <- pivot_longer(my_data, cols=2:8, names_to = "Date_day", values_to = "Amount")
colnames(my_data) <- c("times", "dates", "Amount")
my_data$dates <- excel_numeric_to_date(as.numeric(my_data$dates))
my_data$times <- strftime(as.Date(my_data$times), format = "%H:%M:%S")
my_data <- my_data %>% mutate(location = my_data_location[1])
my_data
})
traffic_congestion = do.call("rbind.data.frame", traffic_congestion)
This is the top of the spreadsheet that I want to bind
This is the top of the spreadsheet that I don't want to bind
Try with
library(data.table)
rbindlist(traffic_congestion, fill = T)
I want to create a bigram wordcloud in R with package tau.
I got bigrams in a list as numeric. so I converted it to matrix but its without column name. I want it in a dataframe table so that I can create a bigram WordCloud with it.
Please find my code below and suggest a way out.
library(tau)
speech1 = Corpus(VectorSource(speech))
myDTM = TermDocumentMatrix(speech1, control = list(minWordLength = 1))
bigrams = textcnt(speech1, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)
n = as.matrix(bigrams)
Please suggest a way on how can I create a wordcloud on bigram. unable to do with weka package
If the goal is the wordcloud, then check out this page : http://www.rpubs.com/rgcmme/PLN-09 , and here is a small adapted example from it:
library(tm)
library(wordcloud)
# sample data
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
speech <- readLines(filePath)
speech1 = Corpus(VectorSource(speech))
myDTM = TermDocumentMatrix(speech1, control = list(minWordLength = 1))
myDTM_mat <- as.matrix(myDTM)
myDTM_mat_sorted <- sort(rowSums(myDTM_mat),decreasing = TRUE)
myDTM_df <- data.frame(word = names(myDTM_mat_sorted), freq = myDTM_mat_sorted)
wordcloud(myDTM_df$word,
myDTM_df$freq,
max.words=100,
random.order = F)
I'm currently trying to create a function that will read many pdf files into a data frame. My ultimate goal is to have it read specific information from the pdf files and convert them into a data.frame with insurance plan names in each row and the columns comprising of information I need such as individual plan price, family plan prices, etc. I have been following an answer given by someone for a similar question in the past. However, I keep getting an error. Here is a link to two different files I am practicing on(1 and 2).
Here are my code and error below:
PDFtoDF = function(file) {
dat = readPDF(control=list(text="-layout"))(elem=list(uri=file),
language="en", id="id1")
dat = c(as.character(dat))
dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)
dat = gsub("(, HVOL )","\\1 ", dat)
dat = gsub(" {2,100}", "|", dat)
excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)
dat = dat[!excludeRows]
dat = read.table(text=dat, sep="", quote="", stringsAsFactors=FALSE)
names(dat) = c("Plan", "Individual", "Family")
return(dat)
}
files <- list.files(pattern = "pdf$")
df = do.call("rbind", lapply(files, PDFtoDF))
Error in read.table(text = dat, sep = "", quote = "", stringsAsFactors =
FALSE) : no lines available in input
Before this approach, I have been using the pdftools package and regular expressions. This approach worked except it was difficult to clarify a pattern for some parts of the document such as the plan name which is at the top. I was hoping the methodology I'm trying now will help since it will extract the text into separate strings for me.
Here's the best answer:
require(readtext)
df <- readtext("*.pdf")
Yes it's that simple, with the readtext package!
I'm trying to import the list of nuclear test sites (from Wikipedia's page) in a data.frame using the code below:
library(RCurl)
library(XML)
theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Find XPath (go the webpage, right-click inspect element, find table then right-click copyXPath)
myxpath <- "//*[#id='mw-content-text']/table[2]"
# Extract table header and contents
tablehead <- xpathSApply(pagetree, paste(myxpath,"/tr/th",sep=""), xmlValue)
results <- xpathSApply(pagetree, paste(myxpath,"/tr/td",sep=""), xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 5, byrow = TRUE))
names(content) <- c("Testing country", "Location", "Site", "Coordinates", "Notes")
However there are multiple sub-headers that prevent the data.frame to be populated consistently. How can I fix this?
Take a look at the htmltab package. It allows you to use the subheaders for populating a new column:
library(htmltab)
tab <- htmltab("https://en.wikipedia.org/wiki/List_of_nuclear_test_sites",
which = "/html/body/div[3]/div[3]/div[4]/table[2]",
header = 1 + "//tr/th[#style='background:#efefff;']",
rm_nodata_cols = F)
I found this example by Carson Sievert that worked well for me:
library(rvest)
theurl <- "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites"
# First, grab the page source
content <- html(theurl) %>%
# then extract the first node with class of wikitable
html_node(".wikitable") %>%
# then convert the HTML table into a data frame
html_table()
Have you tried this?
l.wiki.url <- getURL( url = "https://en.wikipedia.org/wiki/List_of_nuclear_test_sites" )
l.wiki.par <- htmlParse( file = l.wiki.url )
l.tab.con <- xpathSApply( doc = l.wiki.par
, path = "//table[#class='wikitable']//tr//td"
, fun = xmlValue
)