How to extract bold and non-bold text from pdf using R - r

I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
library(pdftools)
library(tesseract)
library(tiff)
info <- pdf_info(news)
numberOfPageInPdf <- as.numeric(info[2])
numberOfPageInPdf
for (i in 1:numberOfPageInPdf){
bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
file_name <- paste0("page", i, ".tiff")
file_tiff <- tiff::writeTIFF(bitmap, file_name)
out <- ocr(file_name)
file_txt <- paste0("text", i, ".txt")
writeLines(out, file_txt)
}

I like using the tabulizer library for this. Here's a small example:
devtools::install_github("ropensci/tabulizer")
library(tabulizer)
news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly
page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')
page1[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "" "Division: 1" "" "" "" "" "Série: A"
[2,] "" "514" "" "Fontaine 1 KBSK 1" "" "" "303"
[3,] "1" "62529 WIRIG ANTHONY" "" "2501 1⁄2-1⁄2" "51560" "CZEBE ATTILLA" "2439"
[4,] "2" "62359 BRUNNER NICOLAS" "" "2443 0-1" "51861" "PICEU TOM" "2401"
[5,] "3" "75655 CEKRO EKREM" "" "2393 0-1" "10391" "GEIRNAERT STEVEN" "2400"
[6,] "4" "50211 MARECHAL ANDY" "" "2355 0-1" "35181" "LEENHOUTS KOEN" "2388"
[7,] "5" "73059 CLAESEN PIETER" "" "2327 1⁄2-1⁄2" "25615" "DECOSTER FREDERIC" "2373"
[8,] "6" "63614 HOURIEZ CLEMENT" "" "2304 1⁄2-1⁄2" "44954" "MAENHOUT THIBAUT" "2372"
[9,] "7" "60369 CAPONE NICOLA" "" "2283 1⁄2-1⁄2" "10430" "VERLINDE TIEME" "2271"
[10,] "8" "70653 LE QUANG KIM" "" "2282 0-1" "44636" "GRYSON WOUTER" "2269"
[11,] "" "" "< 2361 >" "12 - 20" "" "< 2364 >" ""
You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.
You'll note that each table is its own element in the returned list.
Here's an example using a custom region to just select the first table on each page:
customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')
This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.
Edit: just for fun, here's a little example of starting the clean up with data.table
library(data.table)
cleanUp <- setDT(as.data.frame(page1[[1]]))
cleanUp[ , `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
][, c("V1", "V3") := NULL
][-grep('Division', V2, fixed=TRUE)]
Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.
V2 V4 V5 V6 V7 Division Series ID
1: 514 Fontaine 1 KBSK 1 303 1 A 514
2: 62529 WIRIG ANTHONY 2501 1/2-1/2 51560 CZEBE ATTILLA 2439 1 A 62529
3: 62359 BRUNNER NICOLAS 2443 0-1 51861 PICEU TOM 2401 1 A 62359
4: 75655 CEKRO EKREM 2393 0-1 10391 GEIRNAERT STEVEN 2400 1 A 75655
5: 50211 MARECHAL ANDY 2355 0-1 35181 LEENHOUTS KOEN 2388 1 A 50211
6: 73059 CLAESEN PIETER 2327 1/2-1/2 25615 DECOSTER FREDERIC 2373 1 A 73059
7: 63614 HOURIEZ CLEMENT 2304 1/2-1/2 44954 MAENHOUT THIBAUT 2372 1 A 63614
8: 60369 CAPONE NICOLA 2283 1/2-1/2 10430 VERLINDE TIEME 2271 1 A 60369
9: 70653 LE QUANG KIM 2282 0-1 44636 GRYSON WOUTER 2269 1 A 70653
10: 12 - 20 < 2364 > 1 A NA

There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text() can read this file directly:
stringi::stri_split(pdf_text(news), regex = "\n")

Related

Move a [-] symbol with condition

I'm still learning R, and you guys have been so helpful with your educative answers.
So here is my issue, It might be very basic but i tried solutions with sub, gsub and casewhen, getting no results. I have a column with some numbers with [-] sign in the right. And if they have the - i would like to move it upfront.
col<- c("1.000","100-","12.000-","12.568-", "100","150","1.000.000-")
col2<-c("A","B","C","D","E","F","G")
A<-cbind(col2,col)
A<-as.data.frame(A)
Expected result:
col2<-c("A","B","C","D","E","F","G")
col<-c("1.000","-100","-12.000","-12.568", "100","150","-1.000.000")
A<-cbind(col2,col)
A<-as.data.frame(A)
Thanks in advance!
You could do:
sub("(.*)-$", "-\\1", A$col)
#> [1] "1.000" "-100" "-12.000" "-12.568" "100" "150"
#> [7] "-1.000.000"
You can also write an ifelse that checks if the last character in the string is a dash and in that case paste it in front:
library(stringr)
A %>%
mutate(col_edit = ifelse(str_sub(col,-1,-1) == "-",
paste0("-",str_sub(col,1,-2)),
col))
col2 col col_edit
1 A 1.000 1.000
2 B 100- -100
3 C 12.000- -12.000
4 D 12.568- -12.568
5 E 100 100
6 F 150 150
7 G 1.000.000- -1.000.000
Using str_replace
library(stringr)
A$col - str_replace(A$col, "^(.*)-$", "-\\1")
A$col
#[1] "1.000" "-100" "-12.000" "-12.568" "100" "150" "-1.000.000"

Unable to logically adress x,y-values of rasterToPoints extraction

I am unsure how to describe this problem. I have a feeling it is trivial but I cannot get a hold of it.
I have a stack of raster objects (object NDVI). From these I extracted x and y coordinates using rasterToPoints
xycoord1 <- rasterToPoints(NDVI)
xycoord <- xycoord1[,c(1:2)]
Along the pre-processing I kicked out several unusable pixels and ended up with:
> str(xycoord.short)
num [1:20054, 1:2] 3802292 3802523 3802755 3802987 3803218 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "x" "y"
No I simply want to find a certain x and y coordinate.
e.g.
> which(xycoord.short[,1]==3802292)
integer(0)
But I seem unable to "get hold" of the values inside for example one column.
> xycoord.short[,1][1]
[1] 3802292
> xycoord.short[,1][1]==xycoord.short[,1][1]
[1] TRUE
> xycoord.short[,1][1]==3802292
[1] FALSE
Can anyone help me along this problem? I just don't find the problem. Does it have to do with initial extraction through rasterToPoints? Thanks!
EDIT:
dput output for the first 10 rows of my xy-coordinates
xy <- structure(c(3802291.63636448, 3802523.29272274, 3802754.94908101,
3802986.60543927, 3803218.26179754, 3803449.9181558, 3803681.57451406,
3803913.23087233, 3804144.88723059, 3804376.54358886, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191, -49690.2888476191, -49690.2888476191, -49690.2888476191,
-49690.2888476191), .Dim = c(10L, 2L), .Dimnames = list(NULL,
c("x", "y")))
EDIT2:
After posting the dput output it makes sense, as the values are obviously rounded.
Using the exact numbers works...
> any(xycoord.short[,1]==3802291.63636448)
[1] TRUE
What you have here is a rounding "problem". Your coordinates are in what we call "double" (10.3 is a double) but you're trying to subset based on an integer (say 10). What you can do here is round to n places and subset based on that.
For instance, let's check eight digits.
format(xy, digits = 8)
x y
[1,] "3802291.636" " -49690.289"
[2,] "3802523.293" " -49690.289"
[3,] "3802754.949" " -49690.289"
[4,] "3802986.605" " -49690.289"
[5,] "3803218.262" " -49690.289"
[6,] "3803449.918" " -49690.289"
[7,] "3803681.575" " -49690.289"
[8,] "3803913.231" " -49690.289"
[9,] "3804144.887" " -49690.289"
[10,] "3804376.544" " -49690.289"
So in essence, when you're looking for 3802292 it doesn't find it because it's actually 3802291.636....
You can either specify exact coordinate up to x places correct, or perhaps round your number and work on that. Or you could specify a range of values that would encompass your desired value(s).

Quanteda with topicmodels: removed stopwords appear in results (Chinese)

My code:
library(quanteda)
library(topicmodels)
# Some raw text as a vector
postText <- c("普京 称 俄罗斯 未 乌克兰 施压 来自 头 条 新闻", "长期 电脑 前进 食 致癌 环球网 报道 乌克兰 学者 认为 电脑 前进 食 会 引发 癌症 等 病症 电磁 辐射 作用 电脑 旁 水 食物 会 逐渐 变质 有害 物质 累积 尽管 人体 短期 内 会 感到 适 会 渐渐 引发 出 癌症 阿尔茨海默 式 症 帕金森 症 等 兔子", "全 木 手表 乌克兰 木匠 瓦列里·达内维奇 木头 制作 手表 共计 154 手工 零部件 唯一 一个 非 木制 零件 金属 弹簧 驱动 指针 运行 其他 零部件 材料 取自 桦树 苹果树 杏树 坚果树 竹子 黄杨树 愈疮木 非洲 红木 总共 耗时 7 打造 手表 不仅 能够 正常 运行 天 时间 误差 保持 5 分钟 之内 ")
# Create a corpus of the posts
postCorpus <- corpus(postText)
# Make a dfm, removing numbers and punctuation
myDocTermMat <- dfm(postCorpus, stem = FALSE, removeNumbers = TRUE, removeTwitter = TRUE, removePunct = TRUE)
# Estimate a LDA Topic Model
if (require(topicmodels)) {
myLDAfit <- LDA(convert(myDocTermMat, to = "topicmodels"), k = 2)
}
terms(myLDAfit, 11)
The code works and I see a result. Here is an example of the output:
Topic 1 Topic 2
[1,] "木" "会"
[2,] "手表" "电脑"
[3,] "零" "乌克兰"
[4,] "部件" "前进"
[5,] "运行" "食"
[6,] "乌克兰" "引发"
[7,] "内" "癌症"
[8,] "全" "等"
[9,] "木匠" "症"
[10,] "瓦" "普"
[11,] "列" "京"
Here is the problem. All of my posts have been segmented (necessary pre-processing step for Chinese) and had stop words removed. Nonetheless, the topic model returns topics containing single-character stop terms that have already been removed. If I open the raw .txt files and do ctrl-f for a given single-character stop word, no results are returned. But those terms show up in the returned topics from the R code, perhaps because the individual characters occur as part of other multi-character words. E.g. 就 is a preposition treated as a stop word, but 成就 means "success."
Related to this, certain terms are split. For example, one of the events I am examining contains references to Russian president Putin ("普京"). In the topic model results, however, I see separate term entries for "普" and "京" and no entries for "普京". (See lines 10 and 11 in output topic 2, compared to the first word in the raw text.)
Is there an additional tokenization step occurring here?
Edit: Modified to make reproducible. For some reason it wouldn't let me post until I also deleted my introductory paragraph.
Here's a workaround, based on using a faster but "dumber" word tokeniser based on space ("\\s") splitting:
# fails
features(dfm(postText, verbose = FALSE))
## [1] "普" "京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
# works
features(dfm(postText, what = "fasterword", verbose = FALSE))
## [1] "普京" "称" "俄罗斯" "未" "乌克兰" "施压" "来自" "头" "条" "新闻"
So add what = "fasterword" to the dfm() call and you will get this as a result, where Putin ("普京") is not split.
terms(myLDAfit, 11)
## Topic 1 Topic 2
## [1,] "会" "手表"
## [2,] "电脑" "零部件"
## [3,] "乌克兰" "运行"
## [4,] "前进" "乌克兰"
## [5,] "食" "全"
## [6,] "引发" "木"
## [7,] "癌症" "木匠"
## [8,] "等" "瓦列里达内维奇"
## [9,] "症" "木头"
## [10,] "普京" "制作"
## [11,] "称" "共计"
This is an interesting case of where quanteda's default tokeniser, built on the definition of stringi's definition of text boundaries (see stri_split_boundaries, does not work in the default setting. It might after experimentation with locale, but these are not currently options that can be passed to quanteda::tokenize(), which dfm() calls.
Please file this as an issue at https://github.com/kbenoit/quanteda/issues and I'll try to get working on a better solution using the "smarter" word tokeniser.

Splitting a string by more than one space

I am trying to load some data into R that is in the following format (as a text file)
Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43
The problem I have is that the "columns" are separated by spaces such that they all line up nicely, but one row may have 5 spaces between values and the next 10 spaces. So when I load it in using read.delim I get a one column data.frame with
"John,Smith United Kingdom 20"
as the first observation and so on.
Is there any way I can either:
Load the data into R into a usable format? or
Split the character strings up into separate columns once I load it in in the one column format?
My thought was to split the character strings by spaces, except it would need to be between 2 and x spaces (so, for example, "United Kingdom" stays together and doesn't become "United" "" "Kingdom"). But I don't know if that is possible.
I tried strsplit(data.frame[,1], sep="\\s") but it returns a list of character strings like:
"John,Smith" "" "" "" "" "" "" "" "United" "" "Kingdom" "" ""...
which I don't know what to do with.
Having columns that all "line up nicely" is a typical characteristic of fixed-width data.
For the sake of this answer, I've written your three lines of data and one line of header information to a temporary file called "x". For your actual use, replace "x" with the file name/path, as you would normally use with read.delim.
Here's the sample data:
x <- tempfile()
cat("Name Country Age\nJohn,Smith United Kingdom 20\nWashington,George USA 50\nMartin,Joseph Argentina 43\n", file = x)
R has it's own function for reading fixed width data (read.fwf) but it is notoriously slow and you need to know the widths before you can get started. We can count those if the file is small, and then use something like:
read.fwf(x, c(22, 18, 4), strip.white = TRUE, skip = 1,
col.names = c("Name", "Country", "Age"))
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
Alternatively, you can let fwf_widths from the "readr" package do the guessing of widths for you, and then use read_fwf:
library(readr)
read_fwf(x, fwf_empty(x, col_names = c("Name", "Country", "Age")), skip = 1)
# Name Country Age
# 1 John,Smith United Kingdom 20
# 2 Washington,George USA 50
# 3 Martin,Joseph Argentina 43
You can do base R, supposing your columns do not contain words with more than 1 space:
txt = "Name Country Age
John,Smith United Kingdom 20
Washington,George USA 50
Martin,Joseph Argentina 43"
conn = textConnection(txt)
do.call(rbind, lapply(readLines(conn), function(u) strsplit(u,'\\s{2,}')[[1]]))
# [,1] [,2] [,3]
#[1,] "Name" "Country" "Age"
#[2,] "John,Smith" "United Kingdom" "20"
#[3,] "Washington,George" "USA" "50"
#[4,] "Martin,Joseph" "Argentina" "43"

Count misspelled words in R

Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fish", "Final Countdow", "show me your s", "where is what")
Data<-cbind(Row, Content)
View(Data)
I wanted to create a function which tells me how many words are wrong per Row.
A intermediate step would be to have it look like this:
Row<-c(1,2,3,4,5)
Content<-c("I love cheese", "whre is the fs", "Final Countdow", "show me your s", "where is what")
MisspelledWords<-c(NA, "whre, fs", "Countdow","s",NA)
Data<-cbind(Row, Content,MisspelledWords)
I know that i have to use aspell but i'm having problems to perform aspell on only rows and not always directly on the whole file, finally i want to Count how many words are wrong on every Row For this i would take code of: Count the number of words in a string in R?
Inspired by this article, here's a try with which_misspelled and check_spelling in library(qdap).
library(qdap)
# which_misspelled
n_misspelled <- sapply(Content, function(x){
length(which_misspelled(x, suggest = FALSE))
})
data.frame(Content, n_misspelled, row.names = NULL)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
# check_spelling
df <- check_spelling(Content, n.suggest = 0)
n_misspelled <- as.vector(table(factor(df$row, levels = Row)))
data.frame(Content, n_misspelled)
# Content n_misspelled
# 1 I love cheese 0
# 2 whre is the fs 2
# 3 Final Countdow 1
# 4 show me your s 0
# 5 where is what 0
To use aspell you have to use a file. It's pretty straightforward to use a function to dump a column to a file, run aspell and get the counts (but it will not be all that efficient if you have a large matrix/dataframe).
countMispelled <- function(words) {
# do a bit of cleanup (if necessary)
words <- gsub(" *", " ", gsub("[[:punct:]]", "", words))
temp_file <- tempfile()
writeLines(words, temp_file);
res <- aspell(temp_file)
unlink(temp_file)
# return # of mispelled words
length(res$Original)
}
Data <- cbind(Data, Errors=unlist(lapply(Data[,2], countMispelled)))
Data
## Row Content Errors
## [1,] "1" "I love cheese" "0"
## [2,] "2" "whre is thed fish" "2"
## [3,] "3" "Final Countdow" "1"
## [4,] "4" "show me your s" "0"
## [5,] "5" "where is what" "0"
You might be better off using a data frame vs a matrix (I just worked with what you provided) since you can keep Row and Errors numeric that way.

Resources