Is there a way to count specific words from a corpus of PDFs in R? - r

How can I count the number of specific words in a corpus of PDFs?
I tried using text_count but I honestly didn't understand what it was returned.

First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools can help with that and converting to text, but I am not sure that it can handle multiple columns.
https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
Here is another post:
Use R to convert PDF files to text files for text mining
As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.
After you have raw text, you can use a package like tm to obtain word counts in a corpus. Let me know if this works or if you have further questions.

Related

R write fixed width columns to csv (in the plain text file)

I write data frames to csv files using write.csv(). When this is done, the output when viewed in a plain text editor, in particular vi or notepad++, shows no spacing between the column content and the commas, resulting in it being relatively hard to read. For example, the columns are not lined up down the page.
I have negative interest in using excel to view the csv files. I am definitely not looking for a suggestion for a csv viewer. Nor do I want instructions on how to modify the plain text file afterward. Padding needs to be spaces not tabs.
I am interested in how to get R to line up the columns in the plain text csv file so that they are easier to read using a non specialized plain text editor.
I could (and might) write my own routine that converts everything to some fixed width string format and print that. But, I would prefer to find that this is an option within write.csv() or similar common output library call.
[I just this moment found out about printf in R, and that might be the best answer to this conundrum].

Detect if PDF contains searchable text

I want to read the text of 16000 pdf files in an R dataframe. Most of the pdfs are no Problem, pdf_text("link") gives the right output. The problem is that some PDFS have no copyable text. With them I get a following (sample) output:
"0123145456789 20ÿ\016\017ÿ\020\021\022ÿ\023\024\020\024\025\022ÿ\026\023ÿ\027\026\016\022\020\025\030ÿ\031\026\032\033\034\030ÿÿ\035\026\036\022\025\016\026\025ÿ\035 \037\025\033\022\032ÿ !\021\032\026\024\023\r\n ()*+ÿ-./.0012ÿ34ÿ5656ÿ7M9^.C=19ÿO=CDÿ-_`MQÿ-19GEHFC=19ÿ\.GW:CG7\r\n ()5+ÿ8.9:2ÿ;4ÿ5656ÿ7N10=#=:GÿJ1>ÿCD:ÿa>:.Cÿa01V.0ÿ<DECA1O9ÿ.9Aÿ\:K19A7\r\n ())+ÿ<:=0:>2ÿ5656ÿ7-1/=AÿY#191H=#Gbÿ̀:CC:Aÿ.9Aÿ>:.0cC=H:ÿF.F:>G7\r\n"
The solution is of course
a. Do not read out all these texts
b. Read the texts by OCR
But since this method either excludes many texts or is very time-consuming, I would like to recognize the problematic texts beforehand. Is there an easy way to do this in R?
text_raw<-pdf_text("link")
if(text_raw is nonsense){
text<-pdf_ocr_text(text_raw)
}else{
text<-text_raw
}

R-Text mining: replace abbreviations, numbers and symbols in german

I would like to replace the abbreviations, numbers and symbols in my text.
As my text is in german and not in english I have problems in converting it.
I tried:
review_text <- replace_abbreviation(review_text)
review_text <- replace_number(review_text)
review_text <- replace_symbol(review_text)
But this works just for an English text and not for German.
What should I add that the function also works in German?
qdap and qdap related packages are solely for use with the English language. If you want to use German text with ümlauts and everything, packages like quanteda and udpipe can handle this. But they do not handle abbreviations and symbols. Now the replace_symbol function is easy to adjust, just inspect the function, copy the code to create your own function and replace the English translations with the German translations.
The replace_abbreviation function points to a replacement table where the abbreviation are stored with the corresponding value. You need to create your own table for German.
The biggest issue is translating the numbers to text. This is different for each language is not really available online. Searching for this tends to lead to converting numbers to text in excel. But if you can read python, you can translate a python function to R (or use reticulate) to solve this. See this link to a python library on Github which can do this for a few languages including German. But I'm not sure if this can be used in a text mining context.

Reading a CSV file and to tokenize it.

I am a newbie in R. I have been trying to read a CSV file like this.
tweets <- read.csv("tweets.csv")
and I need to be able to remove all of the punctuations, convert to lower cases, remove numbers & stop words & whitespaces from the data frame 'tweets' without having to convert it into a corpus or something. Nothing fancy just straight removing it. Is there any library/function that could help solve this issue?
Reading part of csv is what you have defined
tweets <- read.csv("tweets.csv")
However, for dealing with punctuations, whitespaces the other approach except using corpus is by using regular expressions but that has limited application as it is not generic at all
That is why we prefer corpus as it can become easier to apply to different sources

Excel text data into R

I've been researching into this question, which I assume should be easy to fix, but am not having any luck. I have an excel file where each cell is some text of variable length. I'm wanting to read this into R so I can eventually do some text classification, but am failing. I get errors when using read.table and am struggling with all other alternatives. I've never worked with text data before so perhaps this is my issue. Having problems finding good examples of importing text data into R when it isn't in a corpus format.
There are special packages for reading data from the excel format. I mostly use readxl when I need to do this, but I know that there are several (a lot of them are described in this tutorial by datacamp, in the section Importing Excel files into R).
Another possibility (assuming you are using windows) is to copy the cells to the clipboard and use
read.table("clipboard")
for macOS and Linux there are similar commands, but I don't know them by heart.

Resources