Notice and remove erroneous spaces between letters in a word - r

I have a text file that looks like:
These are the hig hlights. Transit ioning to this, hello. I have
provided this informat ion. The man has this dis eas e. He needs to take this dos age of medicine. Fo r o ne mo nth, thro ug h this pro g ram, do this. Do no t overdose.
There are numerous words that are broken up. Is there any way to notice this errors in word structure and fix them through r?
So basically:
These are the highlights. Transitioning to this, hello. I have
provided this information. The man has this disease. He needs to take this dosage of medicine. For one month, through this program, do this. Do not overdose.
I got the text from a pdf using the following code:
file <- 'C:/Project/Section/SubSection/text.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file), readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
txt<- write(corpus.array, "C:/Project/Section/SubSection/text1.txt")
readtext<- eval(readLines("C:/Project/Section/SubSection/text1.txt"))
This produced the text with awkward spacing. Is there a better way to convert a pdf to to text file?

Related

Create multiiple rmarkdown reports with one dataset

I would like to create several pdf files in rmarkdown.
This is a sample of my data:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"), Second = c("Smith", "Chang", "Jzu","King"), Sport = c("Football","Ballet","Ballet","Football"), Age = c("12", "13", "12","13"), submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
With help from the community, I was able to arrive at a cool rmarkdown solution that would create a single html file, with all the data I want.
This is saved as Essay to Word.Rmd
```{r echo = FALSE}
# using data from above
# mydata <- mydata
# Define template (using column names from data.frame)
template <- "**First:** `r First`   **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(mydata), function(i) {
knitr::knit_child(text=template, envir=mydata[i, ], quiet=TRUE)
})
```
# Print result to document
`r knitr::knit_child(text=unlist(src))`
```
This creates a single file:
I would like to create a single html (or preferably PDF file) for each "sport" listed in the data. So I would have all the submissions for students who do "Ballet" in one file, and a separate file with all the submissions of students who play football.
I have been looking a few different solutions, and I found this to be the most helpful:
R Knitr PDF: Is there a posssibility to automatically save PDF reports (generated from .Rmd) through a loop?
Following suite, I created a separate R script to loop through and subset the data by sport:
Unfortunately, this is creating a separate file with ALL the students, not just those who belong to that sport.
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
render("Essay to Word.Rmd",output_file = paste0('report.',sport, '.html'))
}
Any idea what might be going on with this code above?
Is it possible to directly create these files as PDF docs instead of html? I know I can click on each file to save them as pdf after the fact, but I will have 40 different sports files to work with.
Is is possible to add a thin line between each "submission" essay within a file?
Any help would be great, thank you!!!
This could be achieved via a parametrized report like so:
Add parameters for the data and e.g. the type of sport to your Rmd
Inside the lapply pass your subgroup dataset to render via argument params
You can add horizontal lines via ***
If you want pdf then use output_format="pdf_document". Additionally to render your document I had to switch the latex engine via output_options
Rmd:
---
params:
data: null
sport: null
---
```{r echo = FALSE}
# using data from above
data <- params$data
# Define template (using column names from data.frame)
template <- "
***
**First:** `r First`   **Second:** `r Second` <br>
**Age:** `r Age`
**Submission** <br>
`r Submission`"
# Now process the template for each row of the data.frame
src <- lapply(1:nrow(data), function(i) {
knitr::knit_child(text=template, envir=data[i, ], quiet=TRUE)
})
```
# Print result to document. Sport: `r params$sport`
`r knitr::knit_child(text=unlist(src))`
R Script:
mydata <- data.frame(First = c("John", "Hui", "Jared","Jenner"),
Second = c("Smith", "Chang", "Jzu","King"),
Sport = c("Football","Ballet","Ballet","Football"),
Age = c("12", "13", "12","13"),
Submission = c("Microbes may be the friends of future colonists living off the land on the moon, Mars or elsewhere in the solar system and aiming to establish self-sufficient homes.
Space colonists, like people on Earth, will need what are known as rare earth elements, which are critical to modern technologies. These 17 elements, with daunting names like yttrium, lanthanum, neodymium and gadolinium, are sparsely distributed in the Earth’s crust. Without the rare earths, we wouldn’t have certain lasers, metallic alloys and powerful magnets that are used in cellphones and electric cars.", "But mining them on Earth today is an arduous process. It requires crushing tons of ore and then extracting smidgens of these metals using chemicals that leave behind rivers of toxic waste water.
Experiments conducted aboard the International Space Station show that a potentially cleaner, more efficient method could work on other worlds: let bacteria do the messy work of separating rare earth elements from rock.", "“The idea is the biology is essentially catalyzing a reaction that would occur very slowly without the biology,” said Charles S. Cockell, a professor of astrobiology at the University of Edinburgh.
On Earth, such biomining techniques are already used to produce 10 to 20 percent of the world’s copper and also at some gold mines; scientists have identified microbes that help leach rare earth elements out of rocks.", "Blank"))
for (sport in unique(mydata$Sport)){
subgroup <- mydata[mydata$Sport == sport,]
rmarkdown::render("test.Rmd", output_format = "html_document", output_file = paste0('report.', sport, '.html'), params = list(data = subgroup, sport = sport))
rmarkdown::render("test.Rmd", output_format = "pdf_document", output_options = list(latex_engine = "xelatex"), output_file = paste0('report.', sport, '.pdf'), params = list(data = subgroup, sport = sport))
}
In order to directly create a pdf from your rmd-file , you could use the following function in a separate R script where your data is loaded, and then use map from the purrr package to iterate over the data (in the rmd-file the output must be set to pdf_document):
library(tidyverse)
library(lazyeval)
get_report <- function(sport){
sport <- enquo(sport)
mydata <- mydata %>%
filter(Sport == !!sport)
render("test.rmd", output_file = paste('report_', as_name(sport), '.pdf', sep=''))
}
map(as.vector(data$Sport), get_report)
Hope that is what you are looking for?

Create a Document Frequency Matrix in R

I am attempting to create a document frequency matrix in R.
I currently have a dataframe (df_2), which is made up of 2 columns:
doc_num: which details which document each term is coming from
text_token: which contains each tokenized word relating to each document.
The df's dimensions are 79,447 * 2.
However, there are only 400 actual documents in the 79,447 rows.
I have been trying to create this dfm using the tm package.
I have tried creating a corpus (vectorsource) and then attempting to coerce that into a dfm using
the appropriately named "dfm" command.
However, this indicates that "dfm() only works on character, corpus, dfm, tokens objects."
I understand my data isn't currently in the correct format for the dfm command to work.
My issue is that I don't know how to get from my current point to a matrix as appears below.
Example of what I would like the matrix to look like when complete:
Where 2 is the number of times cat appears in doc_2.
Any help on this would be greatly appreciated.
Is mise le meas.
It will be useful for you and others if all pertinent details are made available with your code - such as the use of quanteda package for dfm().
If the underlying text is setup correctly, the dfm() will directly give you what you are looking for - that is precisely what it is set up for.
Here is a simulation:
library(tm)
library(quanteda)
# install.packages("readtext")
library(readtext)
doc1 <- "COVID-19 can be beaten if all ensure social distance, social distance is critical"
doc2 <- "COVID-19 can be defeated through early self isolation, self isolation is your responsibility"
doc3 <- "Corona Virus can be beaten through early detection & slowing of spread, Corona Virus can be beaten, Yes, Corona Virus can be beaten"
doc4 <- "Corona Virus can be defeated through maximization of social distance"
write.table(doc1,"doc1.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc2,"doc2.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc3,"doc3.txt",sep="\t",row.names=FALSE, col.names = F)
write.table(doc4,"doc4.txt",sep="\t",row.names=FALSE, col.names = F)
# save above into your WD
getwd()
txt <- readtext(paste0("Your WD/docs", "/*"))
txt
corp <- corpus(txt)
x <- dfm(corp)
View(x)
If the issue is one of formatting /cleaning your data so that you can run dfm(), then you need to post a new question which provides necessary details on your data.

A way to add space between two unicode characters

I am using R to do an analysis of tweets and would like to include emojis in my analysis. I have read useful resources and consulted the emoji dictionaries from from both Jessica Peterka Bonetta and Kate Lyons. However, I am running into a problem when there are emojis right next to each other in tweets.
For example, if a use a Tweet with multiple emojis that are spread out, I will get the results I am looking for:
x <- iconv(x, from = "UTF8", to = "ASCII", sub = "byte")
x
x will return:
"Ummmm our plane <9c><88><8f> got delayed <9a><8f> and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way <9c><85> home <9f><8f> so that<80><99>s really exciting <80><8f> t<80>
Which when matching with Kate Lyons' emoji dictionary:
FindReplace(data = x, Var = "x", replaceData = emoticons, from="R_Encoding", to = "Name", exact = FALSE)
Will yield:
Ummmm our plane AIRPLANE got delayed WARNINGSIGN and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way WHITEHEAVYCHECKMARK home <9f><8f> so that<80><99>s really exciting DOUBLEEXCLAMATIONMARK t<80>
If there is a tweet with two emojis in a row, such as:
"Delayed\U0001f615\U0001f615\n.\n.\n.\n\n#flying #flight #travel #delayed #baltimore #january #flightdelay #travelproblems #bummer… "
Repeating the process with iconv from above will not work, because it will not match the codings in the emoji dictionary. Therefore, I thought of adding a space between the two patterns(\U0001f615\U0001f615) to become
(\U0001f615 \U0001f615), however I am struggling with a proper regular expression for this.

How to create a title and footnote for multiple page in R Programming? in RTF or PDF

#Report Section
output<-"D:/R/Reference program for R/Table_EG_chg.doc" # although this is RTF, we can use the
rtf<-RTF(output,width=8.5,height=11,font.size=9,omi=c(0.5,0.5,0.5,0.5))
addHeader(rtf,title = " Table14.3.2.3.1", subtitle =" Vital Signs - Absolute Values", font.size=9,TOC.level=0)
addTable(rtf,final,font.size=9,row.names=FALSE,NA.string="0",col.justify='L',header.col.justify='L',col.widths=c(1.75,1.5,1.25,0.5,0.5,0.5,0.5,0.5,0.5))
addTable(rtf,as.data.frame(head(iris)),font.size=10,row.names=FALSE,NA.string="-")
addText(rtf, "\n\n", bold=TRUE, italic=FALSE)
done(rtf) # writes and closes the file
final is my data frame which i need to print in the RTF output.
This is the code i have used to create the output in RTF. It works fines for first page alone, for the rest of the page output doesn't have Title and footnotes in all the pages. Please can anyone has done the this method if so please can you send the code...
This is easily done in SAS. I need it in R.
Any one has answer for this.....
Think you are asking the listings where we can do in SAS programming, I have tried using R program and got the outputs. Please find the below code I have used a dummy dataset and applied logics which you need to get the rtf document where we can see titles and footnotes in multiple pages.
library(rtf)
final <- data.frame(Subject = c(1001,1002,1003,1004,1005,1006), Country = c("USA","IND","CHN","JPN","SA","EUR"),
Age = c(50,60,51,63,73,65), Sex = c("M","F","M","F","M","F"), SBP = c(120,121,119,123,126,128),
DBP = c(80,70,75,85,89,71))
final$seq <- rep(seq(1,nrow(final),2),each =2)
rtf<-RTF("Table_EG_chg.rtf",width=11,height=5,font.size=9,omi=c(0.5,0.5,0.5,0.5))
for ( i in unique(final$seq)){
new <- final [final$seq == i , ]
new$seq <- NULL
name.width <- max(sapply(names(new), nchar))
new <- format(new, justify = "centre")
addHeader(rtf,title = "\t\t\t\t\t\t\t\t\tTable14.3.2.3.1", subtitle ="\t\t\t\t\t\t\t\t\tVital Signs - Absolute Values", font.size=9)
addTable(rtf,new,font.size=9,row.names=FALSE,NA.string="0",col.justify='L',header.col.justify='L',col.widths=c(1.75,1.5,1.25,1.5,1.5,1.5))
startParagraph.RTF(rtf)
addText.RTF(rtf,paste("\n","- Vital signs lab values are collected at the day of ICF.\n"))
addText.RTF(rtf,"- Vital signs SBP - systolic blood pressure; DBP - Diastolic blood pressure")
endParagraph.RTF(rtf)
addPageBreak(rtf, width=11,height=5,font.size=9,omi=rep(0.5,0.5,0.5,0.5))
}
done(rtf)
#Jaikumar Sorry it took 6 years for a package to come out that can finally do what you want. At the end of last year, the reporter package was released. This package replicates a lot of the functionality of SAS proc report. It can do dataset listings, just like SAS. It will repeat titles and footnotes on every page, without having to do anything special. Here is an example:
library(reporter)
library(magrittr)
# Create table
tbl <- create_table(iris) %>%
titles("Sample Title for Iris Data") %>%
footnotes("My footnote")
# Create report and add table to report
rpt <- create_report("test.rtf", output_type = "RTF") %>%
add_content(tbl)
# Write the report
write_report(rpt)
It can also print in RTF, PDF, and TXT. To use PDF, just change the file name and the output_type.

word frequency scatterplot in R (words as labels)

I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq

Resources