tokenizing on a pdf for quantitative analysis - r

I ran into an issue using the unnest_tokens function on a data_frame. I am working with pdf files I want to compare.
text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,
text_raw)
So far so good. But here comes my problemo:
unnest_tokens(output = token, input = content) -> text1_long
Error: Must extract column with a single valid subscript.
x Subscript var has the wrong type function.
i It must be numeric or character.
I want to tokenize my pdf files so I can analyse the word frequencies and maybe compare multiple pdf files on wordclouds.

Here is a piece of simple code. I kept your German words so you can copy paste everything.
library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)
file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,
text_raw)
text1df_long <- unnest_tokens(text1df , output = wort, input = text_raw ) %>%
filter(str_detect(wort, "[a-z]"))
text1df_long
# A tibble: 4,134 x 2
Zeile wort
<int> <chr>
1 1 training
2 1 and
3 1 development
4 1 policy
5 1 contents
6 1 policy
7 1 statement
8 1 scope
9 1 induction
10 1 training
# ... with 4,124 more rows

Related

Same names in Columns in R

In R, I am using a read_excel function, to import some files, the problem is that my files have some columns with the same name, is there any way to force the same name? (I know it's not a good practice, but it's a very specific thing)
New names:
* `44228` -> `44228...4`
* `44229` -> `44229...5`
* `44230` -> `44230...6`
* `44231` -> `44231...7`
* `44232` -> `44232...8`
I need to use a conversion factor for these data names, so I need to leave it with the name of the member, they are data.
You can use the .name_repair argument of read_excel() to control, and turn off, the checks applied to column names by tibble(). So to allow duplicate names:
library("readxl")
library("writexl") # Only needed to generate an example xlsx file
x <- data.frame(a = 1:3, a = 1:3, a = 1:3, check.names = FALSE)
write_xlsx(x, "data.xlsx")
read_xlsx("data.xlsx", .name_repair = "minimal")
#> # A tibble: 3 x 3
#> a a a
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
Although do be aware that duplicate column names are closer to a syntax error than "bad practice", so the resulting object will behave in strange ways:
df <- read_xlsx("data.xlsx", .name_repair = "minimal")
df$a
#> [1] 1 2 3

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

Get word occurrences

I am trying to get the number of occurrences of each word in a csv file with r.
My dataset looks like this:
TITLE
1 My first Android app after a year
2 Unmanned drone buzzes French police car
3 Make anything editable with HTML5
4 Predictive vs Reactive control
5 What was it like to move to San Antonio and go through TechStars Cloud?
6 Health-care sector vulnerable to hackers, researchers say
And I have tried using the funciton used in 'Machine Learning for Hackers':
get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
But I get an error I dont understand:
Error: is.Source(s) is not TRUE
In addition: Warning message:
In is.Source(s) : vectorized sources must have a positive length entry
What could possibly the problem?
This works for me (calling your dataframe df)
library(tm)
doc.corpus <- Corpus(VectorSource(df))
freq <- data.frame(count=termFreq(doc.corpus[[1]]))
freq
# count
# after 1
# and 1
# android 1
# antonio 1
# anything 1
# ...
# unmanned 1
# vulnerable 1
# was 1
# what 1
# with 1
# year 1

list of word frequencies using R

I have been using the tm package to run some text analysis.
My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors.
Is there a simple solution for this?
Try this
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:
data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:
text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'
stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about
library(tidyverse)
data_frame(text = text_string) %>%
mutate(text = tolower(text)) %>%
mutate(text = str_remove_all(text, '[[:punct:]]')) %>%
mutate(tokens = str_split(text, "\\s+")) %>%
unnest() %>%
count(tokens) %>%
filter(!tokens %in% stop_words) %>%
mutate(freq = n / sum(n)) %>%
arrange(desc(n))
# A tibble: 64 x 3
tokens n freq
<chr> <int> <dbl>
1 i 5 0.0581
2 with 5 0.0581
3 is 4 0.0465
4 words 3 0.0349
5 into 2 0.0233
6 list 2 0.0233
7 of 2 0.0233
8 problem 2 0.0233
9 run 2 0.0233
10 that 2 0.0233
# ... with 54 more rows
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?

R: load R table or csv file with the textconnection command

In previous message
Convert table into matrix by column names
I want to use the same approach for an csv table or an table in R. Could you mind to teach me how to modify the first command line?
x <- read.table(textConnection(' models cores time 4 1 0.000365 4 2 0.000259 4 3 0.000239 4 4 0.000220 8 1 0.000259 8 2 0.000249 8 3 0.000251 8 4 0.000258' ), header=TRUE)
library(reshape) cast(x, models ~ cores)
Should I use following for a data.csv file
x <- read.csv(textConnection("data.csv"), header=TRUE)
Should I use the following for a R table named as xyz
x <- xyz(textConnection(xyz), header=TRUE)
Is it a must to have the textConnection for using cast command?
Thank you.
Several years later...
read.table and its derivatives like read.csv now have a text argument, so you don't need to mess around with textConnections directly anymore.
read.table(text = "
x y z
1 1.9 'a'
2 0.6 'b'
", header = TRUE)
The main use for textConnection is when people who ask questions on SO just dump their data onscreen, rather than writing code to let answerers generate it themselves. For example,
Blah blah blah I'm stuck here is my data plz help omg
x y z
1 1.9 'a'
2 0.6 'b'
etc.
In this case you can copy the text from the screen and wrap it in a call to textConnection, like so:
the_data <- read.table(tc <- textConnection("x y z
1 1.9 'a'
2 0.6 'b'"), header = TRUE); close(tc)
It is much nicer when questioners provide code, like this:
the_data <- data.frame(x = 1:2, b = c(2.9, 0.6), c = letters[1:2])
When you are using you own data, you shouldn't ever need to use textConnection.
my_data <- read.csv("my data file.csv") should suffice.

Resources