Calculate `tf-idf` for a data frame of documents - r

The following code
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
taken from Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles, estimates the tf-idf in Jane Austen's works. Anyway, this code appears to be specific to Jane Austen's books. I would like to derive, istead, the tf-idf for the following data frame:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
df=data.frame(text = sentences,
class = c("A","B","A","C","A","B","A","C","D"),
weight = c(1,1,3,4,1,2,3,4,5))

There are two things you needed to change:
since you did not set stringsAsFactors = FALSE when constructing the data.frame, you need to convert text to character first.
You do not have a column named book, which means you have to select some other column as document. Since you put a column named class into your example, I assume you want to calculate the tf-idf over this column.
Here is the code:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(class, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#> class word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A blue 2 0.0769 0.288 0.0221
#> 2 A you 2 0.0769 0.693 0.0533
#> 3 C is 2 0.2 0.693 0.139
#> 4 A and 1 0.0385 1.39 0.0533
#> 5 A are 1 0.0385 1.39 0.0533
#> 6 A beautiful 1 0.0385 1.39 0.0533
#> 7 A because 1 0.0385 1.39 0.0533
#> 8 A color 1 0.0385 1.39 0.0533
#> 9 A colour 1 0.0385 1.39 0.0533
#> 10 A dress 1 0.0385 1.39 0.0533
#> # ... with 42 more rows
The documentation has helpful remarks for this check out ?count and ?bind_tf_idf.

Related

mutate( ) returns a matrix

After I update my Rstudio today, when I tried to get z-scores of a data frame by using mutate() and scale(), it returns a matrix with a 'new name' warning:
df <- df %>% group_by(participants) %>% mutate(zscore=scale(answer))
New names:
* NA -> ...8
class(df$zscore)
[1] "matrix" "array"
The column of the z-scores should have been named 'zscore', but why it is now named '...8'? I never had any problems with the codes before. Is it because of the update?
I think you just added another column without a header or read in data with a column without a header. There is no issue with your classes.
library(tidyverse)
test <- mtcars|>
group_by(cyl) |>
mutate(zscore=scale(mpg))
#class of test
class(test)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
#class of column
class(test$zscore)
#> [1] "matrix" "array"
#recreate warning
test <- test |>
bind_cols("")
#> New names:
#> * `` -> `...13`
The warning at the bottom means that I added a column without a name in the 13th position.
Part of the issue is that scale() returns a matrix. You can fix this by wrapping in as.double():
library(dplyr)
starwars2 <- starwars %>%
select(height, gender) %>%
group_by(gender) %>%
mutate(zscore = as.double(scale(height)))
Output:
# A tibble: 87 × 3
# Groups: gender [3]
height gender zscore
<int> <chr> <dbl>
1 172 masculine -0.120
2 167 masculine -0.253
3 96 masculine -2.14
4 202 masculine 0.677
5 150 feminine -0.624
6 178 masculine 0.0394
7 165 feminine 0.0133
8 97 masculine -2.11
9 183 masculine 0.172
10 182 masculine 0.146
# … with 77 more rows
But I’m not sure this explains your NA -> ...8 issue. If not, please update your question to include your data (using dput(df)) or a subset (using dput(head(df))).

row bind list columns using dplyr

I would like to find a better way to bind together the results of any number of regressions after adding an identifier for each model. The code below is my current solution but is too manual for a large number of regressions. This is part of a larger tidy workflow so a solution inside of the tidyverse is preferred but whatever works is fine. Thanks
library(tidyverse)
library(broom)
model_dat=mtcars %>%
do(lm_1 = tidy(lm(disp~ wt*vs, data = .),conf.int=T),
lm_2=tidy(lm(cyl ~ wt*vs, data = .),conf.int=T ),
lm_3=tidy(lm(mpg ~ wt*vs, data = .),conf.int=T ))
df=model_dat %>%
select(lm_1) %>%
unnest(c(lm_1)) %>%
mutate(model="one") %>%
select(model,term,estimate,p.value:conf.high) %>%
bind_rows(
model_dat %>%
select(lm_2) %>%
unnest(c(lm_2)) %>%
mutate(model="two") %>%
select(model,term,estimate,p.value:conf.high)) %>%
bind_rows(
model_dat %>%
select(lm_3) %>%
unnest(c(lm_3)) %>%
mutate(model="three") %>%
select(model,term,estimate,p.value:conf.high))
It may be easier with map2 i.e. loop across the columns and the corresponding english word for the sequence of columns, pluck the list element, create the 'model' column with second argument i.e. engish words (.y), select the columns of interest, and create a single dataset by specifying _dfr in map
library(purrr)
library(english)
library(dplyr)
library(broom)
map2_dfr(model_dat, as.character(english(seq_along(model_dat))),
~ .x %>%
pluck(1) %>%
mutate(model = .y) %>%
select(model, term, estimate, p.value:conf.high) )
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
Or use summarise with across, unclass and then bind with bind_rows
model_dat %>%
summarise(across(everything(), ~ {
# // get the column name
nm1 <- cur_column()
# // extract the list element (.[[1]])
list(.[[1]] %>%
# // create new column by extracting the numeric part
mutate(model = english(readr::parse_number(nm1))) %>%
# // select the subset of columns, wrap in a list
select(model, term, estimate, p.value:conf.high))
}
)) %>%
# // unclass to list
unclass %>%
# // bind the list elements
bind_rows
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <english> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419

Using word2vec to substitute less frequent words in data frame R

I have a data frame data1 with cleaned strings of text matched to their ids
# A tibble: 2,000 x 2
id text
<int> <chr>
1 decent scene guys visit spanish lady hilarious flamenco music background re…
3 movie beautiful plot depth kolossal scenes battles moral rationale br br conclusion wond…
4 fan scream killing astonishment story summarized don time move ii won regret plot ironical
5 mistake film guess minutes clunker fought hard stay seat lose hours life feeling br his…
6 phoned awful bed dog ranstuck br br positive grooming eldest daughter beeeatch br ous…
# … with 1,990 more rows
And have created a new variable freq that for every word gives the tf, pdf and itidf. In order, the columns of freq indicate id, word, n, tf, idf, tf_idf
# A tibble: 112,709 x 6
id word n tf idf tf_idf
<int> <chr> <int> <dbl> <dbl> <dbl>
1 335 starcrash 1 0.5 7.60 3.80
2 2974 carly 1 0.5 6.50 3.25
3 1796 phillips 1 0.5 5.81 2.90
4 1796 eric 1 0.5 5.40 2.70
5 1398 wilson 1 0.5 5.20 2.60
6 684 apolitical 1 0.333 7.60 2.53
7 1485 saimin 1 0.333 7.60 2.53
8 1398 charlie 1 0.5 4.77 2.38
9 2733 shouldn 1 0.5 4.71 2.36
10 2974 jones 1 0.5 4.47 2.23
# … with 112,699 more rows
I am trying to create a loop that goes through this second variable and uses word2vec to substitute in data1 any word of tf lower than the mean of all others, with the closest match.
I have tried the function
replace_word <- function(x) {
x<-hunspell_suggest(x)
x<-mutate(x)
p<-system.file(package = "word2vec", "models", "example.bin")
m<-read.word2vec(p)
s<-predict(m, x, type='nearest', top_n=1)
paste0(s)
}
But when I run it it goes into an infinite loop. I originally wanted to check whether the spelling of the word was correct first, but because there are words not in the dictionary I kept on getting errors.
Because I have never done something like this before, I really don't know how to make it work. Could someone please help?
Thank you
Maybe this code is what you are looking for. You can also use a pretrained word2vec model, in the below example the word2vec model is trained upon your data (more info at https://www.bnosac.be/index.php/blog/100-word2vec-in-r)
library(word2vec)
library(udpipe)
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
data1 <- data.frame(id = x$id, text = tolower(x$feedback), stringsAsFactors = FALSE)
str(data1)
#> 'data.frame': 500 obs. of 2 variables:
#> $ id : int 19991431 21054450 22581571 23542577 40676307 46755068 23831365 23016812 46958471 28687866 ...
#> $ text: chr "zeer leuke plek om te vertoeven , rustig en toch erg centraal gelegen in het centrum van brussel , leuk adres o"| __truncated__ "het appartement ligt op een goede locatie: op loopafstand van de europese wijk en vlakbij verschilende metrosta"| __truncated__ "bedankt bettina en collin. ik ben heel blij dat ik bij jullie heb verbleven, in zo'n prachtige stille omgeving "| __truncated__ "ondanks dat het, zoals verhuurder joffrey zei, geen last minute maar een last seconde boeking was, is alles per"| __truncated__ ...
freq <- strsplit.data.frame(data1, term = "text", group = "id", split = "[[:space:][:punct:][:digit:]]+")
freq <- document_term_frequencies(freq)
freq <- document_term_frequencies_statistics(freq)
freq <- freq[, c("doc_id", "term", "freq", "tf", "idf", "tf_idf")]
head(freq)
#> doc_id term freq tf idf tf_idf
#> 1: 19991431 zeer 1 0.03125 1.5702172 0.04906929
#> 2: 19991431 leuke 1 0.03125 1.9519282 0.06099776
#> 3: 19991431 plek 1 0.03125 2.5770219 0.08053194
#> 4: 19991431 om 2 0.06250 1.4105871 0.08816169
#> 5: 19991431 te 2 0.06250 0.9728611 0.06080382
#> 6: 19991431 vertoeven 1 0.03125 4.6051702 0.14391157
## Build word2vec model
set.seed(123456789)
w2v <- word2vec(x = data1$text, dim = 15, iter = 20, min_count = 0, lr = 0.05, type = "cbow")
vocabulary <- summary(w2v, type = "vocabulary")
## For each word, find the most similar one if it is part of the word2vec vocabulary
freq$similar_word <- ifelse(freq$term %in% vocabulary, freq$term, NA)
freq$similar_word <- lapply(freq$similar_word, FUN = function(x){
if(!is.na(x)){
x <- predict(w2v, x, type = 'nearest', top_n = 1)
x <- x[[1]]$term2
}
x
})
head(freq)
#> doc_id term freq tf idf tf_idf similar_word
#> 1: 19991431 zeer 1 0.03125 1.5702172 0.04906929 plezierig
#> 2: 19991431 leuke 1 0.03125 1.9519282 0.06099776 cafes
#> 3: 19991431 plek 1 0.03125 2.5770219 0.08053194 opportuniteit
#> 4: 19991431 om 2 0.06250 1.4105871 0.08816169 verblijven
#> 5: 19991431 te 2 0.06250 0.9728611 0.06080382 overnachten
#> 6: 19991431 vertoeven 1 0.03125 4.6051702 0.14391157 comfortabele
Now your threshold of 0.5. That's up to you to define.
Going by the text of your question, I think you are looking for a way to selectively update the value of the column named word in a data frame called freq using a specialized function to find a replacement value, but only for rows where the value of tf is below a set threshold. For that, here's an example using a tidyverse approach, with some simplifications with regard to your word replacement algorithm.
library(tidyverse)
# a placeholder for your word replacement function
replace_word <- function(x) {
paste0(x, "*")
}
# Creating some simplified example data to work with
freq <- tibble(
id = c(1, 2, 3, 4, 5),
word = c("aa", "bb", "cc", "dd", "ee"),
tf = c(0.001, 0.003, 0.005, 0.007, 0.009)
)
print(freq)
# A tibble: 5 x 3
id word tf
<dbl> <chr> <dbl>
1 1 aa 0.001
2 2 bb 0.003
3 3 cc 0.005
4 4 dd 0.007
5 5 ee 0.009
# Making changes to a column using `mutate()` and `if_else()` to do so conditionally.
freq <- freq %>%
mutate(
word = if_else(tf < 0.007, replace_word(word), word)
)
print(freq)
# A tibble: 5 x 3
id word tf
<dbl> <chr> <dbl>
1 1 aa* 0.001
2 2 bb* 0.003
3 3 cc* 0.005
4 4 dd 0.007
5 5 ee 0.009
The first 3 values of word are updated with stars. Does that help?

What is the equivalent of survey::svymean(~interaction()) using the srvyr package?

I need some help analyzing survey data.
Here is my code.
Data prep
library(survey)
library(srvyr)
data(api)
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
These two codes give me the same result.
One using the package survey
#Code
survey::svymean(~awards, dclus2)
#Results
mean SE
awardsNo 0.28962 0.033
awardsYes 0.71038 0.033
One using the package srvyr
#Code
srvyr::dclus2%>%
group_by(awards)%>%
summarise(m=survey_mean())
#Results
awards m m_se
No 0.2896175 0.0330183
Yes 0.7103825 0.0330183
I would like to get the survey mean of by the variable "awards" subset by the variable "stype" with levels No and Yes.
In the survey package, interaction is used
eg.svymean(~interaction(awards,stype), dclus2) How do I get the same result using the srvyr package?
Thank you for your help
How do get the result below using the package srvyr?
#Code
svymean(~interaction(awards,stype), dclus2)
#Results
mean SE
interaction(awards, stype)No.E 0.180328 0.0250
interaction(awards, stype)Yes.E 0.606557 0.0428
interaction(awards, stype)No.H 0.043716 0.0179
interaction(awards, stype)Yes.H 0.032787 0.0168
interaction(awards, stype)No.M 0.065574 0.0230
interaction(awards, stype)Yes.M 0.071038 0.0203
You can simply imitate the recommended behavior for survey: create a new variable formed by concatenating distinct values of each of the component variables. That's all that the interaction() function is doing for svymean().
library(survey)
library(srvyr)
data(api)
# Set up design object
dclus2 <- apiclus1 %>%
as_survey_design(dnum, weights = pw, fpc = fpc)
# Create 'interaction' variable
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
)
#> # A tibble: 6 x 3
#> awards_stype prop prop_se
#> <chr> <dbl> <dbl>
#> 1 No - E 0.180 0.0250
#> 2 No - H 0.0437 0.0179
#> 3 No - M 0.0656 0.0230
#> 4 Yes - E 0.607 0.0428
#> 5 Yes - H 0.0328 0.0168
#> 6 Yes - M 0.0710 0.0203
To get the various component variables split back into separate columns, you can use the separate() function from the tidyr package.
# Separate the columns afterwards
dclus2 %>%
mutate(awards_stype = paste(awards, stype, sep = " - ")) %>%
group_by(awards_stype) %>%
summarize(
prop = survey_mean()
) %>%
tidyr::separate(col = "awards_stype",
into = c("awards", "stype"),
sep = " - ")
#> # A tibble: 6 x 4
#> awards stype prop prop_se
#> <chr> <chr> <dbl> <dbl>
#> 1 No E 0.180 0.0250
#> 2 No H 0.0437 0.0179
#> 3 No M 0.0656 0.0230
#> 4 Yes E 0.607 0.0428
#> 5 Yes H 0.0328 0.0168
#> 6 Yes M 0.0710 0.0203
Created on 2021-03-30 by the reprex package (v1.0.0)

locating specific columns in a pdf table from R

I was wondering how I could locate the 2nd & 3rd columns from left in the table on the last page (page 18) of the this pdf document.
I'm using pdftools package, I'm wondering if there is a way to extract the 2nd & 3rd columns from left which are just numeric data?
library(pdftools)
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]
Making use of some tidy verse packages this could be achieved like so:
Filter for the values in the 2nd and 3rd column. The 2nd column values start at position x=189, the 3rd col at x=252.
Additionally to make sure that we only get the values I first convert to numeric whereby all text gets converted to NA. Note: One of the values has a comma as decimal mark, which I first had to remove.
After getting the values I reshape the dataset using pivot_wider for which I add a row id.
Finally I rename the cols.
library(pdftools)
#> Using poppler version 0.73.0
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]
library(dplyr)
library(tidyr)
library(stringr)
col_x <- c(189, 252)
df %>%
mutate(value = str_replace(text, "(\\d+),(\\d+)", "\\1.\\2"),
value = as.numeric(value)) %>%
filter(x %in% col_x, !is.na(value)) %>%
select(x, value) %>%
group_by(x) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = x, values_from = value) %>%
rename(row = 1, g = 2, se = 3)
#> Warning: Problem with `mutate()` input `value`.
#> ℹ NAs introduced by coercion
#> ℹ Input `value` is `as.numeric(value)`.
#> Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
#> # A tibble: 20 x 3
#> row g se
#> <int> <dbl> <dbl>
#> 1 1 0.089 0.179
#> 2 2 0.383 0.257
#> 3 3 0.481 0.355
#> 4 4 0.496 0.356
#> 5 5 0.103 0.335
#> 6 6 0.104 0.257
#> 7 7 0.068 0.289
#> 8 8 0.43 0.359
#> 9 9 1.48 0.351
#> 10 10 1.38 0.257
#> 11 11 0.888 0.388
#> 12 12 0.570 0.314
#> 13 13 0.642 0.39
#> 14 14 1.16 0.364
#> 15 15 0.341 0.432
#> 16 16 0.607 0.299
#> 17 17 0.473 0.361
#> 18 18 0.472 0.423
#> 19 19 0.902 0.368
#> 20 20 0.245 0.363
Created on 2020-12-31 by the reprex package (v0.3.0)
The accepted answer is complicated, and after wasting quite a bit of time fiddling with pdf_data() output, thought it might help to show how to extract and manipulate vectors.
library(pdftools)
library(stringr)
df <- pdf_data("https://github.com/rnorouzian/m/raw/master/Kang_et_al%20(2015).pdf")[[18]]
df <- df[stringr::str_detect(df$text, "\\d"),]
data.frame(se = df$text[df$x == 189], g = df$text[df$x == 252])

Resources