How to plot multi-word expressions with quanteda - r

I am using the quanteda package in r for textual data analysis. I am interested in plotting some Keyword-in-context display using the kwic() command that is to useful to find multi-word expressions in tokens.
# Remove punctuation and symbols
toks_comments <- tokens(comments_corpus, remove_punct = TRUE, remove_symbols = TRUE, padding =
TRUE) %>%
tokens_remove(stopwords("spanish"), padding = TRUE)
# Get relevant keywords and phrases from dictionary
servicio <-
c("servicio","atencion","atenciĆ³n","personal","mesera","mesero","muchacha","muchacho","joven",
"pelado", "pelada","meseros")
# Keyword-in-context
servicio_context <- kwic(toks_comments, pattern = phrase(servicio))
View(servicio_context)
Once the previous lines have been run, I get the result that I have included in the photo. From that table in the photo, I am interested in graphing the "pre" and "post" column but I don't know how to do it. Is there a way to include the words in a multiword wordcloud or some other frequency visualization?
Here is the pic:"View(servicio_context)"

You could do both a wordcloud and a frequency bar graph.
Wordcloud
library(quanteda.textplots)
library(quanteda)
dfm(servicio_context$pre) %>%
textplot_wordcloud()
Bar Graph
library(ggplot2)
servicio_context %>%
ggplot(aes(x = pre)) +
geom_bar(stat = "count")

Related

How to visualize Markov chains for NLP using ggplot?

I am working on analyzing some text in R and have settled on (for the moment) Markov chains as part of my procedure. Here is an example of what I'm doing:
# Required libraries
library(stringi) # Input cleaning
library(tidyverse) # dplyr, ggplot, etc.
library(hunspell) # Spell checker
library(markovchain) # Markov chain calculation
# Input
shake <- c("To be, or not to be - that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles And by opposing end them.")
# Process to clean input
miniclean <- function(x = ""){
# x is character string input
words_i = x %>%
gsub(pattern = "[^[:alpha:][:space:]\']", replacement = "") %>%
#gsub(pattern = "[\n]+", replacement = "") %>% # Drop line breaks
stri_trans_tolower() %>%
strsplit(split = " ") %>%
unlist()
correct = hunspell_check(words_i)
words_o = words_i[correct]
return(words_o)
}
# Clean input
cleans <- miniclean(shake)
# Compute Markov chain using cleaned input
mark2 <- markovchainFit(cleans)
# Plot results
plot(mark2$estimate)
The base plot graphics produces this visualization:
I would really like a bit more control over the plot (e.g., increasing arrow lengths to increase the overall size to make it more readable), but I don't see how to do it.
Ideas?
(edited to make a complete example)

How to create tables in R

I am trying to create and export a table for publication (picture attached).
I have created a table using the code below, but I could not export it as a table.
Can anyone help, please
library(tidyverse)
library(gapminder)
data(gapminder)
median_gdp <- median(gapminder$gdpPercap)
gapminder %>%
select(-country) %>%
mutate(gdpPercap = ifelse(gdpPercap > median_gdp, "high", "low")) %>%
mutate(gdpPercap = factor(gdpPercap)) %>%
mutate(pop = pop / 1000000) -> gapminder
gapminder <- lapply(gapminder, function(x) x[sample(c(TRUE, NA),
prob = c(0.9, 0.1),
size = length(x),
replace = TRUE
)])
library(arsenal)
table_one <- tableby(continent ~ ., data = gapminder)
summary(table_one, title = "Gapminder Data", text=TRUE)
If you want to write a table to Microsoft Word, you can use the following code from arsenal package.
write2word(table_one, "table.doc",
keep.md = TRUE,
quiet = TRUE,
title = "Your title")
You can also write a table to pdf and HTML by using the arsenal package. For the details, see
?write2specific
Weirdly, there doesn't seem to be a general question on this topic, though see Create a PDF table for PDFs.
Modern packages to print tables in output formats, including PDF, HTML and Word, include gt, huxtable, flextable and kableExtra.
Packages to create tables of summary statistics include skimr, summarytools and qwraps2. Some of these also have built-in output to different formats.
There are many other packages out there.

textplot_wordcloud group label highlight color

I'm trying to replicate some quanteda() applications from this post. Yet when I replicated their textplot_wordcloud() example on Presidential speeches, the group labels on my output does not contain highlight colors like grey-ish background in the example:
Since the textplot_wordcloud() function is inherited from comparison.cloud(), so I refer back to the latter's document to see if it has any arguments to set label highlight colors, but couldn't find any. I am wondering if it's possible to highlight group labels in textplot_wordcloud() with colors?
The replication code is attached at below.
library(quanteda)
data(data_corpus_inaugural)
compDfm <- dfm(corpus_subset(data_corpus_inaugural, President %in% c("Washington", "Jefferson", "Madison")),
groups = "President", remove = stopwords("english"), removePunct = TRUE)
You are looking at an old example. You should look here on the quanteda website for current plotting examples.
The function textplot_wordcloud has been rewritten and only uses internal quanteda calls so the reference to wordcloud::wordcloud_comparison is not really valid anymore. In this case you can't set the back ground color for the labels anymore. You can adjust the color and the size of the labels if you want to:
library(quanteda)
# Package version: 2.0.0
# See https://quanteda.io for tutorials and examples.
corpus_subset(data_corpus_inaugural,
President %in% c("Washington", "Jefferson", "Madison")) %>%
dfm(groups = "President", remove = stopwords("english"), remove_punct = TRUE) %>%
dfm_trim(min_termfreq = 5, verbose = FALSE) %>%
textplot_wordcloud(comparison = TRUE,
labelcolor = "green",
labelsize = 2)

Visualizing PCA with large number of variables in R using ggbiplot

I am trying to visualize a PCA that includes 87 variables.
prc <-prcomp(df[,1:87], center = TRUE, scale. = TRUE)
ggbiplot(prc, labels = rownames(df[,1:87]), var.axes = TRUE)
When I create the biplot, many of the vectors overlap with each other, making it impossible to read the labels. I was wondering if there is any way to only show some of the labels at a time. For example, I think it'd be useful if I could create a few separate biplots with each one showing only a subset of the labels on the vectors.
This question seems closely related, but I don't know if it translates to the latest version of ggbiplot. I'm also not sure how to modify the original functions.
A potential solution is to use the factoextra package to visualize your PCA results. The fviz_pca_biplot() function includes a repel argument. When repel = TRUE the plot labels are spread out to minimize overlap. There are also select.var options mentioned in the documentation, such as select.var = list(contrib=5) to display only the 5 most influential vectors. Also a select.var = list(name) option that seems to allow for the specification of a specific subset of variables that you want shown.
# read data
df <- mtcars[, c(1:7,10:11)]
# perform PCA
library("FactoMineR")
res.pca <- PCA(df, graph = FALSE)
# visualize
library(factoextra)
fviz_pca_biplot(res.pca, repel = TRUE, select.var = list(contrib = 5))

Plotting a matrix "by parts" in R?

I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.
Obviously, when I try to load the matrix in R by using read.table(), a memory error is encountered as the matrix is too big. Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.
Here's an approach. I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv function in the readr package. First, you will need to write your summarisation function and then apply this to each chunk. See the below for a full repex.
# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)
# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")
fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)
# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
x %>%
mutate(added_bin = cut(V1, breaks = -6:6)) %>%
count(added_bin)
}
# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise
# A
small_read <- read_csv_chunked(temp, # data
DataFrameCallback$new(summarise_for_hist),
chunk_size = 200 # number of lines to read
)
Now that we have summarised our data, we can combine and plot it.
# Generate our histogram by combining all of the results
# and plotting
small_read %>%
group_by(added_bin) %>%
summarise(total = sum(n)) %>%
ggplot(aes(added_bin, total))+
geom_col()
This will yield the following:

Resources