How to visualize Markov chains for NLP using ggplot? - r

I am working on analyzing some text in R and have settled on (for the moment) Markov chains as part of my procedure. Here is an example of what I'm doing:
# Required libraries
library(stringi) # Input cleaning
library(tidyverse) # dplyr, ggplot, etc.
library(hunspell) # Spell checker
library(markovchain) # Markov chain calculation
# Input
shake <- c("To be, or not to be - that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune Or to take arms against a sea of troubles And by opposing end them.")
# Process to clean input
miniclean <- function(x = ""){
# x is character string input
words_i = x %>%
gsub(pattern = "[^[:alpha:][:space:]\']", replacement = "") %>%
#gsub(pattern = "[\n]+", replacement = "") %>% # Drop line breaks
stri_trans_tolower() %>%
strsplit(split = " ") %>%
unlist()
correct = hunspell_check(words_i)
words_o = words_i[correct]
return(words_o)
}
# Clean input
cleans <- miniclean(shake)
# Compute Markov chain using cleaned input
mark2 <- markovchainFit(cleans)
# Plot results
plot(mark2$estimate)
The base plot graphics produces this visualization:
I would really like a bit more control over the plot (e.g., increasing arrow lengths to increase the overall size to make it more readable), but I don't see how to do it.
Ideas?
(edited to make a complete example)

Related

How to plot multi-word expressions with quanteda

I am using the quanteda package in r for textual data analysis. I am interested in plotting some Keyword-in-context display using the kwic() command that is to useful to find multi-word expressions in tokens.
# Remove punctuation and symbols
toks_comments <- tokens(comments_corpus, remove_punct = TRUE, remove_symbols = TRUE, padding =
TRUE) %>%
tokens_remove(stopwords("spanish"), padding = TRUE)
# Get relevant keywords and phrases from dictionary
servicio <-
c("servicio","atencion","atención","personal","mesera","mesero","muchacha","muchacho","joven",
"pelado", "pelada","meseros")
# Keyword-in-context
servicio_context <- kwic(toks_comments, pattern = phrase(servicio))
View(servicio_context)
Once the previous lines have been run, I get the result that I have included in the photo. From that table in the photo, I am interested in graphing the "pre" and "post" column but I don't know how to do it. Is there a way to include the words in a multiword wordcloud or some other frequency visualization?
Here is the pic:"View(servicio_context)"
You could do both a wordcloud and a frequency bar graph.
Wordcloud
library(quanteda.textplots)
library(quanteda)
dfm(servicio_context$pre) %>%
textplot_wordcloud()
Bar Graph
library(ggplot2)
servicio_context %>%
ggplot(aes(x = pre)) +
geom_bar(stat = "count")

Plotting a matrix "by parts" in R?

I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.
Obviously, when I try to load the matrix in R by using read.table(), a memory error is encountered as the matrix is too big. Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.
Here's an approach. I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv function in the readr package. First, you will need to write your summarisation function and then apply this to each chunk. See the below for a full repex.
# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)
# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")
fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)
# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
x %>%
mutate(added_bin = cut(V1, breaks = -6:6)) %>%
count(added_bin)
}
# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise
# A
small_read <- read_csv_chunked(temp, # data
DataFrameCallback$new(summarise_for_hist),
chunk_size = 200 # number of lines to read
)
Now that we have summarised our data, we can combine and plot it.
# Generate our histogram by combining all of the results
# and plotting
small_read %>%
group_by(added_bin) %>%
summarise(total = sum(n)) %>%
ggplot(aes(added_bin, total))+
geom_col()
This will yield the following:

Advise a Chemist: Automate/Streamline his Voltammetry Data Graphing Code

I am a chemist dealing with a significant amount of voltammetry data recently. Let me be very clear and give some research information. I run scans from a starting voltage to an ending voltage on solid state conductive films. These scans are saved as .txt files (name scheme: run#.txt) in a single folder. I am looking at how conductance changes as temperature changes. The LINEST line plotting current v. voltage at a given temperature gives me a line with slope = conductance. Once I have the conductances (slopes) for each scan, I plot conductance v. temperature to see the temperature dependent conductance characteristics. I had been doing this in Excel, but have found quicker ways to get the job done using R. I am brand new to R (Rstudio) and recognize that my coding is not the best. Without doubt, this process can be streamlined and sped up which would help immensely. This is how I am performing the process currently:
# Set working directory with folder containing all .txt files for inspection
# Add all .txt files to the global environment
allruns<-list.files(pattern=".txt")
for(i in 1:length(allruns))assign(allruns[i],read.table(allruns[i]))
Since the voltage column (a 1x1000 matrix) is the same for all runs and is in column V1 of each .txt file, I assign a x to be the voltage column from the first folder
x<-run1.txt$V1
All currents (these change as voltage changes) are found in the V2 column of all the .txt files, so I assign y# to each. These are entered one at a time..
y1<-run1.txt$V2
y2<-run2.txt$V2
y3<-run3.txt$V2
# ...
yn<-runn.txt$V2
So that I can get the eqn for each LINEST (one LINEST for each scan and plotted with abline later). Again entered one at a time:
run1<-lm(y1~x)
run2<-lm(y2~x)
run3<-lm(y3~x)
# ...
runn<-lm(yn~x)
To obtain a single graph with all LINEST (one for each scan ) on the same plot, without the data points showing up, I have been using this pattern of coding to first get all data points on a single plot in separate series:
plot(x,y1,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y2,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y3,yn)))
par(new=TRUE)
plot(x,y3,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
# ...
par(new=TRUE)
plot(x,yn,col="transparent",main="LSV Solid Film", xlab = "potential(V)",ylab="current(A)", xlim=rev(range(x)),ylim=range(c(y1,yn)))
#To obtain all LINEST lines (one for each scan, on the single graph):
abline(run1,col=””, lwd=1)
abline(run2,col=””,lwd=1)
abline(run3,col=””,lwd=1)
# ...
abline(runn,col=””,lwd=1)
# Then to get each LINEST equation:
summary(run1)
summary(run2)
summary(run3)
# ...
summary(runn)
Each time I use summary(), I copy the slope and paste it into an Excel sheet- along with corresponding scan temp which I have recorded separately. I then graph the conductance v temp points for the film as X-Y scatter with smooth lines to give the temperature dependent conductance curve. Giving me a single LINEST lines plot in R and the conductance v temp in Excel.
This technique is actually MUCH quicker than doing it all in Excel, but it can be done much quicker and efficiently!!! Also, if I need to change something, this entire process needs to be reexecuted with whatever change is necessary. This process takes me maybe 5 hours in Excel and 1.5 hours in R (maybe I am too slow). Nonetheless, any tips to help automate/streamline this further are greatly appreciated.
There are plenty of questions about operating on data in lists; storing a list of matrix or a list of data.frame is fast, and code that operates cleanly on one can be applied to the remaining n-1 very easily.
(Note: the way I'm showing it here is one technique: maintaining everything in well-compartmentalized lists. Other will suggest -- very justifiably -- that combing things into a single data.frame and adding a group variable (to identify from which file/experiment the data originated) will help with more advanced multi-experiment regression or combined plotting, such as with ggplot2. I'm not going to go into this latter technique here, not yet.)
It is long decried not to do for(...) assign(..., read.csv(...)); you have the important part done, so this is relatively easy:
allruns <- sapply(list.files(pattern = "*.txt"), read.table, simplify = FALSE)
(The use of sapply(..., simplify=FALSE) is similar to lapply(...), but it has a nice side-effect of naming the individual list-ified elements with, in this case, each filename. It may not be critical here but is quite handy elsewhere.)
Extracting your invariant and variable data is simple enough:
allLMs <- lapply(allruns, function(mdl) lm(V2 ~ V1, data = mdl))
I'm using each table's V1 here instead of a once-extracted x ... though you might wonder why, I argue keeping it like for two reasons: (1) JUST IN CASE the V1 variable is ever even one-row-different, this will save you; (2) it is very easy to construct the model like this.
At this point, each object within allLMs is an lm object, meaning we might do:
summary(allLMs[[1]])
Plotting: I think I understand why you are using par=NEW, and I have to laugh ... I had been deep in R for a while before I started using that technique. What I think you need is actually much simpler:
xlim <- rev(range(allruns[[1]]$V1))
ylim <- range(sapply(allruns, `[`, "V2"))
# this next plot just sets the box and axes, no points
plot(NA, type = "na", xlim = xlim, ylim = ylim)
# no need to plot points with "transparent" ...
ign <- sapply(allLMs, abline, col = "") # and other abline options ...
Copying all models into Excel, again, using lists:
out <- do.call(rbind, sapply(allLMs, function(m) summary(m)$coefficients[,1]))
This will now be a single data.frame with all coefficients in two columns. (Feel free to use similar techniques to extract the other model summary attributes, including std err, t.value, or Pr(>|t|) (in the $coefficients); or $r.squared, $adj.r.squared, etc.)
write.csv(out, file="clipboard", sep="\t")
and paste into Excel. (Or, better yet, save it to a CSV file and import that, since you might want to keep it around.)
One of the tricks to using lists for this is to persevere: keep things in lists as long as you can, so that you don't have deal with models individually. One mantra is that if you do it once, you shouldn't have to type it again, just loop/apply/map/whatever. Don't extract too much from the lists before you have to.
Note: r2evans' answer provides good general advice and doesn't require heavy package dependencies. But it probably doesn't hurt to see alternative strategies.
The tidyverse can be quite handy for this sort of thing, here's a dummy example for illustration,
library(tidyverse)
# creating dummy data files
dummy <- function(T) {
V <- seq(-5, 5, length=20)
I <- jitter(T*V + T, factor = 1)
write.table(data.frame(V=V, I = I),
file = paste0(T,".txt"),
row.names = FALSE)
}
purrr::walk(300:320, dummy)
# reading
lf <- list.files(pattern = "\\.txt")
read_one <- function(f, ...) {cbind(T = as.numeric(gsub("\\.txt", "", f)), read.table(f, ...))}
m <- purrr::map_df(lf, read_one, header = TRUE, .id="id")
head(m)
ggplot(m, aes(V, I, group = T)) +
facet_wrap( ~ T) +
geom_point() +
geom_smooth(se = FALSE)
models <- m %>%
split(.$T) %>%
map(~lm(I ~ V, data = .))
coefs <- models %>% map_df(broom::tidy, .id = "T")
ggplot(coefs, aes(as.numeric(T), estimate)) +
geom_line() +
facet_wrap(~term, scales = "free")

Plot LOESS (STL) decomposition using Ggvis

I want to be able to plot the three different elements of The Seasonal Trend Decomposition using Loess (STL) with Ggvis.
However, I recive this error:
Error: data_frames can only contain 1d atomic vectors and lists
I am using the nottem data set.
# The Seasonal Trend Decomposition using Loess (STL) with Ggvis
# Load nottem data set
library(datasets)
nottem <- nottem
# Decompose using stl()
nottem.stl = stl(nottem, s.window="periodic")
# Plot decomposition
plot(nottem.stl)
Now, this is the information I am interested in. In order to make this into a plot that I can play around with I transform this into a data frame using the xts-package. So far so good.
# Transform nottem.stl to a data.frame
library(xts)
df.nottem.stl <- as.data.frame(as.xts(nottem.stl$time.series))
# Add date to data.frame
df.nottem.stl$date <- data.frame(time = seq(as.Date("1920-01-01"), by = ("months"), length =240))
# Glimpse data
glimpse(df.nottem.stl)
# Plot simple line of trend
plot(df.nottem.stl$date, df.nottem.stl$trend, type = "o")
This is pretty much the plot I want. However, I want to be able to use it with Shiny and therefore Ggvis is preferable.
# Plot ggvis
df.nottem.stl%>%
ggvis(~date, ~trend)%>%
layer_lines()
This is where I get my error.
Any hints on what might go wrong?
First of all your df.nottem.stl data.frame contains a Date data.frame, so you should be using the date$time column. Then using the layer_paths function instead of the layer_lines will make it work. I always find layer_paths working better than layer_lines:
So this will work:
library(ggvis)
df.nottem.stl%>%
ggvis(~date$time, ~trend)%>%
#for points
layer_points() %>%
#for lines
layer_paths()
Output:

Exclude Node in semPaths {semPlot}

I'm trying to plot a sem-path with R.
Im using an OUT file provinent from Mplus with semPaths {semPLot}.
Apparently it seems to work, but i want to remove some latent variables and i don't know how.
I am using the following syntax :
Out from Mplus : https://www.dropbox.com/s/vo3oa5fqp7wydlg/questedMOD2.out?dl=0
outfile1 <- "questedMOD.out"
```
semPaths(outfile1,what="est", intercepts=FALSE, rotation=4, edge.color="black", sizeMan=5, esize=TRUE, structural="TRUE", layout="tree2", nCharNodes=0, intStyle="multi" )
There may be an easier way to do this (and ignoring if it is sensible to do it) - one way you can do this is by removing nodes from the object prior to plotting.
Using the Mplus example from your question Rotate Edges in semPaths/qgraph
library(qgraph)
library(semPlot)
library(MplusAutomation)
# This downloads an output file from Mplus examples
download.file("http://www.statmodel.com/usersguide/chap5/ex5.8.out",
outfile <- tempfile(fileext = ".out"))
# Unadjusted plot
s <- semPaths(outfile, intercepts = FALSE)
In the above call to semPaths, outfile is of class character, so the line (near the start of code for semPaths)
if (!"semPlotModel" %in% class(object))
object <- do.call(semPlotModel, c(list(object), modelOpts))
returns the object from semPlot:::semPlotModel.mplus.model(outfile). This is of class "semPlotModel".
So the idea is to create this object first, amend it and then pass this object to semPaths.
# Call semPlotModel on your Mplus file
obj <- semPlot:::semPlotModel.mplus.model(outfile)
# obj <- do.call(semPlotModel, list(outfile)) # this is more general / not just for Mplus
# Remove one factor (F1) from object#Pars - need to check lhs and rhs columns
idx <- apply(obj#Pars[c("lhs", "rhs")], 1, function(i) any(grepl("F1", i)))
obj#Pars <- obj#Pars[!idx, ]
class(obj)
obj is now of class "semPlotModel" and can be passed directly to semPaths
s <- semPaths(obj, intercepts = FALSE)
You can use str(s) to see the structure of this returned object.
Assuming that you use the following sempath code to print your SEM
semPaths(obj, intercepts = FALSE)%>%
plot()
you can use the following function to remove any node by its label:
remove_nodes_and_edges <- function(semPaths_obj,node_tbrm_vec){
relevent_nodes_index <- semPaths_obj$graphAttributes$Nodes$names %in% node_tbrm_vec
semPaths_obj$graphAttributes$Nodes$width[relevent_nodes_index]=0
semPaths_obj$graphAttributes$Nodes$height[relevent_nodes_index]=0
semPaths_obj$graphAttributes$Nodes$labels[relevent_nodes_index]=""
return(semPaths_obj)
}
and use this function in the plotting pipe in the following way
semPaths(obj, intercepts = FALSE) %>%
remove_nodes_and_edges(c("Y1","Y2","Y3")) %>%
plot()

Resources