I want to create a wordcloud with R. I want to visualize the occurence of variable names, which may consist of more than one word and also special characters and numbers, for example one variable name is "S & P 500 dividend yield".
The variable names are in a text file and they are no further separated. Every line of the text file contains a new variable name.
I tried the folowing code, however the variable names are split into different characters:
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
# load the text:
text <- readLines("./Overview_used_series.txt")
docs <- Corpus(VectorSource(text))
inspect(docs)
# build a term-document matrix:
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
# generate the wordcloud:
pdf("Word cloud.pdf")
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
dev.off()
How can I treat the variable names, so that they are visualized in the wordcloud with their original names as in the text file?
If you have a file as you specified with a variable name per line, there is no need to use tm. You can easily create your own word frequency table to use as input. When using tm, it will split words based a space and will not respect your variable names.
Starting from when the text is loaded, just create a data.frame with where frequency is set to 1 and then you can just aggregate everything. wordcloud also accepts data.frame like this and you can just create a wordcloud from this. Note that I adjusted the scale a bit, because when you have long variable names, they might not get printed. You will get a warning message when this happens.
I'm not inserting the resulting picture.
#text <- readLines("./Overview_used_series.txt")
text <- c("S & P 500 dividend yield", "S & P 500 dividend yield", "S & P 500 dividend yield",
"visualize ", "occurence ", "variable names", "visualize ", "occurence ",
"variable names")
# freq = 1 adds a columns with just 1's for every value.
my_data <- data.frame(text = text, freq = 1, stringsAsFactors = FALSE)
# aggregate the data.
my_agr <- aggregate(freq ~ ., data = my_data, sum)
wordcloud(words = my_agr$text, freq = my_agr$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), scale = c(2, .5))
Related
I am trying to create a wordcloud for my internship and I'm struggling with it, i would appreciate any help!
I have a repository of 200 pdf documents, and I have to build a wordcloud of the terms that are repeated the most. For that, I build a corpus, turn it into a tdm, and use the command wordcloud. The clouds only show single words ("country", "fiscal", "debt", etc.) and I would like to add terms with several words that I have seen frequently repeated in the papers ("fiscal rules", "stability and growth pact", etc.). Because the term document matrix only counts how often a single word is repeated, I would need to find a way to count how many times these "blocks" or "tokens" show up in the text, but I can't find a way to do it from the corpus or group of pdfs. ¿Would anyone know a way to do this? I have tried "tokenizing" or building a dictionary, but it doesn't seem to work when it comes to counting their frequency.
I am attaching the code of how I have been building the wordclouds, and how I attempted to count double words.
Thank you so much in advance!
This is how I got the cloud
files = list.files(pattern = "pdf$")
files
Excel_Metadata <- read_excel("Y:/Excel.xlsx",
range = "A1:D203")
View(Excel_Metadata)
corp = Corpus(URISource(Excel_Metadata$files),
readerControl=list(reader=readPDF))
files.tdm <- TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE))
inspect(files.tdm)
matrixfiles <- as.matrix(files.tdm)
v <- sort(rowSums(matrixfiles),decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
set.seed(6984)
pdf("folder/cloud.pdf")
wordcloud(words= d$word, freq=d$freq, scale= , min.freq= 30, max.words=150, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Paired"), family="serif")
dev.off()
This is what I have been trying to get the two-word terms
text_tokens("files.tdm", text_filter(combine("fiscal rules")))
inspect(files.tdm)
install.packages("tidytext")
library(tidytext)
filesver = tidy(corp)
token = text_tokens(filesver$text, text_filter(combine("fiscal rules")))
matrixfiles <- as.matrix(files.tdm)
v <- sort(rowSums(matrixfiles),decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
wordcloud(words= d$word, freq=d$freq, scale= , min.freq= 10, max.words=150, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Paired"), family="serif")
I create traceplots of my stanfit objects via mcmc_trace.
I want to rename the titles of the traceplots.
I already managed to change the title, but I don't know how I can rename the plots to greek letters with subscripts similar to the expression function.
array <- as.array(fit)
array[1,1,1:3]
dimnames(array)[[3]][1:3] <- c("alpha1", "alpha2", "alpha3")
trace <- mcmc_trace(array, pars = c("alpha1", "alpha2", "alpha3")
I want to replace alpha1 with expression(gamma[0]), but it doesn't work.
Okay, so after a little digging, it simply seems like bayesplott::mcmc_trace doesn't have an option for greek lettering.
However, the much similar mcmcplots::traplot package has an option greek = TRUE.
I did a random example for you to see:
library(mcmcplots)
nc <- 3; nr <- 1000
pnames <- c(paste('alpha[', 1:2, ']', sep = ''), paste('gamma[1]', sep = ''))
means <- rpois(10, 20)
fakemcmc <- coda::as.mcmc.list(
lapply(1:3, function(i) coda::mcmc(matrix(rnorm(nc*nr, rep(means, each=nr)),
nrow=nr, dimnames=list(NULL,pnames)))))
traplot(fakemcmc, greek = TRUE)
Which produces this output:
I have ranked the tokens in my texts according so a criterion and they all have a value. My list looks like this:
value,token
3,tok1
2.84123,tok2
1.5,tok3
1.5,tok4
1.01,tok5
0.9,tok6
0.9,tok7
0.9,tok8
0.81,tok9
0.73,tok10
0.72,tok11
0.65,tok12
0.65,tok13
0.6451231,tok14
0.6,tok15
0.5,tok16
0.4,tok17
0.3001,tok18
0.3,tok19
0.2,tok20
0.2,tok21
0.1,tok22
0.05,tok23
0.04123,tok24
0.03,tok25
0.02,tok26
0.01,tok27
0.01,tok28
0.01,tok29
0.007,tok30
I then try to produce wordcloud with the following code:
library(tm)
library(wordcloud)
tokList = read.table("tokens.txt", header = TRUE, sep = ',')
# Create corpus
corp <- Corpus(DataframeSource(tokList))
corpPTD <- tm_map(corp, PlainTextDocument)
wordcloud(corpPTD, max.words = 50, random.order=FALSE)
Which produces:
But that is not what I want. I would like a wordcloud, where I visualize the tokens (so "tok1", "tok2", ...) according to the value that's in the table. So if the first token has a 3 then I want that word to be three times bigger than the next element in the list.
Can somebody maybe help?
Simply this will also work (assuming that your minimum value is not zero, if zero then filter out the corresponding tokens):
library(RColorBrewer)
wordcloud(tokList$token, tokList$value/min(tokList$value), max.words = 50, min.freq = 1,
random.order=FALSE, colors=brewer.pal(6,"Dark2"), random.color=TRUE)
I want to make a tag cloud to visualize the gene frequency.
library(wordcloud)
genes_snv <- read.csv("genes.txt", sep="", header=FALSE)
wordcloud(genes_snv$V1,
min.freq=15,
scale=c(5,0.5),
max.words=100,
random.order=FALSE,
rot.per=0.3,
colors=brewer.pal(8, "Dark2"))
This is my code, but it converts everything to lowercase (not useful with gene names). How can I avoid this?
genes.txt starts with
Fcrl5
Etv3
Etv3
Lrrc71
Lrrc71
(...)
When freq argument is missing wordcloud calls tm::TermDocumentMatrix, which I guess internally calls function tolower before computing frequency.
To avoid calls to tm we can supply our own frequency, see example:
# dummy data
set.seed(1)
genes <- c("Fcrl5","Etv3","Etv3","Lrrc71","Lrrc71")
genes <- unlist(sapply(genes, function(i)rep(i, sample(1:100,1))))
# get frequency
plotDat <- as.data.frame(table(genes))
# plot
wordcloud(word = plotDat$genes, freq = plotDat$Freq,
min.freq=15,
scale=c(5,0.5),
max.words=100,
random.order=FALSE,
rot.per=0.3,
colors=brewer.pal(8, "Dark2"))
New to R ... struggling to produce results on 10,000 lines; Data model actually has about 1M lines. Is there a better option than a Loop? Read about vectorization and attempted tapply with no success.
Data set has a column of free form text and a category associated to the text. I need to parse the text into distinct words to then perform statistics on the frequency of words being able to predict the category with a certain degree of accuracy. I read in the data via read.table and create a data.frame called data.
Function attempts to parse Text, and count occurrences of each word:
data <- data.frame(category = c("cat1","cat2","cat3", "cat4"),
text = c("The quick brown fox",
"Jumps over the fence",
"The quick car hit a fence",
"Jumps brown"))
parsefunc <- function(data){
finalframe <- data.frame()
for (i in 1:nrow(data)){
description <- strsplit(as.character(data[i,2]), " ")[[1]]
category <- rep(data[i,1], length(description))
worddataframe <- data.frame(description, category)
finalframe <- rbind(finalframe, worddataframe)
}
m1<- ddply(finalframe, c("description","category"), nrow)
m2<- ddply(m1, 'description', transform, totalcount = sum(nrow), percenttotal = nrow/sum(nrow))
m3 <- m2[(m2$totalcount>10) & (m2$percenttotal>0.8), ]
m3
}
This will get your finalframe and do something close to your m1,2, and 3 part. You'll have to edit it to do exactly what you want. I used a longer data set of 40k rows to make sure it performs alright:
# long data set
data <- data.frame(Category = rep(paste0('cat',1:4),10000),
Text = rep(c('The quick brown fox','Jumps over the fence','The quick car hit a fence','Jumps brown cars'),10000),stringsAsFactors = F)
# split into words
wordbag <- strsplit(data$Text,split = ' ')
# find appropriate category for each word
categoryvar <- rep(data$Category,lapply(wordbag,length))
# stick them in a data frame and aggregate
newdf <- data.frame(category = categoryvar,word = tolower(unlist(wordbag)))
agg <- aggregate(list(wordcount = rep(1,nrow(newdf))),list(category = newdf$category,word =newdf$word),sum)
# find total count in entire data set and put in data set
wordagg <- aggregate(list(totalwordcount = rep(1,nrow(newdf))),list(word =newdf$word),sum)
agg <- merge(x = agg,y = wordagg,by = 'word')
# find percentages and do whatever else you need
agg$percentageofword <- agg$wordcount/agg$totalwordcount