Wordcloud2 : Is it possible to only show words that appear n times? - wordcloud2

I created a beautiful word cloud with wordcloud2 but I want to only show words that appears n times. How to do it ?
data <- read.table(text = 'my data', sep = ";")
dim(data)
library(tm)
documents <- Corpus(VectorSource(data$V2))
inspect(documents)
lapply(documents[1],as.character)
inspect(documents)
set.seed(1234)
tdm <- TermDocumentMatrix(documents)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq = v)
d$word=rownames(d)
library("wordcloud2")
wordcloud2(d)

Finally, I used :
d <- d[c(1:n), ]
if I want to keep the first n lines of data

Related

R obs Number Limited

I've been stuck with the data here for days, as I want to get data from API Binance, which is surely over ten thousand obs, but the R only limited the obs at 1500L.
I have been advised to use loop, but it doesn't help any.
Any help would be totally my gratitude!
library(httr)
library(jsonlite)
library(lubridate)
# api description:
#
get
("https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md"
)
#klinecandlestick-data
options(stringsAsFactors = FALSE)
url <- "https://api.binance.com"
path <- "/api/v3/exchangeInfo"
raw.result <- GET(url = url, path = path)
not.cool.data <- rawToChar(raw.result$content)
list1 <- fromJSON(not.cool.data)
list <- list1$symbols$symbol
klines2 <- rbindlist(lapply(
c('LTCTUSD', 'LTCBNB'),
binance_klines,
interval = '30m',
start_time = '2017-01-01',
end_time = '2021-01-08'
))
names(klines2)
sapply(klines2, function(x) length(unique(x)))
klines2
df.1 <- list.files(pattern = "2017-2021")
df.1_r <- vector(mode = integer,
length = length(klines2))
tickling <- unique(klines2$symbol)
tickling
low <- c()
high <- c()
for (symbol in tickling) {
look.at <- klines2$symbol == symbol
low <- append(low,min(symbol$low[look.at]))
high <- append(high, max(symbol$high[look.at]))
}
tickling

KNN for text classification, but train and class have different lengths in R

Hello I am trying to classify text, here is the code
df <- read.csv("D:/AS/tokpedprepro.csv")
#sampling
set.seed(123)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
#Convert to corpus
dfCorpus <- Corpus(VectorSource(df$text))
inspect(dfCorpus[1:20])
#convert DTM
dtm <- DocumentTermMatrix(dfCorpus)
inspect(dtm[1:4, 3:7])
#Data Partition
df.train <- df[1:20,]
df.test <- df[21:37,]
dtm.train <- dtm[1:20,]
dtm.test <- dtm[21:37,]
df.Corpus.train <- dfCorpus[1:20]
df.corpus.test <- dfCorpus[21:37]
train.class <- df$data.class
#TFIDF
dtm.train.knn <- DocumentTermMatrix(df.Corpus.train, control = list(weighting =
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.train.knn)
The dimension is
[1] 20 194
dtm.test.knn <- DocumentTermMatrix(df.corpus.test, control = list(weighting =
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.test.knn)
the dimension is
[1] 17 211
Then
knn.pred <- knn(dtm.train.knn, dtm.test.knn, train.class, k=1 )
But error
'train' and 'class' have different lengths
What should i do?
Thanks
Your train.class is train.class <- df$data.class, but your dtm.train.knn is based on dfCorpus[1:20]. You need to change length of your train.class, probably as train.class <- df$data.class[1:20].

How Can I Compile Counts of Bigrams from Several .csv Files into One .csv File using R

I have a series of .csv files labeled Trigrams1A, Trigrams1B, ... Trigrams66A, Trigrams66B consisting of the counts of trigrams from a text file. My goal is to compile counts of specific trigrams from each of these files into one table. My thought was to try to create a row with the specific counts in each file and then stack the rows, but it does not seem to be working. A sample code follows:
z <- 1
while (z <= 66) {
document <- paste("Trigrams", z, "B", ".csv", sep="")
mytable <- read.csv(document, header = T, sep=",")
a <- length(mytable[which(mytable=="came here from")])
b <- length(mytable[which(mytable=="go home with")])
c <- length(mytable[which(mytable=="going to split")])
d <- length(mytable[which(mytable=="i m gonna")])
e <- length(mytable[which(mytable=="a lot of")])
f <- length(mytable[which(mytable=="lot of money")])
g <- length(mytable[which(mytable=="all the way")])
h <- length(mytable[which(mytable=="i m going")])
i <- length(mytable[which(mytable=="i promise you")])
j <- length(mytable[which(mytable=="i want to")])
k <- length(mytable[which(mytable=="i trust you")])
rows <- c(a,c,d,e,f,g,h,i,j,k)
columns <- rbind(rows)
z <- z+1
}
How can I effectively get the counts from the tables, combine them, and then write this combination into a new table?

Can't generate word cloud by cluster number using R

I am trying to generate a word cloud by cluster but it gives error x must be an array of atleast two dimensions, I am using twitter data -> corpus -> textminig -> document term matrix -> kmeans clustering -> word cloud by each cluster.
library(tm)
library(SnowballC)
library(XML)
library(streamR)
library(wordcloud)
library(NLP)
library(fpc)
library(cluster)
tweetsDF <- parseTweets('tweetsStream.txt', simplify = FALSE)
names(tweetsDF)
corp = Corpus(VectorSource(tweetsDF$text))
inspect(corp[1:1])
corp = Corpus(VectorSource(corp))
dtm = DocumentTermMatrix(corp)
inspect(dtm)
tdm = TermDocumentMatrix(corp)
freq = colSums(as.matrix(dtm))
length(freq)
freq= sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)
d= dist(t(dtm), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
docs1 = names(which(kfit$cluster ==2))
docs1 = as.matrix(docs1)
v1= sort(colSums((docs1)), decreasing= TRUE)
error x must be an array of at least two dimension
myNames1 = names(v1)
d1 = data.frame(word=myNames1, freq=v1)
wordcloud(d1$word, d1$freq, min.freq=2)
output of dput
You are not collecting the term data after clustering to determine the word clouds....
What you what should be something like this:
library(slam)
docs1 <- which(kfit$cluster ==2)
head(docs1); length(docs1)
docs1 <- tdm[docs1, ]
head(docs1)
d1 <- data.frame(word=rownames(docs1), freq=row_sums(docs1))
head(d1)
d1 <- d1[order(d1$freq), ]
wordcloud(d1$word, d1$freq, min.freq=2)
Minimal example:
Using some built in data I have done kmeans clustering and generated a wordcloud based on one of the clusters
library(tm)
library(wordcloud)
library(slam)
data("acq")
dtm = DocumentTermMatrix(acq)
inspect(dtm)
tdm <- TermDocumentMatrix(acq)
freq = colSums(as.matrix(dtm))
length(freq)
freq= sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)
d= dist(t(dtm), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
docs1 <- which(kfit$cluster ==2)
head(docs1); length(docs1)
docs1 <- tdm[docs1, ]
inspect(docs1)
d1 <- data.frame(word=rownames(docs1), freq=row_sums(docs1))
head(d1)
d1 <- d1[order(d1$freq), ]
wordcloud(d1$word, d1$freq, min.freq=2)
As a side note: posting an image of your dput statement doesn;t help as we cannot use this to generate your data on our machines.

Organizing data from physics experiments for ggplot2

I am currently trying to use ggplot2 to visualize results from simple current-voltage experiments. I managed to achieve good results for one set of data of course.
However, I have a number of current-voltage datasets, which I input in R recursively to get the following organisation (see minimal code) :
data.frame(cbind(batch(string list), sample(string list), dataset(data.frame list)))
Edit : My data are stored in text files names batchname_samplenumber.txt, with voltage and current columns. The code I use to import them is :
require(plyr)
require(ggplot2)
#VARIABLES
regex <- "([[:alnum:]_]+).([[:alpha:]]+)"
regex2 <- "G5_([[:alnum:]]+)_([[:alnum:]]+).([[:alpha:]]+)"
#FUNCTIONS
getJ <- function(list, k) llply(list, function(i) llply(i, function(i, indix) getElement(i,indix), indix = k))
#FILES
files <- list.files("Data/",full.names= T)
#NAMES FOR FILES
paths <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex,i)))
paths2 <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex2,i)))
names <- llply(llply(getJ(paths, 2)),unlist)
batches <- llply(llply(getJ(paths2, 2)),unlist)
samples <- llply(llply(getJ(paths2, 3)),unlist)
#SETS OF DATA, NAMED
sets <- llply(files,function(i) read.table(i,skip = 0, header = F))
names(sets) <- names
for (i in as.list(names)) names(sets[[i]]) <- c("voltage","current")
df<-data.frame(cbind(batches,samples,sets))
And a minimal data can be generated via :
require(plyr)
batch <- list("A","A","B","B")
sample <- list(1,2,1,2)
set <- list(data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)))
df<-data.frame(cbind(batch,sample,set))
My question is : is it possible to use the data as is to plot using a code similar to the following (which does not work) ?
ggplot(data, aes(x = dataset$current, y = dataset$voltage, colour = sample)) + facet_wrap(~batch)
The more general version would be : is ggplot2 able of handeling raw physical data, as opposed to discrete statistical data (like diamonds, cars) ?
With the newly-defined problem (two-column files named "batchname_samplenumber.txt"), I would suggest the following strategy:
read_custom <- function(f, ...) {
d <- read.table(f, ...)
names(d) <- c("V", "I")
## extract sample and batch from the base filename
ids <- strsplit(gsub(".txt", "", f), "_")
d$batch <- ids[[1]][1]
d$sample <- ids[[1]][2]
d
}
## list files to read
files <- list.files(pattern=".txt")
## read them all in a single data.frame
m <- ldply(files, read_custom)
It's not clear how the sample names are defined with respect to the dataset. The general idea for ggplot2 is that you should group all your data in the form of a melted (long format) data.frame.
library(ggplot2)
library(plyr)
library(reshape2)
l1 <- list(batch="b1", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l2 <- list(batch="b2", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l3 <- list(batch="b3", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
list_to_df <- function(l, n=10){
m <- l[["dataset"]]
m$batch <- l[["batch"]]
m$sample <- rep(l[["sample"]], each=n)
m
}
## list_to_df(l1)
m <- ldply(list(l1, l2, l3), list_to_df)
ggplot(m) + facet_wrap(~batch)+
geom_path(aes(current, voltage, colour=sample))

Resources