R: Package topicmodels: LDA: Error: invalid argument - r

I have a question regarding LDA in topicmodels in R.
I created a matrix with documents as rows, terms as columns, and the number of terms in a document as respective values from a data frame. While I wanted to start LDA, I got an Error Message stating "Error in !all.equal(x$v, as.integer(x$v)) : invalid argument type" . The data contains 1675 documents of 368 terms. What can I do to make the code work?
library("tm")
library("topicmodels")
data_matrix <- data %>%
group_by(documents, terms) %>%
tally %>%
spread(terms, n, fill=0)
doctermmatrix <- as.DocumentTermMatrix(data_matrix, weightTf("data_matrix"))
lda_head <- topicmodels::LDA(doctermmatrix, 10, method="Gibbs")
Help is much appreciated!
edit
# Toy Data
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toydata <- data.frame(documentstoy,meta1toy,meta2toy,termstoy)

So I looked inside the code and apparently the lda() function only accepts integers as the input so you have to convert your categorical variables as below:
library('tm')
library('topicmodels')
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
toydata <- data.frame(documentstoy,meta1toy,meta2toy)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toy_unique = unique(termstoy)
for (i in 1:length(toy_unique)){
A = as.integer(termstoy == toy_unique[i])
toydata[toy_unique[i]] = A
}
lda_head <- topicmodels::LDA(toydata, 10, method="Gibbs")

Related

KNN: "no missing values are allow" -> I do not have missing values

I am in a group project for a class and one of the people in my group ran the normalization, as well as creating the test/train sets so that we all have the same sets to work from (we're all utilizing different algorithms). I am assigned with running the KNN algorithm.
We had multiple columns with NA's so those columns were omitted (<-NULL). When attempting to run the KNN I keep getting the error of
Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
no missing values are allowed
I ran which(is.na(dataset$col)) and found:
which(is.na(testsetne$median_days_on_market))
# [1] 8038 8097 8098 8100 8293 8304
When I look through the dataset those cells do not have missing data.
I am wondering if I may get some help with how to either find and fix the "No missing values" or to find a work around (if any).
I am sorry if I am missing something simple. Any help is appreciated.
I have listed the code that we have below:
ne$pending_ratio_yy <- ne$total_listing_count_yy <- ne$average_listing_price_yy <- ne$median_square_feet_yy <- ne$median_listing_price_per_square_feet_yy <- ne$pending_listing_count_yy <- ne$price_reduced_count_yy <- ne$median_days_on_market_yy <- ne$new_listing_count_yy <- ne$price_increased_count_yy <- ne$active_listing_count_yy <- ne$median_listing_price_yy <- ne$flag <- NULL
ne$pending_ratio_mm <- ne$total_listing_count_mm <- ne$average_listing_price_mm <- ne$median_square_feet_mm <- ne$median_listing_price_per_square_feet_mm <- ne$pending_listing_count_mm <- ne$price_reduced_count_mm <- ne$price_increased_count_mm <- ne$new_listing_count_mm <- ne$median_days_on_market_mm <- ne$active_listing_count_mm <- ne$median_listing_price_mm <- NULL
ne$factor_month_date <- as.factor(ne$month_date_yyyymm)
ne$factor_median_days_on_market <- as.factor(ne$median_days_on_market)
train20ne= sample(1:20893, 4179)
trainsetne=ne[train20ne,1:10]
testsetne=ne[-train20ne,1:10]
#This is where I start to come in
ne_train_target <- ne[train20ne, 3]
ne_test_target <- ne[-train20ne, 3]
predict_1 <- knn(train = trainsetne, test = testsetne, cl=ne_train_target, k=145)
# Error in knn(train = trainsetne, test = testsetne, cl = ne_train_target, :
# no missing values are allowed

Structural Topic Modeling (stm) Error in makeTopMatrix(prevalence, data) : Error creating model matrix

I'm trying to run the initial steps of this stm tutorial
https://github.com/dondealban/learning-stm
with this dataset, it is part of the original one
http://www.mediafire.com/file/1jk2aoz4ac84jn6/data.csv/file
install.packages("stm")
library(stm)
load("VignetteObjects.RData")
data <- read.csv("C:/data.csv")
head(data)
processed <- textProcessor(data$documents, metadata=data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
poliblogPrevFit <- stm(out$documents, out$vocab, K=4, prevalence=~rating+s(day),
max.em.its=200, data=out$meta, init.type="Spectral",
seed=8458159)
But I keep getting the same error
Error in makeTopMatrix(prevalence, data) : Error creating model matrix.
This could be caused by many things including
explicit calls to a namespace within the formula.
Try a simpler formula.
Please can anyone run it in 64 bits MS Windows R-3.5.2.. I could not even find similar errors anywhere..
It seems your problem was that with the sampling you did, you ended up with a factor object with just one level:
> levels(meta$rating)
[1] "Conservative"
Using a variable like this does not make any sense though, as there is no variation between cases. If you use the original data, your code works absolutely fine:
data <- read.csv("https://raw.githubusercontent.com/dondealban/learning-stm/master/data/poliblogs2008.csv")
processed <- textProcessor(data$documents, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
levels(meta$rating)
[1] "Conservative" "Liberal"
poliblogPrevFit <- stm(docs, vocab, K = 4, prevalence = ~rating+s(day),
max.em.its = 200, data = out$meta, init.type = "Spectral",
seed = 8458159)

R: error in t test

https://filebin.net/3et86d1gh8cer9mu this is example subset of my data
I try to apply a code that was already working on similar data, now I can't tract where its wrong. The code goes like this:
url <- 'https://filebin.net/3et86d1gh8cer9mu/TCA_subset_GnoG_melt.csv'
TCA_subset_GnoG_melt <- read.csv(url)
L <- data.frame()
IDs <- unique(TCA_subset_GnoG_melt$X1)
for (i in 1 : length(IDs)){
temp<-TCA_subset_GnoG_melt[(TCA_subset_GnoG_melt$X1)==IDs[i],]
temp<- na.omit(temp)
t_test_CTROL_ABC.7<- t.test(temp$value[temp$X1.1=="CTROL"], temp$value[temp$X1.1=="ABC.7"])
t_test_CTROL_ABC.8<- t.test(temp$value[temp$X1.1=="CTROL"], temp$value[temp$X1.1=="ABC.8"])
t_test_CTROL_ABC.7.8<- t.test(temp$value[temp$X1.1=="CTROL"], temp$value[temp$X1.1=="ABC7.8"])
t_test_ABC.7_ABC.8<- t.test(temp$value[temp$X1.1=="ABC.7"], temp$value[temp$X1.1=="ABC.8"])
t_test_ABC.7_ABC.7.8<- t.test(temp$value[temp$X1.1=="ABC.7"], temp$value[temp$X1.1=="ABC7.8"])
t_test_ABC.8_ABC.7.8<- t.test(temp$value[temp$X1.1=="ABC.8"], temp$value[temp$X1.1=="ABC7.8"])
LLc <- cbind(as.character(unique(IDs[i])), t_test_CTROL_ABC.7,t_test_CTROL_ABC.8,t_test_CTROL_ABC.7.8, t_test_ABC.7_ABC.8,t_test_ABC.7_ABC.7.8, t_test_ABC.8_ABC.7.8)
L<-rbind(L,LLc)
}
AA<-rownames(L)
L$names <- AA
p_value_TCA <-L[grep("p.value",L$names), ]
df <- apply(p_value_TCA ,2,as.character)
df = as.matrix(df)
the error i get is:
Error in t.test.default(temp$value[temp$X1.1 == "CTROL"], temp$value[temp$X1.1 == :
not enough 'y' observations
I dpm't understand it, when i check the code line by line it goes until the LLc creation and than the df "L" is empty. it makes no sense to me. help!

how to remove whitespace introduced due to tidyr separate()

I am a beginner in R and working on the titanic dataset.
I wanted to split the fullnames which are like this
into :
but the separate function in tidyr is also adding a whitspace during the split. How to i avoid this?
Thanks in advance.
here is my code
require('ggplot2') # visualization
require('ggthemes') # visualization
require('scales') # visualization
require('dplyr') # data manipulation
require('mice') # imputation
require('randomForest') # classification algorithm
require('tidyr')
setwd('~/Downloads/Titanic dataset/')
train <- read.csv('./train.csv')
test <- read.csv('./test.csv')
full <- bind_rows(train,test)
names<-full["Name"]
names$Name<-gsub('\\"','',names$Name)
names$Name<-gsub('\\(.*\\)','',names$Name)
names<-separate(names,Name,into =c("lastname","firstname"),sep="[\\,]")
names<-separate(names,firstname,into =c("title","firstname"),sep="[\\.]")
full<-bind_cols(names,full)
#full$title<-gsub(" ",'',full$title)
full$title<-trimws(x,'b')
rare_title<- c('Capt','Don','Dona','Jonkheer','Lady','Sir',
'the Countess','Major','col','Major','Rev')
full$title[full$title =="Mlle"] <- "Miss"
full$title[full$title =='Ms'] <- 'Miss'
full$title[full$title =='Mme'] <- 'Mrs'
full$title[full$title %in% rare_title] <- "rare_title"
table(full$Sex, full$title)

linear disriminant function error - arguments must be same length

My example dataset:
year <- c("2002","2002","2002","2004","2005","2005","2005","2006", "2006")
FA1 <- c(0.7975030, 1.5032768, 0.8805000, 1.0505961, 1.1379715, 1.1334510, 1.1359434, 0.9614926, 1.2631387)
FA2 <- c(0.7930153, 1.2862355, 0.5633592, 1.0396431, 0.9446277, 1.1944455, 1.086171, 0.767955, 1.2385361)
FA3 <- c(-0.7825210, 0.56415672, -0.9294417, 0.21485071, -0.447953,0.037978, 0.038363, -0.495383, 0.509704)
FA4 <- c(0.38829957,0.34638035,-0.06783007, 0.505020, 0.3158221,0.55505411, 0.42822783, 0.36399347, 0.51352115)
df <- data.frame(year,FA1,FA2,FA3,FA4)
I then select the data I want to use and run a DFA
library(magrittr)
library(DiscriMiner)
yeardf <- df[df$year %in% c(2002, 2005, 2006),]
yeardfd <- linDA(yeardf[,2:4],yeardf$year, validation = "crossval")
But now i get an error telling me the arguments are different lengths.
"Error in table(original = y[test], predicted = pred_class) :
all arguments must have the same length"
I looked at
length(yeardf$year)
dim(yeardf)
And it looks like they are the same.
I also checked for spelling mistakes as that seems to cause this error sometimes.
following up on answer.
The suggested answer works on my example data (which does give me the same error), but I can't quite make it work on my real code.
I first apply the transformation to selected columns in my data.frame. And then I combine the transformed columns with the variables I want to use as groups in my DFA
library(robCompositions)
tFA19 <- cenLR(fadata.PIZ[names(FA19)])[1]
tFA19 <- cbind(fadata.PIZ[1:16],tFA19)
So I think creating my data.frame this way must be leading to my error. I tried to insert stringsAsFactors into my cbind statement, but no luck.
You need ,stringsAsFactors = FALSE in data.frame:
year <- c("2002","2002","2002","2004","2005","2005","2005","2006", "2006")
FA1 <- c(0.7975030, 1.5032768, 0.8805000, 1.0505961, 1.1379715, 1.1334510, 1.1359434, 0.9614926, 1.2631387)
FA2 <- c(0.7930153, 1.2862355, 0.5633592, 1.0396431, 0.9446277, 1.1944455, 1.086171, 0.767955, 1.2385361)
FA3 <- c(-0.7825210, 0.56415672, -0.9294417, 0.21485071, -0.447953,0.037978, 0.038363, -0.495383, 0.509704)
FA4 <- c(0.38829957,0.34638035,-0.06783007, 0.505020, 0.3158221,0.55505411, 0.42822783, 0.36399347, 0.51352115)
df <- data.frame(year,FA1,FA2,FA3,FA4,stringsAsFactors = FALSE)
library(magrittr)
library(DiscriMiner)
yeardf <- df[df$year %in% c(2002, 2005, 2006),]
yeardfd <- linDA(yeardf[,2:4],yeardf$year, validation = "crossval")
yeardfd
Linear Discriminant Analysis
-------------------------------------------
$functions discrimination functions
$confusion confusion matrix
$scores discriminant scores
$classification assigned class
$error_rate error rate
-------------------------------------------
$functions
2002 2005 2006
constant -345 -371 -305
FA1 228 231 213
...

Resources