Issue with length of sparse.matrix - r

I've been trying to train this dataset using xgboost. However when I turn it into a sparse matrix there is the following error message;
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I'm incredibly confused because the label was derived from the dataset - therefore I don't understand how it is a different length to the sparse matrix.
From what I can tell - the dataframe has 2048 rows, as does the label that was derived from it. However when I turn this into a sparse matrix - 300 rows are added.
Can anyone think of a fix to sort this out?
require(xgboost)
require(methods)
require(Matrix)
require(data.table)
require(vcd)
require(dplyr)
train = read.csv("French Ligue 1 train.csv", header = TRUE, stringsAsFactors = F)
test = read.csv("French Ligue 1 test.csv", header = TRUE, stringsAsFactors = F)
df <- data.table(train, keep.rownames = F)
sparse_matrix <- sparse.model.matrix(Response ~.-1, data = df)
output_vector = sparse_matrix[,Response] == 1
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nrounds = 4, objective = "binary:logistic")

Related

Why am I getting metaMDS error: arguments imply differing number of rows: 0, 17 for species matrix? Is this due to an incorrect matrix input?

Beginner and new to stack:
I get this error message while trying to make NMDS ordination plot for bird community count data.
Data subset:
datasubset <- "Cell_ID Treatment Count_cell Bird_Code
1 26-E15-N Single Tree Selection Cut 1 EAWP
2 26-E15-N Single Tree Selection Cut 2 ACFL
4 26-E15-N Single Tree Selection Cut 2 REVI
5 26-E15-N Single Tree Selection Cut 3 AMGO"
bird18_ <- read.table(text=datasubset, header = TRUE)
Here is my code:
num <- nrow(Bird18_)
new_data_frame <- data.frame()
for(i in 1:num){
mult_by <- bird18_$Count_cell[i]
interimdf <- data.frame(bird18_[i, ])
againdf <- rbind(interimdf[rep(1, mult_by), ])
new_data_frame <- rbind(new_data_frame, againdf)
}
bird_matrix_18<-cast(new_data_frame, Cell_ID+Treatment~Bird_Code)
vegan_species<-bird_matrix_18[,c(3:39)] #getting rid of plot info
###vegan bray-curtis
dis_matrix <- vegdist(vegan_species, method = "bray")
###Simper analysis
sim <- with(vegan_plotinfo, simper(vegan_species, Treatment))
lapply(sim, FUN=function(x){x$overall})
sim<-data.frame(unclass(summary(sim)), check.names = FALSE, stringsAsFactors = FALSE)
mds <- metaMDS(vegan_species, distance = "bray", k = 2, trymax = 20, autotransform = TRUE, noshare = 0.1, expand = TRUE, trace = 1, plot = FALSE)
Here is my error:
Error in data.frame(rrownames(x), unx, check.names = FALSE) :
arguments imply differing number of rows: 0, 17

Do you need to tokenize your text to visualize data from a LDA topic model?

I'm currently using the textmineR package to run LDA topicmodels on news articles from 2016-2019.
However, I am quite new to R and i don't know how to display my results from the model.
I want to show the prevalence of the 8 topics my model finds, over the time period I have collected data. The data is structured in a dataframe. My data is defined at the day to day level as %y-%m-%d
My LDA model is made like this:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
Can anyone tell me how to plot the most prevalent topics the model finds over time?
Do I need to tokenize the text or something like that?
I hope this makes sense.
Best,
Tokenization happens in the CreateDtm function. So, it doesn't sound like that's your issue.
You can get the prevalence of topics over a set of documents by taking a mean over the columns of theta, a matrix that's part of the resulting model.
I can't give you an exact answer with your data, but I can show you a similar example with the nih_sample data that ships with textmineR
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")

Trying to perform LDA with LASSO

PenalizedLDA( x = train_x, y =train_y) returns
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
I'm trying to use linear discriminant analysis with lasso on the sampbase dataset from UCI.(I've added the headers to the columns and where appropriate return the columns to an interval [0,1].
The first time I ran the code it gave an error
Error in PenalizedLDA(x = train_x, y = train_y) :
y must be a numeric vector, with values as follows: 1, 2, ....
I solved that by passing train_y as
train_y =as.list.numeric_version(training_set[,58])
When I ran it again it I got the error
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
Here I got stuck.
library(penalizedLDA)
data = read.csv("spambase.csv",header = TRUE)
new_data = data/100
new_data[,c(55,56,57,58)] = data[,c(55,56,57,58)]
new_data[,58]= factor(new_data[,58])
# Splitting dataset into Training set and Test set
set.seed(seeds)
split = sample.split(new_data$factor, SplitRatio = 0.7)
training_set = subset(new_data, split == TRUE)
test_set = subset(new_data, split == FALSE)
#scale data
training_set[-58] = scale(training_set[,-58])
test_set[-58] = scale(test_set[,-58])
train_x =training_set[,-58]
train_y =as.list.numeric_version(training_set[,58])
#Sparse linear discriminant Analysis
classifier = PenalizedLDA( x = training_set[,-58], y =training_set[,58],K = 1,lambda = "standard")
According to the help-page of PenalizedLDA(), its parameter y = should be:
A n-vector containing the class labels. Should be coded as 1, 2, . . . , nclasses, where nclasses is the number of classes.
It means that the levels of the variable of interest (position 58 in your case) should start be one and not 0. Moreover, don't use the function as.list.numeric_version(), because it creates a list, whereas a vector is required.
data = read.csv("...")
new_data = data/100
new_data[,c(55,56,57,58)] = data[,c(55,56,57,58)]
new_data[,58] = factor(new_data[,58] + 1) # in order to start at 1 and not 0
new_data[-58] = scale(new_data[,-58])
classifier = PenalizedLDA(x = new_data[,-58], y = new_data[,58], K = 1, lambda = .1)

human readable rules from xgboost in R

I try to use xgboost in R to get rules (gbtree) from my data, so I can use the rules in an other system (not predicted data with 'predict'). The Input-Data have appr. 1500 colums and 40 Mio rows with binary, sparse data and the Label is a binary column, too.
library(xgboost)
library(Matrix)
labels <- data.frame(labels = sample.int(2, m*1, TRUE)-1L)
observations <- Matrix(as.matrix(data.frame(feat_01=sample.int(2, size=100, T) -1,
feat_02=sample.int(2, size=100, T) -1,
feat_03=sample.int(2, size=100, T) -1,
feat_04=sample.int(2, size=100, T) -1,
feat_05=sample.int(2, size=100, T) -1)), sparse=T)
dtrain <- xgb.DMatrix(data = observations, label = labels$labels)
bstResult <- xgb.train(data = dtrain,
nthread = 1,
nround = 4,
max_depth = 3,
verbose = T,
objective = "binary:logistic",
booster='gbtree')
xgb.dump(bstResult)
xgb.plot.tree(model = bstResult, n_first_tree = 2)
I visualize the data as xgb.dump or xgb.plot.tree. But I need the data in a form like:
rule1: feat_01 == 1 & feat_02==1 & feat_03== 0 --> Label = 1
rule2: feat_01== 0 & feat_03==1 & feat_04== 1 --> Label = 0
Is this possible or am I on the wrong track?
Regards
Heiko
edit: added example and tried to make the question better
On one hand, I think you can use the importance matrix to obtain the coverage and ranking of each feature. On the other hand, xgboost uses an ensemble of weak learners using bagging, the rules should be 'rare'

R Dataframes : Arguments imply differing number of rows

I have data of 11784 records split into test (2946) and train (8838) to run a h20 algorithm, but got an error related to the data frame that I'm trying to create as the final output to link the predictions and the ids that the predictions were made for.
Error for this line:
df_y_test <- data.frame(ID = df_labels, Status = df_y_test$predict)
Error in data.frame(ID = df_labels, Status = df_y_test$predict) :
arguments imply differing number of rows: 2946, 2950
Looked through the forums and understood that the number of rows in df_y_test is 2950 which is causing this, but couldn't figure out why since df_y_test is also derived from the same 'test' variable overall which has only 2946 rows - would be happy for any guidance please, full script posted below for reference
data : 11784 obs of 46 variables
test: 2946 obs of 45 variables
train: 8838 obs of 46 variables
df_labels: 2946 obs of 1 variable
df_y_test: 2950 obs of 4 variables
# Load Data
data <- read.csv('Data.csv')
# Partition Data
library(caTools)
set.seed(75)
split <- sample.split(data$Status, SplitRatio = 0.75)
train <- subset(data, split == TRUE)
test <- subset(data, split == FALSE)
# Dropping the column to be predicted from Test
test <- subset(test[,-c(2)])
library(readr)
library(h2o)
# Init h2o
localh2o <- h2o.init(max_mem_size = '2g', nthreads = -1)
# convert status values (to be predicted) in second column to factors in h2o
train[,2] <- as.factor(train[,2])
train_h2o <- as.h2o(train)
test_h2o <- as.h2o(test)
# Running H2O
model <- h2o.deeplearning(x=c(1, 3:46),
y=2,
training_frame = train_h2o,
activation = "RectifierWithDropout",
input_dropout_ratio = 0.2,
hidden_dropout_ratios = c(0.5, 0.5),
balance_classes = TRUE,
hidden = c(100,100),
nesterov_accelerated_gradient = T,
epochs = 15 )
h2o_y_test <- h2o.predict(model, test_h2o)
# Converting to data frames from h2o
df_y_test <- as.data.frame(h2o_y_test)
df_labels <- as.data.frame(test[,1])
df_y_test <- data.frame(ID = df_labels, Status = df_y_test$predict)
write.csv(df_y_test, file="predictionsH2o.csv", row.names = FALSE)
h2o.shutdown(prompt = FALSE)

Resources