R Network Package - node attributes - r

I've recently started to learn R, with specific interest in network modeling applications. I've made a sample dataset and would like to visualize it, eventually getting to rigorous statistical network analysis.
The example is a high school friendship network. The node attributes are found in HS1_Node_Attributes.csv and the adjacency matrix is found in HS1_adjacency_matrix. I'm able to visualize the network, though I'm having trouble with node attributes (characteristics of the people). I'm using the network package.
The error I get is as follows:
Error in set.vertex.attribute(g, vertex.attrnames[[i]], vertex.attr[[i]]) :
Inappropriate value given in set.vertex.attribute.
I have cross-referenced my example with some tutorials online, along with the R Network package documentation. The potential problem could have been the type of my attribute data frame, though I confirmed it was type list, which checks out. So I'm not sure what the problem is. Everything works fine (meaning I can successfully create a network object) if I don't take out node attributes (the vertex.attr and the vertex.attrnames arguments), showing me that the rest of the code is sound. My code is below.
high_school1_attributes <- read.table("HS1_Node_Attributes.csv", header = TRUE,
sep = ",")
high_school1_adj <- read.table("HS1_adjacency_matrix.csv", header = TRUE,
row.names = 1, sep = ",")
adj1 <- as.matrix(high_school1_adj)
library("network")
high_school1_network <- network(adj1, vertex.attr = high_school1_attributes,
vertex.attrnames = colnames(high_school1_attributes),
directed = FALSE, hyper = FALSE, loops = FALSE,
multiple = FALSE, bipartite = FALSE)

You could be automatically converting strings to factors. I can recreate the error by doing:
high_school1_attributes <- read.csv(text=
"Name,Color
Kermit,green
Piggy,pink
Gonzo,blue")
high_school1_adj <- read.csv(text=
",From,To
1,1,3
2,3,2
3,2,1",
row.names = 1)
adj1 <- as.matrix(high_school1_adj)
library("network")
high_school1_network <- network(
adj1,
vertex.attr = high_school1_attributes,
vertex.attrnames = colnames(high_school1_attributes),
directed = FALSE, hyper = FALSE, loops = FALSE,
multiple = FALSE, bipartite = FALSE)
And can fix it by replacing the first statement with:
high_school1_attributes <- read.csv(text=
"Name,Color
Kermit,green
Piggy,pink
Gonzo,blue",
stringsAsFactors=FALSE)
Which you can see works by plotting:
library(igraph)
library(intergraph)
hs_graph <- asIgraph(high_school1_network)
plot(hs_graph, vertex.size=8,
vertex.color=V(hs_graph)$Color,
vertex.label=V(hs_graph)$Name,
edge.arrow.size=0.25,layout=layout.fruchterman.reingold)

Related

k-fold cross validation in quanteda

I've been using the quanteda SML workflow as described in the quanteda tutorial (https://tutorials.quanteda.io/machine-learning/nb/) and found it extremely helpful to set up my own classification task. However, instead of the fixed held-out train/test sampling I would like to use a k-fold cross-validation. Could you point me towards the best way to implement it into the workflow? Is there an easy way to apply it in quanteda?
Many thanks
I tried to add a cross validation based on this example:
https://rdrr.io/github/quanteda/quanteda.classifiers/man/crossval.html
require(quanteda)
require(quanteda.textmodels)
require(caret)
corp_movies <- data_corpus_moviereviews
summary(corp_movies, 5)
# generate 1500 numbers without replacement
set.seed(300)
id_train <- sample(1:2000, 1500, replace = FALSE)
head(id_train, 10)
# create docvar with ID
corp_movies$id_numeric <- 1:ndoc(corp_movies)
# tokenize texts
toks_movies <- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
dfmt_movie <- dfm(toks_movies)
# get training set
dfmat_training <- dfm_subset(dfmt_movie, id_numeric %in% id_train)
# get test set (documents not in id_train)
dfmat_test <- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
tmod_nb <- textmodel_nb(dfmat_training, dfmat_training$sentiment)
summary(tmod_nb)
dfmat_matched <- dfm_match(dfmat_test, features = featnames(dfmat_training))
actual_class <- dfmat_matched$sentiment
predicted_class <- predict(tmod_nb, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class
require(confusionMatrix)
confusionMatrix(tab_class, mode = "everything", positive = "pos")
#n-fold cross validation
require(crossval)
dfmat <- dfm(toks_movies)
tmod <- textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)
crossval(tmod, k = 5, by_class = TRUE)
crossval(tmod, k = 5, by_class = FALSE)
crossval(tmod, k = 5, by_class = FALSE, verbose = TRUE)
but it returns "Error in group.samples(Y) : argument "Y" is missing, with no default"
It should probably be a comment, but I cannot post them yet. I think your problem is caused by the usage of the crossval() function from the improper package. The link you shared suggests that you want to use it from the remote quanteda/quanteda.classifiers package, instead of crossval. The one you used presumably requires a different pipeline cause its definition is different. The used function requires additional X and Y arguments. Their lack is a reason for your error.

How to delete vertex with degree = 0 in R (social network data)?

I want to delete vertices with the code below but have this error message:
Error in delete.vertices(n_all, V(n_all)[degree(n_all) == 0]) :
delete.vertices requires an argument of class network.
I follow some suggestions from here (remove unconnected nodes R igraph or ggnet) but still not successful. Any one having solutions would be much appreciated.
require(igraph)
meta <- read.csv("vil35_meta.csv", header = TRUE, as.is = TRUE)
n_all <- graph.adjacency(as.matrix(all), mode = "undirected", weighted = TRUE)
V(n_all)$caste <- meta$castesubcaste
V(n_all)$religion <- meta$hohreligion
V(n_all)$rooms <- meta$room_no
n_all <- delete.vertices(n_all, V(n_all)[degree(n_all)==0])

How to customize clustering in mclust?

I am trying to use the mclust package of R. I want to cluster some data.
Here are the steps to what I have done :
Reading data :
mydata <- read.table("\Users......", row.names= 1, sep = "\t", header = TRUE)
Using mclust : library(mclust)
mydataModel <- Mclust(mydata)
summary(mydataModel)
It breaks into 7 clusters. However, I want my data to be broken only into 2 clusters. Please help on how to do ?
As mentioned by MrFlick, you should read the documentation by adding a ?function().
In your case, do ?Mclust() in your R console to see how default parameters have been set up.
This will show up once you do ?Mclust()
Mclust(data, G = NULL, modelNames = NULL,
prior = NULL,
control = emControl(),
initialization = NULL,
warn = mclust.options("warn"), ...)
All you need to do is:
Mclust(mydata, 2)

R: how to map test data into lsa space created by training data

I am trying to do text analysis using LSA. I've read many other posts regarding LSA on StackOverflow, but I have not found one similar to mine yet. IF you know there's one similar to mine, please kindly redirect me to it! Much appreciated!
here's my reproducible code with sample data created:
creating sample data train & test sets
sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time',
'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 1 im happy
2: 1 this is good
3: 0 what a bummer X(
4: 1 today is kinda okay day for me
5: 0 i somehow messed up big time
6: 1 guess not being promoted is not too bad :]
7: 0 stayhing home is boring :(
8: 0 kids wont stop crying QQ
9: 1 warriors are legendary!
10: 0 stop reading my tweets!!!
sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
sentiment text
1: 0 running out of things to say...
2: 1 if you are still reading, good for you!
3: 0 nothing ended on a good note today
4: 0 seriously sleep deprived!! >__<
preprocessing for training data set
corpus.train = Corpus(VectorSource(train_data$text))
create a term document matrix for training set
tdm.train = TermDocumentMatrix(
corpus.train,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
train_matrix = as.matrix(tdm.train)
create an lsa space using train data
lsa.train = lsa(tdm.train, dimcalc_share())
set dimension # (i randomly picked one here b/c the data size is too small to create an elbow shape)
k = 6
project train matrix into the new LSA space
projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]
convert above projected data into a matrix
projected.train.matrix = matrix(projected.train,
nrow = dim(projected.train)[1],
ncol = dim(projected.train)[2])
train the random forest model (somehow this step does not work anymore with this small sample data... but it's okay, won't be a big problem in this question; however, if you can help me with this error too, that'd be fantastic! i tried googling for this error but it's just not fixed...)
trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)
preprocessing for test data set
basically im repeating whatever i did to the training data set, except i did not use the test set to create its own LSA space
corpus.test = Corpus(VectorSource(test_data$text))
create a term document matrix for test set
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en"),
stemming = function(word) wordStem(word, language = "english"),
removeNumbers = TRUE,
tolower = TRUE,
weighting = weightTfIdf)
)
convert into matrix (for later use)
test_matrix = as.matrix(tdm.test)
project test matrix into the trained LSA space (here's where the question is)
projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)
but i'd get an error:
Error in crossprod(docvecs, LSAspace$tk) : non-conformable arguments
i am not finding any useful google search results regarding this error... (there's only one search results page from google QQ)
any help is much appreciated! Thank you!
When you build the LSA model you are using the vocabulary of the training data. But when you build the TermDocumentMatrix for the test data, you are using the vocabulary of the test data. The LSA model only know how to handle documents tabulated against the vocabulary of the training data.
One way to remedy this is to create your test TDM with dictionary set to the vocabulary of the training data:
tdm.test = TermDocumentMatrix(
corpus.test,
control = list(
removeNumbers = TRUE,
tolower = TRUE,
stopwords = stopwords("en"),
stemming = TRUE,
removePunctuation = TRUE,
weighting = weightTfIdf,
dictionary=rownames(tdm.train)
)
)

R HTS package: combinef and aggts not working with gts object

I'm trying to apply the combinef and aggts functions from the R hts package to a time series matrix in order to obtain an optimized set of forecasts across a hierarchy. I've run the same code every month without issue, and am now seeing errors after upgrading to hts package v4.5.
Reproducible example (I can share data file offline if needed)
#Read in forecast data for all levels of hierarchy#
fcast<-read.csv("SampleHierarchyForecast.csv", header = TRUE, check.names = FALSE)
#Convert to time series#
fcast<-ts(fcast, start = as.numeric(2010.25) + (64)/12, end = as.numeric(2010.25) + (75)/12, f= 12)
#Create time series of only the bottom level of the hierarchy#
index<-c()
fcastBottom<-fcast
for (i in 1:length(fcastBottom [1,]))
{
if(nchar(colnames(fcastBottom)[i])!=28)
index[i]<-i
else
index[i]<-0
}
fcastBottom<-fcastBottom[,-index]
#Create grouped time series from the bottom level forecast #
GtsForecast <- gts(fcastBottom, characters = list(c(12,12), c(4)), gnames = c("Category", "Item", "Customer", "Category-Customer"))
#Use combinef function to optimally combine the full hierarchy forecast using the groups from the full hierarchy gts#
combo <- combinef(fcast, groups = GtsForecast$groups)
*Warning message:
In mapply(rep, as.list(gnames), times, SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter*
traceback()
2: stop("Argument fcasts requires all the forecasts.")
1: combinef(fcast, groups = GtsForecast$groups)
There's a little bug when comebinef() function calls gts(). Now I've fixed it on github. So you can run your own code above without any trouble after updating the development version.
Alternatively, you need to tweak your code a bit if you don't want to install the newest version.
combo <- combinef(fcast, groups = GtsForecast$groups, keep = "bottom")
combo <- ts(combo, start = as.numeric(2010.25) + (64)/12,
end = as.numeric(2010.25) + (75)/12, f = 12)
colnames(combo) <- colnames(fcastBottom)
newGtsForecast <- gts(combo, characters = list(c(12,12), c(4)),
gnames = c("Category", "Item", "Customer",
"Category-Customer"))
Aggregate <- aggts(newGtsForecast)
Hope it helps.

Resources