Citation Network - What is the most effective way to present? - r

I have been working on a project that illustrates relationships between authors' articles and their respective citations (from other authors). Then, I created a matrix that shows edges between them.
Ultimately, we want to measure originality among all the articles, and we are open to additional suggestions on measuring originality.
Below is the code that I have already created (in RStudio using the bibtex package and igraph package):
data <- readFiles("network_science_450.bib") #read in data
convert<- convert2df(data, dbsource = "isi", format = "bibtex") #converted in a data frame
matrix <- cocMatrix(convert, Field = "CR", sep = ";")
sort(Matrix::colSums(matrix), decreasing = TRUE)[1:5]
NetMatrix <- biblioNetwork(convert, analysis = "coupling", network = "references", sep = ". ")
NetMatrixTable <- as.matrix(NetMatrix, mode="directed", weighted=TRUE)
binary <- ifelse(NetMatrixTable>0,1,0) #converted into a binary matrix
as.matrix(binary)
We have created a binary matrix to represent all these relationships, but I was wondering if there is a better way to present our data. We have explored Hasse diagram as a possibility.
Our main problem is we cannot find a way to create an adjacency matrix to perform further analysis. We want to perform transitive reduction on the matrix.

I don't really understand your problem but it looks like you want to build a sociomatrix. If so, try:
# citaton data
df <- data.frame(article = sample(LETTERS, 50, replace = TRUE),
cited_article = sample(LETTERS, 50, replace = TRUE))
## network creation
# 2-mode sociomatrix
df.2mode <- table(df)
df.2mode
# diag(df.2mode) <- 0
(A reproducible example is required for SO' questions)

Related

igraph error long vectors not supported yet when trying to create adjacency matrix

I'm trying to perform a social network analysis in R, and I'm having some troubles with creating adjacency matrices from very large matrices using the igraph package. One of the main matrices is 10998555876 elements large (82 Gb) - created from a dataset with 176881 rows.
The error I get when running:
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = TRUE)
is as follows:
Error in graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
The data is two-mode, so I've had to transpose it to get the one-mode matrix with the units I'm interested in. The code used before to create the matrix is:
graph <- graph.data.frame(data, directed = FALSE) # Making a graph object from the dataframe.
types <- bipartite.mapping(graph)$type
matrix <- as_incidence_matrix(graph, types = type) # Creating a two-mode matrix.
one_mode_matrix <- tcrossprod(matrix) # Transposing to get one-mode matrix.
diag(matrix) <- 0
mode(matrix) <- "numeric"
adjacency_matrix <- graph.adjacency(one_mode_matrix, mode = "undirected", weighted = TRUE, diag = FALSE) # This is where things break down.
Having done some research, e.g. in this thread https://github.com/igraph/rigraph/issues/255 , it looks like a problem in R base. It seems to me (without being an expert on these things) that igraph is trying to create an object in a format that R cannot handle because it is too big(?) Does anybody know how to handle this issue? Perhaps there are other packages for creating adjacency matrices that would do a better job on a large matrix?
Solution to anybody who might be interested:
I discovered that igraph can handle sparse matrices. Convert the matrix to a sparse matrix using the Matrix package like so:
sparse_matrix <- as(one_mode_matrix, "sparseMatrix")
Then make it into a graph object like this:
g <- graph_from_adjacency(sparse_matrix)
And ravel in all the functionality igraph has to offer.

Stock price prediction based on financial news in R with SVM

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

How to loop an analysis in R while iteratively removing/replacing rows from the original dataset?

I have an excel csv file with mixed data that looks similar to the sample dataframe provided below.
Given the following sample data and analysis:
#Installing packages
library(cluster)
library(vegan)
size = c(5,300,500,4000,60000,2000)
diet = c('A','A','C','D','C','D')
area = c('Ae','Te','Fo','Ae','Te','Ae')
time = c('Di','No','Di','Cr','Ca','Ca')
distance = c(50,800,60,12000,150000,4200)
DF = data.frame(size,diet,area,time,distance)
row.names(DF) = c('Bird','Rat','Cobra','Dog','Human','Fish')
#Calculate Gower distance dissimilarity matrix for species in "DF"
DF.diss = daisy(DF, metric = "gower", type = list(logratio = c("size", "distance")))
attributes(DF.diss)
#Performing hierarchical cluster analysis on dissimilarity matrix
DF.Hclust = hclust(DF.diss, method = "average")
#Calculating metric for species community based on hclust tree
treeheight(DF.Hclust)
Starting with the all the rows as the example does, how would I go about rerunning the analysis while iteratively removing a row, rerunning the analysis, putting the row back, removing the next row, rerunning the analysis, and so on, until the analysis has been done once for every species removed/replaced.
I am interested in calculating the treeheight metric for the entire community while removing and replacing single species to gauge each of their contributions to overall treeheight.
Since my actual data set has well over 200 species it would be great if there was a way to do this in R without having to prepare over 200 separate csv files where I've removed single species and then running each through the provided analysis. Also is it possible to output each treeheight output/result to a table?
You can create a loop for this:
treeheights <- matrix(-9999, nrow(DF), 1) # make matrix to store answers.
# I set -9999 as standard value so I can check if everything went alright afterwards.
for ( i in 1:nrow(DF)) {
DF.LOO <- DF[-i,] # leave one (row) out
DF.diss.LOO <- daisy(DF.LOO, metric = "gower", type = list(logratio =
c("size", "distance")))
DF.HC.LOO <- hclust(DF.diss.LOO, method = "average")
treeheights[i,] <- treeheight(DF.HC.LOO)
}
This goes through all the rows and always leaves one row out. Hope this helps!

Extracting Class Probabilities from SparkR ML Classification Functions

I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are.
spark.gbt()
spark.mlp()
spark.randomForest()
Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence."
I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.
i ran into the same problem, and following this answer now use SparkR:::callJMethod to transform the probability DenseVector (which R cannot deserialize) to an Array (which R reads as a List). It's not very elegant or fast, but it does the job:
denseVectorToArray <- function(dv) {
SparkR:::callJMethod(dv, "toArray")
}
e.g.:
start your spark session
#library(SparkR)
#sparkR.session(master = "local")
generate toy data
data <- data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
someString = base::sample(c("this", "that"),
100, replace=TRUE),
stringsAsFactors=FALSE)
trainidxs <- base::sample(nrow(data), nrow(data)*0.7)
traindf <- as.DataFrame(data[trainidxs,])
testdf <- as.DataFrame(data[-trainidxs,])
train a random forest and run predictions:
rf <- spark.randomForest(traindf,
clicked~.,
type = "classification",
maxDepth = 2,
maxBins = 2,
numTrees = 100)
predictions <- predict(rf, testdf)
collect your predictions:
collected = SparkR::collect(predictions)
now extract the probabilities:
collected$probabilities <- lapply(collected$probability, function(x) denseVectorToArray(x))
str(probs)
ofcourse, the function wrapper around SparkR:::callJMethod is a bit of an overkill. You can also use it directly, e.g. with dplyr:
withprobs = collected %>%
rowwise() %>%
mutate("probabilities" = list(SparkR:::callJMethod(probability,"toArray"))) %>%
mutate("prob0" = probabilities[[1]], "prob1" = probabilities[[2]])

How to use weights from survey package in TermDocumentMatrix

I work a lot with samples that I want to generalize to larger populations. However, most times the samples are biased and need to be weighted with the survey package. However, I have not found a way to weight Term Document Matrix on these kind of weights. Consider this example
library(tm)
library(wordcloud)
set.seed(123)
# Consider this example: I have performed a sample from a population and now have
# 1000 observations of text. In the data I also have information about gender.
# The sample
data <- rbind(data.frame(gender = "M",
words = sample(c("education", "money", "family",
"house", "debts"),
600, replace = TRUE)),
data.frame(gender = "F",
words = sample(c("career", "bank", "friends",
"drinks", "relax"),
400, replace = TRUE)))
# I create a simple wordcloud
text <- paste(data$words, collapse = " ")
matrix <- as.matrix(
TermDocumentMatrix(
VCorpus(
VectorSource(text)
)
)
)
Which produces a wordcloud that looks something like this:
As you can see, the terms mentioned by men are bigger because the appear more often. However, I know the true distribution of this population, thus this wordcloud is biased.
The true gender distribution
true_gender_dist <- data.frame(gender = c("M", "F"), freq = nrow(data) * c(0.49,0.51))
With the survey package I can weight the data with the rake function
library(survey)
rake_data <- rake(design = svydesign(ids = ~1, data = data),
sample.margins = list(~gender),
population.margins = list(true_gender_dist))
In order to use the weights in analysis, visualizations etc. (that are not included in the survey package) I add the weights to the original data.
data_weighted <- cbind(data, data.frame(weights = weights(rake_data)))
So far so good. However, I would like to make a wordcloud that take these weighs into consideration.
My first attempt would be to use the weights in making the Term Document Matrix.
text_corp <- VCorpus(VectorSource(text))
w_tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weights(rake_data)))
But then I get:
Error in .TermDocumentMatrix(m, weighting) : invalid weighting
Is this at all possible?
I can't comment yet, so I'll use the answer to comment your question:
You could be interested in the R package stm (structured topic models). It provides possibilities to infer latent topics regarding meta variables (continuous and/or discrete).
You can generate different kinds of plots to check out how metavariables influence
a) the selected topics depending,
b) the preferred words inside one topic,
c) and some more :)
Some links, if you're interested:
Paper describing the R package
R documentation
Some more Papers <-- this is a really good collection, if you want to dive into the subject some more!

Resources