I used the bibliometrix function in R, and want to plot some useful graphs.
library(bibliometrix)
??bibliometrix
D<-readFiles("E:\\RE\\savedrecs.txt")
M <- convert2df(D,dbsource = "isi", format= "plaintext")
results <- biblioAnalysis(M ,sep = ";" )
S<- summary(object=results,k=10, pause=FALSE)
plot(x=results,k=10,pause=FALSE)
options(width=100)
S <- summary(object = results, k = 10, pause = FALSE)
NetMatrix <- biblioNetwork(M1, analysis = "co-occurrences", network = "author_keywords", sep = ";")
S <- normalizeSimilarity(NetMatrix, type = "association")
net <- networkPlot(S, n = 200, Title = "co-occurrence network",type="fruchterman", labelsize = 0.7, halo = FALSE, cluster = "walktrap",remove.isolates=FALSE, remove.multiple=FALSE, noloops=TRUE, weighted=TRUE)
res <- thematicMap(net, NetMatrix, S)
plot(res$map)
But in the net <- networkPlot(S, n = 200, Title = "co-occurrence network",type="fruchterman", labelsize = 0.7, halo = FALSE, cluster = "walktrap",remove.isolates=FALSE, remove.multiple=FALSE, noloops=TRUE, weighted=TRUE), it shows error
Error in V<-(*tmp*, value = *vtmp*) : invalid indexing
. Also I cannot do the CR, it always shows unlistCR. I cannot use the NetMatrix function neither.
Some help me plsssssssss
The problem is in the data itself not in the code you presented. When I downloaded the data from bibliometrix.com and changed M1 to M (typo?) in biblioNetwork function call everything worked perfectly. Please see the code below:
library(bibliometrix)
# Plot bibliometric analysis results
D <- readFiles("http://www.bibliometrix.org/datasets/savedrecs.txt")
M <- convert2df(D, dbsource = "isi", format= "plaintext")
results <- biblioAnalysis(M, sep = ";")
S <- summary(results)
plot(x = results, k = 10, pause = FALSE)
# Plot Bibliographic Network
options(width = 100)
S <- summary(object = results, k = 10, pause = FALSE)
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "author_keywords", sep = ";")
S <- normalizeSimilarity(NetMatrix, type = "association")
net <- networkPlot(S, n = 200, Title = "co-occurrence network", type = "fruchterman",
labelsize = 0.7, halo = FALSE, cluster = "walktrap",
remove.isolates = FALSE, remove.multiple = FALSE, noloops = TRUE, weighted = TRUE)
# Plot Thematic Map
res <- thematicMap(net, NetMatrix, S)
str(M)
plot(res$map)
Related
I am trying to run the Monocle3 function find_gene_modules() on a cell_data_set (cds) but am getting a variety of errors in this. I have not had any other issues before this. I am working with an imported Seurat object. My first error came back stating that the number of rows were not the same between my cds and cds#preprocess_aux$gene_loadings values. I took a look and it seems my gene loadings were a list under cds#preprocess_aux#listData$gene_loadings. I then ran the following code to make a dataframe version of the gene loadings:
test <- seurat#assays$RNA#counts#Dimnames[[1]]
test <- as.data.frame(test)
cds#preprocess_aux$gene_loadings <- test
rownames(cds#preprocess_aux$gene_loadings) <- cds#preprocess_aux$gene_loadings[,1]
Which created a cds#preprocess_aux$gene_loadings dataframe with the same number of rows and row names as my cds. This resolved my original error but now led to a new error being thrown from uwot as:
15:34:02 UMAP embedding parameters a = 1.577 b = 0.8951
Error in uwot(X = X, n_neighbors = n_neighbors, n_components = n_components, :
No numeric columns found
Running traceback() produces the following information.
> traceback()
4: stop("No numeric columns found")
3: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,
metric = metric, n_epochs = n_epochs, alpha = learning_rate,
scale = scale, init = init, init_sdev = init_sdev, spread = spread,
min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,
local_connectivity = local_connectivity, bandwidth = bandwidth,
gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,
a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,
method = "umap", approx_pow = approx_pow, n_threads = n_threads,
n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,
target_n_neighbors = target_n_neighbors, target_weight = target_weight,
target_metric = target_metric, pca = pca, pca_center = pca_center,
pca_method = pca_method, pcg_rand = pcg_rand, fast_sgd = fast_sgd,
ret_model = ret_model || "model" %in% ret_extra, ret_nn = ret_nn ||
"nn" %in% ret_extra, ret_fgraph = "fgraph" %in% ret_extra,
batch = batch, opt_args = opt_args, epoch_callback = epoch_callback,
tmpdir = tempdir(), verbose = verbose)
2: uwot::umap(as.matrix(preprocess_mat), n_components = max_components,
metric = umap.metric, min_dist = umap.min_dist, n_neighbors = umap.n_neighbors,
fast_sgd = umap.fast_sgd, n_threads = cores, verbose = verbose,
nn_method = umap.nn_method, ...)
1: find_gene_modules(cds[pr_deg_ids, ], reduction_method = "UMAP",
max_components = 2, umap.metric = "cosine", umap.min_dist = 0.1,
umap.n_neighbors = 15L, umap.fast_sgd = FALSE, umap.nn_method = "annoy",
k = 20, leiden_iter = 1, partition_qval = 0.05, weight = FALSE,
resolution = 0.001, random_seed = 0L, cores = 1, verbose = T)
I really have no idea what I am doing wrong or how to proceed from here. Does anyone with experience with uwot know where my error is coming from? Really appreciate the help!
I'm using the raster package on an R server to process a large set (30000 files) of data files (10MB each).
For now, processing consists of parsing the data and subsequently rasterizing it via the rasterize function.
The data is very sparse (only along roads) but has a high resolution and large extent. I've seen temporary files of 30GB for a raster created from one of the input files.
Because of the amount of files I'm using a foreach() %dopar% approach to processing the files, giving one file to each thread. I've set the raster options as follows:
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
This should come out to 15GB/thread * 32 threads = 480GB of RAM used at maximum for the rasters. Add some overhead and I would expect somewhere between 10GB to 20GB of the 512GB RAM to remain. However, that is not the case and I can't seem to figure out why.
R gobbles up RAM until only 100MB to 2GB remain and only then seems to release previously allocated memory, only to be fed straight back into R for the next raster. I checked the RAM usage repeatedly over several hours to observe this.
I'm using SpatialPolygonDataFrames as input for rasterize, and suspected they might take up a lot of RAM as well. But when I checked their size, they were rather small, at about 100MB. Playing around with maxmemory, chunksize and only 16 threads also didn't seem to have any effect.
I also looked at the rasterize source code to see if I find an explanation there, but that didn't get me far:
setMethod('rasterize', signature(x='SpatialPoints', y='Raster'),
function(x, y, field, fun='last', background=NA, mask=FALSE, update=FALSE, updateValue='all', filename="", na.rm=TRUE, ...){
.pointsToRaster(x, y, field=field, fun=fun, background=background, mask=mask, update=update, updateValue=updateValue, filename=filename, na.rm=na.rm, ...)
}
)
I have no clue where to find .pointsToRaster
Does anyone have an explanation for this behaviour or ideas for things to check? Did I simply overlook something? I´d like to not use the entire RAM so that other users can still work on the server. From what I understand my code should regulate how much RAM is used.
Here's the code I use:
library('iterators')
library('parallel')
library('foreach')
library('doParallel')
#init parallelisation
nCores = 32
cCluster = makeCluster(nCores, type = "FORK", outFile = "parseProcess")
registerDoParallel(cCluster)
foreach(j = 1:length(fileList)) %dopar%{
#load all libraries for every thread
library('sp')
library('raster')
library('spatial')
library('gstat')
library('rgdal')
library('dismo')
library('deldir')
library('rgeos')
library('sjmisc')
#set rasteroptions per thread
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
tmpFolder = paste0("[PATH TO STORAGE]/rtmp",j)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
#generate names for raster files
fileName = basename(fileList[j])
print(paste("Processing:", fileName))
rNameMax0 = sub(pattern = ".bin", replacement = "_scan0_max.tif", fileName)
#repeat this for all 11 scans
rasterStorage = "[PATH TO OTHER STORAGE]" #path to raster folder
scanList = parseFile(fileList[j]) #any memory allocated in this functions should be released on function return
#create template raster
bounds = as.vector(t(bbox(scanList$scan0)))
resolution = c(0.0000566, 0.0000359)
tmp = raster(xmn = bounds[1], xmx = bounds[2], ymn = bounds[3], ymx = bounds[4], res = resolution)
#create rasters from data
coordinates(scanList$scan0) = ~Long+Lat
proj4string(scanList$scan0) = WGS84CRS
rScanMax0 = rasterize(scanList$scan0, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax0))
rm('rScanMax0')
#repeat for scans 1 to 4
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
coordinates(scanList$scan5) = ~Long+Lat
proj4string(scanList$scan5) = WGS84CRS
rScanMax5 = rasterize(scanList$scan5, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax5))
rm('rScanMax5')
#repeat for scans 6 to 10
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
}
stopCluster(cCluster)
Here's the (gutted) code of the parseFile function:
parseFile = function(fileName){
con = file(fileName, "rb")
intSize = 4
fileEndian = "little"
#create data frames for each scan
scan0 = data.frame(matrix(ncol = n1, nrow = 0))
colnames(scan0) = c("Lat", "Long", ...)
scan1 = data.frame(matrix(ncol = n2, nrow = 0))
colnames(scan1) = c("Lat", "Long", ...)
scan2 = data.frame(matrix(ncol = n3, nrow = 0))
colnames(scan2) = c("Lat", "Long", ...)
scan3 = data.frame(matrix(ncol = n4, nrow = 0))
colnames(scan3) = c("Lat", "Long", ...)
scan4 = data.frame(matrix(ncol = n5, nrow = 0))
colnames(scan4) = c("Lat", "Long", ...)
scan5 = data.frame(matrix(ncol = n6, nrow = 0))
colnames(scan5) = c("Lat", "Long", ...)
scan6 = data.frame(matrix(ncol = n7, nrow = 0))
colnames(scan6) = c("Lat", "Long", ...)
scan7 = data.frame(matrix(ncol = n8, nrow = 0))
colnames(scan7) = c("Lat", "Long", ...)
scan8 = data.frame(matrix(ncol = n9, nrow = 0))
colnames(scan8) = c("Lat", "Long", ...)
scan9 = data.frame(matrix(ncol = n10, nrow = 0))
colnames(scan9) = c("Lat", "Long", ...)
scan10 = data.frame(matrix(ncol = n11, nrow = 0))
colnames(scan10) = c("Lat", "Long", ...)
header = readBin(con, raw(), n = 36)
i = 1
while(i){
blockHeader = readBin(con, integer(), n = 3, size = intSize, endian = fileEndian)
if(...){ #check whether file ended
break
}
i = i + 1
#sort data to correct scan, assign GPS tag
blockTrailer = readBin(con, raw(), n = 8)
}
#clean up
close(con)
#return parsed data
returnList = list("scan0" = scan0, "scan1" = scan1, "scan2" = scan2, "scan3" = scan3, "scan4" = scan4,
"scan5" = scan5, "scan6" = scan6, "scan7" = scan7, "scan8" = scan8, "scan9" = scan9, "scan10" = scan10)
return(returnList)
}
I'm also looking at the solutions posted here as another approach, but I'd still like to know why my code doesn't work as I expect it to.
I'm using textmineR to fit a LDA model to documents similar to https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html. Is it possible to get the topic label for each document in the data set?
>library(textmineR)
>data(nih_sample)
> # create a document term matrix
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names =
nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec =
c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower
= TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE,
cpus = 2)
>dtm <- dtm[,colSums(dtm) > 2]
>set.seed(123)
> model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin =
180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood =
TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)
then adding the labels to the model:
> model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm,
M = 1)
now I want the topic labels for each of 100 document in nih_sample$ABSTRACT_TEXT
Are you looking to label each document by the label of its most prevalent topic? IF so, this is how you could do it:
# convert labels to a data frame so we can merge
label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)
# get the top topic for each document
top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])
# convert the top topics for each document so we can merge
top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)
# merge together. Now each document has a label from its top topic
top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)
This kind of throws away some information that you'd get from LDA though. One advantage of LDA is that each document can have more than one topic. Another is that we can see how much of each topic is in that document. You can do that here by
# set the plot margins to see the labels on the bottom
par(mar = c(8.1,4.1,4.1,2.1))
# barplot the first document's topic distribution with labels
barplot(model$theta[1,], names.arg = model$labels, las = 2)
I am trying to produce a sankey diagram in R, which is also referred as a river plot. I've seen this question Sankey Diagrams in R? where a broad variaty of packages producing sankey diagrams are listed. Since I have input data and know different tools/packages I can produce such diagram BUT my euqestion is: how can I prepare input data for such?
Let's assume we would like to present how users have migrated between various states over 10 days and have start data set like the one below:
data.frame(userID = 1:100,
day1_state = sample(letters[1:8], replace = TRUE, size = 100),
day2_state = sample(letters[1:8], replace = TRUE, size = 100),
day3_state = sample(letters[1:8], replace = TRUE, size = 100),
day4_state = sample(letters[1:8], replace = TRUE, size = 100),
day5_state = sample(letters[1:8], replace = TRUE, size = 100),
day6_state = sample(letters[1:8], replace = TRUE, size = 100),
day7_state = sample(letters[1:8], replace = TRUE, size = 100),
day8_state = sample(letters[1:8], replace = TRUE, size = 100),
day9_state = sample(letters[1:8], replace = TRUE, size = 100),
day10_state = sample(letters[1:8], replace = TRUE, size = 100)
) -> dt
Now if one would like to create a sankey diagram with networkD3 package how should one tranform this dt data.frame into required input
so that we would have input like from this example
library(networkD3)
URL <- paste0(
"https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
# Plot
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "TWh", fontSize = 12, nodeWidth = 30)
EDIT
I have found such script which prepares data in other situation and reproduced it so I assume it might be closed now:
https://github.com/mi2-warsaw/JakOniGlosowali/blob/master/sankey/sankey.R
I have found such script which prepares data in other situation and reproduced it so I assume it might be closed now:
https://github.com/mi2-warsaw/JakOniGlosowali/blob/master/sankey/sankey.R
Then this code generates such sankey diagram for mentioned in question data.frame
fixtable <- function(...) {
tab <- table(...)
if (substr(colnames(tab)[1],1,1) == "_" &
substr(rownames(tab)[1],1,1) == "_") {
tab2 <- tab
colnames(tab2) <- sapply(strsplit(colnames(tab2), split=" "), `[`, 1)
rownames(tab2) <- sapply(strsplit(rownames(tab2), split=" "), `[`, 1)
tab2[1,1] <- 0
# mandat w klubie
for (par in names(which(tab2[1,] > 0))) {
delta = min(tab2[par, 1], tab2[1, par])
tab2[par, par] = tab2[par, par] + delta
tab2[1, par] = tab2[1, par] - delta
tab2[par, 1] = tab2[par, 1] - delta
}
# przechodzi przez niezalezy
for (par in names(which(tab2[1,] > 0))) {
tab2["niez.", par] = tab2["niez.", par] + tab2[1, par]
tab2[1, par] = 0
}
for (par in names(which(tab2[,1] > 0))) {
tab2[par, "niez."] = tab2[par, "niez."] + tab2[par, 1]
tab2[par, 1] = 0
}
tab[] <- tab2[]
}
tab
}
flow2 <- rbind(
data.frame(fixtable(z = paste0(dat$day1_state, " day1"), do = paste0(dat$day2_state, " day2"))),
data.frame(fixtable(z = paste0(dat$day2_state, " day2"), do = paste0(dat$day3_state, " day3"))),
data.frame(fixtable(z = paste0(dat$day3_state, " day3"), do = paste0(dat$day4_state, " day4"))),
data.frame(fixtable(z = paste0(dat$day4_state, " day4"), do = paste0(dat$day5_state, " day5"))),
data.frame(fixtable(z = paste0(dat$day5_state, " day5"), do = paste0(dat$day6_state, " day6"))),
data.frame(fixtable(z = paste0(dat$day6_state, " day6"), do = paste0(dat$day7_state, " day7"))),
data.frame(fixtable(z = paste0(dat$day7_state, " day7"), do = paste0(dat$day8_state, " day8"))),
data.frame(fixtable(z = paste0(dat$day8_state, " day8"), do = paste0(dat$day9_state, " day9"))),
data.frame(fixtable(z = paste0(dat$day9_state, " day9"), do = paste0(dat$day10_state, " day10"))))
flow2 <- flow2[flow2[,3] > 0,]
nodes2 <- data.frame(name=unique(c(levels(factor(flow2[,1])), levels(factor(flow2[,2])))))
nam2 <- seq_along(nodes2[,1])-1
names(nam2) <- nodes2[,1]
links2 <- data.frame(source = nam2[as.character(flow2[,1])],
target = nam2[as.character(flow2[,2])],
value = flow2[,3])
sankeyNetwork(Links = links, Nodes = nodes,
Source = "source", Target = "target",
Value = "value", NodeID = "name",
fontFamily = "Arial", fontSize = 12, nodeWidth = 40,
colourScale = "d3.scale.category20()")
I asked a similar question while ago. And I guess I better post it here how it can be done with the tidyverse magic.
library(ggplot2)
library(ggalluvial)
library(tidyr)
library(dplyr)
library(stringr)
# The actual data preperation happens here
dt_new <- dt %>%
gather(day, state, -userID) %>% # Long format
mutate(day = str_match(day, "[0-9]+")[,1]) %>% # Get the numbers
mutate(day = as.integer(day), # Convert to proper data types
state = as.factor(state))
Here is how the data dt_new looks like
userID day state
1 1 1 d
2 2 1 d
3 3 1 g
4 4 1 a
5 5 1 a
6 6 1 d
7 7 1 d
8 8 1 b
9 9 1 d
10 10 1 e
...
Now plotting the Sankey plot:
ggplot(dt_new,
aes(x = day, stratum = state, alluvium = userID, fill = state, label = state)) +
geom_stratum() +
geom_text(stat = "stratum") +
geom_flow()
Here is the output
I'm trying to output some of my code results in knitr. Now the strange thing is, the code generates the error in the title. But running round_any() seperately and outputting it in knitr is fine.
knitr code
```{r, echo = FALSE, message=FALSE, warning=FALSE}
source("BooliQuery.R")
BooliQuery()
```
My code
library(digest)
library(stringi)
library(jsonlite)
library(plyr)
BooliQuery <- function(area = "stockholm", type="lägenhet", sincesold = "", FUN = "", limit = 250, offset = 0, mode = 1) {
#raw data fetch + adjust.
lOriginal <- GETAPI(area, type, sincesold, FUN, limit, offset)
lOriginal$AreaSize <- round_any(lOriginal$livingArea, 10, floor)
lOriginal$PriceDiff <- lOriginal$soldPrice - lOriginal$listPrice
#Create frame overview
Overview.Return <- Frame.Overview(lOriginal)
#Mode - return selector
ifelse( mode == 1, return (Overview.Return), return (lOriginal) )
}
Frame.Overview <- function(lOriginal) {
#Aggregate mean
listPrice <- aggregate(lOriginal, list(lOriginal$AreaSize), FUN = mean, na.rm = TRUE)
colnames(listPrice)[1] <- "SegGroup"
listPrice <- listPrice[, c("SegGroup", "listPrice", "soldPrice", "PriceDiff", "rent", "livingArea", "constructionYear") ]
#Perform Rounding
listPrice[, c(2:5)] <- round(listPrice[,c(2:5)], digits = 0)
listPrice[, 6] <- round(listPrice[, 6], digits = 1)
listPrice[, 7] <- signif(listPrice[,7], digits = 4)
return(listPrice)
}
GETAPI <- function(area = "stockholm", type="lägenhet", sincesold = "", FUN = "", limit = 250, offset = 0) {
#ID Info
key <- "PRIVATE KEY"
caller.ID <- "USERNAME"
#//
unix.timestamp <- as.integer( as.POSIXct(Sys.time()) )
random.string <- stri_rand_strings( n = 1, length = 16)
#Sha1-Hash: CallerID + time + key + unique, 40-char hexadecimal
hash.string <- paste0(caller.ID, unix.timestamp, key, random.string)
hash.sha1 <- digest(hash.string,"sha1",serialize=FALSE)
#Create URL
api.string <- "https://api.booli.se/sold?q="
url.string <- paste0(api.string, area, "&objectType=" , type , "&minSoldDate=", sincesold, FUN, "&limit=", limit, "&offset=", offset,"&callerId=", caller.ID, "&time=" ,
unix.timestamp, "&unique=", random.string, "&hash=", hash.sha1)
#Parse JSON
parsed.JSON <- fromJSON(txt = url.string)
return(parsed.JSON$sold)
}
Running the code seperately in console is fine. So what could be wrong?