K-means clustering in R - r

I'm a beginner in R and I followed this tutorial on K-means clustering. However, I'm trying to run this algorithm on real data. I chose : http://exoplanet.eu/catalog/
I have loaded data :
d <- read.csv2(
"exoplanet.eu_catalog.csv",
header = TRUE,
sep = ","
)
With this code :
plot(
x = log(as.numeric(as.character(d$semi_major_axis))),
y = log(as.numeric(as.character(d$mass))),
xlab = "Star-exoplanet distance (log(UA))",
ylab = "Mass of exoplanets (log(M[Jupiter]))"
)
I have the following graphic :
I'd like to run the K-means clustering algorithm on this graphic to show three clusters with colors but I don't know how to proceed in R. I suppose I have to begin with :
y = log(as.numeric(as.character(d$mass)))
y <- y[!is.na(y)]
x = log(as.numeric(as.character(d$semi_major_axis)))
x <- x[!is.na(x)]
But I don't know how to format data into a matrix in order to run kmeans(matrix, 3, nstart = 20). Any clue please ?

Since you read your file using
d <- read.csv2("exoplanet.eu_catalog.csv",
header = TRUE,
sep = ",")
Your data is in the form of data frame and you need to convert as a matrix
Use this code to convert a data frame into matrix
inMatrixForm <- data.matrix(d)

Related

Cumulative output format resulting in NAs when using dismo::predict function in R

I have run a maxent model using the dismo package in R. I am able to use the predict function to view the output format as "raw" or "cloglog" but when I try the "cumulative" format I get "NA" values and a blank output. In summary, this code works fine and produces coloured maps:
mxPred <- predict(object = mx, x = bioRastersClipBiolnoCorr, args=c("outputformat=raw"),
filename=paste0(filePath, '/maxent_predictionRAW.tif'), overwrite = TRUE)
plot(mxPred, col=rgb.tables(1000))
mxPredClog <- predict(object = mx, x = bioRastersClipBiolnoCorr, args=c("outputformat=cloglog"),
filename=paste0(filePath, '/maxent_predictionCLOG.tif'), overwrite = TRUE)
plot(mxPredClog, col=rgb.tables(1000))
This code produces no map, just NA values and a blank plot:
mxPredCumu <- predict(object = mx, x = bioRastersClipBiolnoCorr, args=c("outputformat=cumulative"),
filename=paste0(filePath, '/maxent_predictionCUMU.tif'), overwrite = TRUE)
plot(mxPredCumu, col=rgb.tables(1000))
Please help, thank you.
Only raw, clolog, or logistic can be set in output format. Choose raw then convert the output to cumulative format.
See details.

How to label just one observation in hierarchical clustering tree with dendextend?

I'd like to create a hierarchical clustering tree of a relatively large dataset (>3000 obs). Unfortunately, by including so many labels at the terminal nodes, the tree looks very cluttered and contains lots of unnecessary information. So to reduce the clutter, I'd like to just label one observation of interest. I have removed all of the labels but I don't know how to retrieve and add the label that I'm interested in.
For this MWE, let's assume, I'd like to add the letter k to my dendrogram.
library(dendextend)
library(cluster)
library(tidyverse)
set.seed(1)
a <- rnorm(20)
b <- rnorm(20)
c <- rnorm(20)
df <- as.data.frame(a, b, c)
names(df) <- letters[length(df)]
my_dist <- dist(df)
my_clust <- hclust(my_dist)
my_dend <- as.dendrogram(my_clust)
plot(color_branches(my_dend, k = 3), leaflab = "none", horiz = T)
You can specify the labels set function. If you only want to show one, make the others be the null string.
LAB = rep("", nobs(my_dend))
LAB[15] = "N15"
my_dend = set(my_dend, "labels", LAB)
plot(color_branches(my_dend, k = 3), horiz = T)

How to fix ‘Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables”

I intend to draw a qq plot on the data, but it reminds me that qqnorm function only works on numerical data.
As the factor include A,B,C,D and their two, three and four way interaction, I have no idea how to convert it into numerical form.
The data is as follows:
Effects,Value
A,76.95
B,-67.52
C,-7.84
D,-18.73
AB,-51.32
AC,11.69
AD,9.78
BC,20.78
BD,14.74
CD,1.27
ABC,-2.82
ABD,-6.5
ACD,10.2
BCD,-7.98
ABCD,-6.25
My code is as follows:
library(readr)
data621 <- read_csv("Desktop/data621.csv")
data621_qq<-qqnorm(data621,xlab = "effects",datax = T)
qqline(data621,probs=c(0.3,0.7),datax = T)
text(data621_qq$x,data621_qq$y,names(data621),pos=4)
Your code would work if using the proper columns instead of the entire data frame. For example,
data621_qq <- qqnorm(data621$Value, xlab = "Effects", datax = TRUE)
qqline(data621$Value, probs = c(0.3, 0.7), datax = TRUE)
text(data621_qq$x, data621_qq$y, data621$Effects, pos=4)
By the way, names(data621) would give you the column names, instead of the effect names (which are stored as values in a column).

Random Graph Function in R

I have an assignment in which I have to generate my own random graph function in R, with an igraph output. I've figured out that the easiest way to do this is to simply generate a square matrix and then build a function which creates edges between the nodes in the matrix. However I'd like to do something special, where the probability of the edges are based on forming a higher likelihood of sybil networks. Would look like this:
My matrix is generated and visualised quite simply like this:
library(ggraph)
library(igraph)
NCols <- 20
NRows <- 20
myMat <-matrix(runif(NCols*NRows), ncol = NCols)
myMat
randomgraph <- graph_from_adjacency_matrix(myMatG, mode = "undirected", weighted = NULL, diag = TRUE, add.colnames = NULL, add.rownames = NA)
randomgraph %>%
ggraph() +
geom_node_point(colour = "firebrick4", size = 0.5, show.legend = F)
I know there are functions like Erdos-Renyi Random- (for a true random graph), Barabási-Albert Scale-Free- and Watts-Strogatz Small-World graphs. I'm trying to write my own with a unique twist.
Any advice or code snippets on how to write my own preferential attachment function for the random matrix would be greatly appreciated! Thank you!

Simulating a discrete distribution on a different scale in R

I'm new to R and have this question. As mentioned in the title, I have a distribution of reported dice number from students. In this task, they are given a dice with 6 faces (from 1-6) and are asked to throw it in private. The data are plotted as in the picture.
However, I wonder if it's possible that I can use this data to simulate the situation where they are given a dice with 10 faces instead (from 1-10)? How can I achieve this in R?
Ok second attempt if you want to use your existing six-sided die data. I use the snpackage to fit a skewed normal distribution to your existing data and then scale it to represent a ten-sided die and make it discrete using round.
First I will simulate your data
set.seed(9999)
n=112
a = rnorm( 42, 3, 1 )
b = rnorm( 70, 5, 0.5 )
dat = round(c( a, b))
dat[!(dat %in% 1:6)] = NA
dat=dat[complete.cases(dat)]
hist(dat,breaks = seq(0.5, 6.5,1), col = rgb(0,0,1,0.25))
Just set dat as your existing data if you want.
Now to parametise the distribution using the sn package. (You can try to fit other distributions if you prefer)
require(sn)
cp.est = sn.mple(y=dat,opt.method = "nlminb")$cp
dp.est = cp2dp(cp.est,family="SN")
##example to sample from the distribution and compare to existing
sim = rsn(n, xi=dp.est[1], omega=dp.est[2], alpha=dp.est[3])
sim = round(sim)
sim[!(sim %in% 1:6)] = NA
hist(sim,breaks = seq(0.5, 6.5,1), col = rgb(1,0,0,0.25), add=T)
Now scale the distribution to represent a ten-sided die.
sim = rsn(n, xi=dp.est[1], omega=dp.est[2], alpha=dp.est[3])/6*10
sim <- round(sim)
sim[!(sim %in% 1:10)] = NA
hist(sim,breaks = seq(0.5, 10.5,1), col = rgb(0,1,0,0.25))
To simulate 112 students rolling a ten-sided die and plotting the results in histogram:
n=112
res = sample(1:10, size = n, replace = T)
hist(res)

Resources