Basic Plotting in "Modeling Techniques in Predictive Analytics" - r

I am trying to plot the x and y pairs as demonstrated below. Can someone provide me with the basic code to plot x1, y1? I've tried a number of things to include plot(x1,y1) and its not recognizing these variables.
# The Anscsombe Quartet in R
# demonstration data from
# Anscombe, F. J. 1973, February. Graphs in statistical analysis.
# The American Statistician 27: 17รข21.
# define the anscombe data frame
anscombe <- data.frame(
x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
y1 = c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26,10.84, 4.82, 5.68),
y2 = c(9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74),
y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
y4 = c(6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89))
# show results from four regression analyses
with(anscombe, print(summary(lm(y1 ~ x1))))
with(anscombe, print(summary(lm(y2 ~ x2))))
with(anscombe, print(summary(lm(y3 ~ x3))))
with(anscombe, print(summary(lm(y4 ~ x4))))
# place four plots on one page using standard R graphics
# ensuring that all have the same scales
# for horizontal and vertical axes
pdf(file = "fig_more_anscombe.pdf", width = 8.5, height = 8.5)
par(mfrow=c(2,2),mar=c(3,3,3,1))
with(anscombe, plot(x1, y1, xlim=c(2,20),ylim=c(2,14),
pch = 19, col = "darkblue", cex = 2, las = 1)
title("Set I")
with(anscombe,plot(x2, y2, xlim=c(2,20),ylim=c(2,14),
pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set II")
with(anscombe,plot(x3, y3, xlim=c(2,20),ylim=c(2,14),
pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set III")
with(anscombe,plot(x4, y4, xlim=c(2,20),ylim=c(2,14),
pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set IV")
dev.off()
par(mfrow=c(1,1),mar=c(5.1, 4.1, 4.1, 2.1)) # return to plotting defaults
# suggestions for the student
# see if you can develop a quartet of your own
# or perhaps just a duet...
# two very different data sets with the same fitted model

Note that anscombe data set comes with R out of the box and does not have to be defined.
The code below sets up a 2x2 grid for plotting and then calculates the overall range for the x and separately for the y variables. Then for i = 1, 2, 3, 4 it creates the ith formula and plots it using the calculated ranges. as.roman is used to get the roman numeral portion of the title. Then we perform a linear regression. We could have just written fm <- lm(fo, anscombe) to calculate the regression but had we done that, the print(summary(fm)) output would have shown literally fo as the formula which is not very nice. Finally we plot the regression line using abline and print the summary.
Try this:
par(mfrow = c(2,2))
xrange <- range(anscombe[1:4])
yrange <- range(anscombe[5:8])
for(i in 1:4) {
fo <- as.formula( sprintf("y%d ~ x%d", i, i) )
plot(fo, anscombe, xlim = xrange, ylim = yrange, main = paste("Set", as.roman(i)))
fm <- do.call("lm", list(fo, quote(anscombe)))
abline(fm)
print( summary(fm) )
}
par(mfrow = c(1,1))
giving this plot (output from print(summary(...)) not shown):

If all you want to do is plot x1 and y1, try:
plot(anscombe$x1,anscombe$y1)
or (from your code):
with(anscombe, plot(x1, y1, xlim=c(2,20),ylim=c(2,14),
pch = 19, col = "darkblue", cex = 2, las = 1)
Your above code is plotting them to a pdf file, starting at the line:
pdf(file = "fig_more_anscombe.pdf", width = 8.5, height = 8.5)
and not ending until you terminate the pdf at:
dev.off()
If you don't terminate the pdf, you will never see a plot output in R. If you have run the code multiple times, make sure no pdf devices are open by running:
dev.off()
until you see:
Error in dev.off() : cannot shut down device 1 (the null device)

Related

Cluster Analysis Visualisation: Colouring the Clusters after categorial variable

Salut folks! I'm still quiet new to ggplot and trying to understand, but I really need some help here.
Edit: Reproducible Data of my Dataset "Daten_ohne_Cluster_NA", first 25 rows
structure(list(ntaxa = c(2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 5, 8, 8, 7, 7, 6, 5, 5), mpd.obs.z = c(-1.779004391,
-1.721014957, -1.77727283, -1.774642404, -1.789386039, -1.983401439,
-0.875426386, -2.276052068, -2.340365105, -2.203126078, -2.394158227,
-2.278173635, -1.269075471, -1.176760985, -1.313045215, -1.164289676,
-1.247549961, -0.868174033, -2.057106804, -2.03154772, -1.691850922,
-1.224391713, -0.93993654, -0.39315089, -0.418380361), mntd.obs.z = c(-1.759874454,
-1.855202792, -1.866281778, -1.798439855, -1.739998395, -1.890847575,
-0.920672112, -1.381541177, -1.382847758, -1.394870597, -1.339878669,
-1.349541665, -0.516793786, -0.525476292, -0.557425575, -0.539534996,
-0.521299478, -0.638951825, -1.06467985, -1.033009266, -0.758380203,
-0.572401837, -0.166616844, 0.399510209, 0.314591018), pe = c(0.046370234,
0.046370234, 0.046370234, 0.046370234, 0.046370234, 0.046370234,
0.071665745, 0.118619482, 0.118619482, 0.118619482, 0.118619482,
0.118619482, 0.205838414, 0.205838414, 0.205838414, 0.205838414,
0.205838414, 0.179091659, 0.215719118, 0.215719118, 0.212092271,
0.315391478, 0.312205596, 0.305510773, 0.305510773), ECO_NUM = c(1,
6, 6, 1, 7, 6, 6, 6, 6, 6, 6, 7, 7, 6, 1, 6, 6, 6, 6, 6, 6, 7,
7, 7, 6)), row.names = c(NA, -25L), class = c("tbl_df", "tbl",
"data.frame"))
(1) I prepared my Dataframe like this:
'Daten_Cluster <- Daten[, c("ntaxa", "mpd.obs.z", "mntd.obs.z", "pe", "ECO_NUM")]
(2) I threw out all the NA's with na.omit. It is 6 variables with 3811 objects each. The column ECO_NUM represents the different ecoregions as a kategorial, numerical factor.
(3) Then I did a Cluster Analysis with k.means. I used 31 groups as there are 31 ecoregions in my dataset and the aim is to colour the plot after ecoregions lateron.
'Biomes_Clus <- kmeans(Daten_Cluster_ohne_NA, 31, iter.max = 10, nstart = 25)
(4) Then I followed the online-instructions from datanovia.com on how to visualise a k.means cluster analysis (I always just follow these How-To
s as I have no idea how to do it all by myself). I tried to change the arguments accordingly to colour after ecoregions.
fviz_cluster(Biomes_Clus, data = Daten_Cluster_ohne_NA,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw(),
) +
stat_mean(aes(color = Daten_Cluster_ohne_NA$ECO_NUM), size = 4)
I get more than 50 warnings here, I guess for each object. Saying: In grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size) : unimplemented pch value '30'
I know that there are not enough pch-symbols for 31 groups, but I also don't need them - I just would like to have it coloured.
I also tried out the other function ggscatter and created my own color-palette (called P36) with more than 31 colours to have enough colours for the ecoregions.
ggscatter(
ind.coord, x = "Dim.1", y = "Dim.2",
color = "Species", palette = "P36", ellipse = TRUE, ellipse.type = "convex",
legend = "right", ggtheme = theme_bw(),
xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
stat_mean(aes(color = cluster), size = 4)
The Error here is that a Discrete value was supplied to continuous scale. THe Question is: How can I easily colour the outcome of my k.means (which worked) and colour it not by the newly clustered groups but by the ecoregions (to visualise if there is a difference between the clusters and the ecoregion-groups)?
I appreciate your help and me and my group partner would be very thankful!! :)
Greetings
Evelyn

Recreating a linear fit plot using R as could be plotted in Excel

I am not able to correctly recreate a linear fit plot in R as I could do in Excel. What am I doing wrong?
x <- data.frame(concn = c(0.25, 0.125, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000),
signal = c(0.0442, 0.0343, 0.0761, 0.144, 0.201, 0.579, 1.29, 2.09, 5.25, 10.9, 24, 55.6, 112))
fit = lm(x$concn ~ x$signal)
plot(x, pch = 16, type = "p", col = "blue" )
abline(fit)
Your order of y and x is incorrect, signal is your y, concn is your x, so it's
fit = lm(x$signal ~ x$concn)
plot(x$signal ~ x$concn, pch = 16, type = "p", col = "blue" )
abline(fit)
> fit
Coefficients:
(Intercept) x$concn
0.05372 0.11198

parameters of histogram with R

First, I wanted to be able to display the absciss axis with decimal numbers (example: 1.5, 2.6, ...), but the problem is that when I display the histogram with my code, then automatically the x-axis displays whole number as you can see in the follow picture (I have circled in red what I would like to change): hist
How can i change the parameters to be able to get these whole numbers into decimals?
Secondly, I would like the numbers that appear on the x-axis to correspond exactly to my breaks vector.
Could someone please help me?
Here is my code:
my_data <- transform(my_data, new = as.numeric(new/1000000))
sal_hist_default = hist(my_data$new, breaks = c(1,6.3,11.6,16.9,22.2,27.5), col = "blue", border = "black", las = 1, include.lowest=TRUE,right=FALSE, main="Salary Of best category", xlab = "salaries", ylab = "num of players",xlim = c(1,27.5), ylim = c(0,600))
You should really provide sample data, but try this:
set.seed(42)
new <- rnorm(1000, 14, 3.5)
my_data <- data.frame(new)
sal_hist_default = hist(my_data$new, breaks = c(1, 6.3, 11.6, 16.9, 22.2, 27.5), col = "blue",
border = "black", las = 1, include.lowest=TRUE,right=FALSE, main="Salary Of best category",
xlab = "salaries", ylab = "num of players",xlim = c(1,27.5), ylim = c(0,600), xaxt="n")
axis(1, c(1, 6.3, 11.6, 16.9, 22.2, 27.5), c(1, 6.3, 11.6, 16.9, 22.2, 27.5))

How to automate positioning of inner labels within a stacked barplot?

I frequently have to produce stacked bar plots with labels. The way I've been coding the labels is very time intensive and I wondered if there was a way to code things more efficiently. I would like the labels to be centered on each section of the bars. I'd prefer base R solutions.
stemdata <- structure(list( #had to round some nums below for 100% bar
A = c(7, 17, 76),
B = c(14, 10, 76),
C = c( 14, 17, 69),
D = c( 4, 10, 86),
E = c( 7, 17, 76),
F = c(4, 10, 86)),
.Names = c("Food, travel, accommodations, and procedures",
"Travel itinerary and dates",
"Location of the STEM Tour stops",
"Interactions with presenters/guides",
"Duration of each STEM Tour stop",
"Overall quality of the STEM Tour"
),
class = "data.frame",
row.names = c(NA, -3L)) #4L=number of numbers in each letter vector#
# attach(stemdata)
print(stemdata)
par(mar=c(0, 19, 1, 2.1)) # this sets margins to allow long labels
barplot(as.matrix(stemdata),
beside = F, ylim = range(0, 10), xlim = range(0, 100),
horiz = T, col=colors, main="N=29",
border=F, las=1, xaxt='n', width = 1.03)
text(7, 2, "14%")
text(19, 2, "10%")
text(62, 2, "76%")
text(7, 3.2, "14%")
text(22.5, 3.2, "17%")
text(65.5, 3.2, "69%")
text(8, 4.4, "10%")
text(55, 4.4, "86%")
text(3.5, 5.6, "7%")
text(15, 5.6, "17%")
text(62, 5.6, "76%")
text(9, 6.9, "10%")
text(55, 6.9, "86%")
Staying base R as OP requested, we can easily automate the inner label positioning (i.e. x coordinates) within a small function.
xFun <- function(x) x/2 + c(0, cumsum(x)[-length(x)])
Now, it's good to know that barplot invisibly trows the y coordinates, we can catch them by assignment (here byc <- barplot(.)).
Eventually, just assemble coordinates and labels in data frame labs and "loop" through the text calls in a sapply. (Use col="white" or col=0 for white labels as wished in the other question.)
# barplot
colors <- c("gold", "orange", "red")
par(mar=c(2, 19, 4, 2) + 0.1) # expand margins
byc <- barplot(as.matrix(stemdata), horiz=TRUE, col=colors, main="N=29", # assign `byc`
border=FALSE, las=1, xaxt='n')
# labels
labs <- data.frame(x=as.vector(sapply(stemdata, xFun)), # apply `xFun` here
y=rep(byc, each=nrow(stemdata)), # use `byc` here
labels=as.vector(apply(stemdata, 1:2, paste0, "%")),
stringsAsFactors=FALSE)
invisible(sapply(seq(nrow(labs)), function(x) # `invisible` prevents unneeded console output
text(x=labs[x, 1:2], labels=labs[x, 3], cex=.9, font=2, col=0)))
# legend (set `xpd=TRUE` to plot beyond margins!)
legend(-55, 8.5, legend=c("Medium","High", "Very High"), col=colors, pch=15, xpd=TRUE)
par(mar=c(5, 4, 4, 2) + 0.1) # finally better reset par to default
Result
Data
stemdata <- structure(list(`Food, travel, accommodations, and procedures` = c(7,
17, 76), `Travel itinerary and dates` = c(14, 10, 76), `Location of the STEM Tour stops` = c(14,
17, 69), `Interactions with presenters/guides` = c(4, 10, 86),
`Duration of each STEM Tour stop` = c(7, 17, 76), `Overall quality of the STEM Tour` = c(4,
10, 86)), class = "data.frame", row.names = c(NA, -3L))
Would you consider a tidyverse solution?
library(tidyverse) # for dplyr, tidyr, tibble & ggplot2
stemdata %>%
rownames_to_column(var = "id") %>%
gather(Var, Val, -id) %>%
group_by(Var) %>%
mutate(id = factor(id, levels = 3:1)) %>%
ggplot(aes(Var, Val)) +
geom_col(aes(fill = id)) +
coord_flip() +
geom_text(aes(label = paste0(Val, "%")),
position = position_stack(0.5))
Result:

Grouping iGraph Vertices in a weighted network by color/subgroup in R

I am struggling to group my network by the subgroups. I currently have the following network:
Current Network
Which I have assigned the subgroups. I would like to plot all of the subgroups clustered together. To get a graph that looks like this:
Goal
Most algorithms seems to cluster based on weights in the graph. But I want to tell it to cluster based on the node colors/labelled subgroups. This is what I have now to code this network:
#Graph with Weighted matrix
g_weighted<-graph.adjacency(WeightedMatrix, mode="undirected", weighted = TRUE)
#Make nodes different colors based on different classes
numberofclasses<-length(table(ConnectedVertexColor))
V(g_weighted)$color=ConnectedVertexColor
Node_Colors <- rainbow(numberofclasses, alpha=0.5)
for(i in 1:numberofclasses){
V(g_weighted)$color=gsub(unique(ConnectedVertexColor[i],Node_Colors[i],V(g_weighted)$color)
}
#Plot with iGraph
plot.igraph(g_weighted,
edge.width=500*E(g_weighted)$weight,
vertex.size=15,
layout=layout.fruchterman.reingold, ##LAYOUT BY CLASS
title="Weighted Network",
edge.color=ifelse(WeightedMatrix > 0, "palegreen4","red4")
)
legend(x=-1.5, y=-1.1, c(unique(ConnectedVertexColor)), pch = 19, col=Node_Colors, bty="n")
The ConnectedVertexColor is a vector the contains information about if the node is a lipid, Nucleotide, Carb or AA. I have tried the command V(g_weighted)$community<-ConnectedVertexColor
but I cannot get this to transfer into useful information for iGraph.
Thanks for advice in advance.
Since you do not provide data, I am making a guess based on your "Current Network" picture. Of course, what you need is a layout for the graph. Below I provide two functions to create layouts that might meet your needs.
First, some data that looks a bit like yours.
EL = structure(c(1, 5, 4, 2, 7, 4, 7, 6, 6, 2, 9, 6, 3, 10,
7, 8, 3, 9, 8, 5, 3, 4, 10, 13, 12, 12, 13, 12, 13, 15, 15,
11, 11, 14, 14, 11, 11, 11, 15, 15, 11, 11, 13, 13, 11, 13),
.Dim = c(23L, 2L))
g2 = graph_from_edgelist(EL, directed = FALSE)
Groups = c(rep(1, 10), 2,2,3,3,3)
plot(g2, vertex.color=rainbow(3)[Groups])
First Layout
GroupByVertex01 = function(Groups, spacing = 5) {
Position = (order(Groups) + spacing*Groups)
Angle = Position * 2 * pi / max(Position)
matrix(c(cos(Angle), sin(Angle)), ncol=2)
}
GBV1 = GroupByVertex01(Groups)
plot(g2, vertex.color=rainbow(3)[Groups], layout=GBV1)
Second Layout
GroupByVertex02 = function(Groups) {
numGroups = length(unique(Groups))
GAngle = (1:numGroups) * 2 * pi / numGroups
Centers = matrix(c(cos(GAngle), sin(GAngle)), ncol=2)
x = y = c()
for(i in 1:numGroups) {
curGroup = which(Groups == unique(Groups)[i])
VAngle = (1:length(curGroup)) * 2 * pi / length(curGroup)
x = c(x, Centers[i,1] + cos(VAngle) / numGroups )
y = c(y, Centers[i,2] + sin(VAngle) / numGroups)
}
matrix(c(x, y), ncol=2)
}
GBV2 = GroupByVertex02(Groups)
plot(g2, vertex.color=rainbow(3)[Groups], layout=GBV2)

Resources