I have a matrix composed of 20 sampled data like this (the original data has 30 observations):
## dummy data
dat <- rnorm(30, 1, 0.5)
## generate 20 sampled data
resamples <- lapply(1:20, function(i) sample(dat, replace = T))
## create matrix combining all sampled data together
mat <- t(do.call(rbind, resamples))
I want to draw a dot plot showing the change of the 30 observation across the 20 sampled dataset. The matplot function seems to work, but it displays numbers and alphabets instead of points in the figure:
## draw plot
matplot(mat, type = "p", ylab = " ")
Does anyone know how to fix this problem? And how can I make the x-axis ranges from 1 to 30, separated by 1? (I tried xlim but did not work)
Thanks!!
Related
I have a data set of 1000 rows and 100 columns, with numbers ordered from smallest to largest left to right (these are all dates, or years in which something has happened). I want to create a scatter plot of this numeric data with each row plotted against an ordinal index of the numbers 1-100 in ascending order. So for example the dataframe is:
[1] [2] [3] [4] ... [100]
[1] 202 216 398 401 ... 2000
[2] 203 243 284 350 ... 1998
[3] 211 269 299 321 ... 2000
...
[1000] 200 247 273 300 ... 1999
I'd like to index each point in every row by 1-100, so essentially plot all rows by the numbers 1-100. Is there an easy way to do this? I'm new and self-taught in R. I've tried it with ggplot and I've also tried to covert the data frame to a matrix and use matplot, but can't quite get it right. I'm shooting for the numbers 1-100 on the y axis, and the numbers 1-2000 on the x.
Here's an example of the graph I am trying to replicate, which I created in Excel (with only 250 series).
I understand this will be quite the messy graph, but I am replicating someone else's agent based model and want to compare my graph and results with their published data.
R almost always thinks about data in columns, not rows, and for ggplot you would want long-format not wide-format data.
Let's get some sample input:
nr = 1000
nc = 100
set.seed(47)
m = matrix(sample(1:2000, size = nr * nc, replace = TRUE), ncol = nc)
# base
plot(x = c(1,2000), y = c(1,100), type = "n")
for(i in 1:nr) points(m[i, ], 1:100, cex = 0.1, pch = 20)
# ggplot
# get data in long format
d = data.frame(x = c(t(m)), y = rep(1:100, nr))
ggplot(d, aes(x = x, y = y)) +
geom_point(shape = '.', alpha = 0.1)
These both look pretty bad since the fake data is just uniformly distributed, but it should give you the right idea.
Here's a solution with 2 lines of plotting code. The first creates an empty plot with the specified axis limits. The second plots one row of your data-matrix at a time. This might not the most elegant solution, but this will run fast enough given the size of the data:
# generate fake data matching your example
mat <- matrix(NA, nrow=1000, ncol=100)
for(r in 1:1000) mat[r, ] <- sort(sample(0:2000, 100))
# create empty plot
plot(x=NA, y=NA, xlim=c(0,2000), ylim=c(0,100), xlab="", ylab="")
# plot your data
for(r in 1:1000) points(x=mat[r,], y=1:100, pch=20)
I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group
After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.
I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20)
#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)
How should I correctly plot and color this principal component analysis as noted above? Thank you.
If I understand your question correctly, ggparcoord in Gally package would help you.
library(GGally)
y <- rep(c(1,2,3), 20)
# matrix of 50 variables i.e. 50 columns and 60 rows
# i.e. 60x50 dimensions (=3000 table cells)
x <- matrix(rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)
# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))
# Plot the first two principal component scores of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
However, I think it makes more sense to do PCA on x rather than xymatrix that includes the target y. So the following codes should be more appropriate in your case.
pca <- prcomp(x, scale=TRUE)
pc_label <- data.frame(pca$x, y=as.factor(y))
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
If you want a scatter plot of first two principal component scores, you can do it using ggplot.
library(ggplot2)
ggplot(data=pc_label) +
geom_point(aes(x=PC1, y=PC2, colour=y))
Here's a base R solution, to show how simply this can be done. First do the PCA on the x matrix only and from the resulting object get a matrix of the transformed variables which we'll call PCs.
x <- matrix(rnorm(3000), ncol=50)
pca <- prcomp(x, scale=TRUE)
PCs <- as.matrix(pca$x)
Now we can make vector of colour names based on your y for the labels.
col.labs <- rep(c("Green", "Blue", "Red"), 20)
Now just plot as a scatter, passing the colour vector to col.
plot(PCs[, 1], PCs[, 2], col=col.labs, pch=19, xlab = "Scores on PC1", ylab="Scores on PC2")
From an experiment I do, I get large files with time series data. Each column represents one series that I would like to plot in a graph. The ranges of the X and Y axis are not important as I only need it as an overview.
The problem is, I have from 150-300 columns (and 400 rows) per data frame and and I am not able to figure out how to plot more than 10 graphs at once.
library(ggplot2)
library(reshape2)
csv <- read.csv(file = "CSV-File-path", header = F, sep = ";", dec = ".")[,1:10]
df <- as.data.frame(csv)
plot.ts(df)
The moment I change [,1:10] to [,1:11] I get an error:
Error in plotts(x = x, y = y, plot.type = plot.type, xy.labels =
xy.labels, : cannot plot more than 10 series as "multiple"
Ideally I would like an output of a multiple paged PDF file with at least 10 graphs per page. I am fairly new to R, I hope you are able to help me.
And here is a ggplot2 way to do it:
library(ggplot2)
library(reshape2)
nrow <- 200
ncol <- 24
df <- data.frame(matrix(rnorm(nrow*ncol),nrow,ncol))
# use all columns and shorten the measure name to mvar
mdf <- melt(df,id.vars=c(),variable.name="mvar")
gf <- ggplot(mdf,aes(value,fill=mvar)) +
geom_histogram(binwidth=0.1) +
facet_grid(mvar~.)
print(gf) # print it out so we can see it
ggsave("gplot.pdf",plot=gf) # now save it as a PDF
This is what the plot looks like:
Here is one way to do it. This one groups the columns in groups of 5 and then writes them as separate pages in a single PDF. But if it were me I would be using ggplot2 and doing them in a single plot.
nrow <- 18
ncol <- 20
df <- data.frame(matrix(runif(nrow*ncol),nrow,ncol))
plots <- list()
ngroup <- 5
icol <- 1
while(icol<=ncol(df)){
print(icol)
print(length(plots))
ecol <- min(icol+ngroup-1,ncol(df))
plot.ts(df[,icol:ecol])
plots[[length(plots)+1]] <- recordPlot()
icol <- ecol+1
}
graphics.off()
pdf('plots.pdf', onefile=TRUE)
for (p in plots) {
replayPlot(p)
}
graphics.off()
My data are set up so that one column contains a continuous value testosterone concentration and the second column contains one of four "Kit type" values being "EIA," "RIA," "Other," or "All." I wanted to make the kit types into categories along the x axis with testosterone concentration along the y. I can't seem to figure out how to make sort of a cross between a boxplot and a scatterplot, but with only the individual data points and a median marking for each category marked on the graph?
This seemed to get me the data points into catagories alright, but the summarySE function does not have a median: Categorical scatter plot with mean segments using ggplot2 in R
Without data, I'm only guessing here, but ...
## create some data
set.seed(42)
n <- 100
dat <- data.frame(Testo=rbeta(n, 2, 5),
Kit=sample(c('EIA', 'RIA', 'Other', 'All'), size = n, replace = TRUE))
## show unequal distribution of points, no problem
table(dat$Kit)
## All EIA Other RIA
## 23 30 14 33
## break into individual levels
dat2 <- lapply(levels(dat$Kit), function(lvl) dat$Testo[ dat$Kit == lvl ])
names(dat2) <- levels(dat$Kit)
## parent plot
boxplot(dat2, main = 'Testosterone Levels per Kit')
## adding individual points
for (lvl in seq_along(dat2)) {
points(jitter(rep(lvl, length(dat2[[lvl]]))), dat2[[lvl]],
pch = 16, col = '#88888888')
}