R, ggplot2 qqplot using 2 vectors + straight line? - r

I have 2 data.frame objects:
df1
df2
Both have one column = amount.
For example:
df1 <- data.frame(amount = c(119.00,191.41,69.00,396.80,245.00,24.50,300.00,149.77,599.01,397.65))
df2 <- data.frame(amount = c(60.00,336.38,115.37,220.01,60.00,611.88,189.78,129.98,34.90,45.00))
I want to make a qqplot using both of them and add a y = x straight line to see if they have same distribution.
I am using qqplot(df1$amount, df2$amount) + abline() but it doesn't work: Error: ggplot2 doesn't know how to deal with data of class uneval
Please advise.
Also please explain me if I have an almost straight line in qqplot but I have a "level" there - what does it mean?

As has been pointed out, qqplot() and abline() are base R functions from the packages 'stats' and 'graphics'. There is no need to use + from the 'ggplot2' package.
It is more convenient to gather the data in a single data.frame.
df <- data.frame(
"Amount_X" = c(119.00,191.41,69.00,396.80,245.00,24.50,300.00,149.77,599.01,397.65),
"Amount_Y" = c(60.00,336.38,115.37,220.01,60.00,611.88,189.78,129.98,34.90,45.00)
)
A base R solution for the plot then would be as follows:
qqplot(df$Amount_X, df$Amount_Y)
abline(0,1)

Related

Represent a colored polygon in ggplot2

I am using the statspat package because I am working on spatial patterns.
I would like to do in ggplot and with colors instead of numbers (because it is not too readable),
the following graph, produced with the plot.quadratest function: Polygone
The numbers that interest me for the intensity of the colors are those at the bottom of each box.
The test object contains the following data:
Test object
I have looked at the help of the function, as well as the code of the function but I still cannot manage it.
Ideally I would like my final figure to look like this (maybe not with the same colors haha):
Final object
Thanks in advance for your help.
Please provide a reproducible example in the future.
The package reprex may be very helpful.
To use ggplot2 for this my best bet would be to convert
spatstat objects to sf and do the plotting that way,
but it may take some time. If you are willing to use base
graphics and spatstat you could do something like:
library(spatstat)
# Data (using a built-in dataset):
X <- unmark(chorley)
plot(X, main = "")
# Test:
test <- quadrat.test(X, nx = 4)
# Default plot:
plot(test, main = "")
# Extract the the `quadratcount` object (regions with observed counts):
counts <- attr(test, "quadratcount")
# Convert to `tess` (raw regions with no numbers)
regions <- as.tess(counts)
# Add residuals as marks to the tessellation:
marks(regions) <- test$residuals
# Plot regions with marks as colors:
plot(regions, do.col = TRUE, main = "")

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

Lines in ggplot order

From library mgcv
i get the points to plot with:
fsb <- fs.boundary(r0=0.1, r=1.1, l=2173)
if with standard graphic package i plot fsb and then i add lines i get :
x11()
plot(fsb)
lines(fsb$x,fsb$y)
I try now with ggplot (this is the line within a bigger code) :
tpdf <- data.frame(ts=fsb$x,ps=fsb$y)
ts=fsb$x
ps=fsb$y
geom_line(data=tpdf, aes(ts,ps), inherit.aes = FALSE)
i get a messy plot:
I think that i'm failing the order in geom_line
This can be solved by using geom_path:
ggplot(tpdf)+
geom_point(aes(ts,ps)) +
geom_path(aes(ts,ps))
You have a very odd way of using ggplot I recommend you to reexamine it.
data:
library(mgcv)
fsb <- fs.boundary(r0 = 0.1, r=2, l=13)
tpdf <- data.frame(ts=fsb$x,ps=fsb$y)
You'll have to specify the group parameter - for example, this
ggplot(tpdf) +
geom_point(aes(ts, ps)) +
geom_line(aes(ts, ps, group = gl(4, 40)))
gives me a plot similar to the one in base R.

error (unused argument) using plyr with lattice xyplot

Hello everybody on stackoverflow,
it's my first question asked here... (well, actually the first one no one had already replied to!).
I'm trying to use lattice xyplot function to plot a big df (2362422 rows), that should be splitted by a variable in several subplots (each of them with about 52 panels).
This is a highly simplified reproduction of the df and of the code I'm using:
library(lattice)
library(plyr)
set.seed(1)
df <- as.data.frame(cbind(x = rnorm(30), y=(1:2), z=rnorm(30), q = c("a","b","c","d","e")))
grpro <- function () {xyplot (x ~ z| q, data=df)}
grpro()
When I try to call the grpro function with d_ply to plot all the subplots based on the y variable, with the following code
d_ply(df, .(y), grpro)
I get the following error
Error in .fun(.data[[i]], ...) : unused argument (.data[[i]])
For what I understand, d_ply function splits the df in several dataframes, in this case two dfs based on the values "1" and "2" of y.
I assume that my code is working on that, and any other argument used in my grpro seems to be useful also when I split the df by y.
So, where am I wrong?
Thanks a lot for your help,
MZ

Add to ggplot with element of different length

I'm new to ggplot2 and I'm trying to figure out how I can add a line to an already existing plot I created. The original plot, which is the cumulative distribution of a column of data T1 from a data frame x, has about 100,000 elements in it. I have successfully plotted this using ggplot2 and stat_ecdf() with the code I posted below. Now I want to add another line using a set of (x,y) coordinates, but when I try this using geom_line() I get the error message:
Error in data.frame(x = c(0, 7.85398574631245e-07, 3.14159923334398e-06, :
arguments imply differing number of rows: 1001, 100000
Here's the code I'm trying to use:
> set.seed(42)
> x <- data.frame(T1=rchisq(100000,1))
> ps <- seq(0,1,.001)
> ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
> p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(aes(ts,ps))
That's what produces the error from above. Now here's the code using base graphics that I used to use but that I am now trying to move away from:
plot(ecdf(x$T1),xlab="T1",ylab="Cum. Prob.",xlim=c(0,4),ylim=c(0,1),main="Empirical vs. Theoretical Distribution of T1")
lines(ts,ps)
I've seen some other posts about adding lines in general, but what I haven't seen is how to add a line when the two originating vectors are not of the same length. (Note: I don't want to just use 100,000 (x,y) coordinates.)
As a bonus, is there an easy way, similar to using abline, to add a drop line on a ggplot2 graph?
Any advice would be much appreciated.
ggplot deals with data.frames, you need to make ts and ps a data.frame then specify this extra data.frame in your call to geom_line:
set.seed(42)
x <- data.frame(T1=rchisq(100000,1))
ps <- seq(0,1,.001)
ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
tpdf <- data.frame(ts=ts,ps=ps)
p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(data=tpdf, aes(ts,ps))

Resources