error (unused argument) using plyr with lattice xyplot - r

Hello everybody on stackoverflow,
it's my first question asked here... (well, actually the first one no one had already replied to!).
I'm trying to use lattice xyplot function to plot a big df (2362422 rows), that should be splitted by a variable in several subplots (each of them with about 52 panels).
This is a highly simplified reproduction of the df and of the code I'm using:
library(lattice)
library(plyr)
set.seed(1)
df <- as.data.frame(cbind(x = rnorm(30), y=(1:2), z=rnorm(30), q = c("a","b","c","d","e")))
grpro <- function () {xyplot (x ~ z| q, data=df)}
grpro()
When I try to call the grpro function with d_ply to plot all the subplots based on the y variable, with the following code
d_ply(df, .(y), grpro)
I get the following error
Error in .fun(.data[[i]], ...) : unused argument (.data[[i]])
For what I understand, d_ply function splits the df in several dataframes, in this case two dfs based on the values "1" and "2" of y.
I assume that my code is working on that, and any other argument used in my grpro seems to be useful also when I split the df by y.
So, where am I wrong?
Thanks a lot for your help,
MZ

Related

grouping without additional packages

I'm using R to plot my data, but am unable to install packages for the moment as my workplace has put up a lot of firewalls (currently trying to get IT to get them down).
In the meantime, I was wondering if by using the plot() function I was able to plot my data in groups.
I have three variables in my data: IDName, Value, and Setpoints.
I wanted to aggregate my values for each setpoint thus I used the aggregate() function although this will aggregate all data for each setpoint, whereby I only want it to aggregate depending on the IDName. All forms of grouping seem to require a package, thus I was wondering if anyone knew any workarounds.
I've supplied the code below (note that the R script is within PowerBI, but for the purposes of my question only R expertise is needed). It would also be great if you know how to colour these points accordingly to each IDName.
# dataset <- data.frame(IDName, Value, Setpoints)
# dataset <- unique(dataset)
# Paste or type your script code here:
dat <- aggregate(Value ~ Setpoints, dataset, mean)
x <- dat$Value
y <- dat$Setpoints
z <- dataset$IDName
plot(x,y, main ="Turbidity Frequency Distribution",xlab="% Time < Turbidity level", ylab="Turbidity (NTU)")
lines(spline(x,y))

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

R, ggplot2 qqplot using 2 vectors + straight line?

I have 2 data.frame objects:
df1
df2
Both have one column = amount.
For example:
df1 <- data.frame(amount = c(119.00,191.41,69.00,396.80,245.00,24.50,300.00,149.77,599.01,397.65))
df2 <- data.frame(amount = c(60.00,336.38,115.37,220.01,60.00,611.88,189.78,129.98,34.90,45.00))
I want to make a qqplot using both of them and add a y = x straight line to see if they have same distribution.
I am using qqplot(df1$amount, df2$amount) + abline() but it doesn't work: Error: ggplot2 doesn't know how to deal with data of class uneval
Please advise.
Also please explain me if I have an almost straight line in qqplot but I have a "level" there - what does it mean?
As has been pointed out, qqplot() and abline() are base R functions from the packages 'stats' and 'graphics'. There is no need to use + from the 'ggplot2' package.
It is more convenient to gather the data in a single data.frame.
df <- data.frame(
"Amount_X" = c(119.00,191.41,69.00,396.80,245.00,24.50,300.00,149.77,599.01,397.65),
"Amount_Y" = c(60.00,336.38,115.37,220.01,60.00,611.88,189.78,129.98,34.90,45.00)
)
A base R solution for the plot then would be as follows:
qqplot(df$Amount_X, df$Amount_Y)
abline(0,1)

R : Bad graphic of ordered boxplot according to median

Here is what I am trying to do : I have a data.frame (data) of 160 rows with 2 variables (fact (8 groups) and response) and I want to do a boxplot of response ~ fact, ordered in increasing order of the medians.
Code :
data <- read.table("box.txt",header=T)
attach(data)
index <- order(tapply(response,fact,median))
ordered <- factor(rep(index,rep(20,8)))
boxplot(response~ordered,notch=T,names=as.character(index),xlab="treatments",ylab="response")
but on the graphic the boxes are badly plotted (not in the right order and with "false" Min, Max, etc...).
I'm using RStudio with R 3.0.2 on Windows 7.
Any clue about what does that mean?
One reproducible and seemingly correct answer would be :
set.seed(1)
data <- data.frame(response=10*rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
data$fact <- reorder(data$fact, data$response, median)
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
Names on the ticks of the x axis are correct, without further ado.
No idea why it looks 'bad', but the order is wrong because you use order instead of rank to find the index. For the other issues you probably have to make a reproducible example.
The reproducible example is as follows, with two boxplots to compare. In my case the plot (possibly) looks bad because of the devil's ears. Regarding the OP's question, I interpret his phrasing as bad referring to the fact that using order() instead of rank() resulted in other mishap as well (although I wouldn't know why).
data <- data.frame(response=rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
data$ordered <- rank(tapply(data$response, data$fact, median))
boxplot(response~ordered, data=data, notch=TRUE, xlab="treatments", ylab="response")

Add to ggplot with element of different length

I'm new to ggplot2 and I'm trying to figure out how I can add a line to an already existing plot I created. The original plot, which is the cumulative distribution of a column of data T1 from a data frame x, has about 100,000 elements in it. I have successfully plotted this using ggplot2 and stat_ecdf() with the code I posted below. Now I want to add another line using a set of (x,y) coordinates, but when I try this using geom_line() I get the error message:
Error in data.frame(x = c(0, 7.85398574631245e-07, 3.14159923334398e-06, :
arguments imply differing number of rows: 1001, 100000
Here's the code I'm trying to use:
> set.seed(42)
> x <- data.frame(T1=rchisq(100000,1))
> ps <- seq(0,1,.001)
> ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
> p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(aes(ts,ps))
That's what produces the error from above. Now here's the code using base graphics that I used to use but that I am now trying to move away from:
plot(ecdf(x$T1),xlab="T1",ylab="Cum. Prob.",xlim=c(0,4),ylim=c(0,1),main="Empirical vs. Theoretical Distribution of T1")
lines(ts,ps)
I've seen some other posts about adding lines in general, but what I haven't seen is how to add a line when the two originating vectors are not of the same length. (Note: I don't want to just use 100,000 (x,y) coordinates.)
As a bonus, is there an easy way, similar to using abline, to add a drop line on a ggplot2 graph?
Any advice would be much appreciated.
ggplot deals with data.frames, you need to make ts and ps a data.frame then specify this extra data.frame in your call to geom_line:
set.seed(42)
x <- data.frame(T1=rchisq(100000,1))
ps <- seq(0,1,.001)
ts <- .5*qchisq(ps,1) #50:50 mixture of chi-square (df=1) and 0
tpdf <- data.frame(ts=ts,ps=ps)
p <- ggplot(x,aes(T1)) + stat_ecdf() + geom_line(data=tpdf, aes(ts,ps))

Resources