R : Bad graphic of ordered boxplot according to median - r

Here is what I am trying to do : I have a data.frame (data) of 160 rows with 2 variables (fact (8 groups) and response) and I want to do a boxplot of response ~ fact, ordered in increasing order of the medians.
Code :
data <- read.table("box.txt",header=T)
attach(data)
index <- order(tapply(response,fact,median))
ordered <- factor(rep(index,rep(20,8)))
boxplot(response~ordered,notch=T,names=as.character(index),xlab="treatments",ylab="response")
but on the graphic the boxes are badly plotted (not in the right order and with "false" Min, Max, etc...).
I'm using RStudio with R 3.0.2 on Windows 7.
Any clue about what does that mean?

One reproducible and seemingly correct answer would be :
set.seed(1)
data <- data.frame(response=10*rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
data$fact <- reorder(data$fact, data$response, median)
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
Names on the ticks of the x axis are correct, without further ado.

No idea why it looks 'bad', but the order is wrong because you use order instead of rank to find the index. For the other issues you probably have to make a reproducible example.

The reproducible example is as follows, with two boxplots to compare. In my case the plot (possibly) looks bad because of the devil's ears. Regarding the OP's question, I interpret his phrasing as bad referring to the fact that using order() instead of rank() resulted in other mishap as well (although I wouldn't know why).
data <- data.frame(response=rnorm(160), fact=factor(rep(1:8), labels=letters[1:8]))
boxplot(response~fact, data=data, notch=TRUE, xlab="treatments", ylab="response")
data$ordered <- rank(tapply(data$response, data$fact, median))
boxplot(response~ordered, data=data, notch=TRUE, xlab="treatments", ylab="response")

Related

uniroot gives multiple answers to equation with 1 unknown

I want to create a column in a data frame in which each row is the solution to an equation with 1 unknown (x). The other variables in the equation are provided in the other columns. In another Stack Overflow question, #flodel provided a solution, which I have tried to adapt. However, the output data frame omits some observations entirely, and others have "duplicates" with two different solutions to the same equation.
Sample of my data frame:
Time
id
V1
V2
V3
V4
199304
79330
259.721
224.5090
0.040140442
0.08100474
201004
77520
5062.200
3245.6921
0.037812662
0.08509553
196804
23018
202.897
842.6852
0.154956206
0.12982818
197804
12319
181.430
341.4415
0.052389156
0.14196588
199404
18542
14807.000
16537.0873
-0.001394388
0.08758791
Code with the equation I want to solve. I have simplified the equation, but the issue relates to this simple equation too.
library(plyr)
library(rootSolve
set.seed(1)
df <- adply(df, 1, summarize,
x = uniroot.all(function(x) V1 * ((V4-V3)/(x-V3)) - V2,
interval = c(-10,10)))
How can I achieve this? If possible, it would be great to do this in an efficient manner, as my actual data frame has >1,000,000 rows
The previous answer by #StefanoBarbi was pointing in the right direction.
Here are the plots of the functions implied by each row of your example data frame, with the solution superimposed as a red vertical line (so that we can see that yes, you're right that there is a root in the interval ...) [code below]
The problem is that the algorithm underlying uniroot() is only guaranteed to find the root of a function that is continuous on the interval. Your functions have discontinuities/singularities. (Even for a continuous function I'm sure that the algorithm could be broken with a function that was sufficiently weird to cause problems with floating-point math ...)
Even a bisection algorithm, which is more robust than Brent's method (the algorithm underlying uniroot) since it makes fewer assumptions about continuity of the derivative, could easily fail on this kind of discontinuous function. (It could be made to work for a function that is discontinuous but monotonic, but your example is neither continuous nor monotonic ...)
Obviously your real problem is more complex than this (or you would just be using easy analytical solution you referred to); what this means is that you need to find some way to "tame" your function. In this example, if you rearrange the function to avoid dividing by x-V3 (but without completely solving the equation) then uniroot() should work ...
f1 <- function(L) with(L, (V1/V2)*(V4-V3) + V3)
f1(df[1,])
png("badfit.png")
par(mfrow = c(2,3), bty = "l", las = 1)
for (i in 1:nrow(df)) {
with(df[i,],
curve(V1 * ((V4-V3)/(x-V3)) - V2,
from = -10, to = 10,
ylab = "", xlab = ""))
abline(v=f1(df[i,]), col = 2)
abline(h=0, col = 4)
}
dev.off()

grouping without additional packages

I'm using R to plot my data, but am unable to install packages for the moment as my workplace has put up a lot of firewalls (currently trying to get IT to get them down).
In the meantime, I was wondering if by using the plot() function I was able to plot my data in groups.
I have three variables in my data: IDName, Value, and Setpoints.
I wanted to aggregate my values for each setpoint thus I used the aggregate() function although this will aggregate all data for each setpoint, whereby I only want it to aggregate depending on the IDName. All forms of grouping seem to require a package, thus I was wondering if anyone knew any workarounds.
I've supplied the code below (note that the R script is within PowerBI, but for the purposes of my question only R expertise is needed). It would also be great if you know how to colour these points accordingly to each IDName.
# dataset <- data.frame(IDName, Value, Setpoints)
# dataset <- unique(dataset)
# Paste or type your script code here:
dat <- aggregate(Value ~ Setpoints, dataset, mean)
x <- dat$Value
y <- dat$Setpoints
z <- dataset$IDName
plot(x,y, main ="Turbidity Frequency Distribution",xlab="% Time < Turbidity level", ylab="Turbidity (NTU)")
lines(spline(x,y))

error (unused argument) using plyr with lattice xyplot

Hello everybody on stackoverflow,
it's my first question asked here... (well, actually the first one no one had already replied to!).
I'm trying to use lattice xyplot function to plot a big df (2362422 rows), that should be splitted by a variable in several subplots (each of them with about 52 panels).
This is a highly simplified reproduction of the df and of the code I'm using:
library(lattice)
library(plyr)
set.seed(1)
df <- as.data.frame(cbind(x = rnorm(30), y=(1:2), z=rnorm(30), q = c("a","b","c","d","e")))
grpro <- function () {xyplot (x ~ z| q, data=df)}
grpro()
When I try to call the grpro function with d_ply to plot all the subplots based on the y variable, with the following code
d_ply(df, .(y), grpro)
I get the following error
Error in .fun(.data[[i]], ...) : unused argument (.data[[i]])
For what I understand, d_ply function splits the df in several dataframes, in this case two dfs based on the values "1" and "2" of y.
I assume that my code is working on that, and any other argument used in my grpro seems to be useful also when I split the df by y.
So, where am I wrong?
Thanks a lot for your help,
MZ

How to produce leverage stats?

I know how to produce the plots using leveragePlot(), but I can not find a way to produce a statistic for leverage for each observation like in megastat output.
I think you're looking for the hat values.
Use hatvalues(fit). The rule of thumb is to examine any observations 2-3 times greater than the average hat value. I don't know of a specific function or package off the top of my head that provides this info in a nice data frame but doing it yourself is fairly straight forward. Here's an example:
fit <- lm(hp ~ cyl + mpg, data=mtcars) #a fake model
hatvalues(fit)
hv <- as.data.frame(hatvalues(fit))
mn <-mean(hatvalues(fit))
hv$warn <- ifelse(hv[, 'hatvalues(fit)']>3*mn, 'x3',
ifelse(hv[, 'hatvalues(fit)']>2*mn, 'x3', '-' ))
hv
For larger data sets you could use subset and/or orderto look at just certain values ranges for the hat values:
subset(hv, warn=="x3")
subset(hv, warn%in%c("x2", "x3"))
hv[order(hv['hatvalues(fit)']), ]
I actually came across a nice plot function that does this in the book R in Action but as this is a copyrighted book I will not display Kabacoff's intellectual property. But that plot would work even better for mid sized data sets.
Here is a decent hat plot though that you may also want to investigate:
plot(hatvalues(fit), type = "h")

Sorting of categorical variables in ggplot

Good day, I wish to produce a graphic using ggplot2, but not using its default sorting of the categorical variable (alphabetically, in script: letters), but using the associated value of a continuous variable (in script: number) .
Here is an example script:
library(ggplot2)
trial<-data.frame(letters=letters, numbers=runif(n=26,min=1,max=26))
trial<-trial[sample(1:26,26),]
trial.plot<-qplot(x=numbers, y=letters, data=trial)
trial.plot
trial<-trial[order(trial$numbers),]
trial.plot<-qplot(x=numbers, y=letters, data=trial)
trial.plot
trial.plot+stat_sort(variable=numbers)
The last line does not work.
I'm pretty sure stat_sort does not exist, so it's not surprising that it doesn't work as you think it should. Luckily, there's the reorder() function which reorders the level of a categorical variable depending on the values of a second variable. I think this should do what you want:
trial.plot <- qplot( x = numbers, y = reorder(letters, numbers), data = trial)
trial.plot
If you could be more specific about how you want it to look, I think the community could make improvements on my answer, regardless is this what you are looking for:
qplot(numbers, reorder(letters, numbers), data=trial)

Resources