R- pareto chart grouping like histogram - r

I need to create a Pareto chart in R. From the example of "qcc" library I need to do grouping before:
let's suppose my table is:
defect <- data.frame(a=c(8, 7, 6, 4, 3, 3, 3, 0, 0, 1))
If a do a histogram I get the grouping automatically
hist(defect$a, breaks=c(-1:8))
But with a pareto graph I don't:
pareto.chart(defect$a, ylab = "Error frequency")
Is there a way to get the grouping and the chart without having to group it with ddply?
I need to get the same result of the following, but without having to group it manually.
bb<-ddply(defect$a, .(a), count)
pareto.chart(bb$a, ylab = "Error frequency")

From the documentation it seems pretty clear that the pareto.chart function expects summarized data. If you don't want to use ddply() you could use the base table() function
pareto.chart(table(defect$a), ylab = "Error frequency")

Related

how to convert output of discretizeDF.supervised function which are in intervals into discrete values like 1,2,3 and so

this is under package arules.
so when i use unsupervised discretization of continuous variables function
df2 <- discretizeDF(df, default = list(method = "cluster", breaks = 4, labels=c("-1", "-0.5","0","0.5")))
i was able to get output of my variables in the aforementioned labels -1, -0.5, 0, 0.5.
but for supervised discretization function under arulesCBA package
df2 <- discretizeDF.supervised(df1, default = list(method = "caim", breaks = 3, labels=c("-1", "0","+1")))
the output of my continuous variables are intervals. e.g [1,18] [18,97] [97,infinite] . how can i convert the intervals to discrete values like -1, 0, +1?
thank you very much in advance. i just started using r and learning ml this recent mth. am msc finance student and am doing this for my master thesis. pls help . <3
It looks like you do not use discretizeDF.supervised as described in the manual page:
discretizeDF.supervised(formula, data, method = "mdlp", dig.lab = 3, ...)
You need a formula. Also, this function can not set labels since some methods create a different number of categories.

Using user-defined functions within "curve" function in R graphics

I am needing to produce normally distributed density plots with different total areas (summing to 1). Using the following function, I can specify the lambda - which gives the relative area:
sdnorm <- function(x, mean=0, sd=1, lambda=1){lambda*dnorm(x, mean=mean, sd=sd)}
I then want to plot up the function using different parameters. Using ggplot2, this code works:
require(ggplot2)
qplot(x, geom="blank") + stat_function(fun=sdnorm,args=list(mean=8,sd=2,lambda=0.7)) +
stat_function(fun=sdnorm,args=list(mean=18,sd=4,lambda=0.30))
but I really want to do this in base R graphics, for which I think I need to use the "curve" function. However, I am struggling to get this to work.
If you take a look at the help file for ? curve, you'll see that the first argument can be a number of different things:
The name of a function, or a call or an expression written as a function of x which will evaluate to an object of the same length as x.
This means you can specify the first argument as either a function name or an expression, so you could just do:
curve(sdnorm)
to get a plot of the function with its default arguments. Otherwise, to recreate your ggplot2 representation you would want to do:
curve(sdnorm(x, mean=8,sd=2,lambda=0.7), from = 0, to = 30)
curve(sdnorm(x, mean=18,sd=4,lambda=0.30), add = TRUE)
The result:
You can do the following in base R
x <- seq(0, 50, 1)
plot(x, sdnorm(x, mean = 8, sd = 2, lambda = 0.7), type = 'l', ylab = 'y')
lines(x, sdnorm(x, mean = 18, sd = 4, lambda = 0.30))
EDIT I added ylab = 'y' and updated the picture to have the y-axis re-labeled.
This should get you started.

How to extract outliers from box plot in R

Could you explain me if there is a way to extract outliers from box plot. I have plotted a box plot and I want to extract only the outliers.
Here is the code for the box plot.
# melting down
require(reshape)
melt_nx <- melt(nx, id.vars = c("x", "y"))
boxplot(data = melt_nx, main = "NX", value ~ variable, las = 2,
par(mar = c(15, 5, 4, 2) + 0.1),
names = c("We1", "We2", "we3"))
Is it possible from the box plot to extract the outliers only?
The boxplot function returns a list with one of it node-names as "out". These are the values that are beyond the "whiskers". I don't know about executing par within the argument list but if you want these particular values, then use this:
vals <- boxplot(data = melt_nx, main = "NX", value ~ variable, las = 2,
names = c("We1", "We2", "we3"))
vals$out
And do read all these help pages:
?boxplot
?boxplot.stats
?bxp
?fivenum
I know this has been answered, but for me there is an alternative method using the Boxplot method from the car package. Note the capital B in the Boxplot function call.
This is the code that does it for me, it returns the row numbers of the outliers which you can then use in your dataframe to filter out or extract, etc...
outliers<-Boxplot(x~y, data=df, id.method="y")
Note that the extracted values are of type Character. Then to exclude them you could do something like:
df2 <- df[-as.numeric(outliers),]
Hope this helps a little

Using optional arguments (...) in a function, as illustrated with new population pyramid plot

Wanting to show the distribution of participants in a survey by level, I came upon the recently-released pyramid package and tried it. As the font on the x-axis is too large and there seem to be no other formatting choices to fix it, I realized I don't know how to add "other options" as permitted by the ... in the pyramid call.
install.packages("pyramid")
library(pyramid)
level.pyr <- data.frame(left = c(1, 4, 6, 4, 41, 17),
right = c(1, 4, 6, 4, 41, 17),
level = c("Mgr", "Sr. Mgr.", "Dir.", "Sr. Dir.", "VP", "SVP+"))
pyramid(level.pyr, Laxis = seq(2,50,6), Cstep = 1, Cgap = .5, Llab = "", Rlab = "", Clab = "Title", GL = T, Lcol = "deepskyblue", Rcol = "deepskyblue", Ldens = -1, main = "Distribution of Participants in Survey")
Agreed, the plot below looks odd because the left and the right sides are the same, not male and female. But my question remains as to how to invoke the options and do something like "Laxis.size = 2" of "Raxis.font = "bold".
Alternatives to this new package for creating pyramid plots include plotrix, grid, and base R, as demonstrated here:
population pyramid density plot in r
By the way, if there were a ggplot method, I would welcome trying it.
Contrary to Roland's and now nrussell's guesses (without apparently looking at the code) expressed in comments, the dots arguments will not be passed to pyramid's axis plotting routine, despite this being a base graphics function. The arguments are not even passed to an axis call, although that would have seemed reasonable. The x-axis tick labels are constructed with a call to text(). You could hack the text calls to accept a named argument of your choosing and it would be passed via the dots mechanism. You seem open to other options and I would recommend using plotrix::pyramid.plot since Jim Lemon does a better job of documenting his routines and it's more likely they will be using standard R plotting conventions:
library(plotrix)
pyramid.plot(lx,rx,labels=NA,top.labels=c("Male","Age","Female"),
main="",laxlab=NULL,raxlab=NULL,unit="%",lxcol,rxcol,gap=1,space=0.2,
ppmar=c(4,2,4,2),labelcex=1,add=FALSE,xlim,show.values=FALSE,ndig=1,
do.first=NULL)
with( level.pyr, pyramid.plot(lx=left, rx=right, labels=level,
gap =5, top.labels=c("", "Title", ""), labelcex=0.6))

Printing plot depending on variable conditions on 2 pdf pages

I'am trying to print a plot, depending on a variable with 12 terms. This plot is the result of cluster classification on sequences, using OM distance.
I print this plot on one pdf page :
pdf("YYY.pdf", height=11,width=20)
seqIplot(XXX.seq, group=XXX$variable, cex.legend = 2, cex.plot = 1.5, border = NA, sortv =XXX.om)
dev.off()
But the printing is to small ... so i try to print this on 2 pages, like this :
pdf("YYY.pdf", height=11,width=20)
seqIplot(XXX.seq, group=XXX$variable, variable="1":"6", cex.legend = 2, cex.plot = 1.5, border = NA, sortv =XXX.om)
seqIplot(XXX.seq, group=XXX$variable, variable="7":"12", cex.legend = 2, cex.plot = 1.5, border = NA, sortv = XXX.om)
dev.off()
But it doesn't work ... Do you know how I can ask R to separate terms' variables into two groups, so as to print 6 graphics per pdf page ?
The solution is to plot separately the subset of groups you want on each page. Here is an example using the biofam data provided by TraMineR. The group variable p02r04 is religious participation which takes 10 different values.
library(TraMineR)
data(biofam)
bs <- seqdef(biofam[,10:25])
group <- factor(biofam$p02r04)
lv <- levels(group)
sel <- (group %in% lv[1:6])
seqIplot(bs[sel,], group=group[sel], sortv="from.end", withlegend=FALSE)
seqIplot(bs[!sel,], group=group[!sel], sortv="from.end")
If you are sorting the index plot with a variable you should indeed take the same subset of the sort variable, e.g. sortv=XXX.om[sel] in your case.
I don't know if I understood your question, you could post some data in order to help us reproduce what you want, maybe this helps. To plot six graphs in one page you should adjust the mfrow parameter, is that what you wanted?
pdf("test.pdf")
par(mfrow=c(3,2))
plot(1:10, 21:30)
plot(1:10, 21:30, pch=20)
hist(rnorm(1000))
barplot(VADeaths)
...
dev.off()

Resources