Remove outlier from five-number summary statistics - r

How can I force fivenum function to not put outliers as my maximum/minimum values?
I want to be able to see uppper and lower whisker numbers on my boxplot.
My code:
boxplot(data$`Weight(g)`)
text(y=fivenum(data$`Weight(g)`),labels=fivenum(data$`Weight(g)`),x=1.25, title(main = "Weight(g)"))

boxplot returns a named-list that includes things you can use to remove outliers in your call to fivenum:
$out includes the literal outliers. It can be tempting to use setdiff(data$`Weight(g)`), but that may be prone to problems due to R FAQ 7.31 (and floating-point equality), so I recommend against this; instead,
$stats includes the numbers used for the boxplot itself without the outliers. I suggest we work with this.
(BTW, title(.) does its work via side-effect, and it is not used by text(.), I suggest you move that call.)
Reproducible data/code:
vec <- c(1, 10:20, 30)
bp <- boxplot(vec)
str(bp)
# List of 6
# $ stats: num [1:5, 1] 10 12 15 18 20
# $ n : num 13
# $ conf : num [1:2, 1] 12.4 17.6
# $ out : num [1:2] 1 30
# $ group: num [1:2] 1 1
# $ names: chr "1"
five <- fivenum(vec[ vec >= min(bp$stats) & vec <= max(bp$stats)])
text(x=1.25, y=five, labels=five)
title("Weight(g)")

Related

How do I matricise a column/vector (applying a function like sum/diff/boolean)? [duplicate]

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

simulate observations and calculate sample autocorrelation

Simulating rk (r stands for autocorrelation) for {et} where each et is iid N(0,1).
R code: simulate 100 observations of {et} and calculate r1.
Here is my code so far:
x=rnorm(100,0,1)
x
y=ts(x)
trial_r1=acf(y)[1]
trial_r1
Is my code right? How to get r1 after running acf()
(I'll post as an answer, both to close the question, plus to help with searching for answers to similarly-structured questions.)
When looking for what you believe is but one part of a structured return, it's useful to look at the return value in detail. One common way to do this is with str:
set.seed(42)
x <- rnorm(100, mean = 0, sd = 1)
ret <- acf(ts(x))
str(ret)
## List of 6
## $ acf : num [1:21, 1, 1] 1 0.05592 -0.00452 0.03542 0.00278 ...
## $ type : chr "correlation"
## $ n.used: int 100
## $ lag : num [1:21, 1, 1] 0 1 2 3 4 5 6 7 8 9 ...
## $ series: chr "ts(x)"
## $ snames: NULL
## - attr(*, "class")= chr "acf"
In this instance, you'll see two clusters of numbers in $acf and $lag. The latter "clearly" is just an array of incrementing integers so is not that interesting in this endeavor, but the former looks more interesting. By seeing that the results is ultimately just a list, you can use dollar-sign subsetting (or [[, over to you) to extract what you need:
ret$acf
## , , 1
## [,1]
## [1,] 1.000000e+00
## [2,] 5.592310e-02
## [3,] -4.524017e-03
## [4,] 3.541639e-02
## [5,] 2.784590e-03
## ...snip...
In the case of your question, you should notice that the first element of this 3-dimensional array is the perfectly-autocorrelated 1, but your first real autocorrelation of concern is the second element, or 0.0559. So your first value is attainable with ret$acf[2,,] (or more formally ret$acf[2,1,1]).

combining and operating on matrices twice nested in a list

if xmpl is a list where each element has an integer age and a list data, where data contains three matrices of equal size, a to c
What is the best way to do
cor( xmpl[[:]]$data[[:]][c('a','b','c')], xmpl[[:]]$age)
where the results would be 3 x length(a) array or list that reflects age correlated with each instance of each element of a (row 1), b (row 2), and c (row 3) across xmpl.
I am reading in matrices that represent the output of different pipelines. There are 3 of these per subject and a whole lot of subjects. Currently, I've built a list of subjects that has among other things a list of pipeline matrices.
The structure looks like:
str(exmpl)
$ :List of 4
..$ id : int 5
..$ age : num 10
..$ data :List of 3
.. ..$ a: num [1:10, 1:10] 0.782 1.113 3.988 0.253 4.118 ...
.. ..$ b: num [1:10, 1:10] 5.25 5.31 5.28 5.43 5.13 ...
.. ..$ c: num [1:10, 1:10] 1.19e-05 5.64e-03 7.65e-01 1.65e-03 4.50e-01 ...
..$ otherdata: chr "ignorefornow"
#[...]
I want to correlate every element of a across all subjects with the age of subjects. Then do the same for b and c and put the results into a list.
I think I am approaching this in a way that is awkward for R. I'm interested in what the "R way" of storing and retrieving this data would be.
Data Structure and desired output http://dl.dropbox.com/u/56019781/linked/struct-2012-12-19.svg
library(plyr)
## example structure
xmpl.mat <- function(){ matrix(runif(100),nrow=10) }
xmpl.list <- function(x){ list( id=x, age=2*x, data=list( a=x*xmpl.mat(), b=x+xmpl.mat(), c=xmpl.mat()^x ), otherdata='ignorefornow' ) }
xmpl <- lapply( 1:5, xmpl.list )
## extract
ages <- laply(xmpl,'[[','age')
data <- llply(xmpl,'[[','data')
# to get the cor for one set of matrices is easy enough
# though it would be nice to do: a <- xmpl[[:]]$data$a
x.a <- sapply(data,'[[','a')
x.a.corr <- apply(x.a,1,cor,ages)
# ...
#xmpl.corr <- list(x.a.corr,x.b.corr,x.c.corr)
# and by loop, not R like?
xmpl.corr<-list()
for (i in 1:length(names(data[[1]])) ){
x <- sapply(data,'[[',i)
xmpl.corr[[i]] <- apply(x,1,cor,ages)
}
names(xmpl.corr) <- names(data[[1]])
Final output:
str(xmpl.corr)
List of 3
$ a: num [1:100] 0.712 -0.296 0.739 0.8 0.77 ...
$ b: num [1:100] 0.98 0.997 0.974 0.983 0.992 ...
$ c: num [1:100] -0.914 -0.399 -0.844 -0.339 -0.571 ..
Here's a solution. It should be short enough.
ages <- sapply(xmpl, "[[", "age") # extract ages
data <- sapply(xmpl, function(x) unlist(x[["data"]])) # combine all matrices
corr <- apply(data, 1, cor, ages) # calculate correlations
xmpl.corr <- split(corr, substr(names(corr), 1, 1)) # split the vector
Instead of x.a, x.b, x.c you would probably want to have all of these in one list.
# First, get a list of the items in data
abc <- names(xmpl[[1]]$data) # incase variables change in future
names(abc) <- abc # these are the same names that will be used for the final list. You can use whichever names make sense
## use lapply to keep as list, use sapply to "simplify" the list
x.data.list <- lapply(abc, function(z)
sapply(xmpl, function(xm) c(xm$data[[z]])) )
ages <- sapply(xmpl, `[[`, "age")
# Then compute the correlations. Note that on each element of x.data.list we are apply'ing per row
correlations <- lapply(x.data.list, apply, 1, cor, ages)

mapply basics? - how to create a matrix from two vectors and a function

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

How to make a plot from summaryRprof?

This is a question for an university assignment.
I was given three algorithms to calculate the GCD that I already did. My problem is getting the Rprof results to a plot so I can compare them side by side.
From what little understanding I have about Rprof, summaryRprof and plot is that Rprof is used like this:
Rprof() #To start
#functions here
Rprof(NULL) #TO end
summaryRprof() # to print results
I understand that plot has many different types of inputs, x and y values and something called a data frame which I assume is a fancy word for table. and to draw different lines and things I need to use this: http://www.harding.edu/fmccown/r/
what I cant figure out is how to get the summaryRprof results to the plot() function.
> Rprof(filename="RProfOut2.out", interval=0.0001)
> gcdBruteForce(10000, 33)
[1] 1
> gcdEuclid(10000, 33)
[1] 1
> gcdPrimeFact(10000, 33)
[1] 1
> Rprof(NULL)
> summaryRprof()
?????plot????
I have been reading on stack overflow that and other sites that I can also try to use profr and proftools although I am not very clear on the usage.
The only graph I have been able to make is one using plot(system.time(gcdFunction(10,100)))
As always any help is appreciated.
There are two packages that visualize Rprof output:
profr by Hadley Wickham
proftools by Luke Tierney
On a Unix/OS X system with graphviz installed you can use a nice Perl script by Romain Francois that is on the R Wiki page on profiling.
All this is described with examples in my Intro to HPC with R tutorials you can find e.g. here.
The gcd functions must be something you've coded separately as they aren't a part of base R. Here is a contrived example to show you how to get the data into a usable format for further processing and plotting.
First, you need to pass summaryRprof() a filename to process. In your example, this would be summaryRprof("RProfOut2.out").
This will return the summary statistics for your previous code. Since we need to do further processing on these statistics, let's assign it to a new object:
sumStats <- summaryRprof("RProfOut2.out")
This returns a list object with 4 elements:
> str(sumStats)
List of 4
$ by.self :'data.frame': 2 obs. of 4 variables:
..$ self.time : num [1:2] 1.97 0.25
..$ self.pct : num [1:2] 88.7 11.3
..$ total.time: num [1:2] 1.97 0.25
..$ total.pct : num [1:2] 88.7 11.3
$ by.total :'data.frame': 3 obs. of 4 variables:
..$ total.time: num [1:3] 1.97 0.25 0
..$ total.pct : num [1:3] 88.7 11.3 0
..$ self.time : num [1:3] 1.97 0.25 0
..$ self.pct : num [1:3] 88.7 11.3 0
$ sample.interval: num 1e-04
$ sampling.time : num 2.22
At this point, I'm assuming you are going to be interested in one of the first two data.frames. I prefer the graphics in ggplot2 over base graphics, but you can certainly achieve most things with base graphics...I just have more experience with ggplot2. Here's one approach to plotting the data for the by.self() dataframe that was generated:
require(ggplot2)
byself <- sumStats$by.self
byself$functions <- rownames(byself)
m <- melt(byself, id.var = "functions")
qplot(functions, value, data = m, fill = variable, geom = "bar", position = "dodge")
That makes a plot that looks like this:
And for completeness sake, here is the contrived code I made up for the example. I know not very creative, but gets the job done:
Rprof("Rprof.out", interval = 0.0001)
x <- rnorm(10000000)
y <- x ^ 2
Rprof(NULL)
sumStats <- summaryRprof("Rprof.out")

Resources