This is a question for an university assignment.
I was given three algorithms to calculate the GCD that I already did. My problem is getting the Rprof results to a plot so I can compare them side by side.
From what little understanding I have about Rprof, summaryRprof and plot is that Rprof is used like this:
Rprof() #To start
#functions here
Rprof(NULL) #TO end
summaryRprof() # to print results
I understand that plot has many different types of inputs, x and y values and something called a data frame which I assume is a fancy word for table. and to draw different lines and things I need to use this: http://www.harding.edu/fmccown/r/
what I cant figure out is how to get the summaryRprof results to the plot() function.
> Rprof(filename="RProfOut2.out", interval=0.0001)
> gcdBruteForce(10000, 33)
[1] 1
> gcdEuclid(10000, 33)
[1] 1
> gcdPrimeFact(10000, 33)
[1] 1
> Rprof(NULL)
> summaryRprof()
?????plot????
I have been reading on stack overflow that and other sites that I can also try to use profr and proftools although I am not very clear on the usage.
The only graph I have been able to make is one using plot(system.time(gcdFunction(10,100)))
As always any help is appreciated.
There are two packages that visualize Rprof output:
profr by Hadley Wickham
proftools by Luke Tierney
On a Unix/OS X system with graphviz installed you can use a nice Perl script by Romain Francois that is on the R Wiki page on profiling.
All this is described with examples in my Intro to HPC with R tutorials you can find e.g. here.
The gcd functions must be something you've coded separately as they aren't a part of base R. Here is a contrived example to show you how to get the data into a usable format for further processing and plotting.
First, you need to pass summaryRprof() a filename to process. In your example, this would be summaryRprof("RProfOut2.out").
This will return the summary statistics for your previous code. Since we need to do further processing on these statistics, let's assign it to a new object:
sumStats <- summaryRprof("RProfOut2.out")
This returns a list object with 4 elements:
> str(sumStats)
List of 4
$ by.self :'data.frame': 2 obs. of 4 variables:
..$ self.time : num [1:2] 1.97 0.25
..$ self.pct : num [1:2] 88.7 11.3
..$ total.time: num [1:2] 1.97 0.25
..$ total.pct : num [1:2] 88.7 11.3
$ by.total :'data.frame': 3 obs. of 4 variables:
..$ total.time: num [1:3] 1.97 0.25 0
..$ total.pct : num [1:3] 88.7 11.3 0
..$ self.time : num [1:3] 1.97 0.25 0
..$ self.pct : num [1:3] 88.7 11.3 0
$ sample.interval: num 1e-04
$ sampling.time : num 2.22
At this point, I'm assuming you are going to be interested in one of the first two data.frames. I prefer the graphics in ggplot2 over base graphics, but you can certainly achieve most things with base graphics...I just have more experience with ggplot2. Here's one approach to plotting the data for the by.self() dataframe that was generated:
require(ggplot2)
byself <- sumStats$by.self
byself$functions <- rownames(byself)
m <- melt(byself, id.var = "functions")
qplot(functions, value, data = m, fill = variable, geom = "bar", position = "dodge")
That makes a plot that looks like this:
And for completeness sake, here is the contrived code I made up for the example. I know not very creative, but gets the job done:
Rprof("Rprof.out", interval = 0.0001)
x <- rnorm(10000000)
y <- x ^ 2
Rprof(NULL)
sumStats <- summaryRprof("Rprof.out")
Related
How can I force fivenum function to not put outliers as my maximum/minimum values?
I want to be able to see uppper and lower whisker numbers on my boxplot.
My code:
boxplot(data$`Weight(g)`)
text(y=fivenum(data$`Weight(g)`),labels=fivenum(data$`Weight(g)`),x=1.25, title(main = "Weight(g)"))
boxplot returns a named-list that includes things you can use to remove outliers in your call to fivenum:
$out includes the literal outliers. It can be tempting to use setdiff(data$`Weight(g)`), but that may be prone to problems due to R FAQ 7.31 (and floating-point equality), so I recommend against this; instead,
$stats includes the numbers used for the boxplot itself without the outliers. I suggest we work with this.
(BTW, title(.) does its work via side-effect, and it is not used by text(.), I suggest you move that call.)
Reproducible data/code:
vec <- c(1, 10:20, 30)
bp <- boxplot(vec)
str(bp)
# List of 6
# $ stats: num [1:5, 1] 10 12 15 18 20
# $ n : num 13
# $ conf : num [1:2, 1] 12.4 17.6
# $ out : num [1:2] 1 30
# $ group: num [1:2] 1 1
# $ names: chr "1"
five <- fivenum(vec[ vec >= min(bp$stats) & vec <= max(bp$stats)])
text(x=1.25, y=five, labels=five)
title("Weight(g)")
I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns
if xmpl is a list where each element has an integer age and a list data, where data contains three matrices of equal size, a to c
What is the best way to do
cor( xmpl[[:]]$data[[:]][c('a','b','c')], xmpl[[:]]$age)
where the results would be 3 x length(a) array or list that reflects age correlated with each instance of each element of a (row 1), b (row 2), and c (row 3) across xmpl.
I am reading in matrices that represent the output of different pipelines. There are 3 of these per subject and a whole lot of subjects. Currently, I've built a list of subjects that has among other things a list of pipeline matrices.
The structure looks like:
str(exmpl)
$ :List of 4
..$ id : int 5
..$ age : num 10
..$ data :List of 3
.. ..$ a: num [1:10, 1:10] 0.782 1.113 3.988 0.253 4.118 ...
.. ..$ b: num [1:10, 1:10] 5.25 5.31 5.28 5.43 5.13 ...
.. ..$ c: num [1:10, 1:10] 1.19e-05 5.64e-03 7.65e-01 1.65e-03 4.50e-01 ...
..$ otherdata: chr "ignorefornow"
#[...]
I want to correlate every element of a across all subjects with the age of subjects. Then do the same for b and c and put the results into a list.
I think I am approaching this in a way that is awkward for R. I'm interested in what the "R way" of storing and retrieving this data would be.
Data Structure and desired output http://dl.dropbox.com/u/56019781/linked/struct-2012-12-19.svg
library(plyr)
## example structure
xmpl.mat <- function(){ matrix(runif(100),nrow=10) }
xmpl.list <- function(x){ list( id=x, age=2*x, data=list( a=x*xmpl.mat(), b=x+xmpl.mat(), c=xmpl.mat()^x ), otherdata='ignorefornow' ) }
xmpl <- lapply( 1:5, xmpl.list )
## extract
ages <- laply(xmpl,'[[','age')
data <- llply(xmpl,'[[','data')
# to get the cor for one set of matrices is easy enough
# though it would be nice to do: a <- xmpl[[:]]$data$a
x.a <- sapply(data,'[[','a')
x.a.corr <- apply(x.a,1,cor,ages)
# ...
#xmpl.corr <- list(x.a.corr,x.b.corr,x.c.corr)
# and by loop, not R like?
xmpl.corr<-list()
for (i in 1:length(names(data[[1]])) ){
x <- sapply(data,'[[',i)
xmpl.corr[[i]] <- apply(x,1,cor,ages)
}
names(xmpl.corr) <- names(data[[1]])
Final output:
str(xmpl.corr)
List of 3
$ a: num [1:100] 0.712 -0.296 0.739 0.8 0.77 ...
$ b: num [1:100] 0.98 0.997 0.974 0.983 0.992 ...
$ c: num [1:100] -0.914 -0.399 -0.844 -0.339 -0.571 ..
Here's a solution. It should be short enough.
ages <- sapply(xmpl, "[[", "age") # extract ages
data <- sapply(xmpl, function(x) unlist(x[["data"]])) # combine all matrices
corr <- apply(data, 1, cor, ages) # calculate correlations
xmpl.corr <- split(corr, substr(names(corr), 1, 1)) # split the vector
Instead of x.a, x.b, x.c you would probably want to have all of these in one list.
# First, get a list of the items in data
abc <- names(xmpl[[1]]$data) # incase variables change in future
names(abc) <- abc # these are the same names that will be used for the final list. You can use whichever names make sense
## use lapply to keep as list, use sapply to "simplify" the list
x.data.list <- lapply(abc, function(z)
sapply(xmpl, function(xm) c(xm$data[[z]])) )
ages <- sapply(xmpl, `[[`, "age")
# Then compute the correlations. Note that on each element of x.data.list we are apply'ing per row
correlations <- lapply(x.data.list, apply, 1, cor, ages)
I have two large data frames, a and b for which identical(a,b) is TRUE, as is all.equal(a,b), but identical(digest(a),digest(b)) is FALSE. What could cause this?
What's more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames.
Here is a sequence of comparisons:
> identical(a, b)
[1] TRUE
> all.equal(a, b)
[1] TRUE
> digest(a)
[1] "cac56b06078733b6fb520442e5482684"
> digest(b)
[1] "fdd5ab78ca961982d195f800e3cf60af"
> digest(a[1:nrow(a),])
[1] "e44f906723405756509a6b17b5949d1a"
> digest(b[1:nrow(b),])
[1] "e44f906723405756509a6b17b5949d1a"
Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies?
For further details: the objects are about 10M rows x 12 columns. Here's the output of str():
'data.frame': 10056987 obs. of 12 variables:
$ V1 : num 1 11 21 31 41 61 71 81 91 101 ...
$ V2 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V3 : num 2 3 2 3 4 5 2 4 2 4 ...
$ V4 : num 1 1 1 1 1 1 1 1 1 1 ...
$ V5 : num 1.8 2.29 1.94 2.81 3.06 ...
$ V6 : num 0.0653 0.0476 0.0324 0.034 0.0257 ...
$ V7 : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
$ V8 : num 0.00653 0.00476 0.00324 0.0034 0.00257 ...
$ V9 : num 1.8 2.3 1.94 2.81 3.06 ...
$ V10: num 0.1957 0.7021 0.0604 0.1866 0.9371 ...
$ V11: num 1704 1554 1409 1059 1003 ...
$ V12: num 23309 23309 23309 23309 23309 ...
> print(object.size(a), units = "Mb")
920.7 Mb
Update 1: On a whim, I converted these to matrices. The digests are the same.
> aM = as.matrix(a)
> bM= as.matrix(b)
> identical(aM,bM)
[1] TRUE
> digest(aM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
> digest(bM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for a).
> aMF = as.data.frame(aM)
> bMF = as.data.frame(bM)
> digest(aMF)
[1] "cac56b06078733b6fb520442e5482684"
> digest(bMF)
[1] "cac56b06078733b6fb520442e5482684"
So, b looks like the bad boy, and it has a colorful past. b came from a much bigger data frame, say B. I took only the columns of B that appeared in a and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.), just to avoid any issues that might arise - though all.equal and identical tend to point out when column names differ.
Since I am working on two different programs and don't have simultaneous access to a and b, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply digest() to it.
ANSWER:
It turns out, to my astonishment (dismay, horror, embarrassment, you name it), identical is very forgiving about attributes. I had assumed that only all.equal was forgiving about attributes.
This was discovered via Tommy's suggestion identical(d1, d2, attrib.as.set=FALSE). Running attributes(a) is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of names(attributes()):
> names(attributes(a))
[1] "names" "row.names" "class"
> names(attributes(b))
[1] "names" "class" "row.names"
They're in different orders! Kudos to digest() for being straight with me.
UPDATE
To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I'm not aware of a faster method for doing this. (I'm also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)
tmpA0 = attributes(a)
tmpA1 = tmpA0[sort(names(tmpA0))]
a2 = a
attributes(a2) = tmpA1
tmpB0 = attributes(b)
tmpB1 = tmpB0[sort(names(tmpB0))]
b2 = b
attributes(b2) = tmpB1
digest(a2) # e04e624692d82353479efbd713ec03f6
digest(b2) # e04e624692d82353479efbd713ec03f6
identical(b,b2, attrib.as.set = FALSE) # FALSE
identical(b,b2, attrib.as.set = TRUE) # TRUE
identical(a2,b2, attrib.as.set = FALSE) # TRUE
Without having the actual data.frames it is of course hard to know, but one difference could be the order of the attributes. identical ignores that by default, but setting attrib.as.set=FALSE can change that:
d1 <- structure(1, foo=1, bar=2)
d2 <- structure(1, bar=2, foo=1)
identical(d1, d2) # TRUE
identical(d1, d2, attrib.as.set=FALSE) # FALSE
Our digest package uses the internal R function serialize() to get what we feed to the hash-generating functions (md5, sha1, ...).
So I strongly suspect that may have something like an attribute differ. Until you can construct something reproducible that does not depend on your 1e7 x 12 data set, there is little we can do.
Also, the digest() function can output intermediate results and (as of the recent 0.5.1 version) even raw vectors. That may help.
Lastly, you can always contact us (as the package maintainers / authors) off-line which happens to be the recommended way within R land, the popularity of StackOverflow not withstanding.
I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns