Access to grouping variable from within d_ply - r

I just discovered the great plyr package and am taking it for a spin.
A question I have is the following: is there some way to access the grouping variables from within d_ply?
Say I have a dataframe df with columns x,y,z, and I would like to plot for each z the values x versus y. If I do the following:
plotxy = function(df, ...) {plot(df$x, df$y, ...)}
d_ply(df, .(z), plotxy(df, main=.(z)))
then the titles that show up on the plots are all "z", and not the values of the z variable. Is there a way to access those values from within d_ply?
EDIT: As #Justin pointed out, the above formulation is wrong because I am passing the whole of df to plotxy. Hence the line
d_ply(df, .(z), plotxy(df, main=.(z)))
should be
d_ply(df, .(z), plotxy, main=.(z))
in order to make sense in terms of my original question (I guess that's also what #joran was hinting at).
However, I realized something else. Even though df gets sliced along z by d_ply, the sub-dataframe that the function receives still has a z column -- simply with always the same value. Hence the problem can apparently be solved as follows:
plotxy = function(df, ...) {plot(df$x, df$y, main=df$z[1])}
d_ply(df, .(z), plotxy)

By way of example, I'll expand on Joran's concern.
df <- data.frame(x=rnorm(100), y=rnorm(100), z=letters[1:10])
lets use your function and see what we get without plyr:
plotxy(df, main=.(z))
versus the maybe more expected(?):
plotxy(df, main=df$z)
However, in you code, you are splitting your data frame on z then sending the whole data.frame df to your function again. Instead you could make a wrapper function:
d_ply(df, .(z), function(ply.df) plotxy(ply.df, main=unique(ply.df$z)))
This way the plotxy function is only seeing the smaller split data.frame ply.df that you pass through the wrapper function.

Related

How to transfer multiple columns into numeric & find correlation coefficients

I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])

Quantiles of a data.frame

There is a data.frame() for which's columns I'd like to calculate quantiles:
tert <- c(0:3)/3
data <- dbGetQuery(dbCon, "SELECT * FROM tablename")
quans <- mapply(quantile, data, probs=tert, name=FALSE)
But the result only contains the last element of quantiles return list and not the whole result. I also get a warning longer argument not a multiple of length of shorter. How can I modify my code to make it work?
PS: The function alone works like a charme, so I could use a for loop:
quans <- quantile(a$fileName, probs=tert, name=FALSE)
PPS: What also works is not specifying probs
quans <- mapply(quantile, data, name=FALSE)
The problem is that mapply is trying to apply the given function to each of the elements of all of the specified arguments in sequence. Since you only want to do this for one argument, you should use lapply, not mapply:
lapply(data, quantile, probs=tert, name=FALSE)
Alternatively, you can still use mapply but specify the arguments that are not to be looped over in the MoreArgs argument.
mapply(quantile, data, MoreArgs=list(probs=tert, name=FALSE))
I finally found a workaround which I don't like but kinda works. Perhaps someone can tell the right way to do it:
q <- function(x) { quantile(x, probs=c(0:3)/3, names=FALSE) }
mapply(q, data)
works, no Idea where the difference is.

Using dataframe column names as chart label in Apply

I want to create a series of x-y scatter charts, where y is always the same variable and x are the variables I want to check if they are correlated with. As an example lets use the mtcars dataset.
I am relatively new to R but getting better.
The code below works, the list charts contains all the charts, except that the X axis shows as "x", and I want it to be the name of the variable. I tried numerous combinations of xlab= and I do not seem to get it
if I use names(data) I see the names I want to use. I guess I want to reference the first of names(data) the first iteration of apply, the second the second time, etc. How can I do that?
Th next step would be to print them in a lattice together, I assume an lapply or sapply will do the trick with the print function - I appreciate idea for this too, just pointers I do not need a solution.
load(mtcars)
mypanel <- function(x,y,...) {
panel.xyplot(x,data[,y],...)
panel.grid(x=-1,y=-1)
panel.lmline(x,y,col="red",lwd=1,lty=1)
}
data <- mtcars[,2:11]
charts <- apply(data,2,function(x) xyplot (mtcars[,1] ~ x, panel=mypanel,ylab="MPG"))
This all started because I was not able to use the panel function to cycle.
I did not find that this code "worked". Modified it to do so:
mypanel <- function(x,y,...) {
panel.xyplot(x, y, ...)
panel.grid(x=-1, y=-1)
panel.lmline(x,y,col="red",lwd=1,lty=1)
}
data <- mtcars[,2:11]
charts <- lapply(names(data), function(x) { xyplot (mtcars[,1] ~ mtcars[,x],
panel=mypanel,ylab="MPG", xlab=x)})
Needed to remove the 'data[,y]' from the panel function and pass names instead of column vectors so there was something to use for a x-label.

Getting strings recognized as variable names in R

Related: Strings as variable references in R
Possibly related: Concatenate expressions to subset a dataframe
I've simplified the question per the comment request. Here goes with some example data.
dat <- data.frame(num=1:10,sq=(1:10)^2,cu=(1:10)^3)
set1 <- subset(dat,num>5)
set2 <- subset(dat,num<=5)
Now, I'd like to make a bubble plot from these. I have a more complicated data set with 3+ colors and complicated subsets, but I do something like this:
symbols(set1$sq,set1$cu,circles=set1$num,bg="red")
symbols(set2$sq,set2$cu,circles=set2$num,bg="blue",add=T)
I'd like to do a for loop like this:
colors <- c("red","blue")
sets <- c("set1","set2")
vars <- c("sq","cu","num")
for (i in 1:length(sets)) {
symbols(sets[[i]][,sq],sets[[i]][,cu],circles=sets[[i]][,num],
bg=colors[[i]],add=T)
}
I know you can have a variable evaluated to specify the column (like var="cu"; set1[,var]; I want to know how to get a variable to specify the data.frame itself (and another to evaluate the column).
Update: Ran across this post on r-bloggers which has this example:
x <- 42
eval(parse(text = "x"))
[1] 42
I'm able to do something like this now:
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
In fiddling with this, I'm finding it interesting that the following are not equivalent:
vars <- data.frame("var1","var2")
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
eval(parse(text=paste(set[[1]],"[,vars[[1]]]",sep="")))
I actually have to do this:
eval(parse(text=paste(set[[1]],"[,as.character(vars[[1]])]",sep="")))
Update2: The above works to output values... but not in trying to plot. I can't do:
for (i in 1:length(set)) {
symbols(eval(parse(text=paste(set[[i]],"$",var1,sep=""))),
eval(parse(text=paste(set[[i]],"$",var2,sep=""))),
circles=paste(set[[i]],".","circles",sep=""),
fg="white",bg=colors[[i]],add=T)
}
I get invalid symbol coordinates. I checked the class of set[[1]] and it's a factor. If I do is.numeric(as.numeric(set[[1]])) I get TRUE. Even if I add that above prior to the eval statement, I still get the error. Oddly, though, I can do this:
set.xvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var1,sep=""))))
set.yvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var2,sep=""))))
symbols(xvars,yvars,circles=data$var3)
Why different behavior when stored as a variable vs. executed within the symbol function?
You found one answer, i.e. eval(parse()) . You can also investigate do.call() which is often simpler to implement. Keep in mind the useful as.name() tool as well, for converting strings to variable names.
The basic answer to the question in the title is eval(as.symbol(variable_name_as_string)) as Josh O'Brien uses. e.g.
var.name = "x"
assign(var.name, 5)
eval(as.symbol(var.name)) # outputs 5
Or more simply:
get(var.name) # 5
Without any example data, it really is difficult to know exactly what you are wanting. For instance, I can't at all divine what your object set (or is it sets) looks like.
That said, does the following help at all?
set1 <- data.frame(x = 4:6, y = 6:4, z = c(1, 3, 5))
plot(1:10, type="n")
XX <- "set1"
with(eval(as.symbol(XX)), symbols(x, y, circles = z, add=TRUE))
EDIT:
Now that I see your real task, here is a one-liner that'll do everything you want without requiring any for() loops:
with(dat, symbols(sq, cu, circles = num,
bg = c("red", "blue")[(num>5) + 1]))
The one bit of code that may feel odd is the bit specifying the background color. Try out these two lines to see how it works:
c(TRUE, FALSE) + 1
# [1] 2 1
c("red", "blue")[c(F, F, T, T) + 1]
# [1] "red" "red" "blue" "blue"
If you want to use a string as a variable name, you can use assign:
var1="string_name"
assign(var1, c(5,4,5,6,7))
string_name
[1] 5 4 5 6 7
Subsetting the data and combining them back is unnecessary. So are loops since those operations are vectorized. From your previous edit, I'm guessing you are doing all of this to make bubble plots. If that is correct, perhaps the example below will help you. If this is way off, I can just delete the answer.
library(ggplot2)
# let's look at the included dataset named trees.
# ?trees for a description
data(trees)
ggplot(trees,aes(Height,Volume)) + geom_point(aes(size=Girth))
# Great, now how do we color the bubbles by groups?
# For this example, I'll divide Volume into three groups: lo, med, high
trees$set[trees$Volume<=22.7]="lo"
trees$set[trees$Volume>22.7 & trees$Volume<=45.4]="med"
trees$set[trees$Volume>45.4]="high"
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth))
# Instead of just circles scaled by Girth, let's also change the symbol
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=set))
# Now let's choose a specific symbol for each set. Full list of symbols at ?pch
trees$symbol[trees$Volume<=22.7]=1
trees$symbol[trees$Volume>22.7 & trees$Volume<=45.4]=2
trees$symbol[trees$Volume>45.4]=3
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=symbol))
What works best for me is using quote() and eval() together.
For example, let's print each column using a for loop:
Columns <- names(dat)
for (i in 1:ncol(dat)){
dat[, eval(quote(Columns[i]))] %>% print
}

Understanding xyplot in R

I'm an R newbie and I'm trying to understand the xyplot function in lattice.
I have a dataframe:
df <- data.frame(Mean=as.vector(abc), Cycle=seq_len(nrow(abc)), Sample=rep(colnames(abc), each=nrow(abc)))
and I can plot it using
xyplot(Mean ~ Cycle, group=Sample, df, type="b", pch=20, auto.key=list(lines=TRUE, points=FALSE, columns=2), file="abc-quality")
My question is, what are Mean and Cycle? Looking at ?xyplot I can see that this is some kind of function and I understand they are coming from the data frame df, but I can't see them with ls() and >Mean gives Error: object 'Mean' not found. I tried to replicate the plot by substituting df[1] and df[2] for Mean and Cycle respectively thinking that these would be equal but that doesn't seem to be the case. Could someone explain what data types these are (objects, variables, etc) and if there is a generic way to access them (like df[1] and df[2])?
Thanks!
EDIT: xyplot works fine, I'm just trying to understand what Mean and Cycle are in terms of how they relate to df (column labels?) and if there is a way to put them in the xyplot function without referencing them by name, like df[1] instead of Mean.
These are simply references to columns of df.
If you'd like access them by name without mentioning df every time, you could write with(df,{ ...your code goes here... }). The ...your code goes here... block can access the columns as simply Mean and Cycle.
A more direct way to get to those columns is df$Mean and df$Cycle. You can also reference them by position as df[,1] and df[,2], but I struggle to see why you would want to do that.
The reason your xyplot call works is it that implicitly does the equivalent of with(df), where df is your third argument to xyplot. Many R functions are like this, for example lm(y~x,obs) would also correctly pick up columns x and y from dataframe obs.
You need to add , data=df to your call to xyplot():
xyplot(Mean ~ Cycle, data=df, # added data= argument
group=Sample, type="b", pch=20,
auto.key=list(lines=TRUE, points=FALSE, columns=2),
file="abc-quality")
Alternatively, you can with(df, ....) and place your existing call where I left the four dots.

Resources