Related
I am trying to analyse a dataframe using hierarchical clustering hclust function in R.
I would like to pass in a vector of p values I'll write beforehand (maybe something like c(5/4, 3/2, 7/4, 9/4)) and be able to have these specified as the different p value options with Minkowski distance when I use expand.grid. Ideally, when hyperparams is viewed, it would also be clear which value of p has been used for each minkowski, i.e. they should be labelled. So for example, where (if you run my code for hyperparams) there would currently just be one minkowski under Dists, for each of the methods in Meths, there would be, if I supplied the p vector as c(5/4, 3/2, 7/4, 9/4), now instead 4 rows for Minkowski distance: minkowski, p=5/4, minkowski, p=3/2, minkowski, p=7/4, minkowski, p=9/4 (or looking something like that, making the p values clear). Any ideas?
(Note: no packages please, only base R!)
Edit: I worded it poorly before, now rewritten. Let's take the following example instead:
acc <- function(x){
first = sum(x)
second = sum(x^2)
return(list(First=first,Second=second))
}
iris0 <- iris
iris1 <- cbind(log(iris[,1:4]),iris[5])
iris2 <- cbind(sqrt(iris[,1:4]),iris[5])
Now the important bit:
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This will work. But now if I want to include a term like "minkowski",p=3 in expand.grid, how would I do it?
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary","minkowski,p=3"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This gives an error.
In reality there should be no p argument unless the method="minkowski". I have tried to use strsplit to get the first part of the expression into ds, and a switch with strsplit to get the second part and then use parse (it would return NULL if the length of the strsplit was not 2 -- this should pass no argument, I think). The issue seems to be that strsplit is not strsplit(x,",") fails to evaluate the vectorized x but rather tries to evaluate the character x which is not a string. Can anyone suggest any workaround/fix or other method for including the minkowski,p=1.6 terms and the like?
We can create a 'p' value column
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary",
"minkowski3", "minkowski4", "minkowski5"),
DS=c("iris0","iris1","iris2"))
Suppose, we have another column of 'p' values in 'tests', the above solution can be changed to
tests$p <- as.list(args(dist))$p # default value
i1 <- grepl("minkowski", tests$Dists)
tests$Dists <- sub("[0-9.]+$", "", tests$Dists)
tests$p[i1] <- rep(3:5, length.out = sum(i1))
Map(function(x, ds, p){
dist1 <- dist(get(ds)[, 1:4], method = x, p = p)
ct <- cutree(hclust(dist1), 3)
acc(table(get(ds)$Species, ct))},
as.character(tests[[1]]), as.character(tests[[2]]), tests$p )
I am trying to store multiple plots produced by ggplot2 into a list.
I am attempting to use the list function suggested in a previous thread, however I am having difficulty creating my own function to meet my needs.
First, I split a dataframe based on a factor into a list with the following code:
heatlist.germ <- split(heatlist.germ, f=as.factor(heatlist.germ$plot))
Afterwhich, I attempt to create a list function that I can later use lapply with.
plot_data_fcn <- function (heatlist.germ) {
ggplot(heatlist.germ[[i]], aes(x=posX, y=posY, fill=germ_bin)) +
geom_tile(aes(fill=germ_bin)) +
geom_text(aes(label=germ_bin)) +
scale_fill_gradient(low = "gray90", high="darkolivegreen4") +
ggtitle(plot) +
scale_x_continuous("Position X", breaks=seq(1,30)) +
scale_y_continuous("Position Y (REVERSED)", breaks=seq(1,20))
}
heatlist.test <- lapply(heatlist.germ[[i]], plot_data_fcn)
Two main things I am trying to accomplish:
Store the 12 ggplots (hence 12 factors of plot) in a list.
Create a title called "Plot [i] Germination".
Any help would be appreciated.
I don't have your data, so I'll simplify the plotting mechanism.
The first problem is that you should not use your [[i]] referencing in your function. Just have your function deal with data as-is, it really doesn't know that its argument is (in another environment) an element with a list. It knows just the object itself.
# a simple plot function
myfunc <- function(x) ggplot(x, aes_string(names(x)[1], names(x)[2])) + geom_point()
# a list of frames, nothing fancy here
datalist <- replicate(3, mtcars, simplify = FALSE)
# just call it ...
myplots <- lapply(datalist, myfunc)
class(myplots[[1]])
# [1] "gg" "ggplot"
When myfunc is called, its argument x is just a data.frame, the function has no idea that x is the first (or second or third) frame in a list of frames.
If you want to include the nth frame with an index indicating which element it is, this is in my view "zipping" data together, so I suggest Map. (You can also use purrr::imap or related tidyverse functions.)
myfunc2 <- function(x, title = "") ggplot(x, aes_string(names(x)[1], names(x)[2])) + geom_point() + labs(title = title)
myplots <- Map(myfunc2, datalist, sprintf("Plot number %s", seq_along(datalist)))
class(myplots[[1]])
# [1] "gg" "ggplot"
To understand how Map relates to lapply, then understand that lapply(datalist, myfunc) is "unrolled" to something like:
myfunc(datalist[[1]])
myfunc(datalist[[2]])
myfunc(datalist[[3]])
With Map, however, it takes one function that must accept one or more arguments in each call. With that, Map accepts as many lists (or vectors) as the function accepts arguments. The two functions are synonomously
lapply(datalist, myfunc) # data first, function second
Map(myfunc, datalist) # function first, data second
and a more complicated call unrolls like thus:
titles <- sprintf("Plot number %d", seq_along(datalist)) # "Plot number 1", ...
Map(myfunc2, datalist, titles)
# equivalent to
myfunc2(datalist[[1]], titles[[1]])
myfunc2(datalist[[2]], titles[[2]])
myfunc2(datalist[[3]], titles[[3]])
It doesn't really matter if each of the arguments is a true list (as in datalist) or a vector (as in titles), as long as they are the same length (or length 1).
I am currently trying to write my first loop for lagged regressions on 30 variables. Variables are labeled as rx1, rx2.... rx3, and the data frame is called my_num_data.
I have created a loop that looks like this:
z <- zoo(my_num_data)
for (i in 1:30)
{dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$rx[i], 2))
}
But I received an error message:
Error in model.frame.default(formula = dyn(my_num_data$rx[i] ~ lag(my_num_data$rx[i], :
invalid type (NULL) for variable 'my_num_data$rx[i]'
Can anyone tell me what the problem is with the loop?
Thanks!
This produces a list, L, whose ith component has the name of the ith column of z and whose content is the regression of the ith column of z on its first two lags. Lag is same as lag except for a reversal of argument k's sign.
library(dyn)
z <- zoo(anscombe) # test input using builtin data.frame anscombe
Lag <- function(x, k) lag(x, -k)
L <- lapply(as.list(z), function(x) dyn$lm(x ~ Lag(x, 1:2)))
First problem, I'm pretty sure the function you're looking for is dynlm(), without the $ character. Second, using $rx[i] doesn't concatenate rx and the contents of i, it selects the (single) element in $rx with index i. Try this... edited I don't have your data, so I can't test it on my machine:
results <- list()
for (i in 1:30) {
results[[i]] <- dynlm(my_num_data[,i] ~ lag(my_num_data[,i], 1)
+ lag(my_num_data[,i], 2))
}
and then list element results[[1]] will be the results from the first regresssion, and so on.
Note that this assumes your my_num_data data.frame ONLY consists of columns rx1, rx2, etc.
I am not super familiar with R, but it appears you are trying to increase the index of rx. Is rx a vector with values at different indices?
If not the solution my be to concatenate a string
for (i in 1:30){
varName <-- "rx"+i
dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$varName, 2))
}
Again, I may be way off here, as this if my first post and R is still pretty new to me.
I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea
Related: Strings as variable references in R
Possibly related: Concatenate expressions to subset a dataframe
I've simplified the question per the comment request. Here goes with some example data.
dat <- data.frame(num=1:10,sq=(1:10)^2,cu=(1:10)^3)
set1 <- subset(dat,num>5)
set2 <- subset(dat,num<=5)
Now, I'd like to make a bubble plot from these. I have a more complicated data set with 3+ colors and complicated subsets, but I do something like this:
symbols(set1$sq,set1$cu,circles=set1$num,bg="red")
symbols(set2$sq,set2$cu,circles=set2$num,bg="blue",add=T)
I'd like to do a for loop like this:
colors <- c("red","blue")
sets <- c("set1","set2")
vars <- c("sq","cu","num")
for (i in 1:length(sets)) {
symbols(sets[[i]][,sq],sets[[i]][,cu],circles=sets[[i]][,num],
bg=colors[[i]],add=T)
}
I know you can have a variable evaluated to specify the column (like var="cu"; set1[,var]; I want to know how to get a variable to specify the data.frame itself (and another to evaluate the column).
Update: Ran across this post on r-bloggers which has this example:
x <- 42
eval(parse(text = "x"))
[1] 42
I'm able to do something like this now:
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
In fiddling with this, I'm finding it interesting that the following are not equivalent:
vars <- data.frame("var1","var2")
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
eval(parse(text=paste(set[[1]],"[,vars[[1]]]",sep="")))
I actually have to do this:
eval(parse(text=paste(set[[1]],"[,as.character(vars[[1]])]",sep="")))
Update2: The above works to output values... but not in trying to plot. I can't do:
for (i in 1:length(set)) {
symbols(eval(parse(text=paste(set[[i]],"$",var1,sep=""))),
eval(parse(text=paste(set[[i]],"$",var2,sep=""))),
circles=paste(set[[i]],".","circles",sep=""),
fg="white",bg=colors[[i]],add=T)
}
I get invalid symbol coordinates. I checked the class of set[[1]] and it's a factor. If I do is.numeric(as.numeric(set[[1]])) I get TRUE. Even if I add that above prior to the eval statement, I still get the error. Oddly, though, I can do this:
set.xvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var1,sep=""))))
set.yvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var2,sep=""))))
symbols(xvars,yvars,circles=data$var3)
Why different behavior when stored as a variable vs. executed within the symbol function?
You found one answer, i.e. eval(parse()) . You can also investigate do.call() which is often simpler to implement. Keep in mind the useful as.name() tool as well, for converting strings to variable names.
The basic answer to the question in the title is eval(as.symbol(variable_name_as_string)) as Josh O'Brien uses. e.g.
var.name = "x"
assign(var.name, 5)
eval(as.symbol(var.name)) # outputs 5
Or more simply:
get(var.name) # 5
Without any example data, it really is difficult to know exactly what you are wanting. For instance, I can't at all divine what your object set (or is it sets) looks like.
That said, does the following help at all?
set1 <- data.frame(x = 4:6, y = 6:4, z = c(1, 3, 5))
plot(1:10, type="n")
XX <- "set1"
with(eval(as.symbol(XX)), symbols(x, y, circles = z, add=TRUE))
EDIT:
Now that I see your real task, here is a one-liner that'll do everything you want without requiring any for() loops:
with(dat, symbols(sq, cu, circles = num,
bg = c("red", "blue")[(num>5) + 1]))
The one bit of code that may feel odd is the bit specifying the background color. Try out these two lines to see how it works:
c(TRUE, FALSE) + 1
# [1] 2 1
c("red", "blue")[c(F, F, T, T) + 1]
# [1] "red" "red" "blue" "blue"
If you want to use a string as a variable name, you can use assign:
var1="string_name"
assign(var1, c(5,4,5,6,7))
string_name
[1] 5 4 5 6 7
Subsetting the data and combining them back is unnecessary. So are loops since those operations are vectorized. From your previous edit, I'm guessing you are doing all of this to make bubble plots. If that is correct, perhaps the example below will help you. If this is way off, I can just delete the answer.
library(ggplot2)
# let's look at the included dataset named trees.
# ?trees for a description
data(trees)
ggplot(trees,aes(Height,Volume)) + geom_point(aes(size=Girth))
# Great, now how do we color the bubbles by groups?
# For this example, I'll divide Volume into three groups: lo, med, high
trees$set[trees$Volume<=22.7]="lo"
trees$set[trees$Volume>22.7 & trees$Volume<=45.4]="med"
trees$set[trees$Volume>45.4]="high"
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth))
# Instead of just circles scaled by Girth, let's also change the symbol
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=set))
# Now let's choose a specific symbol for each set. Full list of symbols at ?pch
trees$symbol[trees$Volume<=22.7]=1
trees$symbol[trees$Volume>22.7 & trees$Volume<=45.4]=2
trees$symbol[trees$Volume>45.4]=3
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=symbol))
What works best for me is using quote() and eval() together.
For example, let's print each column using a for loop:
Columns <- names(dat)
for (i in 1:ncol(dat)){
dat[, eval(quote(Columns[i]))] %>% print
}