It seems like every question involving loops in R is met with "Loops are bad" and "You're doing it wrong" with advice to use list, or tapply or whatnot.
I'm learning R, and have implemented the following loop to create image files for each factor level, with the # of factor levels changing each time I run it:
for(i in unique(df$factor)) {
lnam <- paste("test_", i, sep="")
assign(lnam, subset(df, factor==i))
lfile <- paste(lnam, ".png", sep="")
png(file = lfile, bg="transparent")
with(get(lnam), hist(x, main = paste("Histogram of x for ", i, " factor", sep="")))
dev.off()
}
This works. I want to expand it to perhaps run various tests on those subgroups (also output to files), etc.
Is this a valid and legitimate use of loops? Or is there a preferred way to skin this cat?
There's nothing wrong with loops in general. Sometimes, particularly when you're working with files or calling functions for their side-effects rather than their outputs, loops can be easier to follow than *apply calls. However, when you use a loop to simulate a operation that can be vectorised, it's often much slower, hence the recommendation to avoid them.
Re your specific example, though, I'd make the following comments:
If you want to do something for each level in a factor, it's more straightforward to use levels(factor) rather than unique(factor).
You don't need to create a new data frame specifically for each factor level.
With that in mind:
for(i in levels(df$factor))
{
lf <- paste("test_", i, ".png", sep="")
png(file=lf, bg="transparent",
with(subset(df, factor == i), hist(x, ....)
dev.off()
}
In this case, a reasonable option is to use split to convert your data frame into a list of data frames, each containing subset of with a specific factor level.
split_df <- split(df, df$factor)
As Colin mentioned, paste can be vectorised, so you only need to call it once.
lfile <- paste("test_", names(split_df), ".png", sep = "")
Group all your plotting code into a function.
draw_and_save_histogram <- function(data, file)
{
png(file)
with(data, hist(x))
dev.off()
}
Now you can more easily compare the difference between a plain loop and an *apply function (in this case mapply, since we need two inputs).
for(i in seq_along(split_df))
{
draw_and_save_histogram(split_df[[i]], lfile[i])
}
mapply(
draw_and_save_histogram,
split_df,
lfile
)
Rather than drawing a lots of histograms to be saved in different files, it is much more preferable to draw one plot with several panels using lattice or ggplot2.
library(lattice)
histogram(~ x | factor, df)
library(ggplot2)
ggplot(df, aes(x)) + geom_histogram() + facet_wrap(~ factor)
Related
I would like to calculate the average of values from 10 different files (at line 15 and column 2 in each file).
The first code below is working but I will have to change the number of the line depending of my needs (for example not line 15 but line 12) and I would like to summarize the code so I can change only one number.
error_m<-(file_1[15,2] + file_2[15,2] + file_3[15,2] + file_4[15,2] + file_5[15,2] + file_6[15,2] + file_7[15,2] + file_8[15,2] + file_9[15,2] + file_10[15,2])/10
I tried the code below but it does not work. This error is Error in file_(q) : could not find function "file_"".
sum_e<-data.frame(0)
q=1
for(q in 1:10)
{
sum_e<-rbind(sum_e,file_(q)[15,2])
}
sum_e2<-sum(sum_e)
error_m<-sum_e/10
Step 1: fix the immediate problem:
sum_e<-data.frame(0)
q=1
for(q in 1:10)
{
sum_e<-rbind(sum_e,get(paste0("file_",. q))[15,2])
}
sum_e2<-sum(sum_e)
error_m<-sum_e/10
Step 2: don't have a different variable for each data, when they are all structured identically. To start this, you should read them into a list and then process them as a whole.
allfiles <- list.files(path="...", pattern="*.txt", full.naames=TRUE)
list_of_frames <- lapply(allfiles, read.csv)
At this point, each element of the list_of_frames is exactly one of your files, so you should be able to see list_of_frames[[1]] as the same as (for example) file_1. From here, whenever you want to do "some thing" to all of them, just do it to the list, but inside an lapply, ala:
val_15_2 <- lapply(list_of_frames, function(df) df[15,2])
avg_15_2 <- mean(unlist(val_15_2))
This is now a list which may or may not be immediately useful. If instead you know that they are all the same size/shape (same length, same class) and you want them to be simplified into a vector or matrix, then you can instead use sapply:
val_15_2 <- sapply(list_of_frames, function(df) df[15,2])
# or even
avg_15_2 <- mean(sapply(list_of_frames, function(df) df[15,2]))
You can try using the Paste0 function to try and do the same function above
sum_e<-data.frame(0)
q=1
for(q in 1:10)
{
sum_e<-rbind(sum_e,paste0('file_(',q,')')[15,2])
}
sum_e2<-sum(sum_e)
error_m<-sum_e/10
This question already has answers here:
How to assign values to dynamic names variables
(2 answers)
Closed 7 years ago.
I keep running into situations where I want to dynamically create variables using a for loop (or similar / more efficient construct using dplyr perhaps). However, it's unclear to me how to do it right now.
For example, the below shows a construct that I would intuitively expect to generate 10 variables assigned numbers 1:10, but it doesn't work.
for (i in 1:10) {paste("variable",i,sep = "") = i}
The error
Error in paste("variable", i, sep = "") = i :
target of assignment expands to non-language object
Any thoughts on what method I should use to do this? I assume there are multiple approaches (including a more efficient dplyr method). Full disclosure: I'm relatively new to R and really appreciate the help. Thanks!
I've run into this problem myself many times. The solution is the assign command.
for(i in 1:10){
assign(paste("variable", i, sep = ""), i)
}
If you wanted to get everything into one vector, you could use sapply. The following code would give you a vector from 1 to 10, and the names of each item would be "variable i," where i is the value of each item. This may not be the prettiest or most elegant way to use the apply family for this, but I think it ought to work well enough.
var.names <- function(x){
a <- x
names(a) <- paste0("variable", x)
return(a)
}
variables <- sapply(X = 1:10, FUN = var.names)
This sort of approach seems to be favored because it keeps all of those variables tucked away in one object, rather than scattered all over the global environment. This could make calling them easier in the future, preventing the need to use get to scrounge up variables you'd saved.
No need to use a loop, you can create character expression with paste0 and then transform it as uneveluated expression with parse, and finally evaluate it with eval.
eval(parse(text = paste0("variable", 1:10, "=",1:10, collapse = ";") ))
The code you have is really no more useful than a vector of elements:
x<-1
for(i in 2:10){
x<-c(x,i)
}
(Obviously, this example is trivial, could just use x<-1:10 and be done. I assume there's a reason you need to do non-vectored calculations on each variable).
I regularly used d_ply to produce exploratory plots.
A trivial example:
require(plyr)
plot_species <- function(species_data){
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
d_ply(.data=iris,
.variables="Species",
function(x)plot_species(x))
Which produces three separate plots, one for each species.
I would like to reproduce this behaviour using functions in dplyr.
This seems to require the reassembly of the data.frame within the function called by summarise, which is often impractical.
require(dplyr)
iris_by_species <- group_by(iris,Species)
plot_species <- function(Sepal.Length,Sepal.Width){
species_data <- data.frame(Sepal.Length,Sepal.Width)
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
summarise(iris_by_species, plot_species(Sepal.Length,Sepal.Width))
Can parts of the data.frame be passed to the function called by summarise directly, rather than passing columns?
I believe you can work with do for this task with the same function you used in d_ply. It will print directly to the plotting window, but also saves the plots as a list within the resulting data.frame if you use a named argument (see help page, this is essentially like using dlply). I don't fully grasp all that do can do, but if I don't use a named argument I get an error message but the plots still print to the plotting window (in RStudio).
plot_species <- function(species_data){
p <- qplot(data=species_data,
x=Sepal.Length,
y=Sepal.Width)
print(p)
}
group_by(iris, Species) %>%
do(plot = plot_species(.))
Related: Strings as variable references in R
Possibly related: Concatenate expressions to subset a dataframe
I've simplified the question per the comment request. Here goes with some example data.
dat <- data.frame(num=1:10,sq=(1:10)^2,cu=(1:10)^3)
set1 <- subset(dat,num>5)
set2 <- subset(dat,num<=5)
Now, I'd like to make a bubble plot from these. I have a more complicated data set with 3+ colors and complicated subsets, but I do something like this:
symbols(set1$sq,set1$cu,circles=set1$num,bg="red")
symbols(set2$sq,set2$cu,circles=set2$num,bg="blue",add=T)
I'd like to do a for loop like this:
colors <- c("red","blue")
sets <- c("set1","set2")
vars <- c("sq","cu","num")
for (i in 1:length(sets)) {
symbols(sets[[i]][,sq],sets[[i]][,cu],circles=sets[[i]][,num],
bg=colors[[i]],add=T)
}
I know you can have a variable evaluated to specify the column (like var="cu"; set1[,var]; I want to know how to get a variable to specify the data.frame itself (and another to evaluate the column).
Update: Ran across this post on r-bloggers which has this example:
x <- 42
eval(parse(text = "x"))
[1] 42
I'm able to do something like this now:
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
In fiddling with this, I'm finding it interesting that the following are not equivalent:
vars <- data.frame("var1","var2")
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
eval(parse(text=paste(set[[1]],"[,vars[[1]]]",sep="")))
I actually have to do this:
eval(parse(text=paste(set[[1]],"[,as.character(vars[[1]])]",sep="")))
Update2: The above works to output values... but not in trying to plot. I can't do:
for (i in 1:length(set)) {
symbols(eval(parse(text=paste(set[[i]],"$",var1,sep=""))),
eval(parse(text=paste(set[[i]],"$",var2,sep=""))),
circles=paste(set[[i]],".","circles",sep=""),
fg="white",bg=colors[[i]],add=T)
}
I get invalid symbol coordinates. I checked the class of set[[1]] and it's a factor. If I do is.numeric(as.numeric(set[[1]])) I get TRUE. Even if I add that above prior to the eval statement, I still get the error. Oddly, though, I can do this:
set.xvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var1,sep=""))))
set.yvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var2,sep=""))))
symbols(xvars,yvars,circles=data$var3)
Why different behavior when stored as a variable vs. executed within the symbol function?
You found one answer, i.e. eval(parse()) . You can also investigate do.call() which is often simpler to implement. Keep in mind the useful as.name() tool as well, for converting strings to variable names.
The basic answer to the question in the title is eval(as.symbol(variable_name_as_string)) as Josh O'Brien uses. e.g.
var.name = "x"
assign(var.name, 5)
eval(as.symbol(var.name)) # outputs 5
Or more simply:
get(var.name) # 5
Without any example data, it really is difficult to know exactly what you are wanting. For instance, I can't at all divine what your object set (or is it sets) looks like.
That said, does the following help at all?
set1 <- data.frame(x = 4:6, y = 6:4, z = c(1, 3, 5))
plot(1:10, type="n")
XX <- "set1"
with(eval(as.symbol(XX)), symbols(x, y, circles = z, add=TRUE))
EDIT:
Now that I see your real task, here is a one-liner that'll do everything you want without requiring any for() loops:
with(dat, symbols(sq, cu, circles = num,
bg = c("red", "blue")[(num>5) + 1]))
The one bit of code that may feel odd is the bit specifying the background color. Try out these two lines to see how it works:
c(TRUE, FALSE) + 1
# [1] 2 1
c("red", "blue")[c(F, F, T, T) + 1]
# [1] "red" "red" "blue" "blue"
If you want to use a string as a variable name, you can use assign:
var1="string_name"
assign(var1, c(5,4,5,6,7))
string_name
[1] 5 4 5 6 7
Subsetting the data and combining them back is unnecessary. So are loops since those operations are vectorized. From your previous edit, I'm guessing you are doing all of this to make bubble plots. If that is correct, perhaps the example below will help you. If this is way off, I can just delete the answer.
library(ggplot2)
# let's look at the included dataset named trees.
# ?trees for a description
data(trees)
ggplot(trees,aes(Height,Volume)) + geom_point(aes(size=Girth))
# Great, now how do we color the bubbles by groups?
# For this example, I'll divide Volume into three groups: lo, med, high
trees$set[trees$Volume<=22.7]="lo"
trees$set[trees$Volume>22.7 & trees$Volume<=45.4]="med"
trees$set[trees$Volume>45.4]="high"
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth))
# Instead of just circles scaled by Girth, let's also change the symbol
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=set))
# Now let's choose a specific symbol for each set. Full list of symbols at ?pch
trees$symbol[trees$Volume<=22.7]=1
trees$symbol[trees$Volume>22.7 & trees$Volume<=45.4]=2
trees$symbol[trees$Volume>45.4]=3
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=symbol))
What works best for me is using quote() and eval() together.
For example, let's print each column using a for loop:
Columns <- names(dat)
for (i in 1:ncol(dat)){
dat[, eval(quote(Columns[i]))] %>% print
}
I have 11 lists of different length, imported into R as p1,p2,p3,...,p11. Now I want to get the rollmean (library TTR) from all lists and name the result p1y,p2y,...,p11y.
This seems to be the job for a loop, but I read that this is often not good practice in R. I tried something (foolish) like
sample=10
for (i in 1:11){
paste("p",i,"y",sep="")<-rollmean(paste("p",i,sep=""),sample)
}
which does not work.
I also tried to use it in combination with assign(), but as I understand assign can only take a variable and a single value.
As always it strikes me that I am missing some fundamental function of R.
As Manuel pointed out, your life will be easier if you combine the variables into a list. For this, you want mget (short for "multiple get").
var_names <- paste("p", 1:11, sep = "")
p_all <- mget(var_names, envir = globalenv())
Now simply use lapply to call rollmean on each element of your list.
sample <- 10
rolling_means <- lapply(p_all, rollmean, sample)
(Also, consider renaming the sample to something that isn't already a function name.)
I suggest leaving the answers as a list, but if you really like the idea of having separate rolling mean variables to match the separate p1, p11 variables then use list2env.
names(rolling_means) <- paste(var_names, "y", sep = "")
list2env(rolling_means, envir = globalenv())
You could group your lists into one and do the following
sample <- 10
mylist <- list(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11)
for(i in 1:11) assign(paste('p',i,'y',sep=''), rollmean(mylist[i], sample))
This can be done with ?get and ?do.call .
x1<-1:3
x2 <- seq(3.5,5.5,1)
for (i in 1:2) {
sx<- (do.call("sin",list(c(get(paste('x',i,sep='',collapse=''))))))
cat(sx)
}
Sloppy example, but you get the idea, I hope.