ggplot loop over columns and show plot in one page [duplicate] - r

This question already has answers here:
R assigning ggplot objects to list in loop
(4 answers)
Closed 5 years ago.
As you can see from the title, I ran into a problem that already took me one entire afternoon.
I have a data frame which can be accessed from here, I want to plot several columns against some other columns, a pair of columns at each time to be precise.
therefore, I create a data.frame to store the pair of column names that I want to plot against each other:
varname.df<-data.frame(num=c(1:9),
cust= c("custEnvironment",'custCommunity','custHuman','custEmp','custDiversity',
'custProduct','custCorp',"custtotal.index","custtotal.noHC.index"),
firm=c("firmEnvironment",'firmCommunity','firmHuman','firmEmp','firmDiversity',
'firmProduct','firmCorp',"firmtotal.index","firmtotal.noHC.index"))
## factor to character
i<-sapply(varname.df,is.factor)
varname.df[i]<-lapply(varname.df[i], as.character)
rm(i)
then plot using ggplot2 and store the resultant figure in a list, see the code below:
## data I provided in the link above
temp<-read.xlsx('sample data.xlsx',1)
f <- list()
for(i in 1:9) { #dim(varname.df)[1]
# p[[i]]<-plot(x = SC.csr[,varname.df[i,'cust']],y = SC.csr[,varname.df[i,'firm']])
dat<-subset(temp,select = c(varname.df[i,'cust'], varname.df[i,'firm']))
pc1 <- ggplot(dat,aes(y = dat[,1], x = dat[,2])) +
# labs(title="Plot of CSR", x =colnames(dat)[2], y = colnames(dat)[1]) +
geom_point()
f[[i]]<-pc1
print(pc1)
Sys.sleep(5)
rm(pc1,pc2,pc3)
}
do.call(grid.arrange,f)
which suppose to work as the answer here and here, however, things just seem not that good to me since it gives me
the exact same points in all the figure, but if you run the for loop, it will literally produce different figures at each iteration as you can see with your own eyes.
After debugging nearly an afternoon, it seems like whenever I add a new ggplot object to the list, it just changes all the data points of all other ggplot objects in the same list.
This is so weird and frustrating in a sense that no error throwout but things are wrong somewhere out there. Any suggestion would be deeply appreciated.
-----------"EDIT"-------
problem solved here, the 3rd answer.

If you would like to work with facets you could try this:
varname.df<-data.frame(num=c(1:9),
cust= c("custEnvironment",'custCommunity','custHuman','custEmp','custDiversity',
'custProduct','custCorp',"custtotal.index","custtotal.noHC.index"),
firm=c("firmEnvironment",'firmCommunity','firmHuman','firmEmp','firmDiversity',
'firmProduct','firmCorp',"firmtotal.index","firmtotal.noHC.index"))
## factor to character
i<-sapply(varname.df,is.factor)
varname.df[i]<-lapply(varname.df[i], as.character)
rm(i)
temp<-read.xlsx('sample data.xlsx',1)
temp1=temp%>%select(c(varname.df$cust))%>%melt(value.name = "y")%>%mutate(id=str_replace(variable,"cust",""))
temp2=temp%>%select(c(varname.df$firm))%>%melt(value.name = "x")%>%mutate(id=str_replace(variable,"firm",""))
temp0=merge(temp1,temp2,by="id")%>%select(id,x,y)
ggplot(temp0,aes(x=x,y=y))+geom_point()+facet_grid(.~id)+xlab("Firm")+ylab("Cust")
If you prefer to store your plots in a list and the plot them in a grid the code below seems to do the trick.
varname.df<-data.frame(num=c(1:9),
cust= c("custEnvironment",'custCommunity','custHuman','custEmp','custDiversity',
'custProduct','custCorp',"custtotal.index","custtotal.noHC.index"),
firm=c("firmEnvironment",'firmCommunity','firmHuman','firmEmp','firmDiversity',
'firmProduct','firmCorp',"firmtotal.index","firmtotal.noHC.index"))
## factor to character
i<-sapply(varname.df,is.factor)
varname.df[i]<-lapply(varname.df[i], as.character)
rm(i)
temp<-read.xlsx('sample data.xlsx',1)
f <- list()
for(i in 1:9) { #dim(varname.df)[1]
f[[i]]<-subset(temp,select = c(varname.df[i,'cust'], varname.df[i,'firm']))
}
plotlist=lapply(1:9,function(x) ggplot(f[[x]],aes(y = f[[x]][,1], x = f[[x]][,2])) +geom_point()+xlab("Firm")+
ylab("Cust"))
plot_grid(plotlist=plotlist)

Related

How do I create different plots from same data frame using the loop function?

I am supposed to create these plots for my course, but I'm very confused with using the looping function. I added an image of my dataframe and the plot I was asked to create. Does anyone know how I can get started?
I named my dataframe 'gutan' and I tried to do this so that I can loop through the dataframe:
gutan
gd <- gutan[ , i]
Here I tried to just see how it would work if I looped through it, but it did not work because I'm not sure how I can code for it to use one column as x and the other as y on the plot
i = 1
for (i in 3:8) {
plot(gd) }

ggplot within a function does not return a scatterplot with datapoints, instead a plot with dataframe values. How to fix this?

I'm writing a function where I should get 2 ggplots objects returned to me in RStudio based on two different dataframes generated within my function. However, instead I get a plot with all the dataframe values "printed" in it returned and not a normal scatterplot.
I tried:
return(list(df1, df2))
Plots<- list(df1, df2), return(Plots)
View(df1) View(df2)
ggplot without storing it into an object
Just return a single ggplot and not using list() to return two.
Print() instead of return or view.
Every result has the same outcome (picture):
As you can see on the bottom right, I do not get a scatter plot. The console does show output [1] and [[2]], but nothing else. The code itself is working perfectly.
I ran debug, I've got no errors and above all when I replaced ggplot with plot(), this DID return the prefered scatterplot to me. So I assume the problem is not related to the code itself.
However, I am much more familiar with customizations with ggplot than plot(), so if anyone knows how to solve this issue it would be amazing. Provided below I added some sample data and some sample code, although I'm not sure whether that is relevant with this issue.
The code I used within my function to create and return the ggplots is:
MD_filter_trial<- function(dataframe, mz_col, a = 0.00112, b = 0.01953){
MZ<- mz_col
MZR<- trunc(mz_col, digits = 0)#Either floor() or trunc() can be used for this part.
MD<- as.numeric(MZ-MZR)
MD.limit<- b + a*mz_col
dataframe<- dataframe%>%
dplyr::mutate(MD, MZ, MD.limit)%>%
dplyr::select(MD, MZ, MD.limit)
highlight_df <- dataframe %>% filter(MD >= MD.limit) #Notice how this is the exact opposite from the
MD_plot<- ggplot(data=dataframe, aes(x=MZ, y=MD))+
geom_point()+
geom_point(data=highlight_df, aes(x=MZ,y=MD), color='red')+#I added this one, so the data which will be removed will be highlighted in red.
ggtitle(paste("Unfiltered MD data - ", dataframe))
filtered<- dataframe%>%
filter(MD <= MD.limit)# As I understood: Basically all are coordinates. The maxima equation basically gives coordinates
MD_plot_2<- ggplot(data=filtered, aes(x=MZ, y=MD))+ #Filtered is basically the second dataframe, #which subsets datapoints with an Y value (which is the MD), below the linear equation MD...
geom_point()+
ggtitle(paste("Filtered MD data - ", dataframe))
N_Removed_datapoints <- nrow(dataframe) - nrow(filtered)
print(paste("Number of peaks removed:", N_Removed_datapoints))
MD_PLOTS<-list(dataframe, filtered, MD_plot, MD_plot_2)
return(MD_PLOTS)
}
Sample data:
structure(list(mz_col= c(99.0001, 99.0056, 99.0079, 99.0097, 99.0105,
99.0116, 99.0158, 99.0169, 99.019, 99.0196, 99.0207, 99.0215,
99.0239, 99.0252, 99.026, 99.0269, 99.0288, 99.0295, 99.0302,
99.0311, 99.0318, 99.0332, 99.034, 99.0346, 99.0355, 99.0376,
99.039, 99.04, 99.0405, 99.0414, 99.0421, 99.043, 99.0444, 99.0473,
99.048, 99.0517, 99.0536, 99.0547, 99.0556, 99.057, 99.0575,
99.0586, 99.0599, 99.0606, 99.0621, 99.0637, 99.0652, 99.0661,
99.0668, 99.0686, 99.0694, 99.0699, 99.0707, 99.0714, 99.072,
99.075, 99.0762, 99.0794, 99.0808, 99.0836, 99.0888, 99.0901,
99.0911, 99.092, 99.095, 99.0962, 99.1001, 99.1064, 99.1173,
99.4889, 99.5059, 99.5084, 99.5126, 99.5158, 99.5165, 99.5173,
99.5183, 99.526, 99.5266, 99.5315, 99.5345, 99.5358, 99.5402,
99.543, 99.5472, 99.548, 99.5529, 99.5572, 99.5577, 99.9408,
99.9551, 99.9599, 99.9646, 99.9718, 99.9887)), row.names = c(NA,
-95L), class = c("tbl_df", "tbl", "data.frame"))
In your ggtitles calls perhaps you mean:
ggtitle(paste("Filtered MD data -", deparse(substitute(dataframe)))
Within a function this takes the name of the object passed to the dataframe argument and pastes it into a string, rather than putting the whole dataframe in.

performing a calculation with a `paste`d vector reference

So I have some lidar data that I want to calculate some metrics for (I'll attach a link to the data in a comment).
I also have ground plots that I have extracted the lidar points around, so that I have a couple hundred points per plot (19 plots). Each point has X, Y, Z, height above ground, and the associated plot.
I need to calculate a bunch of metrics on the plot level, so I created plotsgrouped with split(plotpts, plotpts$AssocPlot).
So now I have a data frame with a "page" for each plot, so I can calculate all my metrics by the "plot page". This works just dandy for individual plots, but I want to automate it. (yes, I know there's only 19 plots, but it's the principle of it, darn it! :-P)
So far, I've got a for loop going that calculates the metrics and puts the results in a data frame called Results. I pulled the names of the groups into a list called groups as well.
for(i in 1:length(groups)){
Results$Plot[i] <- groups[i]
Results$Mean[i] <- mean(plotsgrouped$PLT01$Z)
Results$Std.Dev.[i] <- sd(plotsgrouped$PLT01$Z)
Results$Max[i] <- max(plotsgrouped$PLT01$Z)
Results$75%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .75)])
Results$50%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .50)])
...
and so on.
The problem arises when I try to do something like:
Results$mean[i] <- mean(paste("plotsgrouped", groups[i],"Z", sep="$")). mean() doesn't recognize the paste as a reference to the vector plotsgrouped$PLT27$Z, and instead fails. I've deduced that it's because it sees the quotes and thinks, "Oh, you're just some text, I can't get the mean of you." or something to that effect.
Btw, groups is a list of the 19 plot names: PLT01-PLT27 (non-consecutive sometimes) and FTWR, so I can't simply put a sequence for the numeric part of the name.
Anyone have an easier way to iterate across my test plots and get arbitrary metrics?
I feel like I have all the right pieces, but just don't know how they go together to give me what I want.
Also, if anyone can come up with a better title for the question, feel free to post it or change it or whatever.
Try with:
for(i in seq_along(groups)) {
Results$Plot[i] <- groups[i] # character names of the groups
tempZ = plotsgrouped[[groups[i]]][["Z"]]
Results$Mean[i] <- mean(tempZ)
Results$Std.Dev.[i] <- sd(tempZ)
Results$Max[i] <- max(tempZ)
Results$75%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .75)])
Results$50%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .50)])
}

Getting strings recognized as variable names in R

Related: Strings as variable references in R
Possibly related: Concatenate expressions to subset a dataframe
I've simplified the question per the comment request. Here goes with some example data.
dat <- data.frame(num=1:10,sq=(1:10)^2,cu=(1:10)^3)
set1 <- subset(dat,num>5)
set2 <- subset(dat,num<=5)
Now, I'd like to make a bubble plot from these. I have a more complicated data set with 3+ colors and complicated subsets, but I do something like this:
symbols(set1$sq,set1$cu,circles=set1$num,bg="red")
symbols(set2$sq,set2$cu,circles=set2$num,bg="blue",add=T)
I'd like to do a for loop like this:
colors <- c("red","blue")
sets <- c("set1","set2")
vars <- c("sq","cu","num")
for (i in 1:length(sets)) {
symbols(sets[[i]][,sq],sets[[i]][,cu],circles=sets[[i]][,num],
bg=colors[[i]],add=T)
}
I know you can have a variable evaluated to specify the column (like var="cu"; set1[,var]; I want to know how to get a variable to specify the data.frame itself (and another to evaluate the column).
Update: Ran across this post on r-bloggers which has this example:
x <- 42
eval(parse(text = "x"))
[1] 42
I'm able to do something like this now:
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
In fiddling with this, I'm finding it interesting that the following are not equivalent:
vars <- data.frame("var1","var2")
eval(parse(text=paste(set[[1]],"$",var1,sep="")))
eval(parse(text=paste(set[[1]],"[,vars[[1]]]",sep="")))
I actually have to do this:
eval(parse(text=paste(set[[1]],"[,as.character(vars[[1]])]",sep="")))
Update2: The above works to output values... but not in trying to plot. I can't do:
for (i in 1:length(set)) {
symbols(eval(parse(text=paste(set[[i]],"$",var1,sep=""))),
eval(parse(text=paste(set[[i]],"$",var2,sep=""))),
circles=paste(set[[i]],".","circles",sep=""),
fg="white",bg=colors[[i]],add=T)
}
I get invalid symbol coordinates. I checked the class of set[[1]] and it's a factor. If I do is.numeric(as.numeric(set[[1]])) I get TRUE. Even if I add that above prior to the eval statement, I still get the error. Oddly, though, I can do this:
set.xvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var1,sep=""))))
set.yvars <- as.numeric(eval(parse(text=paste(set[[i]],"$",var2,sep=""))))
symbols(xvars,yvars,circles=data$var3)
Why different behavior when stored as a variable vs. executed within the symbol function?
You found one answer, i.e. eval(parse()) . You can also investigate do.call() which is often simpler to implement. Keep in mind the useful as.name() tool as well, for converting strings to variable names.
The basic answer to the question in the title is eval(as.symbol(variable_name_as_string)) as Josh O'Brien uses. e.g.
var.name = "x"
assign(var.name, 5)
eval(as.symbol(var.name)) # outputs 5
Or more simply:
get(var.name) # 5
Without any example data, it really is difficult to know exactly what you are wanting. For instance, I can't at all divine what your object set (or is it sets) looks like.
That said, does the following help at all?
set1 <- data.frame(x = 4:6, y = 6:4, z = c(1, 3, 5))
plot(1:10, type="n")
XX <- "set1"
with(eval(as.symbol(XX)), symbols(x, y, circles = z, add=TRUE))
EDIT:
Now that I see your real task, here is a one-liner that'll do everything you want without requiring any for() loops:
with(dat, symbols(sq, cu, circles = num,
bg = c("red", "blue")[(num>5) + 1]))
The one bit of code that may feel odd is the bit specifying the background color. Try out these two lines to see how it works:
c(TRUE, FALSE) + 1
# [1] 2 1
c("red", "blue")[c(F, F, T, T) + 1]
# [1] "red" "red" "blue" "blue"
If you want to use a string as a variable name, you can use assign:
var1="string_name"
assign(var1, c(5,4,5,6,7))
string_name
[1] 5 4 5 6 7
Subsetting the data and combining them back is unnecessary. So are loops since those operations are vectorized. From your previous edit, I'm guessing you are doing all of this to make bubble plots. If that is correct, perhaps the example below will help you. If this is way off, I can just delete the answer.
library(ggplot2)
# let's look at the included dataset named trees.
# ?trees for a description
data(trees)
ggplot(trees,aes(Height,Volume)) + geom_point(aes(size=Girth))
# Great, now how do we color the bubbles by groups?
# For this example, I'll divide Volume into three groups: lo, med, high
trees$set[trees$Volume<=22.7]="lo"
trees$set[trees$Volume>22.7 & trees$Volume<=45.4]="med"
trees$set[trees$Volume>45.4]="high"
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth))
# Instead of just circles scaled by Girth, let's also change the symbol
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=set))
# Now let's choose a specific symbol for each set. Full list of symbols at ?pch
trees$symbol[trees$Volume<=22.7]=1
trees$symbol[trees$Volume>22.7 & trees$Volume<=45.4]=2
trees$symbol[trees$Volume>45.4]=3
ggplot(trees,aes(Height,Volume,colour=set)) + geom_point(aes(size=Girth,pch=symbol))
What works best for me is using quote() and eval() together.
For example, let's print each column using a for loop:
Columns <- names(dat)
for (i in 1:ncol(dat)){
dat[, eval(quote(Columns[i]))] %>% print
}

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources