Iterate over a dataframe and create one plot for each column - r

I would like to iterate over a data frame and plot each column against a particular column such as price.
What I have done so far is:
for(i in ncol(dat.train)) {
ggplot(dat.train, aes(dat.train[[,i]],price)) + geom_point()
}
What I want is to have the first introduction to my data (Approximately 300 columns) by plotting against the decision variable (i.e., price)
I know that there is a similar question, though I cannot really understand why the above is not really working.

You can do this, I have used mtcars data to plot other continuous variables with mpg. You have to melt the data into long form (use gather) and then use ggplot to plot these contiuous variables (disp,drat,qsec etc) against mpg. In your case instead of mpg you would take price and all the other continuous variables to be melted (like here disp,drat,qsec etc), the rest categorical variables can be taken for shape and colors etc (optional).
library(tidyverse)
mtcars %>%
gather(-mpg, -hp, -cyl, key = "var", value = "value") %>%
ggplot(aes(x = value, y = mpg, color = hp, shape = factor(cyl))) +
geom_point() +
facet_wrap(~ var, scales = "free") +
theme_bw()
EDIT:
This is another solution in case we need separate graphs for each of the variables.
Create a list of variables like this: lyst <- list("disp","hp") , you can use colnames function to get all the variable names. Use lapply to to loop through all the "lyst" objects on your data frame.
setwd("path") ###set the working directory here, This is the place where all the files are saved.
pdf(file=paste0("one.pdf"))
lapply(lyst, function(i)ggplot(mtcars, aes_string(x=i, y="mpg")) + geom_point())
dev.off()
A pdf file wil. be generated with all the graphs pdfs at your working directory which you have set
Output from solution first:

Related

Scatterplot using ggplot

I need to create a scatterplot of count vs. depth of 12 species using ggplot.
This is what I have so far:
library(ggplot2)
ggplot(data = ReefFish, mapping = aes(count, depth))
However, how do I use geom_point(), geom_smooth(), and facet_wrap() to include a smoother as well as include just the 12 species I want from the data (ReefFish)? Since I believe what I have right now includes all species from the data.
Here is an example of part of my data:
Since I don't have access to the ReefFish data set, here's an example using the built-in mpg data set about cars. To make it work with your data set, just edit this code to replace manufacturers with species.
Filter the data
First we filter the data so that it only includes the species/manufacturers we're interested in.
# load our packages
library(ggplot2)
library(magrittr)
library(dplyr)
# set up a character vector of the manufacturers we're interested in
manufacturers <- c("audi", "nissan", "toyota")
# filter our data set to only include the manufacturers we care about
mpg_filtered <- mpg %>%
filter(manufacturer %in% manufacturers)
Plot the data
Now we plot. Your code was just about there! You just needed to add the plot elements, you wanted, like so:
mpg_filtered %>%
ggplot(mapping = aes(x = cty,
y = hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~manufacturer)
Hope that helps, and let me know if you have any issues.

Box plots not appearing properly in RStudio

I am creating box plots within R, however, they are appearing incorrectly. My data is based off of German Credit Dataset on Kaggle.
My code with two different attributes trying to be tested:
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
data %>%
ggplot(aes(x = Creditability, y = Account.Balance, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Account Balance")
I've tried a few of the different attributes for it, but results in the same error
Edited info: Is it because the attributes have too much information? I have split the sample into test (300) vs train (700) and I am currently using train. Would it simply be because there's too much info?
Edit picture:
Factors
Edit for graph error:
Error
As others have explained in the comments, you cannot show boxplots where the y axis is set to be a factor. Factors are by their nature discrete variables, even if the levels are named as numbers. In order to utilize the stat function for the boxplot geom, you need the y axis to be continuous and the x axis to be discrete (or able to be separated into discrete values via the group= aesthetic).
Let me demonstrate with the mtcars dataset built into ggplot2:
library(ggplot2)
ggplot(mtcars, aes(x=factor(carb), y=mpg)) + geom_boxplot()
Here we can draw boxpots because the x aesthetic is forced to be discrete (via factor(carb)), while the y axis is using mpg which is a numeric column in the mtcars dataset.
If you set both carb and mpg to be factors, you get something that should look pretty similar to what you're seeing:
ggplot(mtcars, aes(x=factor(carb), y=factor(mpg))) + geom_boxplot()
In your case, all your columns in your dataset are factors. If they are factors that can be coerced to be numbers, you can turn them into continuous vectors via using as.numeric(levels(column_name)[column_name]). Alternatively, you can use as.numeric(as.character(column_name)). Here's what it looks like to first convert the mtcars$mpg column to a factor of numeric values, and then back to being only numeric via this method.
df <- mtcars
# convert to a factor
df$mpg <- factor(df$mpg)
# back to numeric!
df$mpg <- as.numeric(levels(df$mpg)[df$mpg])
# this plot looks like it did before when we did the same with mtcars
ggplot(df, aes(x=factor(carb), y=mpg)) + geom_boxplot()
So, for your case, do this two step process:
data$Purpose <- as.numeric(levels(data$Purpose)[data$Purpose])
data %>%
ggplot(aes(x = Creditability, y = Purpose, fill = Creditability)) +
geom_boxplot() +
ggtitle("Creditability vs Purpose")
That should work. You can follow in a similar fashion for your other variables.

How can I make a loop with tickers created in Excel in R?

I am trying to make a loop in R with tickers I have created in Excel.
I am trying to collect stock data from Yahoo finance, and the idea is that R is going to read the name of the stock from the Excel file, and then collect it from Yahoo Finance and create a graph.
In the data frame "Stocks" there are 10 different stocks listed in different columns, and I would like to run a loop so that I can get 10 different graphs. Here is the formula I have used to create a graph out of the first stock in the dataset.
`Stocks %>%
ggplot(aes(x=Date, y = NOVO.B.CO.Close)) +
geom_line(col = "darkgreen") +
geom_point(col = "darkgreen") +
theme_gray() +
ggtitle("Novo Nordiske B") `
Wrapping my comments into a proper answer. It is not clear whether you want to make this a single plot (using facet_wrap) or multiple plots combined into a single window. When using ggplot2 it is beneficial to have a single data.frame in long format, as one can let ggplot handle all of the colouring and grouping based on a grouping column similar to the example below
library(ggplot2)
data(mtcars)
ggplot(mtcars, aes(x = mpg, y = hp, col = factor(cyl))) +
geom_point() +
geom_smooth() +
labs(col = 'Nmb. Cylinder')
From here the guide gives names for each colour, and scale_*_(manual/discrete/continuous) can be used to change specific colour palettes (eg scale_colour_discrete can be used to change the palette for factors).
When it comes to combining ggplots the patchwork package provides the simple interface. If we assume you have a vector tickers, titles and colors respectively, we can create a list of plots and combine them using simply addition (+).
library(purrr)
plots <- vector('list', n <- length(tickers))
base <- ggplot(Stocks, aes(x = Date)) +
theme_gray()
for(i in seq_len(n)){
plots[[i]] <- base +
geom_point(aes_string(y = tickers[i]), col = colors[i])
geom_line(aes_string(y = tickers[i]), col = colors[i])
ggtitle(titles[i])
}
reduce(plots, `+`)
However for stocks the first option is likely going to give a better result.

Want to compare all variables from two dataframes with same columns?

I have been using this code to plot all the variables in my synthetic dataset, but I would like to modify it in order to compare the dataset to the main, original dataset.
synthetic %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I can't figure out how to make the plots go on the same, well, plots, or where to put the color values so I can tell which is which...
I tried combining both into one dataframe and using gather to inform the plots which was which but the coloring didn't work and it didn't work in general.
I tried matplot, but it told me they both have to have the same number of rows, which inclines me to believe it's not the right function for this.
The third thing I tried was:
par(mfrow=c(5,4))
i<-1
for (i in 1:26) {
plot(synthetic[i], col = "red")
points(full2[i], col = "blue")
}
But that failed as well. I would like all the plots to appear at once and not have to click through them.

How subset a data frame by a factor and repeat a plot for each subset?

I am new to R. Forgive me if this if this question has an obvious answer but I've not been able to find a solution. I have experience with SAS and may just be thinking of this problem in the wrong way.
I have a dataset with repeated measures from hundreds of subjects with each subject having multiple measurements across different ages. Each subject is identified by an ID variable. I'd like to plot each measurement (let's say body WEIGHT) by AGE for each individual subject (ID).
I've used ggplot2 to do something like this:
ggplot(data = dataset, aes(x = AGE, y = WEIGHT )) + geom_line() + facet_wrap(~ID)
This works well for a small number of subjects but won't work for the entire dataset.
I've also tried something like this:
ggplot(data=data, aes(x = AGE,y = BW, group = ID, colour = ID)) + geom_line()
This also works for a small number of subjects but is unreadable with hundreds of subjects.
I've tried to subset using code like this:
temp <- split(dataset,dataset$ID)
but I'm not sure how to work with the resulting dataset. Or perhaps there is a way to simply adjust the facet_wrap so that individual plots are created?
Thanks!
Because you want to split up the dataset and make a plot for each level of a factor, I would approach this with one of the split-apply-return tools from the plyr package.
Here is a toy example using the mtcars dataset. I first create the plot and name it p, then use dlply to split the dataset by a factor and return a plot for each level. I'm taking advantage of %+% from ggplot2 to replace the data.frame in a plot.
p = ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_line()
require(plyr)
dlply(mtcars, .(cyl), function(x) p %+% x)
This returns all the plots, one after another. If you name the resulting list object you can also call one plot at a time.
plots = dlply(mtcars, .(cyl), function(x) p %+% x)
plots[1]
Edit
I started thinking about putting a title on each plot based on the factor, which seems like it would be useful.
dlply(mtcars, .(cyl), function(x) p %+% x + facet_wrap(~cyl))
Edit 2
Here is one way to save these in a single document, one plot per page. This is working with the list of plots named plots. It saves them all to one document, one plot per page. I didn't change any of the defaults in pdf, but you can certainly explore the changes you can make.
pdf()
plots
dev.off()
Updated to use package dplyr instead of plyr. This is done in do, and the output will have a named column that contains all the plots as a list.
library(dplyr)
plots = mtcars %>%
group_by(cyl) %>%
do(plots = p %+% . + facet_wrap(~cyl))
Source: local data frame [3 x 2]
Groups: <by row>
cyl plots
1 4 <S3:gg, ggplot>
2 6 <S3:gg, ggplot>
3 8 <S3:gg, ggplot>
To see the plots in R, just ask for the column that contains the plots.
plots$plots
And to save as a pdf
pdf()
plots$plots
dev.off()
A few years ago, I wanted to do something similar - plot individual trajectories for ~2500 participants with 1-7 measurements each. I did it like this, using plyr and ggplot2:
library(plyr)
library(ggplot2)
d_ply(dat, .var = "participant_id", .fun = function(x) {
# Generate the desired plot
ggplot(x, aes(x = phase, y = result)) +
geom_point() +
geom_line()
# Save it to a file named after the participant
# Putting it in a subdirectory is prudent
ggsave(file.path("plots", paste0(x$participant_id, ".png")))
})
A little slow, but it worked. If you want to get a sense of all participants' trajectories in one plot (like your second example - aka the spaghetti plot), you can tweak the transparency of the lines (forget coloring them, though):
ggplot(data = dat, aes(x = phase, y = result, group = participant_id)) +
geom_line(alpha = 0.3)
lapply(temp, function(X) ggplot(X, ...))
Where X is your subsetted data
Keep in mind you may have to explicitly print the ggplot object (print(ggplot(X, ..)))

Resources