Creating a scatter plot using two data sets in R - r

Beginner here. I'm hoping to create a scatterplot using two datasets that I created using group by:
menthlth_perc_bystate <- brfss2013 %>%
group_by(state) %>%
summarise(percent_instability = sum(menthlth > 15, na.rm = TRUE) / n()) %>%
arrange(desc(percent_instability))
exercise_perc_bystate <- brfss2013 %>%
group_by(state) %>%
summarise(perc_exercise = sum(exeroft1 > 30, na.rm = TRUE) / n()) %>%
arrange(desc(perc_exercise))
I want to merge these into one dataset, total_data. Both have 54 obs.
total_data <- merge(menthlth_perc_bystate,exercise_perc_bystate,by="state")
Presumably the scatter plot would take on one axis the state's percent instability (menthlth_perc_bystate) and on another the states percent exercise (exercise_perc_by_state). I tried this using ggplot but got an error:
ggplot(total_data, aes(x = total_data$menthlth_perc_bystate, y = total_data$exercise_perc_bystate)) + geom_point()
The error: Aesthetics must be either length 1 or the same as the data (54): x, y

In the aes() function of ggplot you put the bare column names from the data frame you provided for the data argument. So in your example it would be:
ggplot(total_data ,
aes(x = percent_instability,
y = perc_exercise)) +
geom_point()
Although I'm not sure what total_ex is in your example.
Also, using total_ex$menthlth_perc_bystate implies you are looking for a column named menthlth_perc_bystate in the data frame total_ex. That column does not exist, it is the name of a different data frame. Once you have merged the two data frames, the columns in the resulting data frame will be state, percent_instability, and perc_exercise.

Related

Removing plot points within a ggplot based on column value within a function

I am trying to run a function in R that plots the values of a single column in a dataframe against values from several other columns from the same dataframe. The output is several geom_point plots on a single ggplot. However, I would like to remove plot points with certain values from the plots. Specifically some of the dataframes contain 0 values that should not be plotted.
I have tried using subset() in several different formats for this problem however, I normally get errors saying
" Error in FUN(left, right) : non-numeric argument to binary operator"
I have tried simplifying the code and the 0 values are not removed at all!
## Plots #------
library(tidyverse) #loads all Hadley verse
x <- c(0,0,3,0,4)
y <- c(2,0,1,2,5)
KeyValue <- c(1,0,2,2,3)
df <- data.frame (x, y, KeyValue)
## Write function for producing graphs
PlotGr <- function(x){
xplot <- x %>%
gather(-"KeyValue", key = "var", value = "value") %>%
ggplot(subset(x,value!==0) +
aes(x = value, y = `KeyValue`)) +
geom_point(alpha=1/4) +
facet_wrap(~ var, scales = "free") +
theme_bw()
ggsave(filename= paste0(deparse(substitute(x)),'.pdf'))
}
PlotGr(df)
To get a PDF output that contains 2 separate plots, with zero values on the y axis removed.
Currently my function is failing and I am completely lost!
Perhaps try something like:
xplot <-
x %>%
gather(-"KeyValue", key = "var", value = "value") %>%
filter(value != 0) %>%
ggplot(aes(x = value, y = `KeyValue`)) +
...

How to plot a large number of density plots with different categorical variables

I have a dataset in which I have one numeric variable and many categorical variables. I would like to make a grid of density plots, each showing the distribution of the numeric variable for different categorical variables, with the fill corresponding to subgroups of each categorical variable. For example:
library(tidyverse)
library(nycflights13)
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
plot_1 <- dat %>%
ggplot(aes(x = distance, fill = carrier)) +
geom_density()
plot_1
plot_2 <- dat %>%
ggplot(aes(x = distance, fill = origin)) +
geom_density()
plot_2
I would like to find a way to quickly make these two plots. Right now, the only way I know how to do this is to create each plot individually, and then use grid_arrange to put them together. However, my real dataset has something like 15 categorical variables, so this would be very time intensive!
Is there a quicker and easier way to do this? I believe that the hardest part about this is that each plot has its own legend, so I'm not sure how to get around that stumbling block.
This solutions gives all the plots in a list. Here we make a single function that accepts a variable that you want to plot, and then use lapply with a vector of all the variables you want to plot.
fill_variables <- vars(carrier, origin)
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!fill_variable)) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
If you have no idea of what those !! mean, I recommend watching this 5 minute video that introduces the key concepts of tidy evaluation. This is what you want to use when you want to create this sorts of wrapper functions to do stuff programmatically. I hope this helps!
Edit: If you want to feed an array of strings instead of a quosure, you can change !!fill_variable for !!sym(fill_variable) as follows:
fill_variables <- c('carrier', 'origin')
func_plot <- function(fill_variable) {
dat %>%
ggplot(aes(x = distance, fill = !!sym(fill_variable))) +
geom_density()
}
plotlist <- lapply(fill_variables, func_plot)
Alternative solution
As #djc wrote in the comments, I'm having trouble passing the column names into 'fill_variables'. Right now I am extracting column names using the following code...
You can separate the categorical and numerical variables like; cat_vars <- flights[, sapply(flights, is.character)] for categorical variables and cat_vars <- flights[, sapply(flights, !is.character)] for continuous variables and then pass these vectors into the wrapper function given by mgiormenti
Full code is given below;
library(tidyverse)
library(nycflights13)
cat_vars <- flights[, sapply(flights, is.character)]
cont_vars<- flights[, !sapply(flights, is.character)]
dat <- flights %>%
select(carrier, origin, distance) %>%
mutate(origin = origin %>% as.factor,
carrier = carrier %>% as.factor)
func_plot_cat <- function(cat_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cat_vars)) +
geom_density()
}
func_plot_cont <- function(cont_vars) {
dat %>%
ggplot(aes(x = distance, fill = !!cont_vars)) +
geom_point()
}
plotlist_cat_vars <- lapply(cat_vars, func_plot_cat)
plotlist_cont_vars<- lapply(cont_vars, func_plot_cont)
print(plotlist_cat_vars)
print(plotlist_cont_vars)

ggplot labels for melted dataframes

It's often the case I melt my dataframes to show multiple variables on one barplot. The goal is to create a geom_bar with one par for each variable, and one summary label for each bar.
For example, I'll do this:
mtcars$id<-rownames(mtcars)
tt<-melt(mtcars,id.vars = "id",measure.vars = c("cyl","vs","carb"))
ggplot(tt,aes(variable,value))+geom_bar(stat="identity")+
geom_text(aes(label=value),color='blue')
The result is a barplot in which the label for each bar is repeated for each case (it seems):
What I want to have is one label for each bar, like this:
A common solution is to create aggregated values to place on the graph, like this:
aggr<-tt %>% group_by(variable) %>% summarise(aggrLABEL=mean(value))
ggplot(tt,aes(variable,value))+geom_bar(stat="identity")+
geom_text(aes(label=aggr$aggrLABEL),color='blue')
or
ggplot(tt,aes(variable,value))+geom_bar(stat="identity")+
geom_text(label=dplyr::distinct(tt,value),color='blue')
However, these attempts result in errors, respectively:
For solution 1: Error: Aesthetics must be either length 1 or the same as the data (96): label, x, y
For solution 2: Error in [<-.data.frame(*tmp*, aes_params, value = list(label = list( : replacement element 1 is a matrix/data frame of 7 rows, need 96
So, what to do? Setting geom_text to stat="identity" does not help either.
What I would do is create another dataframe with the summary values of your columns. I would then refer to that dataframe in the geom_text line. Like this:
library(tidyverse) # need this for the %>%
tt_summary <- tt %>%
group_by(variable) %>%
summarize(total = sum(value))
ggplot(tt, aes(variable, value)) +
geom_col() +
geom_text(data = tt_summary, aes(label = total, y = total), nudge_y = 1) # using nudge_y bc it looks better.

How to make plot of each variable in dataframe using loop in R

I have a large dataset with 30 different variables. I want to investigate some characteristics of each variable by making a histogram for each variable.
For example, for my variable A this now looks like:
hist = qplot(A, data = full_data_noNO, geom="histogram",
binwidth = 50, fill=I("lightblue"))+
theme_light()
Now, I want do this for all my variables. Does anyone know how I can loop through the names of all variables of my dataframe (so A should change each iteration).
Also, I want to loop through all variables in this code for the same purpose:
avg_price = full_data_noNO %>%
group_by(Month, Country) %>%
dplyr::summarize(total = mean(A, na.rm = TRUE))
You could reference your variables by column number:
histograms = list()
for(i in 1:ncol(full_data_noNO)){
histograms[[i]] = qplot(full_data_noNO[,i], geom="histogram",
binwidth = 50, fill=I("lightblue"))+
theme_light()
}
If all your variables are numeric, then you can do the following to produce a list of all plots, which you can then explore one by one with list indexing:
library(tidyverse)
list_of_plots <-
full_data_noNO %>%
map(~ qplot(x = ., geom = "histogram"))

Show outliers in an efficient manner using ggplot

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.
Therefore the first thing I did is calculating the sd values and ranges:
library(dplyr)
library(ggplot2)
#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)
#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd
#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)
#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)
total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0
Next, using a loop I determine whether the age is inside or outside the range
for (row in 1:nrow(total_data)){
if((total_data$Age[row]) > (total_data$outlier_max[row])){
total_data$outlier[row] <- 1
} else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
total_data$outlier[row] <- 1
} else {
total_data$outlier[row] <- 0
}
}
Do some data cleaning ...
total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)
Now this code gives me the plot I am looking for.
ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
facet_grid(. ~Sex.x)
However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.
One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:
titanic %>%
mutate(Pclass = as.factor(Pclass)) %>%
group_by(Pclass, Sex) %>%
mutate(Age.mean = mean(Age),
Age.sd = sd(Age),
outlier.max = Age.mean + Age.sd,
outlier.min = Age.mean - Age.sd,
outlier = as.factor(ifelse(Age > outlier.max, 1,
ifelse(Age < outlier.min, 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, Pclass, colour = outlier)) +
facet_grid(.~Sex)
Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.
For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:
...
rowwise() %>%
mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...
Plus:
Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:
titanic %>%
group_by(Pclass, Sex) %>%
mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1,
ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
facet_grid(.~Sex)

Resources