Looping over columns of data frame to create plots with ggplot2 - r

I am trying to overcome this. Can't get any further.
I have a dataframe with factor and numeric variables. Herewith displayed are the first few rows and columns.
# A tibble: 6 x 5
cluster SEV_D SEV_M OBS PAN
<int> <dbl> <dbl> <fct> <fct>
1 1 5 1 0 1
2 2 6 1 0 0
3 1 5 1 0 1
4 2 4 2 0 0
5 1 4 1 1 1
6 1 4 2 1 0
cluster=as.factor(c(1,2,1,2,1,1))
SEV_D=as.numeric(c(5,6,5,4,4,4))
SEV_M=as.numeric(c(1,1,1,2,1,2))
OBS=as.factor(c(0,0,0,0,1,1))
PAN=as.factor(c(1,0,1,0,1,0))
data<-data.frame(cluster,SEV_D,SEV_M,OBS,PAN)
I split the dataframe like this, in numeric and factor variables, keeping 'cluster' in both subsets since I need it for grouping.
data_fact <- data[, sapply(data, class) == 'factor']
data_cont <- data[, sapply(data, class) == 'numeric' | names(data)
== "cluster"]
The two following snippets of code would produce the plots I want.
data_fact %>% group_by(cluster,OBS)%>%summarise(total.count=n()) %>%
ggplot(., aes(x=cluster, y=total.count, fill=OBS)) +
geom_bar(position = 'dodge', stat='identity') +
geom_text(aes(label=total.count),
position=position_dodge(width=0.9), vjust=-0.2)
data_cont %>% group_by(cluster) %>% dplyr::summarise(mean =
mean(SEV_D), sd = sd(SEV_D)) %>%
ggplot(.,aes(x=cluster,y=mean))+geom_bar(position=position_dodge(),
stat="identity",colour="black",size=.3)+geom_errorbar(aes(ymin=mean-
sd, ymax=mean+sd),size=.3,width=.4,position=position_dodge(.4)) +
ggtitle("SEV_D")
My goal is to create as many graphs as variables in the data frame, looping over columns and to store such graphs in one single sheet.
My attempt was
col<-names(data_fact)[!names(data_fact)%in%"cluster"]
for(i in col) {
data_fact %>% group_by(cluster,i)%>%summarise(total.count=n()) %>%
ggplot(., aes(x=cluster, y=total.count, fill=i)) + geom_bar(position
= 'dodge', stat='identity') + geom_text(aes(label=total.count),
position=position_dodge(width=0.9), vjust=-0.2)
}
But it throws this error:
Error in grouped_df_impl(data, unname(vars), drop) :
Column i is unknown
On top of that, that code would not display all graphs in one sheet I am afraid. Any help would be much appreciated!!!

The link above is a good reference. Or see Rstudio's tidyeval cheatsheet: https://github.com/rstudio/cheatsheets/raw/master/tidyeval.pdf
To evaluate i in the ggplot statement, you need to unquote the string with the !!ensym( ) function construct. Also, you will need to use the print statement to print the plots within the loop.
library(ggplot2)
col<-names(data_fact)[!names(data_fact)%in%"cluster"]
for(i in col) {
print(i)
g<-data_fact %>% group_by(cluster, !!ensym(i)) %>% summarise(total.count=n()) %>%
ggplot(., aes(x=cluster, y=total.count, fill=!!ensym(i))) +
geom_bar(position = 'dodge', stat='identity') +
geom_text(aes(label=total.count), position = position_dodge(width=0.9), vjust=-0.2) +
labs(title=i)
print(g)
}

Related

boxplot in R is howing a vertical straight line

I have a data frame of multiple columns. I want to create a two boxplots of the two variable "secretary" and "driver" but the result is not satisfiying as the picture shows boxplot. This is my code:
profession ve.count.descrition euse.count.description Qualitative.result
secretary 0 1 -0.5
secretary 0 2 1
driver 1 1 -1
driver 0 2 0.3
data %>%
mutate(Qualitative.result = factor(Qualitative.result)) %>%
ggplot(aes(x = Profession , fill = Qualitative.result)) +
geom_boxplot()
You should not make Qualitative.result as factor. Maybe you want something like this:
library(tidyverse)
data %>%
ggplot(aes(x = Profession, y = Qualitative.result, fill = Profession)) +
geom_boxplot()
Output:

Error in ggplot2 when using both fill and group parameters in geom_bar

There seems to be a problem with R's ggplot2 library when I include both the fill and group parameters in a bar plot (geom_bar()). I've already tried looking for answers for several hours but couldn't find one that would help. This is actually my first post here.
To give a little background, I have a dataframe named smokement (short for smoke and mental health), a categorical variable named smoke100 (smoked in the past 100 days?) with "Yes" and "No", and another categorical variable named misnervs (frequency of feelings of nervousness) with 5 possible values: "All", "Most", "Some", "A little", and "None."
When I run this code, I get this result:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
However, the result I want is to have all grouped bar plots display their respective proportions. By reading a bit of "R for Data Science" book I found out that I need to include y = ..prop.. and group = 1 in aes() to achieve it:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1)) +
facet_wrap(~misnervs, nrow = 1)
Finally, I try to use the fill = smoke100 parameter in aes() to display this categorical variable in color, just like I did on the first code. But when I add this fill parameter, it doesn't work! The code runs, but it shows exactly the same output as the second code, as if the fill parameter this time was somehow ignored!
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
Does anyone have an idea of why this happens, and how to solve it? My end goal is to display each value of smoke100 (the "Yes" and "No" bars) with colors and a legend at the right, just like on the first graph, while having each grouping level of "misnervs" display their respective proportions of smoke100 ("Yes", "No") levels, just like on the second graph.
EDIT:
> dim(smokement)
[1] 35471 6
> str(smokement)
'data.frame': 35471 obs. of 6 variables:
$ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 1 1 1 ...
$ misnervs: Factor w/ 5 levels "All","Most","Some",..: 3 4 5 4 1 5 3 3 5 5 ...
$ mishopls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 5 5 5 5 5 5 5 ...
$ misrstls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 3 1 5 3 5 1 5 ...
$ misdeprd: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 4 5 5 5 5 5 ...
$ miswtles: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 5 5 5 5 5 5 ...
> head(smokement)
smoke100 misnervs mishopls misrstls misdeprd miswtles
1 Yes Some Some Some None None
2 No A little None None None None
3 Yes None None None None None
4 No A little None Some None None
5 Yes All None All A little None
6 Yes None None None None None
As for the output without group = 1
ggplot(data = smokement) +
+ geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100)) +
+ facet_wrap(~misnervs, nrow = 1)
Besides the solution offered here the GGAlly package includes a stat_prop which introduces a new by aesthetic to specify the way the proportions should be calculated:
library(GGally)
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100, by = misnervs), stat = "prop") +
facet_wrap(~misnervs, nrow = 1)
And just for reference the same could be achieved without GGAlly by setting fill=factor(..x..):
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = factor(..x..), group = 1)) +
facet_wrap(~misnervs, nrow = 1)
DATA
misnervs <- c("All", "Most", "Some", "A little", "None")
set.seed(123)
smokement <-
data.frame(
smoke100 = sample(c("Yes", "No"), 100, replace = TRUE),
misnervs = factor(sample(misnervs, 100, replace = TRUE), levels = misnervs)
)
I wasn't able to get what you wanted by tweaking your call to geom_bar*, but I think this gives you what you are looking for. As you didn't provide your input dataset (for understandable reasons), I've used the diamonds tibble in my code. The changes you need to make should be obvious.
*: I'm sure it can be done: I just wasn't able to work it out.
The idea behind my solution is to pre-compute the proportions you want to plot before the call to ggplot.
group_modify takes a grouped tibble and applies the specified function to each group in turn, before returning the modified (grouped) tibble.
diamonds %>%
group_by(cut) %>%
group_modify(
function(.x, .y)
.x %>%
group_by(color) %>%
summarise(Prop=n()/nrow(.))
) %>%
ggplot() +
geom_col(aes(x=color, y=Prop, fill=color)) +
facet_wrap(~cut)
Note the switch from geom_bar to geom_col: geom_bar uses row counts, geom_col uses values in the data.
As a rough-and-ready QC, here's the equivalent of your code that produces the "all grey' plot:
diamonds %>%
ggplot() +
geom_bar(aes(x=color, y=..prop.., fill=color, group=1)) +
facet_wrap(~cut)

plotting two numeric variables in the same graph

I want to visualise two variables in the same graph.
the variables look like this
> head(intp.trust_male)
# A tibble: 1 × 1
average_intp.trust
<dbl>
1 2.33
and
> head(intp.trust_fem)
# A tibble: 1 × 1
average_intp.trust
<dbl>
1 2.34
I have tried merge to put them in the same data frame, but it doesn't seem to work
Q5 <- merge(intp.trust_fem, intp.trust_male)
ggplot(data = Q5)+
aes(fill = percent_owned) +
geom_sf() +
scale_fill_viridis_c()
can anyone help me out here, please?
Thank you :)
I think what you want to do is stack your data frames. You can do this with dplyr::bind_rows. It's not clear from your question what you're trying to accomplish because percent_owned is not a variable in the data you've shown. Generally, you could do (using geom_point):
library(dplyr)
library(ggplot2)
intp.trust_male <- mutate(intp.trust_male, label = "intp.trust_male")
intp.trust_fem <- mutate(intp.trust_fem, label = "intp.trust_fem")
df <- bind_rows(intp.trust_male, intp.trust_fem)
ggplot(df, aes(x = label, y = average_intp.trust)) +
geom_point()

Plotting 3 Variables on One Chart - ggplot

I have some experience with base R but am trying to learn tidyverse and ggplot. I have a dataframe with 4 columns of data. I want a simple x-y plot, where the first column of data is on the x-axis, and the data in the other 3 columns is plotted on the y-axis, resulting in 3 lines on one plot. The first 15 lines of my data look like this (sorry about the image - I don't know how to insert a sample of my data):
screen shot - first 15 rows of data
I tried to plot the second and third columns of data as follows: ,
ggplot(data=SWRC_SL, aes(x=SWRC_SL$pressure_head, y=SWRC_SL$UNSODA_theta)) +
geom_line(colour="red") + scale_x_log10() +
ggplot(data=SWRC_SL, aes(x=SWRC_SL$pressure_head, y=SWRC_SL$Vrugt_theta)) +
geom_line(colour="blue") + scale_x_log10()
I get this error:
Error: Don't know how to add ggplot(data = SWRC_SL, aes(x = SWRC_SL$pressure_head, y = SWRC_SL$Vrugt_theta)) to a plot
I believe I should be using something like "group=" to indicate which columns should be plotted, but I haven't been able to find an example that shows how you can use gglot to plot data across multiple columns. What am I missing ?
ggplot() is only ever called once when you create a chart. Try with the following:
ggplot() +
geom_line(data=SWRC_SL, aes(x=pressure_head, y=UNSODA_theta), colour="red") +
geom_line(data=SWRC_SL, aes(x=pressure_head, y=Vrugt_theta), colour="blue") +
scale_x_log10()
A better method would be to turn your data to long, where the UNSODA_theta and Vrugt_theta data are in the same column (say thetas), and have another column (say type_theta) indicating whether the data is for UNSODA_theta or Vrugt_theta. Then you could do the following:
ggplot(data=SWRC_SL, aes(x=pressure_head, y=thetas, colour=type_theta)) +
geom_line() +
scale_x_log10()
This is more desirable because ggplot2 will include a legend indicating what type of theta the colours are applied to.
As suggested by #Marius, the most efficient way to plot your data is to convert them into a long format.
Using tidyverse, you can have the use of pivot_longer function (from tidyr package) and write the following code:
library(tidyverse)
SWRC_SL %>% pivot_longer(.,-pressure_head, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = pressure_head, y = value, color = variable))+
geom_line()+
scale_x_log10()
EDIT: Illustrating example
Using this dummy dataset:
pressure UNSODA_theta Vrugt_theta Cassel_theta
1 0 -1.4672500 1.4119747 -2.0553118
2 1 0.5210227 0.6189239 1.4817574
3 2 -0.1587546 1.4094018 2.2796175
4 3 1.4645873 2.6888733 -0.4631109
5 4 -0.7660820 2.5865884 -1.8799346
6 5 -0.4302118 0.6690922 0.9633620
First, you pivot your data into a long format:
df %>% pivot_longer(.,-pressure, names_to = "variable", values_to = "value")
# A tibble: 45 x 3
pressure variable value
<int> <chr> <dbl>
1 0 UNSODA_theta -1.47
2 0 Vrugt_theta 1.41
3 0 Cassel_theta -2.06
4 1 UNSODA_theta 0.521
5 1 Vrugt_theta 0.619
6 1 Cassel_theta 1.48
7 2 UNSODA_theta -0.159
8 2 Vrugt_theta 1.41
9 2 Cassel_theta 2.28
10 3 UNSODA_theta 1.46
# … with 35 more rows
Now, your data are suitable for the plotting with ggplot2, you can directly add ggplot command to the previous command by adding a "pipe" (%>%) between them:
library(tidyverse)
df %>% pivot_longer(.,-pressure, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = pressure, y = value, color = variable))+
geom_line()+
scale_x_log10()
And you get the following plot with legend included:
Data example
structure(list(pressure = 0:14, UNSODA_theta = c(-1.46725002909224,
0.521022742648139, -0.158754604716016, 1.4645873119698, -0.766081999604665,
-0.430211753928547, -0.926109497377437, -0.17710396143654, 0.402011779486338,
-0.731748173119606, 0.830373167981674, -1.20808278630446, -1.04798441280774,
1.44115770684428, -1.01584746530465), Vrugt_theta = c(1.41197471231751,
0.61892394889108, 1.40940183965093, 2.68887328620405, 2.58658843344197,
0.669092199317234, -1.28523553529247, 3.49766158983416, 1.66706616676549,
1.5413273359637, 0.986600476854091, 1.51010842295293, 0.835624168230333,
1.42069464325451, 0.599753256022356), Cassel_theta = c(-2.05531181632119,
1.48175740118232, 2.27961753824932, -0.46311085383842, -1.87993463341154,
0.963361958516736, -0.0670637053409687, -2.59982761023726, 0.00319778952040447,
-0.945450500892219, -0.511452869790608, -1.73485854395378, 2.7047128618762,
-0.496698054586832, -2.40827011837962)), class = "data.frame", row.names = c(NA,
-15L))

Why is geom_bar y-axis unproportional to actual numbers?

Sorry if this question already exists - was googling for a while now already and didn't find anything.
I am relatively new to R and learning while doing all of this.
I'm supposed to create some PDF via r markdown that analyses patient-data with specific main-diagnosis and secondary-diagnosis. For this I'm supposed to plot some numbers via ggplot (geom_bar and geom_boxplot).
So what I do so far is, I retrieve data-sets that include both codes via SQL and load them into data.table-objects afterwards. Afterwards I join them to get the data I need.
After this I add columns that consist sub-strings of those codes and others that consist the count of those certain sub-strings (so I can plot the occurrences of every code).
I wanted now for example to put certain data.table into a geom_bar or geom_boxplot and make it visible. This actually works, but my y-axis has a weird scale that doesn't fit the numbers it actually should show. The proportions of the bars are also not accurate.
For example: one diagnoses appears 600 times and the other one 1000 times. The y-axis shows steps of 0 - 500.000 - 1.000.000 - 1.500.000 - ....
The Bar that shows 600 is super small and the bar with 1000 goes up to 1.500.000
If I create a new variable before and count what I need via count() and plot this it just works. The rows I put for the y-axis have in both variable the same datatype (integer)
So here is just how I create the data.table that I use for plotting
exazerbationsHdComorbiditiesNd <- allExazerbationsHd[allComorbiditiesNd, on="encounter_num", nomatch=0]
exazerbationsHdComorbiditiesNd <- exazerbationsHdComorbiditiesNd[, c("i.DurationGroup", "i.DurationInDays", "i.start_date", "i.end_date", "i.duration", "i.patient_num"):=NULL]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeCount := .N, by = concept_cd]
exazerbationsHdComorbiditiesNd[ , IcdHdCodeClassCount := .N, by = IcdHdClass]
If I want to bar-plot now for example IcdHdClass by IcdHdCodeClassCount I do following:
ggplot(exazerbationsHdComorbiditiesNd, aes(exazerbationsHdComorbiditiesNd$IcdHdClass, exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount, label=exazerbationsHdComorbiditiesNd$IcdHdCodeClassCount)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
It outputs said bar-plot with weird proportions.
If I do first:
plotTest <- count(exazerbationsHdComorbiditiesNd, exazerbationsHdComorbiditiesNd$IcdHdClass)
And then bar-plot it:
ggplot(plotTest, aes(plotTest$`exazerbationsHdComorbiditiesNd$IcdHdClass`, plotTest$n, label=plotTest$n)) + geom_bar(stat = "identity") + geom_text(vjust = 0, size = 5)
Its all perfect and works.
I checked also data-types of the columns I needed:
sapply(exazerbationsHdComorbiditiesNd, class)
sapply(plotTest, class)
In both variables the columns I need are of the type character and integer
Edit:
Unfortunately I cant post images. So here are just the links to those.
Here is a screenshot of the plot with wrong y-axis:
https://ibb.co/CbxX1n7
And here is a screenshot of the plot shown right:
https://ibb.co/Xb8gyx1
Here is some example-data that I copied out the data.table object:
Exampledata
Since you added the class counts as an additional column--rather than aggregating--what’s happening is that for each row in your data, the class counts get stacked on top of each other:
library(tidyverse)
set.seed(42)
df <- tibble(class = sample(letters[1:3], 10, replace = TRUE)) %>%
add_count(class, name = "count")
df # this is essentially what your data looks like
#> # A tibble: 10 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 a 5
#> 3 a 5
#> 4 a 5
#> 5 b 3
#> 6 b 3
#> 7 b 3
#> 8 a 5
#> 9 c 2
#> 10 c 2
ggplot(df, aes(class, count)) + geom_bar(stat = "identity")
You could use position = "identity" so that the bars don’t get stacked:
ggplot(df, aes(class, count)) +
geom_bar(stat = "identity", position = "identity")
However, that creates a whole bunch of unnecessary layers in your plot that you can’t see. A better approach would be to drop the extra rows from your data before plotting:
df %>%
distinct(class, count)
#> # A tibble: 3 x 2
#> class count
#> <chr> <int>
#> 1 a 5
#> 2 b 3
#> 3 c 2
df %>%
distinct(class, count) %>%
ggplot(aes(class, count)) +
geom_bar(stat = "identity")
Created on 2019-09-05 by the reprex package (v0.3.0.9000)

Resources