Show the difference between two averages with ggplot in R - r

In my dataset I have two columns, named part_1 and part_2, that contain several numerical values.
I am required to create a graph that shows how the average varies in the two parts.
I think that the best way is to create a barplot with a bar for each part, but I'm not sure about it.
First, I created two new columns that contain the mean values for the two parts in each row:
averages <- my_data %>% mutate(avg_part1=mean(part_1,na.rm=T)) %>% mutate(avg_part2=mean(part_2,na.rm=T))
Then, I inserted the values in two new variables:
avg_part1 <- averages %>% slice(1) %>% pull(avg_part1) avg_part2 <- averages %>% slice(1) %>% pull(avg_part2)
To create the plot I did:
to_graph<-c("First part"=avg_part1,"Second part"=avg_part2) barplot(to_graph)
And I obtained the graph I wanted, but it's not very nice to see.
I feel like this process is too complex and I may be able to do everything in a couple lines and without creating so many new variables, do you have any suggestions?
Also, I would prefer to create the graph with ggplot because it's better to improve the design, but I don't really know how to do it.
Thanks!

Using ggplot:
library(ggplot2)
library(dplyr)
my_data %>%
stack(select = c(part_1, part_2)) %>%
ggplot(aes(values, x=ind)) + geom_bar(stat="summary", fun=mean)

Related

Debt/GDP graph with ggplot

I have a dataset in which the first column is named central_government_debt_percent_of_gdp and contains a list of years, then several columns with the name of some countries that contain the debt/GDP ratio for each of them in every year.
You can see some of the data at this link:
I want to create a graph that shows the evolution of the ratio for each country, with separate lines. How can I do it with ggplot?
Do I have to add a geom_line for each country?
Should I do some data manipulation ?
As some have already mentioned, it would be appreciated if you provided a reproducible example. I will still try to answer your question, based on the link you included.
You need to do some data transformation, as your data is not yet in "tidy" format. This means: You want a column for every variable, a row for every observation and a cell should contain one value. For that, you need the pivot_longer() function.
library(tidyverse)
data %>%
pivot_longer(
cols= austria:germania,
names_to= "countries",
values_to= "values") %>%
ggplot(aes(x= central_government_dept_percent_of_gdp,
y=values,
color= countries)+
geom_line()

Are there any modification/another function to plot two numerical variable against one string variable?

I have a data set like this one: Names of mutations and two numerical variables representing values in two conditions (CIP and TIG):
I was able to plot one variable (e.g. CIP) in these mutation using the following code:
Data names as "Dotchart2)
dotchart(Dotchart2$`CIP resistance`,
labels = rownames((Dotchart2)), pch = 16, cex = 1, pt.cex = 2)
This appeared as follows:
Since I am comparing CIP vs TIG, I would like to have the same figure but showing another dots for the TIG for the same mutation (i.e. on each horizontal mutation line, there will be two dots of different color, one for CIP value and the other for TIG value). It should appear like this figure for instance
Could any of you provide a simplified code for this ?
I think you'll find your answer here.
In the link provided, #JoshO'Brien creates a dotchart plot using a lattice configuration:
autos_data <- read.table("~/Documents/R/test.txt", header=F)
library(lattice)
dotplot(V1~V2, data=autos_data)
This documentation does a thorough job of explaining and detailing graph styles (graph_type), data graphing (formula), and the data source (data=), resulting in the following:
library(lattice)
graph_type(formula, data=)
To do this easily in lattice or ggplot2 you first need to convert your data to long format. I don't have a data set handy in the right format, so I took the famous iris data set and converted it to a wide-format data set called iris_wide (see code at the bottom). I'm using tidyverse here: all of this can also be done in base R.
(To understand what's going on here you should definitely examine the iris_wide and iris_long objects.)
convert from wide to long format
library(tidyverse)
iris_long <- iris_wide %>%
pivot_longer(cols=-id,names_to="species",values_to="value")
lattice version
lattice::dotplot(id~value, data=iris_long, group=species,pch=16,
auto.key=TRUE)
ggplot version
ggplot(iris_long, aes(value,id,colour=species))+geom_point()
convert iris data from long to wide
To match your example, I'm selecting only two categories (species) and one variable (sepal length)
iris_wide <- (iris
%>% filter(Species %in% c("setosa","virginica"))
%>% select(Sepal.Length, Species)
%>% group_by(Species)
%>% mutate(id=seq(n()))
%>% pivot_wider(names_from=Species, values_from=Sepal.Length)
%>% head(10)
%>% mutate(id=LETTERS[seq(n())])
)

Saturation curve with geom_step in R

I have two problems: I would like to create a graph with multiple lines by adding the values in the columns stepwise (should kind of end up looking like multiple saturation curves). I think geom_step in the ggplot2 package should work. However, I don't know how to add the values in the columns as I go and I don't know how to add multiple lines (I will have over 100 lines) therefore both steps should be automated in some way.
This data set shows my data, only contains the first 3 columns and the first 13 lines.
a<-c(0,1,1,1,1,1,1,0,1,0,1,0,1)
b<-c(0,1,0,0,1,0,1,0,1,0,1,0,1)
c<-c(0,1,1,0,1,0,1,1,1,1,1,1,1)
df<-data.frame(a,b,c)
Can anyone help me? I have no idea where to start.
If you're looking for cumulative sums of the data, the cumsum() function will do it for you.
It isn't completely clear to me what you're looking for, but this might take care of it:
a<-c(0,1,1,1,1,1,1,0,1,0,1,0,1)
b<-c(0,1,0,0,1,0,1,0,1,0,1,0,1)
c<-c(0,1,1,0,1,0,1,1,1,1,1,1,1)
df2<-data.frame(a,b,c)
df3 <- df2 %>%
mutate_all(cumsum) %>%
rename_all(paste0, 'x') %>%
cbind(df2) %>%
mutate(row = row_number()) %>%
pivot_longer(ax:c)
ggplot(df3) +
geom_step(aes(x = row, y = value, color = name))
The data was reshaped to longer data for ease of plotting. Original data was left in as well, those are the lines that stay near the bottom of the graph.
The output:

Need Help Incorporating Tidyr's Spread into a Function that Outputs a List of Dataframes with Grouped Counts

library(tidyverse)
Using the sample data at the bottom, I want to find counts of the Gender and FP variables, then spread these variables using tidyr::spread(). I'm attempting to do this by creating a list of dataframes, one for the Gender counts, and one for FP counts. The reason I'm doing this is to eventually cbind both dataframes. However, I'm having trouble incorporating the tidyr::spread into my function.
The function below creates a list of two dataframes with counts for Gender and FP, but the counts are not "spread."
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))
If I add tidyr::spread, it doesn't work. I'm not sure how to incorporate this since each dataframe in the list has a different variable.
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))%>%
spread(!!.x,n)
Any help would be appreciated!
Sample Code:
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
I think you misplaced the closing parenthesis. This code works for me:
library(tidyverse)
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
group_by_quo <- quos(Gender, FP)
DF2 <- map(group_by_quo,
~DF %>%
group_by(Code,!!.x) %>%
summarise(n=n()) %>%
spread(!!.x,n))
This last part is a bit more concise using count:
DF2 <- map(group_by_quo,
~DF %>%
count(Code,!!.x) %>%
spread(!!.x,n))
And by using count the unnecessary grouping information is removed as well.

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

Resources