R - Multiple Columns on one single Scatterplot - r

Would you mind taking a look at this?
https://docs.google.com/spreadsheets/d/14vVWxhaQynPmnAsZHlrkkdeJTt0XlDzHc5JSd4DNF-Y/edit?usp=sharing
I have three variables; first one for Year from 2000 - 2017, second one for each country's GDP over the 2000-2017 and the third for soccer ranking over the 2000-2017.
I would like to draw one giant scatter plot; Year 2000-2017 on X-axis, Rank reversed starting from 200 on bottom to 1 on top on Y-axis while each scatter point size vary with GDP size.
All I can come up with is plotting a scatter plot for one country only:
rank <- read.csv("Test1.csv", sep=",", header=TRUE)
library(ggplot2)
qplot(Year, Rank , data = rank, size = Aruba)
But I would like to fit all the countries into one scatter plot while y-axis being reversed and draw a linear regression of all scatter points if possible.
Can someone help me on this?

I am not sure how you want the regression done. But here is the graph.
Edits: Because there is a country named "Rankmibia" which I never heard of, select by prefix won't work, I used position this time.
rank <- read.csv("Test1.csv", sep=",", header=TRUE)
library(tidyr)
library(ggplot2)
library(dplyr)
r=rank %>% select(seq(3,ncol(rank),2)) %>% gather(id,rank)
g=rank %>% select(1,seq(2,ncol(rank),2)) %>% gather(country,GDP,-Year)
df=cbind(g, rank=r$rank)
g=qplot(Year, rank , data = df, size = GDP, color=country)+scale_y_reverse()
ggsave("fig.png",g,width=40,height=20)

Related

R ggplot2 Visualize categorical variable that levels appear more than once

I am trying to visualize some tennis data with ggplot2 in R.
Here are my data:
Year<-c(1999:2020)
Player <- rep("Federer",22)
Rank <-
c("Q1","3R","3R","4R","4R","W","SF","W","W","SF","F","W","SF","SF","SF","SF","3R",
"SF","W","W","4R","SF")
data <- data.frame(Year, Player, Rank)
data$Rank <- factor(data$Rank, levels = unique(data$Rank))
What I want to do is a diagram that looks like a bar plot but actually is not a bar plot. I would like to have as x-axis Years from 1999 to 2020 and correspond them to Rank level.
My problem is that Rank, which is I converted to categorical variable, has some levels that appear more than once in time and this makes things difficult for me.
I am looking to do something like the following pic from Wikipedia with specific color for every level of Rank variable.
The Australian open result is what I want to visualize.
Maybe something like this, using geom_tile() to make like a heatmap..instead of a barplot:
library(ggthemes)
ggplot(data,aes(x=factor(Year),y=Player,fill=Rank)) +
geom_tile() + scale_fill_economist()

plotting two categorical vectors in ggridges

I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")

How to make a plot showing dots representing age at the two timepoints (baseline and follow up) connected by a line (spaghetti plot)?

R studio (ggplot) question: I need to prepare a plot with age on X-axis with each subject represented with one dot per session (baseline and followup) with a line drawn between them (spaghetti plot). preferably sorting them by age at baseline.. can anyone help me?
I want to plot the lines horizontally along the x-axis (from Age at Timepoint 1 to AgeTp2), and the y-axis can represent some index based on a sorted list of individuals based on AgeTp1 (so just a pile of horizontal lines, really)
IMAGE OF DATASET
Here is a simple example that you can modify to suit your purposes...
df <- data.frame(ID=c("A","A","B","B","C","C"),
age=c(20,25,22,27,21,28))
library(dplyr)
library(ggplot2)
#sort by first age for each ID
df <- df %>% group_by(ID) %>%
mutate(index=min(age)) %>%
ungroup() %>%
mutate(index=rank(index))
ggplot(df,aes(x=age,y=index,colour=ID,group=ID))+
geom_point(size=4)+
geom_line(size=1)

Creating multiple density plots using only summary statistics (no raw data) in R

I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)

geom_point: Put overlapping points with highest values on top of others

I am visualizing a panel dataset with geom_point where y = var1, x = year, and color = var2. The problem is that there are many overlapping points, even with horizontal jitter.
Reducing the point size or setting a low alpha value is undesirable because both reduce the visual impact of the second variable, which has a very long right skew. I would like ggplot to place the points with the highest values of var2 on top of all other overlapping points.
Reproducible example:
df <- data.frame(diamonds)
ggplot(data = df,aes(x=factor(cut),y=carat,colour=price)) +
geom_point(position=position_jitter(width=.4))+
scale_colour_gradientn(colours=c("grey20","orange","orange3"))
How does one place the points with highest values in df$price on top of an overlapping stack of points?
It looks as though grid plots in the order of the data,
library(grid)
d <- data.frame(x=c(0.5,0.52),y=c(0.6,0.6), fill=c("blue","red"),
stringsAsFactors=FALSE)
grid.newpage()
with(d,grid.points(x,y,def='npc', pch=21,gp=gpar(cex=5, fill=fill)))
with(d[c(2,1),], grid.points(x,y-0.2,def='npc', pch=21,
gp=gpar(cex=5, fill=fill)))
so I would suggest you first reorder your data.frame, and pray that ggplot2 won't mess with it :)
library(ggplot2)
library(plyr)
df <- diamonds[order(diamonds$price, decreasing=TRUE), ]
# alternative with plyr
df <- arrange(diamonds, desc(price))
last_plot() %+% df
In ggplot2, you can use the order aesthetic to specify the order in which points are plotted. The last ones plotted will appear on top. To apply this, create a variable holding the order in which you'd like points to be drawn; in your case you might be able to specify rank(var2).
For the reproducible example, to put the points with the highest df$price on top:
df <- data.frame(diamonds)
df$orderrank <- rank(df$price,ties.method="first")
ggplot(data = df,aes(x=factor(cut),y=carat,colour=price, order=orderrank)) +
geom_point(position=position_jitter(width=.4))+
scale_colour_gradientn(colours=c("grey20","orange","orange3"))
Here is the difference in outputs between the example in the question and the chart with specified plotting order by price:
(The jittering makes the comparison a little less clear but the difference still comes across.)

Resources