In the time series data created below data, individuals (denoted by a unique ID) were sampled from 2 populations (NC and SC). All individuals have the same number of observations. I want to average the data for each respective "time point" for all individuals that belong to the same "State" (the average line) and I want to plot the average lines from each state against each other. I want it to look something like this:
library(tidyverse)
set.seed(123)
ID <- rep(1:10, each = 500)
Time = rep(c(1:500),10)
Location = rep(c("NC","SC"), each = 2500)
Var <- rnorm(5000)
data <- data.frame(
ID = factor(ID),
Time = Time,
State = Location,
Variable = Var
)
I would recommend getting familiar with the various dplyr functions. Specifically, group_by and summarise. You may want to read through: Introduction to dplyr or going through this series of blog posts.
In short, we are grouping the data by the Time and State variable and then summarizing that data with an average (i.e., mean(Variable)). To plot the data, we put Time on our x-axis, the newly created avg_var on our y-axis, and use State to represent color. These are assigned as our chart's aesthetics (i.e., aes(...). Finally, we add the line geom with geom_line() to render the lines on our visualization.
data %>%
group_by(Time, State) %>%
summarise(avg_var = mean(Variable)) %>%
ggplot(aes(x = Time, y = avg_var, color = State)) +
geom_line()
Related
I have a data set which contains 27M samples per day. I can reduce this, using count(), to 1500 samples per day, without loss.
When I come to plot, for example, histograms from this, I can use stat="identity" to process the count data considerably faster than the original data.
Is there a similar way to process the count data to obtain ridges using ggridges::geom_density_ridges(), or similar, to get the probability density without having to process the original data set?
It sounds like your current set-up is something like this (obviously with far more cases): a data frame containing a large vector of numeric measurements, with at least one grouping variable to specify different ridge lines.
We will stick to 2000 samples rather than 27M samples for demonstration purposes:
set.seed(1)
df <- data.frame(x = round(c(rnorm(1000, 35, 5), rnorm(1000, 60, 12))),
group = rep(c('A', 'B', 'C'), len = 2000))
We can reduce these 2000 observations down to ~200 by using count, and plot with geom_histogram using stat = 'identity':
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, y = n, fill = group)) +
geom_histogram(stat = 'identity', color = 'black')
But we want to create density ridgelines from these 200 rows of counts rather than the original data. Of course, we could uncount them and create a density ridgeline normally, but this would be tremendously inefficient. What we can do is use the counts as weights for a density calculation. It seems that geom_density_ridges doesn't take a weight parameter, but stat_density does, and you can tell it to use the density_ridges geom. This allows us to pass our counts as weights for the density calculation.
library(ggridges)
df %>%
group_by(x, group) %>%
count() %>%
ggplot(aes(x, fill = group)) +
stat_density(aes(weight = n, y = group, height = after_stat(density)),
geom = 'density_ridges', position = 'identity')
Note that this should give us the same result as creating a ridgleine from our whole data set before counting, since our 'bins' are unique interval values. If your real data is binning continuous data before counting, you will have a slightly less accurate kernel density estimate when using count data, depending on how 'thin' your bins are.
I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks
I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:
We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
I have time series data where measurements of 7 variables (Var1:Var7) were taken on 15 individuals (denoted by a unique ID). These individuals were sampled from 3 different Locations. Note that the number of observations is different for each individual. I believe the individuals within each Location will be more similar to each other than individuals in other Locations, both in value and trend. For each Variable within each Location, I want to plot the average time series (to get an idea of what the group looks like as a whole) up to the point where Time is the same for each individual (so the length of the x-axis will only be as long as the shortest individual).
How can I do this and add error bars for each Time point to see how much variation exists between individuals?
Here is some sample data:
set.seed(123)
ID = factor(letters[seq(15)])
Time = c(1000,1200,1234,980,1300,1020,1180,1908,1303,
1045,1373,1111,1097,1167,1423)
df <- data.frame(ID = rep(ID, Time), Time = sequence(Time))
df$Location = rep(c("NY","WA","MA"), c(5714,7829,4798))
df[paste0('Var', c(1:7))] <- rnorm(sum(Time))
The values of all your variables are the same, so I did the following to make it more random:
for(i in 1:7) df[paste0('Var', i)] <- rnorm(sum(Time))
Then the following code gives a time-series plot for each of the 7 variables averaged over the three locations.
df %>%
pivot_longer(cols = Var1:Var7, names_to="Variable") %>%
group_by(Location, Variable, Time) %>%
summarise(mval=mean(value)) %>%
ggplot(aes(y=mval, x=Time, color=Variable)) +
geom_line() +
facet_grid(~Location) # , scales="free" # ?
I'm not sure if this is what you had in mind though.
I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")
I have data currently structured like so:
set.seed(100)
require(ggplot2)
require(reshape2)
d<-data.frame("ID" = 1:30,
"Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
"Score1" = rnorm(30)^2,
"Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
"Score2" = rnorm(30)^2,
"Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
"Score3" = rnorm(30)^2)
Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.
d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))
I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:
ggplot(d.melt)+
geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))
But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels...
Any help getting my head around this problem would be great. Thank you in advance
The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment and their Score for each treatment. I've used one of the map functions from the purrr package (which is part of the tidyverse suite of packages) for this.
The code steps through each of the three pairs of treatments/scores, adds a column called Treatment indicating the treatment number and returns the stacked (long format) data frame.
library(tidyverse)
dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2),
function(t,s) {
data.frame(ID = d[,"ID"],
Treatment = gsub(".*([0-9]$)", "\\1", names(d)[t]),
Treat_Flag = d[,t],
Score = d[,s])
})
Now we plot the data using Treatment on the x-axis to mark the treatment number and color by Treat_Flag to provide separate box plots based on whether a given subject received a given treatment.
ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
geom_boxplot() +
theme_classic() +
labs(colour="Treatment Indicator")
Here's another way to reshape the data. The code below uses functions from tidyr rather than from reshape2 (tidyr is the successor to reshape2). In the code below, gather(d, key, value, -ID) is essentially equivalent to melt(d, id.var="ID"). You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse paradigm for data reshaping, but I find it a bit less intuitive than the map approach above.
dr = gather(d, key, value, -ID) %>%
separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
spread(key, value) %>%
rename(Treatment=value2, Treat_Flag=Treatment)