I have time series data where measurements of 7 variables (Var1:Var7) were taken on 15 individuals (denoted by a unique ID). These individuals were sampled from 3 different Locations. Note that the number of observations is different for each individual. I believe the individuals within each Location will be more similar to each other than individuals in other Locations, both in value and trend. For each Variable within each Location, I want to plot the average time series (to get an idea of what the group looks like as a whole) up to the point where Time is the same for each individual (so the length of the x-axis will only be as long as the shortest individual).
How can I do this and add error bars for each Time point to see how much variation exists between individuals?
Here is some sample data:
set.seed(123)
ID = factor(letters[seq(15)])
Time = c(1000,1200,1234,980,1300,1020,1180,1908,1303,
1045,1373,1111,1097,1167,1423)
df <- data.frame(ID = rep(ID, Time), Time = sequence(Time))
df$Location = rep(c("NY","WA","MA"), c(5714,7829,4798))
df[paste0('Var', c(1:7))] <- rnorm(sum(Time))
The values of all your variables are the same, so I did the following to make it more random:
for(i in 1:7) df[paste0('Var', i)] <- rnorm(sum(Time))
Then the following code gives a time-series plot for each of the 7 variables averaged over the three locations.
df %>%
pivot_longer(cols = Var1:Var7, names_to="Variable") %>%
group_by(Location, Variable, Time) %>%
summarise(mval=mean(value)) %>%
ggplot(aes(y=mval, x=Time, color=Variable)) +
geom_line() +
facet_grid(~Location) # , scales="free" # ?
I'm not sure if this is what you had in mind though.
Related
I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks
I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:
We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
In the time series data created below data, individuals (denoted by a unique ID) were sampled from 2 populations (NC and SC). All individuals have the same number of observations. I want to average the data for each respective "time point" for all individuals that belong to the same "State" (the average line) and I want to plot the average lines from each state against each other. I want it to look something like this:
library(tidyverse)
set.seed(123)
ID <- rep(1:10, each = 500)
Time = rep(c(1:500),10)
Location = rep(c("NC","SC"), each = 2500)
Var <- rnorm(5000)
data <- data.frame(
ID = factor(ID),
Time = Time,
State = Location,
Variable = Var
)
I would recommend getting familiar with the various dplyr functions. Specifically, group_by and summarise. You may want to read through: Introduction to dplyr or going through this series of blog posts.
In short, we are grouping the data by the Time and State variable and then summarizing that data with an average (i.e., mean(Variable)). To plot the data, we put Time on our x-axis, the newly created avg_var on our y-axis, and use State to represent color. These are assigned as our chart's aesthetics (i.e., aes(...). Finally, we add the line geom with geom_line() to render the lines on our visualization.
data %>%
group_by(Time, State) %>%
summarise(avg_var = mean(Variable)) %>%
ggplot(aes(x = Time, y = avg_var, color = State)) +
geom_line()
I have previously received help on this issue, and below is the code I was given. As a first time coding/using R, it is hard to understand/manipulate for my specific data set. Initially i was trying to make a scatter plot comparing rainfall (y) and humidity (x), but because the data consists of daily rainfall it consists of a lot of zeroes which makes the scatterplot useless. So now, I am trying to create a scatter plot which gets the average humidity per month (x) and sum of rainfall in that month (y). The dataset is extensive, so to make it easier I just limited myself to the first 5 locations on the dataset: Albury, Badgery Creek, Cobar, Coffs Harbour and Moree, which is around 3000 rows. At the bottom is an example of the first couple of rows of the data set. Is this possible to achieve, and if it is how would I go about adding a Regression to it in order to assess it? Thanks for any help
library(data.table)
library(tidyverse)
df <- as_tibble(fread('realdata.csv'))
df <- add_column(df, Month = format(as.Date(df$Date), '%B %Y'), .after = 'Date') %>%
group_by(Month) %>%
summarize(sum(`Rainfall`), mean(`Humidity3pm`))
colnames(df)[2:3] <- c('Total Rainfall (mm)', 'Average 3 PM Relative Humidity (%)')
ggplot(df, aes(x = `Total Rainfall (mm)`, y = `Average 3 PM Relative Humidity (%)`)) + geom_point()
This was the rainfall australia data I took from Kaggle: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package
This is what I'm currently up to
This is the vim attempt
First change your date format in you file. Your file looks like:
Date,Location,MinTemp,MaxTemp,Rainfall,Humidity3pm,Pressure3pm,Temp3pm,RainToday,RainTomorrow
1/12/08,Albury,13.4,22.9,0.6,22,1007.1,21.8,No,No
2/12/08,Albury,7.4,25.1,0,25,1007.8,24.3,No,No
3/12/08,Albury,12.9,25.7,0,30,1008.7,23.2,No,No
4/12/08,Albury,9.2,28,0,16,1012.8,26.5,No,No
5/12/08,Albury,17.5,32.3,1,33,1006,29.7,No,No
6/12/08,Albury,14.6,29.7,0.2,23,1005.4,28.9,No,No
Change it to:
Day,Month-Year,Location,MinTemp,MaxTemp,Rainfall,Humidity3pm,Pressure3pm,Temp3pm,RainToday,RainTomorrow
1,12-08,Albury,13.4,22.9,0.6,22,1007.1,21.8,No,No
2,12-08,Albury,7.4,25.1,0,25,1007.8,24.3,No,No
3,12-08,Albury,12.9,25.7,0,30,1008.7,23.2,No,No
4,12-08,Albury,9.2,28,0,16,1012.8,26.5,No,No
5,12-08,Albury,17.5,32.3,1,33,1006,29.7,No,No
6,12-08,Albury,14.6,29.7,0.2,23,1005.4,28.9,No,No
You can achieve this format using vim or simple editor. Here in "Date" column I changed the first "/" with "," and second "/" with "-". Open your file in vim by command
vim realdata.csv
Then type and enter after each line
:%s!/!,!g
:%s!,!-!
:%s!,!-!
:%s!-!,!
then make change on your header like shown in converted file and save your file by:
:x!
Then use R with following scripts
install.packages('dplyr')
library(dplyr)
df <- as.data.frame(read.table("realdata.csv",sep=",",header=TRUE))
df1 <- df %>% group_by(Month.Year) %>% summarize(Rainfall_sum=sum(Rainfall,na.rm=TRUE))
df2 <- df %>% group_by(Month.Year) %>% summarize(Humidity3pm_mean=mean(Humidity3pm,na.rm=TRUE))
df_f <- cbind(df1,df2)
ggplot(df_f, aes(x = df_f[,2], y = df_f[,4])) + geom_point()
I have a dataset with a few organisms, which I would like to plot on my y-axis, against date, which I would like to plot on the x-axis. However, I want the fluctuation of the curve to represent the abundance of the organisms. I.e I would like to plot a time series with the relative abundance separated by the organism to show similar patterns with time.
However, of course, plotting just date against an organism does not yield any information on the abundance. So, my question is, is there a way to make the curve represent abundance using ggridges?
Here is my code for an example dataset:
set.seed(1)
Data <- data.frame(
Abundance = sample(1:100),
Organism = sample(c("organism1", "organism2"), 100, replace = TRUE)
)
Date = rep(seq(from = as.Date("2016-01-01"), to = as.Date("2016-10-01"), by =
'month'),times=10)
Data <- cbind(Date, Data)
ggplot(Data, aes(x = Abundance, y = Organism)) +
geom_density_ridges(scale=1.15, alpha=0.6, color="grey90")
This produces a plot with the two organisms, however, I want the date on the x-axis and not abundance. However, this doesn't work. I have read that you need to specify group=Date or change date into julian day, however, this doesn't change the fact that I do not get to incorporate abundance into the plot.
Does anyone have an example of a plot with date vs. a categorical variable (i.e. organism) plotted against a continuous variable in ggridges?
I really like to output from ggridges and would like to be able to use it for these visualizations. Thank you in advance for your help!
Cheers,
Anni
To use geom_density_ridges, it'll help to reshape the data to show observations in separate rows, vs. as summarized by Abundance.
library(ggplot2); library(ggridges); library(dplyr)
# Uncount copies the row "Abundance" number of times
Data_sum <- Data %>%
tidyr::uncount(Abundance)
ggplot(Data_sum, aes(x = Date, y = Organism)) +
ggridges::geom_density_ridges(scale=1, alpha=0.6, color="grey90")
I have data currently structured like so:
set.seed(100)
require(ggplot2)
require(reshape2)
d<-data.frame("ID" = 1:30,
"Treatment1" = sample(0:1,30,replace = T, prob = c(0.5,0.5)),
"Score1" = rnorm(30)^2,
"Treatment2" = sample(0:1,30,replace = T,prob = c(0.3,0.7)),
"Score2" = rnorm(30)^2,
"Treatment3" = sample(0:1,30,replace = T,prob = c(0.2,0.8)),
"Score3" = rnorm(30)^2)
Where there are unique IDs, 3 different treatments (coded 1 if they received the given treatment and 0 if not), and the different scores the Ids have after each treatment period. I'm trying to create a boxplot that will illustrate the score distribution associated with each treatment period for each of the unique ids in the data set, but I'm either not melting the data properly or not coding the plot properly or both.
d.melt<-melt(d,id.vars = c("ID","Treatment1","Treatment2","Treatment3"),measure.vars = c("Score1","Score2","Score3"))
I can produce the boxplot that shows the scores separated by whether they recieved one of the three treatments with this code:
ggplot(d.melt)+
geom_boxplot(aes(x = variable,y = value,fill = factor(Treatment1)))
But this will only plot the difference in all the scores for the IDs that got treatment 1 and not the difference in scores for all of the 3 levels...
Any help getting my head around this problem would be great. Thank you in advance
The complication is that the data has pairs of columns (Treatment1, Score1, etc.) representing each treatment/score and we need to keep track of both whether a given subject received a given Treatment and their Score for each treatment. I've used one of the map functions from the purrr package (which is part of the tidyverse suite of packages) for this.
The code steps through each of the three pairs of treatments/scores, adds a column called Treatment indicating the treatment number and returns the stacked (long format) data frame.
library(tidyverse)
dr = map2_df(seq(2,ncol(d),2), seq(3,ncol(d),2),
function(t,s) {
data.frame(ID = d[,"ID"],
Treatment = gsub(".*([0-9]$)", "\\1", names(d)[t]),
Treat_Flag = d[,t],
Score = d[,s])
})
Now we plot the data using Treatment on the x-axis to mark the treatment number and color by Treat_Flag to provide separate box plots based on whether a given subject received a given treatment.
ggplot(dr, aes(Treatment, Score, colour=factor(Treat_Flag))) +
geom_boxplot() +
theme_classic() +
labs(colour="Treatment Indicator")
Here's another way to reshape the data. The code below uses functions from tidyr rather than from reshape2 (tidyr is the successor to reshape2). In the code below, gather(d, key, value, -ID) is essentially equivalent to melt(d, id.var="ID"). You can stop the chain of functions at any step to look at the intermediate outputs. This approach is probably more in keeping with the tidyverse paradigm for data reshaping, but I find it a bit less intuitive than the map approach above.
dr = gather(d, key, value, -ID) %>%
separate(key, into=c("key", "value2"), sep="(?=[0-9])") %>%
spread(key, value) %>%
rename(Treatment=value2, Treat_Flag=Treatment)