Plotting time series ggplot month-year, xaxis only show month with value? - r

I have a 8 year time series data. I am able to plot my data but I want the x axis only to show the month which I have data for.
My problem here is that my x axis shows january but I have data only for june, july and august for each year.
I would also like to add vertical line to separate each year..
Here is how my script looks like so far:
ggplot(data=CMRB, aes(x=D, y=Densite, group = habitat)) + geom_line() + scale_x_date(date_labels ="%b%Y")+ geom_point( aes(shape=habitat),size=4, fill="white")
And my data looks like:
Annee Grille Periode Densite SE Methode espece notes notes_2
82 2004 LG1 PP2 1.8888330 0.3990163 secr brun NA
83 2004 LG1 PP3 3.8880450 0.7570719 secr brun NA
84 2004 LG1 PP4 3.3281370 0.5573953 secr brun NA
85 2005 LG1 PP1 0.2367488 NA secr brun mnka NA
86 2005 LG1 PP2 0.4791649 0.2105729 secr brun NA
87 2005 LG1 PP3 0.1597214 0.1302571 secr brun NA
habitat Mois Date D
82 humid 07 07/1/2004 2004-07-01
83 humid 08 08/1/2004 2004-08-01
84 humid 08 08/1/2004 2004-08-01
85 humid 06 06/1/2005 2005-06-01
86 humid 07 07/1/2005 2005-07-01
87 humid 08 08/1/2005 2005-08-01
>
D is a column I have created to tranform Date(which is a character) into a date format.
Does somebody knows how to do that ? If possible I would also like the month without data to take less space into the graph to leave more space to see the data from june to august...
Cheers
Nico

This should convert into a date column.
CMRB <- as.Date(CMRB$D, format = "%Y-%m-%d")
If you want to plot time-series data, I suggest using dygraphs
For example,
library(dygraphs)
library(xts)
ts_object <- as.xts(CMRB$Densite, CMRB$D)
dygraph(ts_object)
Here's the holy grail of websites to guide you through dygraphs.
https://rstudio.github.io/dygraphs/

Related

How to subset columns based on the value in another column in R

I'm looking to subset multiple columns based on the value (a year) that is issued elsewhere in the data. For example, I have a column reflecting various data, and another including a year. My data looks something like this:
Individual
Age 2010
weight 2010
Age 2011
Weight 2011
Age 2012
Weight 2012
Age 2013
Weight 2013
Year
A
53
50
85
100
82
102
56
90
2013
B
22
NA
23
75
NA
68
25
60
2013
C
33
65
34
64
35
70
NA
75
2010
D
NA
70
28
NA
29
78
30
55
2012
E
NA
NA
64
90
NA
NA
NA
NA
2011
I want to create a new column that reflects the data that the 'Year' columns highlights. For example, subsetting data for 'Individual' A from 2013, and 'Individual B' from 2012.
My end goal is to have a table that looks like:
Individual
Age
Weight
A
56
90
B
25
60
C
33
65
D
29
78
E
64
90
Is there any way to subset the years based on the years chosen in the final column?
I made a subset of your data and came up with the following (could be more elegant but this works):
Individual<-c("A","B","C","D","E")
Age2010<-c(53,22,33,NA,NA)
`weight 2010`<-c(50,NA,65,70,NA)
Age2011<-c(85,23,34,28,64)
Weight2011<-c(100,75,64,NA,90)
df<-as.data.frame(cbind(Individual,Age2010,`weight 2010`,Age2011,Weight2011))
colnames(df)<-str_replace_all(colnames(df)," ", "") # remove spaces
# create a dataframe for each year (prob could do this using `apply`)
df2010<-df %>% select(Individual, contains("2010")) %>% mutate(year=2010) %>% rename(weight=weight2010,age=Age2010)
df2011<-df %>% select(Individual, contains("2011")) %>% mutate(year=2011) %>% rename(weight=Weight2011,age=Age2011)
final<-bind_rows(df2010,df2011)
Of course, you can extend this for the remaining years in your dataset. You will then have a year variable to perform your analyses.

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

R - split data to hydrological quarters

I wish to split my data sets into year quarters according to definition of hydrological year. According to Wikipedia, "Due to meteorological and geographical factors, the definition of the water years varies". In USA, hydrological year is a period between October 1st of one year and September 30th of the next.
I use definition of hydrological year for Poland (starts at November 1st and ends at October 31st).
Sample data set looks as folllows:
sampleData <- structure(list(date = structure(c(15946, 15947, 15875, 15910, 15869, 15888, 15823, 16059, 16068, 16067), class = "Date"),`example value` = c(-0.325806595888448, 0.116001346459147, 1.68884381116696, -0.480527505762716, -0.50307381813168,-1.12032214801472, -0.659699514672226, -0.547101497279717, 0.729148872679021,-0.769760735764215)), .Names = c("date", "example value"), row.names = c(NA, -10L), class = "data.frame")
For some reason, function "cut" in my code complains that "breaks" and "labels" differs in length (but they don't). If I omit "labels" options in cut (as below) function works perfectly.
What is wrong with labels?
ToHydroQuarters <-function(df)
{
result <- df
yearStart <- as.numeric(format(min(df$date),'%Y'))-1
#Hydrological year in Poland starts at November 1st
DateStart <- as.Date(paste(yearStart,"-11-01",sep=""))
breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter")
breakYear <- format(breaks,'%Y')
#Please, do not create labels in such way.
#Please note that for November and December we have next hydrological year - since it started at 1st November. So, we need to check month to decide which year we have (?) or use cut function again as mentioned here: http://stackoverflow.com/questions/22073881/hydrological-year-time-series
labels <- c(paste("Winter",breakYear[1]),
paste("Spring",breakYear[2]),
paste("Summer",breakYear[3]),
paste("Autumn",breakYear[4]),
paste("Autumn",breakYear[5]))
######Here is problem - once I add labels parameter, function complains about different lengths
result$hydroYear <- cut(df$date, breaks)
result
}
Firstly I think it is unwise to have labels as a "hardcoded" variable in a function since it is impossible to check without some kind of reproducible example, however I can see what you're trying to achieve.
You claim that your break and labels should be the correct length, however the function itself doesn't always work (this is without the labels, even if the labels did exist the cut function did not process the last portion of the dates).
For example:
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
a <- ToHydroQuarters(df)
tail(a)
returns:
date hydroYear
971 2011-08-29 <NA>
972 2011-08-30 <NA>
973 2011-08-31 <NA>
974 2011-09-01 <NA>
975 2011-09-02 <NA>
976 2011-09-03 <NA>
Doing something like breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter"), does resolve that issue, as it forces a break to actually exist. This might solve your labelling issue that you've had in your function, but it does not make the function "generic".
Personally on the coding side I think it would be better to convert the month, and year parts separately, because it would be easier to understand. For example, you could use library(lubridate) to easily extract the month and specify the breaks and the labels as you normally would. I was thinking the function could look something like this:
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
So then using some dummy data ...
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="month")))
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
df$newdate <- thq(df$date)
Which has the following output:
date newdate
1 2009-01-01 Spring 2009
2 2009-02-01 Spring 2009
3 2009-03-01 Spring 2009
4 2009-04-01 Summer 2009
5 2009-05-01 Summer 2009
6 2009-06-01 Summer 2009
7 2009-07-01 Autumn 2009
8 2009-08-01 Autumn 2009
9 2009-09-01 Autumn 2009
10 2009-10-01 Winter 2010
11 2009-11-01 Winter 2010
12 2009-12-01 Winter 2010
13 2010-01-01 Spring 2010
14 2010-02-01 Spring 2010
15 2010-03-01 Spring 2010
16 2010-04-01 Summer 2010
17 2010-05-01 Summer 2010
18 2010-06-01 Summer 2010
19 2010-07-01 Autumn 2010
20 2010-08-01 Autumn 2010
21 2010-09-01 Autumn 2010
22 2010-10-01 Winter 2011
23 2010-11-01 Winter 2011
24 2010-12-01 Winter 2011
25 2011-01-01 Spring 2011
26 2011-02-01 Spring 2011
27 2011-03-01 Spring 2011
28 2011-04-01 Summer 2011
29 2011-05-01 Summer 2011
30 2011-06-01 Summer 2011
31 2011-07-01 Autumn 2011
32 2011-08-01 Autumn 2011
33 2011-09-01 Autumn 2011
You can shift the months using the modulo operator if it is in a weird order...
thq <- function(date) {
mnth <- cut(((month(df$date)+1) %% 12), breaks=c(0, 3, 6, 9, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Nov_Jan", "Feb_Apr", "May_Jul", "Aug_Oct")
)
# you will need to alter the return statement yourself, because
# I feel there is enough information for you to do it, rather than
# me changing it every time you change the question.
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
df$new <- thq(df$date)
head(df)
output:
> head(df)
date new
1 2009-01-01 Nov_Jan 2009
2 2009-01-02 Nov_Jan 2009
3 2009-01-03 Nov_Jan 2009
4 2009-01-04 Nov_Jan 2009
5 2009-01-05 Nov_Jan 2009
6 2009-01-06 Nov_Jan 2009

Changing X-axis values in Time Series plot with R

I'm a newer R user and I need help with a time series plot. I created a time series plot, and cannot figure out how to change my x-axis values to correspond to my sample dates. My data is as follows:
Year Month Level
2009 8 350
2009 9 210
2009 10 173
2009 11 166
2009 12 153
2010 1 141
2010 2 129
2010 3 124
2010 4 103
2010 5 69
2010 6 51
2010 7 49
2010 8 51
2010 9 51
Let's say this data is saved as the name "data.csv"
data = read.table("data.csv", sep = ",", header = T)
data.ts = ts(data, frequency = 1)
plot(dat.mission.ts[, 3], ylab = "level", main = "main", axes = T)
I've also tried inputing the start = c(2009, 8) into the ts function but I still get wrong values
When I plot this my x axis does not correlate to August 2009 through Sept. 2010. It will either increase by year or just by decimal. I've looked up many examples online and also through the ? help on R, but cannot find a way to relabel my axis values. Any help would be appreciated.
Using base coding, you can accomplish this in a few steps. As described in this SO answer, you can identify your "Month" and "Year" data as a date if you use as.Date and paste functions together and incorporate a day (i.e., first day of the month; "1"). For the purposes of this answer, I will simply refer to the data you provided as df:
df$date<-with(df,as.Date(paste(Year,Month,'1',sep='-'),format='%Y-%m-%d'))
df
Year Month Level date
1 2009 8 350 2009-08-01
2 2009 9 210 2009-09-01
3 2009 10 173 2009-10-01
4 2009 11 166 2009-11-01
5 2009 12 153 2009-12-01
6 2010 1 141 2010-01-01
7 2010 2 129 2010-02-01
8 2010 3 124 2010-03-01
9 2010 4 103 2010-04-01
10 2010 5 69 2010-05-01
11 2010 6 51 2010-06-01
12 2010 7 49 2010-07-01
13 2010 8 51 2010-08-01
14 2010 9 51 2010-09-01
Then you can use your basic plot, axis, and mtext functions to control how you want to visualize the data and your axes. For instance:
xmin<-min(df$date,na.rm=T);xmax<-max(df$date,na.rm=T) #ESTABLISH X-VALUES (MIN & MAX)
ymin<-min(df$Level,na.rm=T);ymax<-max(df$Level,na.rm=T) #ESTABLISH Y-VALUES (MIN & MAX)
xseq<-seq.Date(xmin,xmax,by='1 month') #CREATE DATE SEQUENCE THAT INCREASES BY MONTH FROM DATE MINIMUM TO MAXIMUM
yseq<-round(seq(0,ymax,by=50),0) # CREATE SEQUENCE FROM 0-350 BY 50
par(mar=c(1,1,0,0),oma=c(6,5,3,2)) #CONTROLS YOUR IMAGE MARGINS
plot(Level~date,data=df,type='b',ylim=c(0,ymax),axes=F,xlab='',ylab='');box() #PLOT LEVEL AS A FUNCTION OF DATE, REMOVE AXES FOR FUTURE CUSTOMIZATION
axis.Date(side=1,at=xseq,format='%Y-%m',labels=T,las=3) #ADD X-AXIS LABELS WITH "YEAR-MONTH" FORMAT
axis(side=2,at=yseq,las=2) #ADD Y-AXIS LABELS
mtext('Date (Year-Month)',side=1,line=5) #X-AXIS LABEL
mtext('Level',side=2,line=4) #Y-AXIS LABEL
library(data.table)
library(ggplot2)
library(scales)
data<-data.table(datetime=seq(as.POSIXct("2009/08/01",format="%Y/%m/%d"),
as.POSIXct("2010/09/01",format="%Y/%m/%d"),by="1 month"),
Level=c(350,210,173,166,153,141,129,124,103,69,51,49,51,51))
ggplot(data)+
geom_point(aes(x=datetime,y=Level),col="brown1",size=1)+
scale_x_datetime(labels = date_format("%Y/%m"),breaks = "1 month")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.3))
Example using xts package:
library(xts)
ts1 <- xts(data$Level, as.POSIXct(sprintf("%d-%d-01", data$Year, data$Month)))
# or ts1 <- xts(data$Level, as.yearmon(data$Year + (data$Month-1)/12))
plot(ts1)
If you are using ggplot2:
library(ggplot2)
autoplot(ts1)

ggplot2 + Date structure using scale X

I really need help here because I am way beyond lost.
I am trying to create a line chart showing several teams' performance over a year. I divided the year into quarters: 1/1/2012, 4/1/12. 8/1/12. 12/1/12 and loaded the csv data frame into R.
Month Team Position
1 1/1/12 South Africa 56
2 1/1/12 Angola 85
3 1/1/12 Morocco 61
4 1/1/12 Cape Verde Islands 58
5 4/1/12 South Africa 71
6 4/1/12 Angola 78
7 4/1/12 Morocco 62
8 4/1/12 Cape Verde Islands 76
9 8/1/12 South Africa 67
10 8/1/12 Angola 85
11 8/1/12 Morocco 68
12 8/1/12 Cape Verde Islands 78
13 12/1/12 South Africa 87
14 12/1/12 Angola 84
15 12/1/12 Morocco 72
16 12/1/12 Cape Verde Islands 69
When I try using ggplot2 to generate the graph the fourth quarter 12/1/12 inexplicably moves to the second spot.
ggplot(groupA, aes(x=Month, y=Position, colour=Team, group=Team)) + geom_line()
I then put this plot into a variable GA in order to try to use scale_x to format the date:
GA + scale_x_date(labels = date_format("%m/%d"))
But I keep getting this Error:
Error in structure(list(call = match.call(), aesthetics = aesthetics, :
could not find function "date_format"
And if I run this code:
GA + scale_x_date()
I get this error:
Error: Invalid input: date_trans works with objects of class Date only
I am using a Mac OS X running R 2.15.2
Please help.
Its because, df$Month, (assuming your data.frame is df), which is a factor has its levels in this order.
> levels(df$Month)
# [1] "1/1/12" "12/1/12" "4/1/12" "8/1/12"
The solution is to re-order the levels of your factor.
df$Month <- factor(df$Month, levels=df$Month[!duplicated(df$Month)])
> levels(df$Month)
# [1] "1/1/12" "4/1/12" "8/1/12" "12/1/12"
Edit: Alternate solution using strptime
# You could convert Month first:
df$Month <- strptime(df$Month, '%m/%d/%y')
Then your code should work. Check the plot below:

Resources