This question already has answers here:
Plot separate years on a common day-month scale
(3 answers)
Closed 6 years ago.
I'm trying to drop year from a multiyear data frame and plot day-month on x axis with geom_smooth() calculated for different years.
My data structure, initially looks like this:
> str(pmWaw)
'data.frame': 52488 obs. of 5 variables:
$ date : POSIXct, format: "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 00:00:00" "2014-01-01 01:00:00" ...
$ stacja: Factor w/ 273 levels "DsWrocKorzA",..: 26 27 129 26 27 129 26 27 129 26 ...
$ pm25 : num 100 63 NA 69 36 NA 41 31 NA 37 ...
$ pm10 : num 122 68 79 77 38 90 43 32 39 38 ...
$ season: Ord.factor w/ 4 levels "spring (MAM)"<..: 4 4 4 4 4 4 4 4 4 4 ...
Using lubridate I added year and month as separate variables:
library(lubridate)
pmWaw$year<- year(pmWaw$date)
pmWaw$month<- month(pmWaw$date)
Next, using a code found here on stackoverflow I calculated a month and day variable in %m-%d format:
pmWaw$month.day<-format(pmWaw$date, format="%m-%d")
#check new variable type:
> typeof(pmWaw$month.day)
[1] "character"
Eventually data frame I work with is this:
> head(pmWaw)
date stacja pm25 pm10 season year month month.day
1 2014-01-01 00:00:00 MzWarNiepodKom 100 122 winter (DJF) 2014 1 01-01
2 2014-01-01 00:00:00 MzWarszUrsynow 63 68 winter (DJF) 2014 1 01-01
3 2014-01-01 00:00:00 MzWarTarKondra NA 79 winter (DJF) 2014 1 01-01
4 2014-01-01 01:00:00 MzWarNiepodKom 69 77 winter (DJF) 2014 1 01-01
5 2014-01-01 01:00:00 MzWarszUrsynow 36 38 winter (DJF) 2014 1 01-01
6 2014-01-01 01:00:00 MzWarTarKondra NA 90 winter (DJF) 2014 1 01-01
> tail(pmWaw)
date stacja pm25 pm10 season year month month.day
52483 2015-12-30 22:00:00 MzWarAlNiepo 36 47 winter (DJF) 2015 12 12-30
52484 2015-12-30 22:00:00 MzWarKondrat 26 29 winter (DJF) 2015 12 12-30
52485 2015-12-30 22:00:00 MzWarWokalna 36 44 winter (DJF) 2015 12 12-30
52486 2015-12-30 23:00:00 MzWarAlNiepo 39 59 winter (DJF) 2015 12 12-30
52487 2015-12-30 23:00:00 MzWarKondrat 36 39 winter (DJF) 2015 12 12-30
52488 2015-12-30 23:00:00 MzWarWokalna 40 49 winter (DJF) 2015 12 12-30
Passing new values to ggplot gives me three issues:
ggplot(pmWaw, aes(x=month.day, y=pm25)) +
geom_jitter(alpha=0.5) +
geom_smooth()
First (minor) problem: month.day is a char type variable and ggplot won't recognize it's initial time series nature. This I can probably overcome by manually setting scale labels to months.
Second (major) problem geom_smooth() is not calculated at all and I can't figure out why?
Third (major) problem is I can't work out a solution to add year as a grouping variable for two separate smoothed lines (mostly because geom_smooth is not there at all).
My guess is, that the source of all problems lies somewhere in the way how I extracted month and day format and ended up with a character class variable.
Could anyone help me fix it? Any hints appreciated.
Looks like I found a solution to work with:
ggplot(pmWaw, aes(x=month.day, y=pm25, group = year)) +
geom_point(alpha=0.5) +
geom_smooth(aes(color=factor(year)))
solves issues 2 and 3 - geom smooth is there and I can distinguish years. Probably not the best solution but might be a good place to start
Related
I want create a new colume to represent which date are in the same week.
A data.table DATE_SET contains Date information, like:
DATA_SET<- data.table(transday = seq(from = (Sys.Date()-64), to = Sys.Date(), by = 1))
For example, '2017-03-01' and '2017-03-02' are in the same week, '2017-03-01' and '2017-03-08' both Wednesday, but they are not in the same week.
If "2016-01-01" is the first week in 2016, "2017-01-01" is the first week in 2017, the value is 1, but they are not in the same week. So i want the unique value to pecify "a same week".
The answer to this question depends strongly on
the definition of the first day of the week (usually Sunday or Monday) and
the numbering of the weeks within the year (starting with the first Sunday, Monday, or Thursday of the year, or on 1st January, etc).
A selection of different options can be seen from the example below:
dates isoweek day week_iso week_US week_UK DT_week DT_iso lub_week lub_iso cut.Date
2015-12-25 2015-W52 5 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-26 2015-W52 6 2015-W52 51 51 52 52 52 52 2015-12-21
2015-12-27 2015-W52 7 2015-W52 52 51 52 52 52 52 2015-12-21
2015-12-28 2015-W53 1 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-29 2015-W53 2 2015-W53 52 52 52 53 52 53 2015-12-28
2015-12-30 2015-W53 3 2015-W53 52 52 53 53 52 53 2015-12-28
2015-12-31 2015-W53 4 2015-W53 52 52 53 53 53 53 2015-12-28
2016-01-01 2015-W53 5 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-02 2015-W53 6 2015-W53 00 00 1 53 1 53 2015-12-28
2016-01-03 2015-W53 7 2015-W53 01 00 1 53 1 53 2015-12-28
2016-01-04 2016-W01 1 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-05 2016-W01 2 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-06 2016-W01 3 2016-W01 01 01 1 1 1 1 2016-01-04
2016-01-07 2016-W01 4 2016-W01 01 01 2 1 1 1 2016-01-04
2016-01-08 2016-W01 5 2016-W01 01 01 2 1 2 1 2016-01-04
which is created by this code:
library(data.table)
dates <- as.Date("2016-01-01") + (-7:7)
print(data.table(
dates,
isoweek = ISOweek::ISOweek(dates),
day = ISOweek::ISOweekday(dates),
week_iso = format(dates, "%G-W%V"),
week_US = format(dates, "%U"),
week_UK = format(dates, "%W"),
DT_week = data.table::week(dates),
DT_iso = data.table::isoweek(dates),
lub_week = lubridate::week(dates),
lub_iso = lubridate::isoweek(dates),
cut.Date = cut.Date(dates, "week")
), row.names = FALSE)
The format YYYY-Www used in some of the columns is one of the ISO 8601 week formats. It includes the year which is required to distinguish different weeks in different years as requested by the OP.
The ISO week definition is the only format which ensures that each week always consists of 7 days, also across New Year. The other definitions may start or end the year with "weeks" with less than 7 days. Due to the seamless partioning of the year, the ISO week-numbering year is slightly different from the traditional Gregorian calendar year, e.g., 2016-01-01 belongs to the last ISO week 53 of 2015 (2015-W53).
As suggested here, cut.Date() might be the best option for the OP.
Disclosure: I'm maintainer of the ISOweek package which was published at a time when strptime() did not recognize the %G and %V format specifications for output in the Windows versions of R. (Still today they aren't recognized on input).
You can use the week() function of the lubridate package in R.
library(lubridate)
DATA_SET$week <- week(DATA_SET$transday)
This will give you a new column week. Dates within the same week will have same week number.
I'm trying to plot a ts data divided by year for comparison.
Problem is I can't figure out how to force ggplot to skip missing dates on each plot.
My data structure looks like this:
> head(pmWaw)
date stacja pm25 pm10 season year month
1 2014-01-01 00:00:00 MzWarNiepodKom 100 122 winter (DJF) 2014 1
2 2014-01-01 00:00:00 MzWarszUrsynow 63 68 winter (DJF) 2014 1
3 2014-01-01 00:00:00 MzWarTarKondra NA 79 winter (DJF) 2014 1
4 2014-01-01 01:00:00 MzWarNiepodKom 69 77 winter (DJF) 2014 1
5 2014-01-01 01:00:00 MzWarszUrsynow 36 38 winter (DJF) 2014 1
6 2014-01-01 01:00:00 MzWarTarKondra NA 90 winter (DJF) 2014 1
> tail(pmWaw)
date stacja pm25 pm10 season year month
52483 2015-12-30 22:00:00 MzWarAlNiepo 36 47 winter (DJF) 2015 12
52484 2015-12-30 22:00:00 MzWarKondrat 26 29 winter (DJF) 2015 12
52485 2015-12-30 22:00:00 MzWarWokalna 36 44 winter (DJF) 2015 12
52486 2015-12-30 23:00:00 MzWarAlNiepo 39 59 winter (DJF) 2015 12
52487 2015-12-30 23:00:00 MzWarKondrat 36 39 winter (DJF) 2015 12
52488 2015-12-30 23:00:00 MzWarWokalna 40 49 winter (DJF) 2015 12
ggplot2 code I came up with is:
pmWaw %>%
ggplot(aes(x=date, y=pm25)) +
geom_jitter(alpha=0.5) +
geom_smooth() +
facet_wrap( ~ year)
Resulting plot has gaps in each year that I'd like to remove, but can't figure out how:
Try scales = 'free_x' in facet_wrap
like this
pmWaw %>%
ggplot(aes(x=date, y=pm25)) +
geom_jitter(alpha=0.5) +
geom_smooth() +
facet_wrap( ~ year, scales = "free_x")
Basically, I'm looking at snowpack data. I want to assign a unique value to each date (column "snowday") over the period October 15 to May 15th the following year (the winter season of course) ~215 days. then add a column "snowmonth" that corresponds to the sequential months of the seasonal data, as well as a "snow year" column that represents the year where each seasonal record starts.
There are some missing dates- however- but instead of finding those dates and inserting NA's into the rows, I've opted to skip that step and instead go the sequential root which can then be plotted with respect to the "snowmonth"
Basically, I just need to get the "snowday" sequence of about 1:215 (+1 for leap years down in a column, and the rest I can do myself. It looks like this
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 NA NA
12 27 1955 24 1955-12-27 361 NA NA
12 28 1955 24 1955-12-28 362 NA NA
12 29 1955 24 1955-12-29 363 NA NA
12 30 1955 26 1955-12-30 364 NA NA
12 31 1955 26 1955-12-31 365 NA NA
1 1 1956 25 1956-01-01 1 NA NA
1 2 1956 25 1956-01-02 2 NA NA
1 3 1956 26 1956-01-03 3 NA NA
man<-data.table()
man <- read.delim('mansfieldstake.txt',header=TRUE, check.names=FALSE)
man[is.na(man)]<-0
man$date<-paste(man$yy, man$mm, man$dd,sep="-", collapse=NULL)
man$yearday<-NA #day of the year 1-365
colnames(man)<- c("month","day","year","depth", "date","yearday")
man$date<-as.Date(man$date)
man$yearday<-yday(man$date)
man$snowday<-NA
man$snowmonth<-NA
man[420:500,]
head(man)
output would look something like this:
month day year depth date yearday snowday snowmonth
12 26 1955 27 1955-12-26 360 73 3
12 27 1955 24 1955-12-27 361 74 3
12 28 1955 24 1955-12-28 362 75 3
12 29 1955 24 1955-12-29 363 76 3
12 30 1955 26 1955-12-30 364 77 3
12 31 1955 26 1955-12-31 365 78 3
1 1 1956 25 1956-01-01 1 79 4
1 2 1956 25 1956-01-02 2 80 4
1 3 1956 26 1956-01-03 3 81 4
I've thought about loops and all that- but it's inefficient... leap years kinda mess things up as well- this has become more challenging than i thought. good first project though!
just looking for a simple sequence here, dropping all non-snow months. thanks for anybody who's got input!
If I understand correctly that snowday should be the number of days since the beginning of the season, all you need to make this column using data.table is:
day_one <- as.Date("1955-10-01")
man[, snowday := -(date - day_one)]
If all you want is a sequence of unique values, then seq() is your best bet.
Then you can create the snowmonth using:
library(lubridate)
man[, snowmonth := floor(-time_length(interval(date, day_one), unit = "month"))
I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA
I want to generate a row (with zero ammount) for each missing month (until the current) in the following dataframe. Can you please give me a hand in this? Thanks!
trans_date ammount
1 2004-12-01 2968.91
2 2005-04-01 500.62
3 2005-05-01 434.30
4 2005-06-01 549.15
5 2005-07-01 276.77
6 2005-09-01 548.64
7 2005-10-01 761.69
8 2005-11-01 636.77
9 2005-12-01 1517.58
10 2006-03-01 719.09
11 2006-04-01 1231.88
12 2006-05-01 580.46
13 2006-07-01 1468.43
14 2006-10-01 692.22
15 2006-11-01 505.81
16 2006-12-01 1589.70
17 2007-03-01 1559.82
18 2007-06-01 764.98
19 2007-07-01 964.77
20 2007-09-01 405.18
21 2007-11-01 112.42
22 2007-12-01 1134.08
23 2008-02-01 269.72
24 2008-03-01 208.96
25 2008-04-01 353.58
26 2008-05-01 756.00
27 2008-06-01 747.85
28 2008-07-01 781.62
29 2008-09-01 195.36
30 2008-10-01 424.24
31 2008-12-01 166.23
32 2009-02-01 237.11
33 2009-04-01 110.94
34 2009-07-01 191.29
35 2009-11-01 153.42
36 2009-12-01 222.87
37 2010-09-01 1259.97
38 2010-11-01 375.61
39 2010-12-01 496.48
40 2011-02-01 360.07
41 2011-03-01 324.95
42 2011-04-01 566.93
43 2011-06-01 281.19
44 2011-08-01 428.04
'data.frame': 44 obs. of 2 variables:
$ trans_date : Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount: num 2969 501 434 549 277 ...
you can use seq.Date and merge:
> str(df)
'data.frame': 44 obs. of 2 variables:
$ trans_date: Date, format: "2004-12-01" "2005-04-01" "2005-05-01" "2005-06-01" ...
$ ammount : num 2969 501 434 549 277 ...
> mns <- data.frame(trans_date = seq.Date(min(df$trans_date), max(df$trans_date), by = "month"))
> df2 <- merge(mns, df, all = TRUE)
> df2$ammount <- ifelse(is.na(df2$ammount), 0, df2$ammount)
> head(df2)
trans_date ammount
1 2004-12-01 2968.91
2 2005-01-01 0.00
3 2005-02-01 0.00
4 2005-03-01 0.00
5 2005-04-01 500.62
6 2005-05-01 434.30
and if you need months until current, use this:
mns <- data.frame(trans_date = seq.Date(min(df$trans_date), Sys.Date(), by = "month"))
note that it is sufficient to call simply seq instead of seq.Date if the parameters are Date class.
If you're using xts objects, you can use timeBasedSeq and merge.xts. Assuming your original data is in an object Data:
# create xts object:
# no comma on the first subset (Data['ammount']) keeps column name;
# as.Date needs a vector, so use comma (Data[,'trans_date'])
x <- xts(Data['ammount'],as.Date(Data[,'trans_date']))
# create a time-based vector from 2004-12-01 to 2011-08-01. The "m" denotes
# monthly time-steps. By default this returns a yearmon class. Use
# retclass="Date" to return a Date vector.
d <- timeBasedSeq(paste(start(x),end(x),"m",sep="/"), retclass="Date")
# merge x with an "empty" xts object, xts(,d), filling with zeros
y <- merge(x,xts(,d),fill=0)