Why doesn't the cut at "3 months" produce labels as expected?
# create time series data
everyday <- seq(from = as.Date('2014-1-1'), to = as.Date('2014-12-31'), by = 'day')
# create a factor based on the quarter of the year an observation is in:
qtrs <- cut(everyday, "3 months", labels = paste0('Q', 1:4))
## Error in cut.default(unclass(x), unclass(breaks), labels = labels,
## right = right, :
## lengths of 'breaks' and 'labels' differ
The cut is every 3 months, so that would create 4 Quarters and I'd expect to need 4 labels, but the error message suggests that the length of breaks and labels differs.
qtrs <- cut(everyday, "3 months", labels = paste0('Q', 1:5))
table(qtrs)
## qtrs
## Q1 Q2 Q3 Q4 Q5
## 90 91 92 92 0
The fifth label Q5 seems to be needed and yet appears with a zero count.
The example is taken from "Data Manipulation with R" by Phil Spector,
http://www.springer.com/statistics/computational+statistics/book/978-0-387-74730-9
This does not answer your original question, but is a way to achieve (what I assume is) the same result, without cut. You may use the quarters function to extract the 'quarter' from a Date object:
table(quarters(everyday))
# Q1 Q2 Q3 Q4
# 90 91 92 92
Related
I found a pet adoption dataset that includes the age of a pet when adopted. However, the age variable contains strings like "3 months" or "4 years" or "3 weeks" all in the same column. The dataset is otherwise tidy. How can I convert these variables into year values?
I've tried something like this:
for(i in i:nrow(Pet_Train$AgeuponOutcome)){
if(grepl(i, "month") == TRUE)
Pet_Train$Age_in_Years[i] == "0"
}
But I have little experience with loops/if statements/this "grepl" function I just looked up. I do have experience with tidy functions like mutate() and filter() but I'm not sure how to apply those with these many of possible argument combinations.
Since there are 27,000 instances, so I'd rather not go through this by hand.
Edit:
I figured out how to use the grepl function to replace instances containing "month" with "less than a year." But is there a way to take the exact number of months and convert them into the year as a decimal?
The first two use only base of R and the third uses dplyr and tidyr.
1) Use read.table to split the input column into the numeric and units parts and then multiply the numeric part by the fraction of a year that the units part represents.
PT <- data.frame(Age = c("3 months", "4 years", "3 weeks")) # input
transform(cbind(PT, read.table(text = as.character(PT$Age))),
Years = V1 * (7 / 365.25 * (V2 == "weeks") + 1/12 * (V2 == "months") + (V2 == "years")))
giving:
Age V1 V2 Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
2) Alternately the last line could be written in terms of switch:
transform(cbind(PT, read.table(text = as.character(PT$Age), as.is = TRUE)),
Years = V1 * sapply(V2, switch, weeks = 7 / 365.25, months = 1 / 12, years = 1))
3) This uses dplyr and tidyr:
PT %>%
separate(Age, c("No", "Units")) %>%
mutate(No = as.numeric(No),
Years = No * case_when(Units == "weeks" ~ 7 / 365.25,
Units == "months" ~ 1 / 12,
Units == "years" ~ 1))
giving:
No Units Years
1 3 months 0.25000000
2 4 years 4.00000000
3 3 weeks 0.05749487
lubridate-based solution:
library(tidyverse)
library(lubridate)
dat <- data_frame(age_text = c("3 months", "4 years", "3 weeks"))
dat %>% mutate(age_in_years = duration(age_text) / dyears(1))
The answer of David Rubinger uses the lubridate package to coerce character strings to objects of class Duration.
The as.duration() function seems to recognize a variety of strings, e.g.,
age_text <- c("3 months", "4 years", "3 weeks", "52 weeks", "365 days 6 hours")
lubridate::as.duration(age_text)
[1] "7889400s (~13.04 weeks)" "126230400s (~4 years)" "1814400s (~3 weeks)"
[4] "31449600s (~52 weeks)" "31557600s (~1 years)"
However, the OP has requested to convert the strings into year values rather than seconds.
This can be achieved by using the as.numeric() function which takes a units parameter to specify the desired conversion:
as.numeric(lubridate::as.duration(age_text), units = "years")
[1] 0.25000000 4.00000000 0.05749487 0.99657769 1.00000000
Other units can be chosen as well:
as.numeric(lubridate::as.duration(age_text), units = "months")
[1] 3.0000000 48.0000000 0.6899384 11.9589322 12.0000000
as.numeric(lubridate::as.duration(age_text), units = "weeks")
[1] 13.04464 208.71429 3.00000 52.00000 52.17857
Just to expand on the comment I left, you could use ifelse. First though, here's a reproducible example of your data (always very useful for you to provide this when asking a question):
df <- data.frame("Duration" = c("3 months", "4 years", "3 weeks"))
You can then split out the units and values from this using string split:
df$Value <- as.numeric(vapply(strsplit(as.character(df$Duration), split = " "), `[`, 1, FUN.VALUE=character(1)))
df$Units <- vapply(strsplit(as.character(df$Duration), split = " "), `[`, 2, FUN.VALUE=character(1))
Finally, use nested ifelse arguments which tell R what to do if data in a column matches a condition, and what to do if not - so I have this saying that, if the units is weeks, divide the amount by 52.18 (the number of weeks per year).
df$Years <- ifelse(df[,'Units']=="weeks", df[,'Value']/(365.25/7), ifelse(df[,'Units']=="months", df[,'Value']/12, df[,'Value']))
And the successful output:
> df
Duration Value Units Years
1 3 months 3 months 0.25000000
2 4 years 4 years 4.00000000
3 3 weeks 3 weeks 0.05749487
Note: It would be more appropriate to do this with "days" as your unit of time, which could be done if you had dates for the first and second event (birth and adoption dates of the animal). This is because years and months are variable length units - December is longer than February, 2016 was longer than 2015 and 2017.
With ggplot2, I would like to create a multiplot (facet_grid) where each plot is the weekly count values for the month.
My data are like this :
day_group count
1 2012-04-29 140
2 2012-05-06 12595
3 2012-05-13 12506
4 2012-05-20 14857
I have created for this dataset two others colums the Month and the Week based on day_group :
day_group count Month Week
1 2012-04-29 140 Apr 17
2 2012-05-06 12595 May 18
3 2012-05-13 12506 May 19
4 2012-05-20 14857 May 2
Now I would like for each Month to create a barplot where I have the sum of the count values aggregated by week. So for example for a year I would have 12 plots with 4 bars (one per week).
Below is what I use to generate the plot :
ggplot(data = count_by_day, aes(x=day_group, y=count)) +
stat_summary(fun.y="sum", geom = "bar") +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
facet_grid(facets = Month ~ ., scales="free", margins = FALSE)
So far, my plot looks like this
https://dl.dropboxusercontent.com/u/96280295/Rplot.png
As you can see the x axes is not as I'm looking for. Instead of showing only week 1, 2, 3 and 4, it displays all the month.
Do you know what I must change to get what I'm looking for ?
Thanks for your help
Okay, now that I see what you want, I wrote a small program to illustrate it. The key to your order of month problem is making month a factor with the levels in the right order:
library(dplyr)
library(ggplot2)
#initialization
set.seed(1234)
sday <- as.Date("2012-01-01")
eday <- as.Date("2012-07-31")
# List of the first day of the months
mfdays <- seq(sday,length.out=12,by="1 month")
# list of months - this is key to keeping the order straight
mlabs <- months(mfdays)
# list of first weeks of the months
mfweek <- trunc((mfdays-sday)/7)
names(mfweek) <- mlabs
# Generate a bunch of event-days, and then months, then week numbs in our range
n <- 1000
edf <-data.frame(date=sample(seq(sday,eday,by=1),n,T))
edf$month <- factor(months(edf$date),levels=mlabs) # use the factor in the right order
edf$week <- 1 + as.integer(((edf$date-sday)/7) - mfweek[edf$month])
# Now summarize with dplyr
ndf <- group_by(edf,month,week) %>% summarize( count = n() )
ggplot(ndf) + geom_bar(aes(x=week,y=count),stat="identity") + facet_wrap(~month,nrow=1)
Yielding:
(As an aside, I am kind of proud I did this without lubridate ...)
I think you have to do this but I am not sure I understand your question:
ggplot(data = count_by_day, aes(x=Week, y=count, group= Month, color=Month))
I have an age variable containing observations that follow this (inconsistent) format:
3 weeks, 2 days, 4 hours
4 weeks, 6 days, 12 hours
3 days, 18 hours
4 days, 3 hours
7 hours
8 hours
I need to convert each observation to hours using R.
I have used strsplit(vector, ',') to split the variable at each comma.
I am running trouble because splitting each observation at the ',' yields anywhere from 1 to 3 entries for each observation. I do not know how to properly index these entries so that I end up with one row for each observation.
I am guessing that once I am able to store these values in sensible rows, I can extract the numeric data from each column in a row and convert accordingly, then sum the entire row.
I am also open to any different methods of approaching this problem.
After you split your data you can parse the resulting list for the keywords defining the times like 'hours', 'weeks', 'days' and create a dataframe containing the relevant value (or 0 if there is no value for a certain keyword). You can achieve that with something like this:
library(dplyr)
vector = c("3 weeks, 2 days, 4 hours", "4 weeks, 6 days, 12 hours", "3 days, 18 hours", "4 days, 3 hours", "7 hours", "8 hours")
split_vector = strsplit(vector, ",", fixed = TRUE)
parse_string = function(i){
x = split_vector[[i]]
data_frame(ID = i) %>%
mutate(hours = ifelse(any(grepl("hours", x)), as.numeric(gsub("\\D", "", x[grepl("hours", x)])), 0),
days = ifelse(any(grepl("days", x)), as.numeric(gsub("\\D", "", x[grepl("days", x)])), 0),
weeks = ifelse(any(grepl("weeks", x)), as.numeric(gsub("\\D", "", x[grepl("weeks", x)])), 0))
}
all_parsed = lapply(1:length(split_vector), parse_string)
all_parsed = rbind_all(all_parsed) %>%
mutate(final_hours = hours + days * 24 + weeks * 7 * 24)
Hadleyverse comes to the rescue again:
library(lubridate)
library(stringr)
dat <- readLines(textConnection(" 3 weeks, 2 days, 4 hours
4 week, 6 days, 12 hours
3 days, 18 hours
4 day, 3 hours
7 hours
8 hour"))
sapply(str_split(str_trim(dat), ",[ ]*"), function(x) {
sum(sapply(x, function(y) {
bits <- str_split(str_trim(y), "[ ]+")[[1]]
duration(as.numeric(bits[1]), bits[2])
})) / 3600
})
## [1] 556 828 90 99 7 8
I whacked the data a bit to show it's also somewhat flexible in how it parses things. I rly don't think the second str_trim is absolutely necessary but didn't have cycles to verify.
The exposition is that it trims the original vector then splits it into components (which makes a list of vectors). That list is then iterated over and the individual vector elements are further trimmed and split into # and unit duration. That's passed to lubridate and the value is returned and automatically converted to numeric seconds by the call to sum and we then make it into hours.
Currently I am working on a river discharge data analysis. I have the daily discharge record from 1935 to now. I want to extract the annual maximum discharge for each hydrolocial year (start from 01/11 to next year 31/10). However, I found that the hydroTSM package can only deal with the natural year. I tried to use the "zoo" package, but I found it's difficult to compute, as each year have different days. Does anyone have some idea? Thanks.
the data looks like:
01-11-1935 663
02-11-1935 596
03-11-1935 450
04-11-1935 381
05-11-1935 354
06-11-1935 312
my code:
mydata<-read.table("discharge")
colnames(mydata) <- c("date","discharge")
library(zoo)
z<-zooreg(mydata[,2],start=as.Date("1935-11-1"))
mydta$date <- as.POSIXct(dat$date)
q.month<-daily2monthly(z,FUN=max,na.rm = TRUE,date.fmt = "%Y-%m-%d",out.fmt="numeric")
q.month.plain=coredata(q.month)
z.month<-zooreg(q.month.plain,start=1,frequency=12)
With dates stored in a vector of class Date, you can just use cut() and tapply(), like this:
## Example data
df <- data.frame(date = seq(as.Date("1935-01-01"), length = 100, by = "week"),
flow = (runif(n = 100, min = 0, max = 1000)))
## Use vector of November 1st dates to cut data into hydro-years
breaks <- seq(as.Date("1934-11-01"), length=4, by="year")
df$hydroYear <- cut(df$date, breaks, labels=1935:1937)
## Find the maximum flow in each hydro-year
with(df, tapply(flow, hydroYear, max))
# 1935 1936 1937
# 984.7327 951.0440 727.4210
## Note: whenever using `cut()`, I take care to double-check that
## I've got the cuts exactly right
cut(as.Date(c("1935-10-31", "1935-11-01")), breaks, labels=1935:1937)
# [1] 1935 1936
# Levels: 1935 1936 1937
Here is a one-liner to do that.
First convert the dates to "yearmon" class. This class represents a year month as the sum of a year as the integer part and a month as the fractional part (Jan = 0, Feb = 1/12, etc.). Add 2/12 to shift November to January and then truncate to give just the years. Aggregate over those. Although the test data we used starts at the beginning of the hydro year this solution works even if the data does not start on the beginning of the hydro year.
# test data
library(zoo)
z <- zooreg(1:1000, as.Date("2000-11-01")) # test input
aggregate(z, as.integer(as.yearmon(time(z)) + 2/12), max)
This gives:
2001 2002 2003
365 730 1000
Try the xts package, which works together with zoo:
require(zoo)
require(xts)
dates = seq(Sys.Date(), by = 'day', length = 365 * 3)
y = cumsum(rnorm(365 * 3))
serie = zoo(y, dates)
# if you need to specify `start` and `end`
# serie = window(serie, start = "2015-06-01")
# xts function
apply.yearly(serie, FUN = max)
I struggle mightily with dates in R and could do this pretty easily in SPSS, but I would love to stay within R for my project.
I have a date column in my data frame and want to remove the year completely in order to leave the month and day. Here is a peak at my original data.
> head(ds$date)
[1] "2003-10-09" "2003-10-11" "2003-10-13" "2003-10-15" "2003-10-18" "2003-10-20"
> class((ds$date))
[1] "Date"
I "want" it to be.
> head(ds$date)
[1] "10-09" "10-11" "10-13" "10-15" "10-18" "10-20"
> class((ds$date))
[1] "Date"
If possible, I would love to set the first date to be October 1st instead of January 1st.
Any help you can provide will be greatly appreciated.
EDIT: I felt like I should add some context. I want to plot an NHL player's performance over the course of a season which starts in October and ends in April. To add to this, I would like to facet the plots by each season which is a separate column in my data frame. Because I want to compare cumulative performance over the course of the season, I believe that I need to remove the year portion, but maybe I don't; as I indicated, I struggle with dates in R. What I am looking to accomplish is a plot that compares cumulative performance over relative dates by season and have the x-axis start in October and end in April.
> d = as.Date("2003-10-09", format="%Y-%m-%d")
> format(d, "%m-%d")
[1] "10-09"
Is this what you are looking for?
library(ggplot2)
## make up data for two seasons a and b
a = as.Date("2010/10/1")
b = as.Date("2011/10/1")
a.date <- seq(a, by='1 week', length=28)
b.date <- seq(b, by='1 week', length=28)
## make up some score data
a.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
b.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
## create a data frame
df <- data.frame(a.date, b.date, a.score, b.score)
df
## Since I am using ggplot I better create a "long formated" data frame
df.molt <- melt(df, measure.vars = c("a.score", "b.score"))
levels(df.molt$variable) <- c("First season", "Second season")
df.molt
Then, I am using ggplot2 for plotting the data:
## plot it
ggplot(aes(y = value, x = a.date), data = df.molt) + geom_point() +
geom_line() + facet_wrap(~variable, ncol = 1) +
scale_x_date("Date", format = "%m-%d")
If you want to modify the x-axis (e.g., display format), then you'll probably be interested in scale_date.
You have to remember Date is a numeric format, representing the number of days passed since the "origin" of the internal date counting :
> str(Date)
Class 'Date' num [1:10] 14245 14360 14475 14590 14705 ...
This is the same as in EXCEL, if you want a reference. Hence the solution with format as perfectly valid.
Now if you want to set the first date of a year as October 1st, you can construct some year index like this :
redefine.year <- function(x,start="10-1"){
year <- as.numeric(strftime(x,"%Y"))
yearstart <- as.Date(paste(year,start,sep="-"))
year + (x >= yearstart) - min(year) + 1
}
Testing code :
Start <- as.Date("2009-1-1")
Stop <- as.Date("2011-11-1")
Date <- seq(Start,Stop,length.out=10)
data.frame( Date=as.character(Date),
year=redefine.year(Date))
gives
Date year
1 2009-01-01 1
2 2009-04-25 1
3 2009-08-18 1
4 2009-12-11 2
5 2010-04-05 2
6 2010-07-29 2
7 2010-11-21 3
8 2011-03-16 3
9 2011-07-09 3
10 2011-11-01 4