Related
so I have a dataframe with date and ranks of three tennis players. I want to convert the dates to continuous days so that 2001-01-01 is 0.
I tried this:
days <- yday(x) - 1 # so Jan 1 = day 0
total_days <- cumsum(days)
It does do the job but only per year so for 2002 it starts over, again in 2003, and so on.
I would very much appreciate some help with this.
Cheers
We could also convert to integer first and subtract
as.integer(x) - as.integer(as.Date('2001-01-01'))
[1] 92 396
data
x <- as.Date(c('2001-04-03', '2002-02-01'))
You can subtract days from '2001-01-01' to get number of days since that date.
x <- as.Date(c('2001-04-03', '2002-02-01'))
total_days <- as.numeric(x - as.Date('2001-01-01'))
total_days
#[1] 92 396
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
a<- data.frame(species,month,number)
my data frame means I caught two kinds of frogs,frog1 and frog2 with different numbers in different months.
I would like to convert months into 4 seasons. The first season is month 1, month 12, month 2, second is 4,3,5 , third is 7,6,8 ,and forth is 10,9,11. there is order in theses 4 seasons,namely, in the first season I would like to choose month 1 first, month 12 secondly, month 2 finally,in the same way, in the second season I would choose month 4 first, month 3 secondly, month 5 eventually, and so on.For example, in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
I would like to ask how do I create a column that can choose the most important month in turn for 4 seasons in two kinds of frogs.For instance,in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
My expected output is :
species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
choosemonth <- c("season1","","","","season4","","","season3","","season2",
"","season2","","season3","","","","season4","","season1")
b<- data.frame(species,month,number,choosemonth)
I'm guessing a little at your final desired result, but here's how to create the season and the importance, and I'm solving for the most important month for each species
Here's a way with dplyr:
library(dplyr)
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
group_by(season, species) %>%
# keep only those with the max importance
filter(importance == max(importance))
EDIT: It looks like you just want to flag the value with the most importance, so here's how to do that,
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
mutate(choosemonth = ifelse(importance == 3, paste0('season',season + 1),''))
EDIT 2: edited one more time, was dividing into 3 seasons rather than 4
I have a dataset with dates occurring randomly. For example:
10/21/15, 11/21/15, 11/22/15, 11/28/15,11/30/15, 12/12/15...etc
I am looking to create a rolling average by time-period NOT by at the observation level. For instance if I wanted to do a moving average of the last 7 days. I would not want to look up at the last 7 rows, but rather the last 7 days
For a tiny example:
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5,10,5,3,2)
df=data.frame(dates, value)
df$desired = c(NA,5,7.5, NA,NA)
I am obviously looking to do this for much larger dataset, but I hope you get the idea. If I was to use 7 days for example this is the result I would expect.
Notice that I don't include the current observations value into the rolling average, only the previous. I want rolling average by time period, not observation row number.
I tried looking at rollmean and dplyr but I couldnt figure it out. I don't really care how it happens though.
Thanks!
try this:
rollavgbyperiod <- function(i,window){
startdate <- dates[i]-window
enddate <- dates[i]-1
interval <- seq(startdate,enddate,1)
tmp <- value[dates %in% interval]
return(mean(tmp))
}
dates <- as.Date(dates)
window <- 7
res <- sapply(1:length(dates),function(m) rollavgbyperiod(m,window))
res[is.nan(res)] <- NA
> data.frame(dates,value,res)
dates value res
1 2015-08-07 5 NA
2 2015-08-08 10 5.0
3 2015-08-09 5 7.5
4 2015-09-09 3 NA
5 2015-10-10 2 NA
I suggest using runner package in this case. What is needed here is mean_run with k = 7 window, lagged by 1 period. Simple one-liner:
library(runner)
dates = c('2015-08-07', '2015-08-08','2015-08-09','2015-09-09','2015-10-10')
value = c(5, 10, 5, 3, 2)
mean_run(x = value, k = 7, lag = 1, idx = as.Date(dates))
#[1] NA 5.0 7.5 NA NA
Check package and function documentation
I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.