I want to make a forecast project from a time series dataframe.
but, the time span is too big.
then, I have this column in dataframe from a time series data frame.
Date
2010-06-29
2010-06-30
2010-07-01
2010-07-02
how can I change it so that it only shows every 7 days?
Date
2010-06-29
2010-07-05
2010-07-12
2010-07-19
etc
dataframe.new = dataframe[seq(1, nrow(dataframe), 7),]
seq documentation - https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/seq
basically, seq(1, 100, 7) will generate - 1, 8, 15, ...
Daniel's answer is very simple and direct.
However, it will return only data from a specified weekday, which could lead to biased results depending on the nature of your data.
You can create an index of weekdays that is balanced with random sampling of weekdays:
# example data
df <- data.frame(date = seq.Date(from = ymd("2021/01/01"),
to = ymd("2021/12/31"),
by = "day"))
#create index by sampling weekdays randomly
set.seed(1)
index<-replicate(floor(nrow(df)/7), {sample(unique(df$weekday), replace = FALSE)}) %>%
as.vector()
#subsetting to a 7-fold smaller dataset
library(dplyr)
output<-df %>% filter(weekdays(date)==index)
#checking table of weekdays in the final dataset
table(output$weekday)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
13 6 5 9 8 10 6
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a huge dataframe consisting of 10 mio objects with the following format in RStudio.
ID DATE reading
100845 2014-08-17 0,0,0,0,3,0,0,0,0,1,1,0,0,0,0,0,2,0,0,1,0,0,2,0
100845 2014-08-18 0,0,4,0,0,0,0,1,0,0,0,0,1,1,1,1,0,1,1,2,1,1,0,1
100845 2014-08-19 0,1,0,1,0,1,1,1,2,0,1,0,1,0,1,0,1,0,1,0,2,1,1,0
100918 2015-07-02 1,0,0,1,0,1,3,1,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1
100920 2013-02-07 0,1,0,0,1,0,1,1,1,0,0,1,0,0,1,0,0,5,6,4,2,1,0,1
100920 2013-02-08 0,1,0,0,1,3,5,4,2,1,0,1,0,1,0,0,1,3,7,5,1,1,1,0
The 24 readings per row refer to hourly meter readings during a day. I would like to convert the daily dates to hourly and convert the strings of readings to a column format instead. The IDs should follow this format.
For example, I have implemented the following:
hourly <- data.frame(Hourly=seq(min(as.POSIXct(paste0(df$date, "00:00"),tz="")),max(as.POSIXct(paste0(df$date, "23:00"),tz="")),by="hour"))
How can I fill in the new fields that are created due to the hourly setting with the same IDs as in the daily format? As the full dataset I have is extremely big, I would appreciate a solution that can run very fast.
I can't speak to the speed of this approach on a data set as large as yours, but I think this code does the steps you want:
library(dplyr)
library(tidyr)
df2 <- df %>%
# use separate to spread the readings across separate columns
separate(reading, into = paste0("hour.", seq(24)), sep = ",") %>%
# use gather to convert that wide data frame into a long one
gather(key = hour, value = reading, hour.1:hour.24) %>%
# make the hour marker into a number
mutate(hour = as.numeric(gsub("hour.", "", hour)) - 1) %>%
# order the data
arrange(ID, DATE, hour) %>%
# create a new column that combines the date and time stamp
mutate(datetime = as.POSIXct(paste(DATE, hour), format = "%Y-%m-%d %H")) %>%
# shed unneeded columns
select(ID, datetime, reading)
Result:
> head(df2)
ID datetime reading
1 100845 2014-08-17 00:00:00 0
2 100845 2014-08-17 01:00:00 0
3 100845 2014-08-17 02:00:00 0
4 100845 2014-08-17 03:00:00 0
5 100845 2014-08-17 04:00:00 3
6 100845 2014-08-17 05:00:00 0
I'm relatively new to R but I am very familiar with Excel and T-SQL.
I have a simple dataset that has a date with time and a numeric value associated it. What I'd like to do is summarize the numeric values by-hour of the day. I've found a couple resources for working with time-types in R but I was hoping to find a solution similar to is offered excel (where I can call a function and pass-in my date/time data and have it return the hour of the day).
Any suggestions would be appreciated - thanks!
library(readr)
library(dplyr)
library(lubridate)
df <- read_delim('DateTime|Value
3/14/2015 12:00:00|23
3/14/2015 13:00:00|24
3/15/2015 12:00:00|22
3/15/2015 13:00:00|40',"|")
df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S")))) %>%
group_by(hour_of_day) %>%
summarise(meanValue = mean(Value))
breakdown:
Convert column of DateTime (character) into formatted time then use hour() from lubridate to pull out just that hour value and put it into new column named hour_of_day.
> df %>%
mutate(hour_of_day = hour(as.POSIXct(strptime(DateTime, "%m/%d/%Y %H:%M:%S"))))
Source: local data frame [4 x 3]
DateTime Value hour_of_day
1 3/14/2015 12:00:00 23 12
2 3/14/2015 13:00:00 24 13
3 3/15/2015 12:00:00 22 12
4 3/15/2015 13:00:00 40 13
The group_by(hour_of_day) sets the groups upon which mean(Value) is computed in the via the summarise(...) call.
this gives the result:
hour_of_day meanValue
1 12 22.5
2 13 32.0
I'm using R's ff package with ffdf objects named MyData, (dim=c(10819740,16)). I'm trying to split the variable Date into Day, Month and Year and add these 3 variables into ffdf existing data MyData.
For instance: My Date column named SalesReportDate with VirtualVmode and PhysicalVmode = double after I've changed SalesReportDate to as.date(,format="%m/%d/%Y").
Example of SalesReportDate are as follow:
> B
SalesReportDate
1 2013-02-01
2 2013-05-02
3 2013-05-04
4 2013-10-06
5 2013-15-10
6 2013-11-01
7 2013-11-03
8 2013-30-02
9 2013-12-12
10 2014-01-01
I've refer to Split date into different columns for year, month and day and try to apply it but keep getting error warning.
So, is there any way for me to do this? Thanks in advance.
Credit to #jwijffels for this great solution:
require(ffbase)
MyData$SalesReportDateYear <- with(MyData["SalesReportDate"], format(SalesReportDate, "%Y"), by = 250000)
MyData$SalesReportDateMonth <- with(MyData["SalesReportDate"], format(SalesReportDate, "%m"), by = 250000)
MyData$SalesReportDateDay <- with(MyData["SalesReportDate"], format(SalesReportDate, "%d"), by = 250000)