I have a dataframe where one column is a date time (chron). I would like to split this dataframe into a list of dataframes split by the date part only. So each dataframe will have all the data for that day. I looked at split function but not sure how to use part of a column value?
say you have this data.frame :
df <- data.frame(date=rep(seq.POSIXt(as.POSIXct("2010-01-01 15:26"), by="day", length.out=3), each=3), var=rnorm(9))
> df
date var
1 2010-01-01 15:26:00 -0.02814237
2 2010-01-01 15:26:00 -0.26924825
3 2010-01-01 15:26:00 -0.57968310
4 2010-01-02 15:26:00 0.88089757
5 2010-01-02 15:26:00 -0.79954092
6 2010-01-02 15:26:00 1.87145778
7 2010-01-03 15:26:00 0.93234835
8 2010-01-03 15:26:00 1.29130038
9 2010-01-03 15:26:00 -1.09841234
to split by day you just need:
> split(df, as.Date(df$date))
$`2010-01-01`
date var
1 2010-01-01 15:26:00 -0.02814237
2 2010-01-01 15:26:00 -0.26924825
3 2010-01-01 15:26:00 -0.57968310
$`2010-01-02`
date var
4 2010-01-02 15:26:00 0.8808976
5 2010-01-02 15:26:00 -0.7995409
6 2010-01-02 15:26:00 1.8714578
$`2010-01-03`
date var
7 2010-01-03 15:26:00 0.9323484
8 2010-01-03 15:26:00 1.2913004
9 2010-01-03 15:26:00 -1.0984123
EDIT:
the above method is consistent with chron datetime object too:
x <- chron(dates = "02/27/92", times = "22:29:56")
> x
[1] (02/27/92 22:29:56)
> as.Date(x)
[1] "1992-02-27"
EDIT 2
making sure that as.Date doesn't change your data is crucial, see here:
# I'm using "DSTday" to make a sequece of one entire _apparent_ day
x <- rep(seq.POSIXt(as.POSIXct("2010-03-27 00:31"), by="DSTday", length.out=3))
> x
[1] "2010-03-27 00:31:00 GMT" "2010-03-28 00:31:00 GMT" "2010-03-29 00:31:00 BST"
> as.Date(x)
[1] "2010-03-27" "2010-03-28" "2010-03-28"
the third item is in the summer time and as.Date retrieve the actual day, i.e. minus one hour. To avoid this:
> as.Date(cut(x, "DSTday"))
[1] "2010-03-27" "2010-03-28" "2010-03-29"
The trick is to create a vector that tells R how to split the data. So in your example we have a data frame:
dd = data.frame(x = runif(100),data= paste0(1:4, "/05/13"))
##This step will depend on your data structure
dd$date = strptime(dd$data, "%d/%m/%y")
Note that I've made the date column have class POSIXlt`POSIXt`. This allows easy manipulation of dates.
Next I'll create the variable I'm going to split on - split_date. Basically, I subtract the minimum date from all other dates and divide by the number of seconds in a day:
split_date = (dd$date -min(dd$date))/86400
Since this will result in fractions, I'll round down to the nearest day:
split_date = floor(split_date)
Now I use the split function in the standard way:
split_by_day = split(dd, split_date)
Related
I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6
This question already has answers here:
How to split a data frame?
(8 answers)
Closed 4 years ago.
Is there a way within R to make a function that would make subsets (for example by dates) into it's own data frame? For example I have 30 days worth of data, and I want to break each day down into individual days and output it into a new individual data frame. I can't figure out how to do it in a function. Any clues?
Example:
Dataframe: df_of_month
Output desired via a loop function of sorts:
df_of_month_day1
df_of_month_day2
df_of_month_day3
df_of_month_day4
df_of_month_day5
df_of_month_day6
etc?.... I've been looking for multiple way sand it's not working.
To give you an answer to your question, you would achieve this with lapply. For instance, consider the following:
Create some sample data:
df <- data.frame(Day = rep(seq.Date(from = as.Date('2010-01-01'), to = as.Date('2010-01-30'), by =1), 5))
df$somevar <- rnorm(nrow(df))
head(df)
Day somevar
1 2010-01-01 -0.946059466
2 2010-01-02 0.005897001
3 2010-01-03 -0.297566286
4 2010-01-04 -0.637562495
5 2010-01-05 -0.549800912
6 2010-01-06 0.287709994
Now, observe that unique can give you a vector with all unique dates:
unique(df$Day)
[1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15" "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25" "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
This you can pass to lapply to be used for subsetting:
lapply(unique(df$Day), function(x) df[df[,"Day"]==x,])
[[1]]
Day somevar
1 2010-01-01 -0.9460595
31 2010-01-01 -0.3434005
61 2010-01-01 -1.5463641
91 2010-01-01 -0.5192375
121 2010-01-01 -1.1780619
[[2]]
Day somevar
2 2010-01-02 0.005897001
32 2010-01-02 -1.346336688
62 2010-01-02 -0.321702391
92 2010-01-02 -0.384277955
122 2010-01-02 0.058906305
... (output omitted)
where the output of lapply is a list with the corresponding dataframes.
Needless to say, you would assign this to a name to capture all dataframes in a list as in mylist <- lapply(...). However, if you want to have them in your global environment, you can first give each dataframe a name, for instance using setNames as in setNames(mylist, paste0("df", format(unique(df$Day), format = "%Y%m%d"))) and then you could use list2env(mylist) to push each list element into the global environment.
However, as mentioned in the comments, this is probably not a good idea. If you want to do something to each date, consider the group-by solution with dplyr: For instance, imagine you want to get the mean by date:
library(dplyr)
df %>% group_by(Day) %>% summarize(mean_var = mean(somevar))
# A tibble: 30 x 2
Day mean_var
<date> <dbl>
1 2010-01-01 -0.907
2 2010-01-02 -0.398
3 2010-01-03 0.213
4 2010-01-04 -0.142
5 2010-01-05 -0.377
6 2010-01-06 0.404
7 2010-01-07 -0.634
8 2010-01-08 1.00
9 2010-01-09 0.378
10 2010-01-10 -0.0863
# ... with 20 more rows
where each row corresponds to the group-wise mean. This is called split-apply-combine and is worthwhile googling. It will come again and again.
Just for reference, in base R, you could achieve this using e.g. by, as in
by(df$somevar, df$Day, FUN = mean)
though either dplyr or data.table are probably more user-friendly.
I have two data frames: A
y_m_d SNOW
1 2010-01-01 0.0
2 2010-01-02 0.0
3 2010-01-03 0.1
4 2010-01-04 0.0
5 2010-01-05 0.0
6 2010-01-06 2.3
B:
time temp
1 2010-01-01 00:00:00 20.00000
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
I want to combine two data frame based on time. A is a daily record and B is a hourly record. I want to fill the A record at the beginning of each day at 00:00:00 and leave the rest of day blank.
The result should be look like this:
time temp SNOW
1 2010-01-01 00:00:00 20.00000 0.0
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
6 2010-01-01 05:00:00 22.66667
Could you please give me some advice?
Thank you.
Here's a quick solution:
A$y_m_d <- as.Date(A$y_m_d)
B$SNOW <- sapply(as.Date(B$time), function(x) A[A$y_m_d==x, "SNOW"])
This might not be the most efficient way in the world to do this, but it is a solution. I attempted to create data with the exact same variable types and structure as you.
# Create example data
y_m_d <- as.POSIXct(c("2010-01-01", "2010-01-02"), format="%Y-%m-%d")
SNOW <- c(0, 0.1)
time <- as.POSIXct(c("2010-01-01 00:00:00", "2010-01-01 01:00:00", "2010-01-01 02:00:00", "2010-01-02 00:00:00", "2010-01-02 01:00:00", "2010-01-02 02:00:00"), format="%Y-%m-%d %H:%M:%S")
temp <- rnorm(6, mean=20, sd=4)
A <- data.frame(y_m_d, SNOW)
B <- data.frame(time, temp)
# Check data
A
## y_m_d SNOW
## 1 2010-01-01 0.0
## 2 2010-01-02 0.1
B
## time temp
## 1 2010-01-01 00:00:00 17.52852
## 2 2010-01-01 01:00:00 12.42715
## 3 2010-01-01 02:00:00 21.79584
## 4 2010-01-02 00:00:00 19.90442
## 5 2010-01-02 01:00:00 16.40524
## 6 2010-01-02 02:00:00 16.86854
# Loop through days and construct new SNOW variable
days <- as.POSIXct(format(B$time, "%Y-%m-%d"), format="%Y-%m-%d")
SNOW_new <- c()
for (i in 1:nrow(A)) {
SNOW_new <- c(A[i, "SNOW"], rep(NA, sum(days==A[i, "y_m_d"])-1), SNOW_new)
}
# Create new data frame
C <- data.frame(B, SNOW_new)
## time temp SNOW_new
## 1 2010-01-01 00:00:00 17.52852 0.1
## 2 2010-01-01 01:00:00 12.42715 NA
## 3 2010-01-01 02:00:00 21.79584 NA
## 4 2010-01-02 00:00:00 19.90442 0.0
## 5 2010-01-02 01:00:00 16.40524 NA
## 6 2010-01-02 02:00:00 16.86854 NA
I put NA rather than a blank space because I assume you want the SNOW_new variable to be numeric, not character. But if you do want a blank space, you can just replace the NA in the rep function with a "".
Making sure time variables are in the right format.
A$y_m_d <- as.POSIXct(A$y_m_d, format="%Y-%m-%d")
B$time <- as.POSIXct(B$time, format="%Y-%m-%d %H:%M:%S")
The package lubridate is suited to merge time series data
#install.packages("lubridate")
library(lubridate)
A <- xts(A[,-1], order.by = A$y_m_d)
B <- xts(B[,-1], order.by = B$time)
merge.xts(A, B)
My problem is as follows: I've got a time series with 5-Minute precipitation data like:
Datum mm
1 2004-04-08 00:05:00 NA
2 2004-04-08 00:10:00 NA
3 2004-04-08 00:15:00 NA
4 2004-04-08 00:20:00 NA
5 2004-04-08 00:25:00 NA
6 2004-04-08 00:30:00 NA
With this structure:
'data.frame': 1098144 obs. of 2 variables:
$ Datum: POSIXlt, format: "2004-04-08 00:05:00" "2004-04-08 00:10:00" "2004-04-08 00:15:00" "2004-04-08 00:20:00" ...
$ mm : num NA NA NA NA NA NA NA NA NA NA ...
As you can see, the time series begins with a lot of NA's, but there is measured precipitation further down, although riddled with single, less common NA's due to malfunction of the measuring station.
What I'm trying to achieve, is summing up the measured precipitation to hourly sums, not considering NA's.
This is what I tried so far:
sums <- aggregate(precip["mm"],
list(cut(precip$Datum, "1 hour")), sum)
Even though the timestamps are correctly aggregated to hours, all sums are 0 or NA. The sums are not even calculated if there is no NA at all.
additionally to be taken into account:
Hourly precipitation sums in meteorology always describe the cumulative sum until a certain hour: The amount of precipitation at 0:00 o'clock describes the sum from 23:00 the previous day until 0:00. So I always need to sum up the previous hour.
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-03-08 23:00:00")
r <- seq(s, s+1e4, "30 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 6, T))
Datum mm
2004-03-08 23:00:00 4
2004-03-08 23:30:00 1
2004-03-09 00:00:00 2
2004-03-09 00:30:00 4
2004-03-09 01:00:00 1
2004-03-09 01:30:00 4
With the above example, the result I am looking for is:
Datum mm
2004-03-09 00:00:00 5
2004-03-09 01:00:00 6
2004-03-09 02:00:00 5
Try adding na.rm=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
# Group.1 mm
# 1 2004-04-08 00:00:00 26
# 2 2004-04-08 01:00:00 35
# 3 2004-04-08 02:00:00 25
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-04-08 00:05:00")
r <- seq(s, s+1e4, "5 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 34, T))
addendum
To your second question: If you would like measurements on the hour to be calculated with the lesser hour add right=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour", right=TRUE)), sum, na.rm=TRUE)
Further Explanation
We will create another more detailed explanation to show how the solution works:
p <- c("2004-04-07 23:48:20", "2004-04-08 00:00:00", "2004-04-08 00:03:20")
ptime <- as.POSIXlt(p)
#[1] "2004-04-07 23:48:20 EDT" "2004-04-08 00:00:00 EDT" "2004-04-08 00:03:20 EDT"
We have three dates to separate into groups. If we use cut without any extra arguments, the second entry "2004-04-08 00:00:00 EDT" will be grouped with the third entry for hour "00:00":
cut(ptime, "1 hour")
#[1] 2004-04-07 23:00:00 2004-04-08 00:00:00 2004-04-08 00:00:00
But if we add the argument right=FALSE we can group it with the "23:00" hour:
cut(ptime, "1 hour", right=TRUE)
#[1] 2004-04-07 23:00:00 2004-04-07 23:00:00 2004-04-08 00:00:00
We can specify the behavior of edge cases.
edit
With your new data the original solution produces the desired output:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
Group.1 mm
1 2004-03-08 23:00:00 5
2 2004-03-09 00:00:00 6
3 2004-03-09 01:00:00 5
You can use dplyr to calculate sum like :
precip$hour <- strftime(precip$Datum,"%Y-%m-%d %H")
library(dplyr)
sum_hour <- precip %>% group_by(hour) %>% summarise(sum_hour = sum(mm,na.rm = T))
I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6