I have hourly values for precipitation that I'd like to sum up over the hour.
My data (Nd_hourly) looks like this:
Datum Uhrzeit Nd
1 2013-05-01 01:00:00 0.0
2 2013-05-01 02:00:00 0.1
3 2013-05-01 03:00:00 0.0
4 2013-05-01 04:00:00 0.3
(date,time, precipitation)
and I'd like to have an output of Datum - Nd
I did the min and max temperatur with the package plyr and the function ddply with
t_maxmin=ddply(t_air,.(Datum),summarize,Datum=Datum[which.max(T_Luft)],max.value=max(T_Luft),min.value=min(T_Luft))
I then tried to do something similar for the precipitation and tried
Nd_daily=ddply(Nd_hourly,.(Datum),summarize,Datum=Datum, sum(Nd_hourly))
but get the error message
Error: only defined on a data frame with all numeric variables
I assume something may be wrong with my data input? I imported data from Excel 2010 via a .txt file.
Still very new to R and programming in general, so I would really appreciate some help :)
Is this what you want?
library(plyr)
ddply(.data = df, .variables = .(Datum), summarize,
sum_precip = sum(Nd))
# Datum sum_precip
# 1 2013-05-01 0.4
I think #Henrik has identified your problem, but here's an alternative approach, using data.table:
# Create some fake datetime data
datetime <- seq(ISOdate(2000,1,1), ISOdate(2000,1,10), "hours")
# A data.frame with columns for date, time, and random precipitation data.
DF <- data.frame(date=format(datetime, "%Y-%m-%d"),
time=format(datetime, "%H:%M:%S"),
precip=runif(length(datetime)))
head(DF)
# date time precip
# 1 2000-01-01 12:00:00 0.9294353
# 2 2000-01-01 13:00:00 0.5082905
# 3 2000-01-01 14:00:00 0.5222088
# 4 2000-01-01 15:00:00 0.1841305
# 5 2000-01-01 16:00:00 0.9121000
# 6 2000-01-01 17:00:00 0.2434706
library(data.table)
DT <- as.data.table(DF) # convert to a data.table
DT[, list(precip=sum(precip)), by=date]
# date precip
# 1: 2000-01-01 7.563350
# 2: 2000-01-02 10.147659
# 3: 2000-01-03 10.936760
# 4: 2000-01-04 13.925727
# 5: 2000-01-05 11.415149
# 6: 2000-01-06 10.966494
# 7: 2000-01-07 12.751461
# 8: 2000-01-08 15.218148
# 9: 2000-01-09 12.213046
# 10: 2000-01-10 6.219439
There's a great introductory text on data.tables here.
Given your particular data structure, the following should do the trick.
library(data.table)
DT <- data.table(Nd_hourly)
DT[, list(Nd_daily=sum(Nd)), by=Datum]
Related
So I do have the output of a water distribution model, which is inflow and discharge values of a river for every hour. I have done 5 model runs
reproducible example:
df <- data.frame(rep(seq(
from=as.POSIXct("2012-1-1 0:00", tz="UTC"),
to=as.POSIXct("2012-1-1 23:00", tz="UTC"),
by="hour"
),5),
as.factor(c(rep(1,24),rep(2,24),rep(3,24), rep(4,24),rep(5,24))),
rep(seq(1,300,length.out=24),5),
rep(seq(1,180, length.out=24),5) )
colnames(df)<-c("time", "run", "inflow", "discharge")
In reality, of course, the values for the runs are varying. (And I do have a lot of more data, as I do have 100 runs and hourly values of 35 years).
So, at first I would like to calculate a water scarcity factor for every run, which means I need to calculate something like (1 - (discharge / inflow of 6 hours before)), as the water needs 6 hours to run through the catchment.
scarcityfactor <- 1 - (discharge / lag(inflow,6))
And then I want to calculate to a mean, max and min of scarcity factors over all runs (to find out the highest, the lowest and mean value of scarcity that could happen at every time step; according to the different model runs). So I would say, I could just calculate a mean, max and min for every time step:
f1 <- function(x) c(Mean = (mean(x)), Max = (max(x)), Min = (min(x)))
results <- do.call(data.frame, aggregate(scarcityfactor ~ time,
data = df,
FUN = f1))
Can anybody help me with the code??
I believe this is what you want, if I understand the problem description correctly.
I'll use data.table:
library(data.table)
setDT(df)
# add scarcity_factor (group by run)
df[ , scarcity_factor := 1 - discharge/shift(inflow, 6L), by = run]
# group by time, excluding times for which the
# scarcity factor is missing
df[!is.na(scarcity_factor), by = time,
.(min_scarcity = min(scarcity_factor),
mean_scarcity = mean(scarcity_factor),
max_scarcity = max(scarcity_factor))]
# time min_scarcity mean_scarcity max_scarcity
# 1: 2012-01-01 06:00:00 -46.695652174 -46.695652174 -46.695652174
# 2: 2012-01-01 07:00:00 -2.962732919 -2.962732919 -2.962732919
# 3: 2012-01-01 08:00:00 -1.342995169 -1.342995169 -1.342995169
# 4: 2012-01-01 09:00:00 -0.776086957 -0.776086957 -0.776086957
# 5: 2012-01-01 10:00:00 -0.487284660 -0.487284660 -0.487284660
# 6: 2012-01-01 11:00:00 -0.312252964 -0.312252964 -0.312252964
# 7: 2012-01-01 12:00:00 -0.194826637 -0.194826637 -0.194826637
# 8: 2012-01-01 13:00:00 -0.110586011 -0.110586011 -0.110586011
# 9: 2012-01-01 14:00:00 -0.047204969 -0.047204969 -0.047204969
# 10: 2012-01-01 15:00:00 0.002210759 0.002210759 0.002210759
# 11: 2012-01-01 16:00:00 0.041818785 0.041818785 0.041818785
# 12: 2012-01-01 17:00:00 0.074275362 0.074275362 0.074275362
# 13: 2012-01-01 18:00:00 0.101356965 0.101356965 0.101356965
# 14: 2012-01-01 19:00:00 0.124296675 0.124296675 0.124296675
# 15: 2012-01-01 20:00:00 0.143977192 0.143977192 0.143977192
# 16: 2012-01-01 21:00:00 0.161047028 0.161047028 0.161047028
# 17: 2012-01-01 22:00:00 0.175993343 0.175993343 0.175993343
# 18: 2012-01-01 23:00:00 0.189189189 0.189189189 0.189189189
You can be a tad more concise by lapplying over different aggregators:
df[!is.na(scarcity_factor), by = time,
lapply(list(min, mean, max), function(f) f(scarcity_factor))]
Lastly you could think of this as reshaping with aggregation and use dcast:
dcast(df, time ~ ., value.var = 'scarcity_factor',
fun.aggregate = list(min, mean, max))
(use df[!is.na(scarcity_factor)] in the first argument of dcast if you want to exclude the meaningless rows)
library(tidyverse)
df %>%
group_by(run) %>%
mutate(scarcityfactor = 1 - discharge / lag(inflow,6)) %>%
group_by(time) %>%
summarise(Mean = mean(scarcityfactor),
Max = max(scarcityfactor),
Min = min(scarcityfactor))
# # A tibble: 24 x 4
# time Mean Max Min
# <dttm> <dbl> <dbl> <dbl>
# 1 2012-01-01 00:00:00 NA NA NA
# 2 2012-01-01 01:00:00 NA NA NA
# 3 2012-01-01 02:00:00 NA NA NA
# 4 2012-01-01 03:00:00 NA NA NA
# 5 2012-01-01 04:00:00 NA NA NA
# 6 2012-01-01 05:00:00 NA NA NA
# 7 2012-01-01 06:00:00 -46.7 -46.7 -46.7
# 8 2012-01-01 07:00:00 -2.96 -2.96 -2.96
# 9 2012-01-01 08:00:00 -1.34 -1.34 -1.34
#10 2012-01-01 09:00:00 -0.776 -0.776 -0.776
# # ... with 14 more rows
I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.
This one is almost a challenge!
I have the following dataframe:
tag hour val
N1 2013-01-01 00:00:00 0.3404266179
N1 2013-01-01 01:00:00 0.3274182995
N1 2013-01-01 02:00:00 0.3142598749
N2 2013-01-01 02:00:00 0.3189924887
N2 2013-01-01 04:00:00 0.3170907762
N3 2013-01-01 05:00:00 0.3161910788
N3 2013-01-01 06:00:00 0.4247638954
I need to transform it to something like this:
hour N1 N2 N3
2013-01-01 00:00:00 0.3404266179 NULL NULL
2013-01-01 01:00:00 0.3274182995 NULL NULL
2013-01-01 02:00:00 0.3142598749 0.3189924887 NULL
2013-01-01 03:00:00 NULL NULL NULL
2013-01-01 04:00:00 NULL 0.3170907762 NULL
2013-01-01 05:00:00 NULL NULL 0.3161910788
2013-01-01 06:00:00 NULL NULL 0.4247638954
As things are not that easy, my dataframe goes up to N5000 and hour has almost 200.000 entries for each N.
The timestamp is very well behaved, as it increases minute by minute for everybody in a way you could generate all timestamps with a simple command like strptime("2013-01-01 00:00:00", "%Y-%m-%d %H:%M:%S") + c(0:172800)*60 (172800 minutes ~ 4 months). But not necessarily you have data for every timestamp, as I show on the example.
I know I could write a function with endless loops, but is there a way to do this using only R (and its packages) functions?
Thanks!
You want to use the "reshape2" package:
install.packages("reshape2")
library(reshape2)
newdf <- dcast(mydata, hour~tag)
reshape2 is a wildly powerful package that I completely fail to understand... but sometimes it has nice useful things like this that just work. :-)
UPDATED: that's "dcast" not "cast"... I mistakenly used the "reshape" not the "reshape2" package. Fixed!
This is neither the most straightforward nor elegant solution, but it works:
An exemplary data.frame:
df <- data.frame(tag=rep(c("N1", "N2", "N4"), c(3,2,2)),
hour=structure(c(1,2,3,3,5,6,7), class="POSIXct"),
val=runif(7))
## tag hour val
## 1 N1 1970-01-01 01:00:01 0.6645598
## 2 N1 1970-01-01 01:00:02 0.7924186
## 3 N1 1970-01-01 01:00:03 0.3813311
## 4 N2 1970-01-01 01:00:03 0.8555780
## 5 N2 1970-01-01 01:00:05 0.4480540
## 6 N4 1970-01-01 01:00:06 0.1875233
## 7 N4 1970-01-01 01:00:07 0.5755332
Now we create the resulting date column (it's just an example):
uh <- structure(1:7, class="POSIXct") # or e.g. uh <- unique(df$hour), or seq(), etc.
Then we create an "empty" resulting data frame (each val will be NA)
nr <- length(uh) # number of rows on out
# column definitions:
(coldef <- paste("hour=uh", paste(unique(df$tag), "NA_real_", sep="=", collapse=", "), sep=", "))
## [1] "hour=uh, N1=NA_real_, N2=NA_real_, N4=NA_real_"
# create output df:
outdf <- eval(parse(text=sprintf("data.frame(list(%s))", coldef)))
Finally, let's set vals in each N* column:
for (idx in split(1:nrow(df), df$tag))
outdf[outdf$hour %in% df$hour[idx], as.character(df$tag[idx[1]])] <- df$val[idx]
You might also consider the base function reshape if you don't want to bother with another package. Using #gagolews's sample data
> reshape(df, idvar="hour", timevar="tag", v.names="val", direction="wide")
hour val.N1 val.N2 val.N4
1 1969-12-31 19:00:01 0.8156553 NA NA
2 1969-12-31 19:00:02 0.9203821 NA NA
3 1969-12-31 19:00:03 0.8127614 0.7386737 NA
5 1969-12-31 19:00:05 NA 0.9648562 NA
6 1969-12-31 19:00:06 NA NA 0.2540216
7 1969-12-31 19:00:07 NA NA 0.5024042
I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)
I have a dataframe with a POSIXct datetime column and a column with a value.
The value may contain periods of NA, sometimes even lags between some hours (no data at all, eg.)
t v
2014-01-01 20:00:00 1000
2014-01-01 20:15:00 2300
2014-01-01 20:30:00 1330
2014-01-01 20:45:00 NA
2014-01-01 21:00:00 NA
2014-01-01 22:15:00 NA
2014-01-01 22:30:00 1330
2014-01-01 22:45:00 3333
One can easily see that there is a period with simply no data written (21:00 to 22:15)
When I now apply
aggregate(data, list(t=cut($t, "1hour"), FUN=sum)
it interprets anything missing as zero. When plotting it with ggplot2 and geom_line, the curve in that region will break down from 1000s to 10s.
I want that aggregate returns NA values for every hour that is not represented by the data (missing or NA itself), such that the values are not bent down to 0 and such that the line plot shows a gap in that period (disconnected data points).
Thanks to #JulienNavarre and #user20650 who both contributed parts of the solution, I put here my final solution which is additionally capable of handling data at non-regular times and demands at least x values per hour for aggregation.
data$t <- as.POSIXct(strptime(data$t,"%Y-%m-%d %H:%M:%S"))
x <- 4 # data available x times per hour
h <- 1 # aggregate to every h hours
# aggregation puts NA if data has not x valid values per hour
dataagg <- aggregate(data$v, list(t=cut(data$t, paste(h,"hours"))),
function(z) ifelse(length(z)<x*h||any(is.na(z)),NA,sum(z,na.rm=T)))
dataagg$t <- as.POSIXct(strptime(dataagg$t, '%Y-%m-%d %H:%M:%S'))
# Now fill up missing datetimes with NA
a <- seq(min(dataagg$t), max(dataagg$t), by=paste(h,"hours"))
t <- a[seq(1, length(a), by=1)]
tdf <- as.data.frame(t)
tdf$t <- as.POSIXct(strptime(tdf$t, '%Y-%m-%d %H:%M:%S'))
dataaggfinal <- merge(dataagg, tdf, by="t", all.y=T)
What you want is not clear tho, but maybe you are looking for a right join, which you can do with merge and all.Y = TRUE.
And after you can do your sum grouped by, with aggregate.
> data$t <- as.POSIXct(data$t)
>
> time.seq <- seq(min(as.POSIXct(data$t)), max(as.POSIXct(data$t)), by = "min")[seq(1, 166, by = 15)]
>
> merge(data, as.data.frame(time.seq), by.x = "t", by.y = "time.seq", all.y = T)
t v
1 2014-01-01 20:00:00 1000
2 2014-01-01 20:15:00 2300
3 2014-01-01 20:30:00 1330
4 2014-01-01 20:45:00 NA
5 2014-01-01 21:00:00 NA
6 2014-01-01 21:15:00 NA
7 2014-01-01 21:30:00 NA
8 2014-01-01 21:45:00 NA
9 2014-01-01 22:00:00 NA
10 2014-01-01 22:15:00 NA
11 2014-01-01 22:30:00 1330
12 2014-01-01 22:45:00 3333
And the x argument in aggregate should be, in this case, the variable you want to "sum", then its "data$v" not "data".