I have a big data.table that contains the following cols:
timestamp, value, house
The value is a cumulative value of eg energy of that one house. So here is a small sample:
time value house
2014-10-27 11:40:00 100 2
2014-10-27 15:40:00 150 2
2014-10-27 19:40:30 160 2
2014-10-28 00:00:01 170 2
2014-10-28 20:20:20 180 2
2014-10-27 11:40:00 200 3
2014-10-27 15:40:00 300 3
2014-10-27 19:40:30 400 3
2014-10-28 00:00:01 500 3
2014-10-28 20:20:20 600 3
I want to get 3 bar charts: one with the average per house usage per hour of a day, one with the average per house usage per day of a week, and the average per house usage per month of a year.
To get the value of one hour of one day, I guess I should do something like
max(data$value) - min(data$value)
, but that per time interval of an hour and also per house. I know cut(data$time, breaks="hour") splits it up in intervals, but of course does not take the difference of the maximum and minimum value and also doesn't consider the house it is from. On top of that I would also need the average of course.
How can I do this?
First, I'd split time variable to hours, days, months. Convenient and quick way is using regular expressions, for example
hour <- str_extract(rl, ' [[:digit:]]{2}')
hour <- substring(hour, 2)
day <- str_extract(rl, '-[[:digit:]]{2} ')
day <- substring(day, 2, 3)
Then we need to cope with value being in cumulated form, reverse cumsum with diff (both from base R):
value <- diff(value)
Aggregated data for one of barplots created with data.table syntax
data[ , .(avg = mean(value)), by=.(house, day)]
or by using aggregate(base), which looks more readable
aggregate(data, value ~ house + day, mean)
Related
Hi everyone so I am currently plotting time series graphs in Rstudio, I have created a nice time series graph however I would actually like the x axis not to be showing me the date but more like an integer showing a number from the starting date of the graph.
Time Series Graph
Such as instead of seeing 01/01/2021 I want to see day 100, as in its the 100th day of recording data.
Do i need to create another column converting all the days into a numerical value then plot this?
If so how do i do this. At the moment all i have is a Date column and the value i am plotting column.
Column Data
Thanks
Assuming you want 01/01/2021 as first day you could use that as a reference and calculate the number of days passed since the first day of recording and plot that, this should give you more like an integer showing a number from the starting date.
Not sure what your data frame looks like so hopefully this helps.
Using lubridate
library(lubridate)
df
Date
1 01/01/2021
2 02/01/2021
3 03/01/2021
4 04/01/2021
df$days <- yday(dmy(df$Date)) -1
Output:
Date days
1 01/01/2021 0
2 02/01/2021 1
3 03/01/2021 2
4 04/01/2021 3
Which is indeed a numeric
str(df$days)
num [1:4] 0 1 2 3
This a simulation of dates
date.simulation = as.Date(1:100, "2001-01-01")
factor(date.simulation-min(date.simulation))
You just subtract the dates to the minimum date. And you need it as a factor for plotting purposes.
I have a data frame like this called "fulldata" with over 12000 observations:
Station_id Date Prec_daily
44 2019-04-30 0.5
172 2019-05-21 0.0
82 2019-04-30 2.3
44 2019-05-07 4.7
250 2019-05-21 0.0
45 2019-05-02 NA
890 2019-04-30 0.5
82 2019-05-14 5.2
250 2019-05-07 NA
(Station_id = integer, Date = POSIXct, Prec_daily = numeric)
Station_id's stand for different weather stations across a country. The variable precipitation_daily shows the daily amount of rainfall on the given date at the given station. The dates are days on which a voluntary nature observation was submitted to a platform by a participant of the project.
My goal is being able to make a prediction like "When I have precipitation of 1.5 in one day, there will be an estimated amount of XX observations that day (across the country)."
How can I do this in R? I think I need a new data frame or variable that calculates the average amount of total observations across the country for every amount of rainfall in a day.
I really struggle with that right now. So far my analysis contains a histogram of the df above and the mean of the Prec_daily variable. I also managed to make a new data frame in which the total frequency for every amount of rainfall is counted. For that I used the following code.
PrecClean <- fulldata$Prec_daily[!is.na(fulldata$Prec_daily)]
hist(PrecClean, xlim = c(0,10), ylim = c(0, 11000))
box()
mean(PrecClean)
Precfreq <- count(fulldata, vars = "Prec_daily")
But that doesn't help me determining the estimated DAILY amount of observations for each precipitation level...
Thanks a lot in advance for any advice!
I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).
I´m working with invoices. I want to calculate the pool of money claimable each month (invoice´s amount once it past expiration date). The point is that the invoices can be canceled and paid. So, I would like to aggregate the value of the invoices, month by month, taking into account only invoices from month corresponding the day after the expiration date until the month it has been paid or canceled, including that month.
Here is an example of my matrix
Client.Code. Invoice Expiration.Date Amount Payment.date Out.Of.Process
1: 1004773 21506000409 2016-09-28 6993.80 <NA> Current
2: 1004773 21506000670 2016-08-29 30034.62 <NA> Current
3: 1004773 21507000583 2017-10-29 3872.00 <NA> Current
4: 1005109 21601000237 2016-04-30 3594.31 <NA> Current
5: 1005109 21606000480 2016-08-29 6301.68 <NA> Current
6: 1004737 20170500125 2016-07-24 142818.72 2017-06-19 Paid
For example, the code should count the first one from September in each aggregate and should count the number six from July 16 to June 17 in every aggregate. The number 4 would be better to count in each month from may 16 (next day).
There is a way to achieve the aggregate sum of amount invoices claimable per month I´m looking for?
Here is one solution which aggregates Amount by month. I don't know if it is exactly what you wanted, but its close I hope.
library(dplyr)
library(lubridate) # to floor_date function
your_df %>% mutate(exp_date = as.Date(expiration.Date), #you might not need this if you expiration date is in Date format
monthly_date = floor_date(exp_date, "month")) %>%
select(Amount,monthly_date) %>%
group_by(monthly_date) %>%
summarise(aggregate_amt = sum(Amount))
I have the following problem:
Suppose we have:
Idx ID StartTime EndTime
1: 1 2014-01-01 02:20:00 2014-01-01 03:42:00
2: 1 2014-01-01 14:51:00 2014-01-01 16:44:00
note: Idx is not given, but I'm simply adding it to the table view.
Now we see that person with ID=1 is using the computer from 2:20 to 3:42. Now what I would like to do is to convert this interval into a set of variables representing hour and weekday and the duration in those periods.
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-2:00 Wednesday-3:00
1: 1 40 42
For the second row we would have
Idx ID Monday-0:00 Monday-1:00 ... Wednesday-14:00 Wednesday-15:00 Wednesday-16:00
2: 1 9 60 44
Now the problem is of course that it can span over multiple hours as you can see from the second row.
I would like to do this per row and I was wondering if this is possible without too much computational effort and using data.table?
PS: it is also possible that the interval spans over the day.
library(data.table)
library(lubridate)
#produce sample data
DT<-data.table(idx=1:100,ID=rep(1:20,5), StartTime=runif(100,60*60,60*60*365)+ymd('2014-01-01'))
DT[,EndTime:=StartTime+runif(1,60,60*60*8)]
#make fake start and end dates with same day of week and time but all within a single calendar week
DT[,fakestart:=as.numeric(difftime(StartTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
DT[,fakeend:=as.numeric(difftime(EndTime,ymd('1970-01-01'),units="days"))%%7*60*60*24+ymd('1970-01-01')]
setkey(DT,fakestart,fakeend)
#check that weekdays line up
nrow(DT[weekdays(EndTime)==weekdays(fakeend)])
nrow(DT[weekdays(StartTime)==weekdays(fakestart)])
#both are 100 so we're good.
#check that fakeend > fakestart
DT[fakeend<fakestart]
#uh-oh some ends are earlier than starts, let's add 7 days to those ends
DT[fakeend<fakestart,fakeend:=fakeend+days(7)]
#make data.table with all possible labels
DTin<-data.table(start=seq(from=ymd('1970-01-01'),to=DT[,floor_date(max(fakeend),"hour")],by=as.difftime(hours(1))))
DTin[,end:=start+hours(1)]
DTin[,label:=paste0(format(start,format="%A-%H:00"),' ',format(end,format="%A-%H:00"))]
#set key and use new foverlaps feature of data.table which merges by interval
setkey(DT,fakestart,fakeend)
setkey(DTin,start,end)
DTout<-foverlaps(DT,DTin,type="any")
#compute duration in each interval
DTout[,dur:=60-pmax(0,difftime(fakestart,start,unit="mins"))-pmax(0,difftime(end,fakeend,unit="mins"))]
#cast all the rows up to columns for final result
castout<-dcast.data.table(DTout,idx+ID~label,value.var="dur",fill=0)