How to plot every second timestep? [r] - r

There must be a very easy way to do this but I don't know what it is...
As the title says, I would like to know how I can plot every second timestep of a time series in R? For example, I have half hourly data but I only want to plot the data on the hour e.g. I have
10:00 0
10:30 1
11:00 2
11:30 3
12:00 4
I just want to plot
10:00 0
11:00 2
12:00 4

Something like
plot(x[seq_along(x)%%2==0])
?
Edit: I don't know how you are plotting your data set above, but however you're doing it, you can subset your data as follows
halfhourdata <- fulldata[seq(nrow(fulldata)) %%2 == 1,]
If you give more details someone might tell you how to figure out which time values are hourly rather than relying (as here) on the fact that they are the odd-numbered rows ...

Slightly less verbose and not quite as clear as Ben's solution but you can use vector recycling and indexing using a boolean to achieve this (as long as you're just interested in every other observation).
# Extract the data you want (assuming you want to keep
# the first observation and skip the second, ...
newdat <- x[c(T,F)]
plot(newdat)

Related

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

How to make an hourly time series in R with this data?

times booked_res
11:00 23
13:00 26
15:00 27
17:00 25
19:00 28
21:00 30
So I need to use the ts() function in R to convert this frame into a time series. The column on the right are the number of people reserved in each time. How should I approach this? I'm not sure about the arguments and I don't know if the frequency should be set to 24 (hours in a day) or 10 (11:00 to 21:00) as shown above. Any help appreciated.
First, find the frequency by noting that you are taking a sample every two minutes. The frequency is the inverse of the time between samples, which is 1/2 samples per minute or 30 samples per hour. The data you're interested in is on the right, so you can just use that data vector rather than the entire data frame. The code to convert that into a time series is simply:
booked_res <- c(23,26,27,25,28,30)
ts(booked_res,frequency = 30)
A simple plot with your data might be coded like this:
plot(ts(booked_res,frequency = 30),ylab='Number of people reserved',xlab='Time (in hours) since start of sampling',main='Time series chart of people reservations')
UPDATE:
A time series model can only be created in R when the times series is stationary. Using a varying sample rate would make the time series non-stationary, and so you wouldn't be able to create a time-series object in R.
This page on Analytics Vidhya provides a nice definition of stationary and non-stationary time series, while this page on R bloggers gives some resources that relate to analyzing a non-stationary time series.

Flexible calculations in data frames

I have a little problem with R and my skills are somehow limited.
I want to conduct two calculations in a data frame which are based on the previous row.
The first one is a count variable, additionally I want to calculate the difference between the current and the previous line.
I think the easiest way to clarify my problem is a small example:
Imagine the following table below, which consists of only two columns. user is a customer number and time is the time of a transaction of the particular user.
Now I want to create two new columns as specified in the example table:
The counter variable count, which simply counts the transactions of the user, indicating the actual number of the actual user's transaction.
The variable diff (time [s]), which is the time difference [in seconds] between the current transaction and the previous one. Thus something like: time [i] - time [i-1], but the calculation for each new user must start again from zero; obviously no time difference can be calculated for the first transaction of each user.
I've tried to solve this problem with a loop, however the table is very large and the calculation on the complete data set just didn't want to end.
user time count diff(time[s])
A 10:00:00 1
A 10:30:00 2 1.800
A 12:00:00 3 5.400
A 13:00:00 4 3.600
B 14:00:00 1
C 15:00:00 1
C 16:00:00 2 3.600
C 17:00:00 3 3.600
I would do it using the plyr package, which makes life a lot easier when it comes to data wrangling. There are ways to do this and other transformations in base R, but it's a mess of different functions with inconsistent interfaces.
library(plyr)
ddply(df, .(user), transform, count=seq_len(time), diff=c(0, diff(time)))

R Table modification

How do I take the average of a few entries in a column whose corresponding entry in another column,has the same entries?
For instance I have a large table with say 3 columns, time and prices being 2. and lets say under the time column the values repeat. like 10:30 appears 4 times, then i would need to take the average of the corresponding price column entries and summarize the same onto a single row of 10:30 with a single price of it. Can someone provide me some insights?
Sample data:
time prices size
10:00 23 1
10:15 12 3
10:30 12 1
10:30 19 4
10:45 12 1
I would like to modify rows 3 and 4 merging into a single row, averaging the prices.
How about something like
tapply(prices, time, mean)
For a more complete picture, see ?tapply
But what would you like to do with the column size?
EDIT:
To take the mean of prices and the last value of size, here's one suggestion:
myDF<-data.frame(time=c("10:00","10:15","10:30","10:30","10:45"),
prices=c(23,12,12,19,12),size=c(1,3,1,4,1))
theRows <- tapply(seq_len(nrow(myDF)), myDF$time, function(x) {
return(data.frame(time = head(myDF[x, "time"],1), prices = mean(myDF[x, "prices"]),
size = tail(myDF[x, "size"], 1)))
}
)
Reduce(function(...) rbind(..., deparse.level = FALSE), theRows)
p.s. This can be done very well using ddply -- see Paul's answer, too!
You could also take a look at the plyr package. I would use ddply for this:
ddply(df, .(time), summarise,
mean_price = mean(prices),
sum_size = sum(size))
this assumes your data is in df. For a more elaborate description of plyr, please take a look at this paper in the Journal of Statistical Software.
Other alternatives include using data.table, or ave.

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources