Data aggregation loop in R - r

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)

Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).

Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']

It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Related

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?
It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.
Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

converting hours:minutes with unevern column lengths - zeros

i am trying to convert a data.frame with the amount of time in the format hours:minutes.
i found this post useful and like the simple code approach of using the POSIXlt field type.
R: Convert hours:minutes:seconds
However each column represents a month's worth of days. columns are thus uneven. When i try the code below following several other SO posts, i get zeros in the one column with fewer row values.
The code is below. Note that when run, you get all zeros for feb which has fewer data values in its rows.
rDF <- data.frame(jan=c("9:59","10:02","10:04"),
feb=c("9:59","10:02",""),
mar=c("9:59","10:02","10:04"),stringsAsFactors = FALSE)
for (i in 1:3) {
Res <- as.POSIXlt(paste(Sys.Date(), rDF[,i]))
rDF[,i] <- Res$hour + Res$min/60
}
Thank you for any suggestions to fix this issue. I'm open to a more efficient approach as well.
Best,
Leah
You could try using the package lubridate. Here we are converting your data row by row to hour-minute format (using hm), then extracting the hours, and adding the minutes divided by 60:
library(lubridate)
rDF[] <- lapply(rDF, function(x){hm(x)$hour + hm(x)$minute/60})
jan feb mar
1 9.983333 9.983333 9.983333
2 10.033333 10.033333 10.033333
3 10.066667 NA 10.066667
This could easily be achieved with package lubridate's hm:
library(lubridate)
temp<-lapply(rDF,hm)
NewDF<-data.frame(jan=temp[[1]],feb=temp[[2]],mar=temp[[3]])

Subsetting dataframe by day according to most non zero data

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

R Table modification

How do I take the average of a few entries in a column whose corresponding entry in another column,has the same entries?
For instance I have a large table with say 3 columns, time and prices being 2. and lets say under the time column the values repeat. like 10:30 appears 4 times, then i would need to take the average of the corresponding price column entries and summarize the same onto a single row of 10:30 with a single price of it. Can someone provide me some insights?
Sample data:
time prices size
10:00 23 1
10:15 12 3
10:30 12 1
10:30 19 4
10:45 12 1
I would like to modify rows 3 and 4 merging into a single row, averaging the prices.
How about something like
tapply(prices, time, mean)
For a more complete picture, see ?tapply
But what would you like to do with the column size?
EDIT:
To take the mean of prices and the last value of size, here's one suggestion:
myDF<-data.frame(time=c("10:00","10:15","10:30","10:30","10:45"),
prices=c(23,12,12,19,12),size=c(1,3,1,4,1))
theRows <- tapply(seq_len(nrow(myDF)), myDF$time, function(x) {
return(data.frame(time = head(myDF[x, "time"],1), prices = mean(myDF[x, "prices"]),
size = tail(myDF[x, "size"], 1)))
}
)
Reduce(function(...) rbind(..., deparse.level = FALSE), theRows)
p.s. This can be done very well using ddply -- see Paul's answer, too!
You could also take a look at the plyr package. I would use ddply for this:
ddply(df, .(time), summarise,
mean_price = mean(prices),
sum_size = sum(size))
this assumes your data is in df. For a more elaborate description of plyr, please take a look at this paper in the Journal of Statistical Software.
Other alternatives include using data.table, or ave.

R: Aggregating by dates with POSIXct?

I have some zoo series that use POSIXct index.
In order to aggregate by days I've tried these two ways:
aggregate(myzoo,format((index((myzoo)),"%Y-%m-%d")),sum)
aggregate(myzoo,as.Date(index(myzoo)),sum)
I don't know why they don't give the same output.
myzoo series had the weekends removed. The "as.Date way" seems to be OK but the "format way" aggregation gives me data on the weekends.
Why?
Which one is the right?
I've even tried it as.POSIXct(format(...))
As I mentioned in my comment, you need to be careful when changing the format of a timestamp that includes time with a time zone, because it can get shifted between days. Without any data, it's hard to say exactly what your problem is, but you might also try apply.daily from xts:
apply.daily(myzoo, sum)
Here's a working example:
> x <- zoo(2:20, as.POSIXct("2003-02-01") + (2:20) * 7200)
> apply.daily(x, sum)
> 2003-02-01 22:00:00 2003-02-02 16:00:00
65 144

Resources