Flexible calculations in data frames - r

I have a little problem with R and my skills are somehow limited.
I want to conduct two calculations in a data frame which are based on the previous row.
The first one is a count variable, additionally I want to calculate the difference between the current and the previous line.
I think the easiest way to clarify my problem is a small example:
Imagine the following table below, which consists of only two columns. user is a customer number and time is the time of a transaction of the particular user.
Now I want to create two new columns as specified in the example table:
The counter variable count, which simply counts the transactions of the user, indicating the actual number of the actual user's transaction.
The variable diff (time [s]), which is the time difference [in seconds] between the current transaction and the previous one. Thus something like: time [i] - time [i-1], but the calculation for each new user must start again from zero; obviously no time difference can be calculated for the first transaction of each user.
I've tried to solve this problem with a loop, however the table is very large and the calculation on the complete data set just didn't want to end.
user time count diff(time[s])
A 10:00:00 1
A 10:30:00 2 1.800
A 12:00:00 3 5.400
A 13:00:00 4 3.600
B 14:00:00 1
C 15:00:00 1
C 16:00:00 2 3.600
C 17:00:00 3 3.600

I would do it using the plyr package, which makes life a lot easier when it comes to data wrangling. There are ways to do this and other transformations in base R, but it's a mess of different functions with inconsistent interfaces.
library(plyr)
ddply(df, .(user), transform, count=seq_len(time), diff=c(0, diff(time)))

Related

how to iterate based on a condition, and assign aggregated value to a row in new dataframe in R

I have a large dataset of stock prices with 203615 rows and 2 columns(price and Timestamp). in below format
price(USD) | Timestamp
3.5 | 2014-01-01 20:00:00
2 | 2014-01-01 20:15:00
5 | 2014-01-01 20:15:00
----
4 | 2014-01-31 23:00:00
5 | 2014-01-31 23:00:00
4.5 | 2014-01-31 23:00:00
203615 2.3 | 2014-01-31 23:00:00
Time stamp varies from "2014-01-01 20:00:00" to "2014-01-31 23:00:00" with intervals of 15min(rounded to 15min). i have several transactions on same timestamp.
I have to group rows based on timestamp with difference of one day, and caluclate min,max and mean of the price and no of rows within the timestamp limits and assign them to a row in a new dataframe for every iteration until it reaches the end timestamp("2014-01-31 23:00:00") from starting date('2014-01-02 20:00:00")
note: iteration has to be done for every 15min
i have tried while loop. please help me with this and suggest me if i can use any packages
This is my own code which I used as a way of creating a window of time (the prior 24 hours) to iterate over and create min and max values for a project I am working on...
inter is the inteval I worked on in the loop
raw is the data frame name
i is the specific row from which the datetime column was selected from raw
I started my intervals at 97th row ( (i in 97:nrow(raw) ) because the stamps were taken at 15 minute intervals and I wanted a 24 hour backward window, so I needed to leave 96 intervals to pull from...I could not reach back into time I had no data for...so I started far enough into my data to leave room for those intervals.
for (i in 97:nrow(raw)){
inter=raw$datetime[i] - as.difftime(24, unit='hours')
raw$deltaAirTemp_24[i] <-max(temp$Air.Temperature)- min(temp$Air.Temperature)
}
The key is getting into a real date time format. Run str() on the field with the dates, if the come back as anything but Factor, use:
as.POSIXct(yourdate$field, %Y-%m-%d %H:%M:%S)
If they come back from str(yourdatecolumn here) as FACTOR then wrap it in as.POSIXct(as.character(yourdate$field), %Y-%m-%d %H:%M:%S) to be sure it does not coerce the date into a Level number then time..
Get them into a consistent date format, then construct something like above to extract the periods you need. difftime is in the base package and works well you can use positive and negative intervals with it. I hope his helps!

Finding the Min & Max Times for Multiple Individuals

For work I have a report where I compile the number of calls, emails, and texts a person makes each day. Along with this I need to pick out the earliest (Min) and the latest (max) times for each of those actions. I'm wondering if there isn't an easier way to me to pull this data from the date column rather than scrolling down for each person and finding the information.
You are right, there is definitely an easier way. What we need to rely on is that Excel stores times as the number of days since 0th January 1900 (so 1st January 1900 is day 1). Therefore, finding the earliest and latest times is simple a matter of finding the min and max values within a specific day.
I'm assuming that your data is set out as in the following. If it isn't, you can just edit my formula as appropriate.
A B C D
1 Person Date Time
2 Steve Monday 14:00
3 Steve Monday 14:05
4 Sharon Monday 12:00
5 Steve Tuesday 09:00
6 Sharon Tuesday 15:00
What we need to do is find the minimum time for Steve, given that date = Monday. We need to use an array formula. Array formulas let us 'look up' against more than one cell at the same time. The formula I would use is:
=MIN(IF(A2:A6="Steve",IF(B2:B6="Monday",C2:C6)))
Instead of clicking 'Enter' when you use this formula, you need to click Ctrl+Shift+Enter i.e., you enter the formula above and click Ctrl+Shift+Enter and Excel will return:
={MIN(IF(A2:A6="Steve",IF(B2:B6="Monday",C2:C6)))}
Can you see how to add more constraints to the look up? I've included an example screenshot below, where I've made a bigger table and also made the 'Steve' and 'Monday' references refer out to a cell, rather than just being hardcoded into the formula.

R Studio aborting with time series data [duplicate]

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

How to plot every second timestep? [r]

There must be a very easy way to do this but I don't know what it is...
As the title says, I would like to know how I can plot every second timestep of a time series in R? For example, I have half hourly data but I only want to plot the data on the hour e.g. I have
10:00 0
10:30 1
11:00 2
11:30 3
12:00 4
I just want to plot
10:00 0
11:00 2
12:00 4
Something like
plot(x[seq_along(x)%%2==0])
?
Edit: I don't know how you are plotting your data set above, but however you're doing it, you can subset your data as follows
halfhourdata <- fulldata[seq(nrow(fulldata)) %%2 == 1,]
If you give more details someone might tell you how to figure out which time values are hourly rather than relying (as here) on the fact that they are the odd-numbered rows ...
Slightly less verbose and not quite as clear as Ben's solution but you can use vector recycling and indexing using a boolean to achieve this (as long as you're just interested in every other observation).
# Extract the data you want (assuming you want to keep
# the first observation and skip the second, ...
newdat <- x[c(T,F)]
plot(newdat)

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources