Linking characters from one data.frame to other datasets - r

I have a data.frame with two columns. The first column contains various specific times during a day. The second column contains the animal behavior (behavior period) that I observed at each specific time:
Time; Behavior
10:20; feeding
10:25; feeding
10:30; resting
...
For each of those behavior periods I have an additional dataset (TimeSeries) which contains data about the actual animal movement (output from a movement sensor). Each TimeSeries has about 100 rows:
Time; Var1; Var2
10:20:01; 1345; 5232
10:20:02; 1423; 5271
...
Now I would like to link each TimeSeries with the behavior from the first dataset. So, that R knows that "feeding" is related to the TimeSeries of 10:20 and 10:25 and that "resting" is related to the TimeSeries of 10:30 and so on.
Afterwards I want to use this "knowledge" to calculate mean and sd from each TimeSeries. So I will have all the means and sd's from all TimeSeries for each behavior.

It is not clear whether your times are currently characters, factors, POSIXct, variables, etc. So you should first convert them (possibly in a new column) to a numeric variable, something like the number of seconds since midnight. Functions like strptime, difftime, and as.numeric may help.
Add a column to the first data frame that is just 1:nrow(firstdf). Then add a column to the second dataframe that is computed by the findInterval function:
seconddf$newcol <- findInterval( seconddf$seconds, firstdf$seconds )
Now you can merge the 2 data frames on the new columns and the finer grained times will be associated with the activity from the most recent time.

Related

function in R that creates dummies for given time period

There is a data frame like this:
The first two columns in the df describe the start date (month and year) and the end date (month and year). Column names describe every single month and year of a certain time period.
I need a function/loop that insterts "1" or "0" in each cell - "1" when the date from given column name is within the period described by the two first columns, and "0" if not.
I would appreciate any help.
You want to do two different things. (a) create a dummy variable and (b) see if a particular date is in an interval.
Making a dummy variable is the easiest one, in base R you can use ifelse. For example in the iris data frame:
iris$dummy <- ifelse(iris$Sepal.Width > 2.5, 1, 0)
Now working with dates is more complicated. In this answer we will use the library lubridate. First you need to convert all those dates to a format 'Month Year' to something that R can understand. For example for February you could do:
new_format_february_2016 <- interval(ymd('2016-02-01'), ymd('2016-03-01') - dseconds(1))
#[1] 2016-02-01 UTC--2016-02-29 23:59:59 UTC
This is February, the interval of time from the 1 of February to one second before the 1 of March. You can do the same with your start date column and you end date column.
To compare two intevals of time (so, to see if a particular month fall into your other intervals) you can do:
int_overlaps(new_format_february_2016, other_interval)
If this returns true, the two intervals (one particular month and another one) overlaps. This is not the same as one being inside another, but in your case it will work. Using this you can iterate over different columns and rows and build your dummy variable.
But before doing so, I would recommend to clean your data, as your current format is complicate to work with. To get all the power that vector types in R provides ideally you would want to have one row per observation and one variable per column. This does not seem to be the case with your data frame. Take a look to the chapter 'Tidy data' of 'R for Data Science' specially the spreading and gathering subsection:
Tidy data

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

How do I convert timestamps to Julian time with millisecond precision using R?

I have a dataset that contains a column of timestamps. For example "16-Feb-2015 17:41:36.666" and "16-Feb-2015 17:41:36.700" are the elements in first two rows of this column. I am reading from a data-stream where a time-stamp is collected at a rate of 30 instances per second. How do I convert these time-stamp elements from this time-stamp to their corresponding Julian time with millisecond precision? R reads this column as a type factor. I would like this to be a numeric. I have tried using as.character and as.numeric but I sense there is a more specific way to hand time stamps.

R: How to lag xts column by one day of the set

Imagine an intra-day set of data, e.g. hourly intervals. Thanks to Google and valuable Joshua's answers to other people, I managed to create new columns in the xts object carrying DAILY Open/High/Low/Close values. These are daily values applied on intra-day intervals so all rows of the same day have the same value in particular column. Since the HLC values are look-ahead biased, I want to move them to the next day. Let's focus on just one column called Prev.Day.Close.
Actual status:
My Prev.Day.Close column caries proper values for the current day. All "2010-01-01 ??:??" rows have the same value - Close of 2010-01-01 trading session. So it is not PREVIOUS day at the moment how the column name says.
What I need:
Lag the Prev.Day.Close column to the NEXT DAY OF THE SET.
I cannot lag it using lag() because it works on row (not day) basis. It must not be fixed calendar day like:
C <- ave(x$Close, .indexday(x), FUN = last)
index(C) <- index(C) + 86400
x$Prev.Day.Close <- C
Because this solution does not care about real data in the set. For example it adds new rows because the original data set has holes on weekends and holidays. Moreover, two particular days may not have the same number of intervals (rows) so the shifted data will not fit.
Desired result:
All rows of the first day in the set have NA in Prev.Day.Close because there is no previous day to get data from.
All rows of the second day have the same value in Prev.Day.Close - Any of the values I actually have in Prev.Day.Close of previous day.
The same for every next row.
If I understand correctly, here's one way to do it:
require(xts)
# sample data
dt <- .POSIXct(seq(1, 86400*4, 3600), tz="UTC")-1
x <- xts(seq_along(dt), dt)
# get the last value for each calendar day
daily.last <- apply.daily(x, last)
# merge the last value of the day with the origianl data set
y <- merge(x, daily.last)
# now lag the last value of the day and carry the NA forward
# y$daily.last <- na.locf(lag(y$daily.last))
y$daily.last <- lag(y$daily.last)
y$daily.last <- na.locf(y$daily.last)
Basically, you want to get the end of day values, merge them with the original data, then lag them. That will align the previous end of day values with the beginning of the day.

Creating a single timestamp from separate DAY OF YEAR, Year and Time columns in R

I have a time series dataset for several meteorological variables. The time data is logged in three separate columns:
Year (e.g. 2012)
Day of year (e.g. 261 representing 17-September in a Leap Year)
Hrs:Mins (e.g. 1610)
Is there a way I can merge the three columns to create a single timestamp in R? I'm not very familiar with how R deals with the Day of Year variable.
Thanks for any help with this!
It looks like the timeDate package can handle gregorian time frames. I haven't used it personally but it looks straightforward. There is a shift argument in some methods that allow you to set the offset from your data.
http://cran.r-project.org/web/packages/timeDate/timeDate.pdf
Because you mentioned it, I thought I'd show the actual code to merge together separate columns. When you have the values you need in separate columns you can use paste to bring them together and lubridate::mdy to parse them.
library(lubridate)
col.month <- "Jan"
col.year <- "2012"
col.day <- "23"
date <- mdy(paste(col.month, col.day, col.year, sep = "-"))
Lubridate is a great package, here's the official page: https://github.com/hadley/lubridate
And here is a nice set of examples: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
You should get quite far using ISOdatetime. This function takes vectors of year, day, hour, and minute as input and outputs an POSIXct object which represents time. You just have to split the third column into two separate hour minute columns and you can use the function.

Resources