function in R that creates dummies for given time period - r

There is a data frame like this:
The first two columns in the df describe the start date (month and year) and the end date (month and year). Column names describe every single month and year of a certain time period.
I need a function/loop that insterts "1" or "0" in each cell - "1" when the date from given column name is within the period described by the two first columns, and "0" if not.
I would appreciate any help.

You want to do two different things. (a) create a dummy variable and (b) see if a particular date is in an interval.
Making a dummy variable is the easiest one, in base R you can use ifelse. For example in the iris data frame:
iris$dummy <- ifelse(iris$Sepal.Width > 2.5, 1, 0)
Now working with dates is more complicated. In this answer we will use the library lubridate. First you need to convert all those dates to a format 'Month Year' to something that R can understand. For example for February you could do:
new_format_february_2016 <- interval(ymd('2016-02-01'), ymd('2016-03-01') - dseconds(1))
#[1] 2016-02-01 UTC--2016-02-29 23:59:59 UTC
This is February, the interval of time from the 1 of February to one second before the 1 of March. You can do the same with your start date column and you end date column.
To compare two intevals of time (so, to see if a particular month fall into your other intervals) you can do:
int_overlaps(new_format_february_2016, other_interval)
If this returns true, the two intervals (one particular month and another one) overlaps. This is not the same as one being inside another, but in your case it will work. Using this you can iterate over different columns and rows and build your dummy variable.
But before doing so, I would recommend to clean your data, as your current format is complicate to work with. To get all the power that vector types in R provides ideally you would want to have one row per observation and one variable per column. This does not seem to be the case with your data frame. Take a look to the chapter 'Tidy data' of 'R for Data Science' specially the spreading and gathering subsection:
Tidy data

Related

Comparing dates in a dataframe and appending info based on comparison result in R

so I am lost with the following problem:
I have a dataframe, in which one column contains (STARTED) the starting time of a survey, and several others information of the survey schedule of that survey participant (D5 to D10: only the planned survey dates, D17 to D50: planned send-out times of measurement per day). I'd like to create to columns that indicate now which survey day (1-6) and which measurement per day (1-6) this survey corresponds to.
First problem is the format (!)...
STARTED has the format %Y-%m-%d %H:%M:%S, D5 to D10 %d.%m.%Y and D17 to D50 %d.%m.%Y %H:%M.
I tried dmy_hms() from lubridate, parse_date_time(), and simply as.POSIXct(), but I always fail to get STARTED and the D17 to D50 section into a comparable format. Any solutions on this one?
After just separating STARTED into date & time columns, I was able to compare using ifelse() with D5 to D10 and to create the column of day running from 1 to 6.
This might be already more elegant with something like which(), but I was not able to create a vectorized version of this, as which(<<D5:D10>> == STARTED) would need to compare that per row. Does anyone have a solution for this?
And lastly, how on earth can I set up the second column indicating the measurement time? The first and last survey of the is easy, as there are also uniquely labelled, but for the other four ones I would need to compare per day whether the starting time is before the planned survey time of the following survey. I could imagine just checking whether STARTED falls in between two planned survey times just next to each other - as a POSIXct object that might work, if I can parse the different formats.
Help is greatly appreciated, thanks!
A screenshot from the beginning of the data:
Screenshot from R data using View()
For these first few rows, the intended variable day would need to be c(1,2,1,1,1,2,2) and measurement c(3,2,4,2,1,2,3).
Your other columns are not formatted with %d.%m.%Y, instead either %d.%m.%t (date only) or %d.%m.%y %H:%M. Note the change from %Y to %y.
Try:
as.Date("20.05.22", format = "%d.%m.%y")
# [1] "2022-05-20"
as.POSIXct("20.05.22 06:00", format = "%d.%m.%y %H:%M")
# [1] "2022-05-20 06:00:00 EDT"

Subset a dataframe based on numerical values of a string inside a variable

I have a data frame which is a time series of meteorological measurement with monthly resolution from 1961 till 2018. I am interested in the variable that measures the monthly average temperature since I need the multi-annual average temperature for the summers.
To do this I must filter from the "DateVaraible" column the fifth and sixth digit, which are the month.
The values in time column are formatted like this
"19610701". So I need the 07(Juli) after 1961.
I start coding for 1 month for other purposes, so I did not try anything worth to mention. I guess that .grepl could do the work, but I do not know how the "matching" operator works.
So I started with this code that works.
summersmonth<- Df[DateVariable %like% "19610101" I DateVariable %like% "19610201"]
I am expecting a code like this
summermonths <- Df[DateVariable %like% "**06**" I DateVariable%like% "**07**..]
So that all entries with month digit from 06 to 09 are saved in the new dataframe summermonths.
Thanks in advance for any reply or feedback regarding my question.
Update
Thank to your answers I got the first part, which is to convert the variable in a as.date with the format "month"(Class=char)
Now I need to select months from Juni to September .
A horrible way to get the result I wanted is to do several subset and a rbind afterward.
Sommer1<-subset(Df, MonthVar == "Mai")
Sommer2<-subset(Df, MonthVar == "Juli")
Sommer3<-subset(Df, MonthVar == "September")
SummerTotal<-rbind(Sommer1,Sommer2,Sommer3)
I would be very glad to see this written in a tidy way.
Update 2 - Solution
Here is the tidy way, as here Using multiple criteria in subset function and logical operators
Veg_Seas<-subset(Df, subset = MonthVar %in% c("Mai","Juni","Juli","August","September"))
You can convert your date variable as date (format) and take the month:
allmonths <- month(as.Date(Df$DateVariable, format="%Y%m%d"))
Note that of your column has been originally imported as factor you need to convert it to character first:
allmonths <- month(as.Date(as.character(Df$DateVariable), format="%Y%m%d"))
Then you can check whether it is a summermonth:
summersmonth <- Df[allmonths %in% 6:9, ]
Example:
as.Date("20190702", format="%Y%m%d")
[1] "2019-07-02"
month(as.Date("20190702", format="%Y%m%d"))
[1] 7
We can use anydate from anytime to convert to Date class and then extract the month
library(anytime)
month(anydate(as.character(Df$DateVariable)))

Time series (xts) strptime; ONLY month and day

I've been trying to do a time series on my dataframe, and I need to strip times from my csv. This is what I've got:
campbell <-read.csv("campbell.csv")
campbell$date = strptime(campbell$date, "%m/%d")
campbell.ts <- xts(campbell[,-1],order.by=campbell[,1])
First, what I'm trying to do is just get xts to strip the dates as "xx/xx" meaning just the month and day. I have no year for my data. When I try that second line of code and call upon the date column, it converts it to "2013-xx-xx." These months and days have no year associated with them, and I can't figure out how to get rid of the 2013. (The csv file I'm calling on has the dates in the format "9/30,10/1...etc.)
Secondly, once I try and make a time series (the third line), I am unsure what the "order.by" command is calling on. What am I indexing?
Any help??
Thanks!
For strptime, you need to provide the full date, i.e. day, month and year. In case, any of these is not provided, current ones are assumed from the system's time and appended to the incomplete date. So, if you want to retain your date format as you have read it, first make a copy of that and store in a temporary variable and then use strptime over campbell$date to convert into R readable date format. Since, year is not a concern to you, you need not bother about it even though it is automatically appended by strptime.
campbell <-read.csv("campbell.csv")
date <- campbell$date
campbell$date <- strptime(campbell$date, "%m/%d")
Secondly, what you are doing by 'the third line' (xts(campbell[,-1],order.by=campbell[,1])) command is that, your are telling to order all the data of campbell except the first column (campbell[,-1]) according to the index provided by the time data in the first column of campbell (campbell[,1]). So, it would only work given the date is in the first column.
After ordering the data according to time-series, you can replace back the campbell$date column with date to get back the date format you wanted (although here, first you have to order date also like shown below)
date <- xts(date, order.by=campbell[,1]) # assuming campbell$date is campbell[,1]
campbell.ts <- xts(campbell[,-1], order.by=campbell[,1])
campbell.ts <- cbind(date, campbell.ts)
format(as.Date(campbell$dat, "%m/%d/%Y"), "%m/%d")

Linking characters from one data.frame to other datasets

I have a data.frame with two columns. The first column contains various specific times during a day. The second column contains the animal behavior (behavior period) that I observed at each specific time:
Time; Behavior
10:20; feeding
10:25; feeding
10:30; resting
...
For each of those behavior periods I have an additional dataset (TimeSeries) which contains data about the actual animal movement (output from a movement sensor). Each TimeSeries has about 100 rows:
Time; Var1; Var2
10:20:01; 1345; 5232
10:20:02; 1423; 5271
...
Now I would like to link each TimeSeries with the behavior from the first dataset. So, that R knows that "feeding" is related to the TimeSeries of 10:20 and 10:25 and that "resting" is related to the TimeSeries of 10:30 and so on.
Afterwards I want to use this "knowledge" to calculate mean and sd from each TimeSeries. So I will have all the means and sd's from all TimeSeries for each behavior.
It is not clear whether your times are currently characters, factors, POSIXct, variables, etc. So you should first convert them (possibly in a new column) to a numeric variable, something like the number of seconds since midnight. Functions like strptime, difftime, and as.numeric may help.
Add a column to the first data frame that is just 1:nrow(firstdf). Then add a column to the second dataframe that is computed by the findInterval function:
seconddf$newcol <- findInterval( seconddf$seconds, firstdf$seconds )
Now you can merge the 2 data frames on the new columns and the finer grained times will be associated with the activity from the most recent time.

Creating a single timestamp from separate DAY OF YEAR, Year and Time columns in R

I have a time series dataset for several meteorological variables. The time data is logged in three separate columns:
Year (e.g. 2012)
Day of year (e.g. 261 representing 17-September in a Leap Year)
Hrs:Mins (e.g. 1610)
Is there a way I can merge the three columns to create a single timestamp in R? I'm not very familiar with how R deals with the Day of Year variable.
Thanks for any help with this!
It looks like the timeDate package can handle gregorian time frames. I haven't used it personally but it looks straightforward. There is a shift argument in some methods that allow you to set the offset from your data.
http://cran.r-project.org/web/packages/timeDate/timeDate.pdf
Because you mentioned it, I thought I'd show the actual code to merge together separate columns. When you have the values you need in separate columns you can use paste to bring them together and lubridate::mdy to parse them.
library(lubridate)
col.month <- "Jan"
col.year <- "2012"
col.day <- "23"
date <- mdy(paste(col.month, col.day, col.year, sep = "-"))
Lubridate is a great package, here's the official page: https://github.com/hadley/lubridate
And here is a nice set of examples: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
You should get quite far using ISOdatetime. This function takes vectors of year, day, hour, and minute as input and outputs an POSIXct object which represents time. You just have to split the third column into two separate hour minute columns and you can use the function.

Resources