Selecting Specific Dates in R - r

I am wondering how to create a subset of data in R based on a list of dates, rather than by a date range.
For example, I have the following data set data which contains 3 years of 6-minute data.
date zone month day year hour minute temp speed gust dir
1 09/06/2009 00:00 PDT 9 6 2009 0 0 62 2 15 156
2 09/06/2009 00:06 PDT 9 6 2009 0 6 62 13 16 157
I have used breeze<-subset(data, ws>=15 & wd>=247.5 & wd<=315, select=date:dir) to select the rows which meet my criteria for a sea breeze, which is fine, but what I want to do is create a subset of the days which contain those times that meet my criteria.
I have used...
as.character(breeze$date)
trimdate<-strtrim(breeze$date, 10)
breezedate<-as.Date(trimdate, "%m/%d/%Y")
breezedate<-format(breezedate, format="%m/%d/%Y")
...to extract the dates from each row that meets my criteria so I have a variable called breezedate that contains a list of the dates that I want (not the most eloquent coding to do this, I'm sure). There are about two-hundred dates in the list. What I am trying to do with the next command is in my original dataset data to create a subset which contains only those days which meet the seabreeze criteria, not just the specific times.
breezedays<-(data$date==breezedate)
I think one of my issues here is that I am comparing one value to a list of values, but I am not sure how to make it work.

Lets assume your breezedate list looks like this and data$date is simple string:
breezedate <- as.Date(c("2009-09-06", "2009-10-01"))
This is probably want you want:
breezedays <- data[as.Date(data$date, '%m/%d/%Y') %in% breezedate]

The intersect() function (docs) will allow you to compare one data frame to another and return those records that are the same.
To use, run the following:
breezedays <- intersect(data$date,breezedate) # returns into breezedays all records that are shared between data$date and breezedate

Related

How to match data to two conditions in a loop?

I'm having trouble building a data table that matches numbers based on two conditions (ID and date). Below is an example of a table snippet containing batch data.
ID
Power
Fuel
Starting_date
Shutting_down_date
El_Bel
344
WB
1983
2030
El_Opo
256
WK
1987
2027
El_Tur
400
WB
2019
2049
The question is how do I effectively match this data so that the data in the "Power" column is matched until the last year of operation by column "Shutting_down_date" is reached.
Date
El_Bel
El_Opo
El_Tur
2017
2018
2019
2020
2021
Many thanks for any suggestions.
Let us call the first dataframe x and the second data frame y and let us further assume that each ID only occurs once in the first table. The problem is that you have a different number of years for each ID which means that they cannot be stored in a data.frame (requires all columns to have the same length). You can use a list, though:
result <- list()
for (i in 1:nrow(x)) {
id <- x[i,"ID"]
end_date <- x[i,"Shutting_down_date"]
result[[id]] <- subset(y[,c("Date",id)], Date <= end_date)
}
Then you can query the results as result[["El_Bel"]] or result$El_Bel etc.

creating columns of monthly averages in R

I have a dataframe in R where each row corresponds to a household. One column describes a date in 2010 when that household planted crops. The remainder of the dataset contains over 1000 columns describing the temperature on every day between 2007-2010 for those households.
This is the basic form:
Date 2007-01-01 2007-01-02 2007-01-03
1 2010-05-01 70 72 61
2 2010-02-10 63 59 73
3 2010-03-06 60 59 81
I need to create columns for each household that describe the monthly mean temperatures of the two months following their planting date in each of the three years prior to 2010.
For instance: if a household planted on 2010-05-01, I would need the following columns:
mean temp of 2007-05-01 through 2007-06-01
mean temp of 2007-06-02 through 2007-07-01
mean temp of 2008-05-01 through 2008-06-01
...
mean temp of 2009-06-02 through 2009-07-01
I skipped two columns, but you get the idea. Specific code would be most helpful, but in general, I am just looking for a way to pull data from specific columns based upon a date that is described by another column.
Hi #bricevk you could use the apply function. It allows you to use a function over a data either column-wise or row-wise.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply
Say your data is in a object df. It applies the mean function over the columns of df . Giving you the column-wise mean. The 2 indicates the columns. This wpuld the daily average, assuming each column, is a day.
Averages <- apply(df,2,mean)
If I didn't answer this the way you would like perhaps I have not really understood your dataset. Could you try explain it more clearly?
I suggest you to use tidyverse. However, in order to be compatible with this universe, you firstly have to make your data standard, ie tidy. In your example, the things would be easier if you transformed your data in order to have your observations ordrered by rows, and columns being variables. If I correctly understood your data, you have households planting trees (the row names are dates of plantation ?), and then controls with temperature. I'd do something like :
-----------------------------------------------------------------------------
| Household ID | planting date | Date of control | Temperature controlled |
-----------------------------------------------------------------------------
firstly, have your planting date stored as another thing than a rowname, by example :
library(dplyr)
df <- tibble::rownames_to_column(data, "PlantingDate")
You also have to get your household id var you haven't specified to us.
Then you can manage to have the tidy data with tidyr, using
library(tidyr)
df <- gather(df,"DateOfControl","Temperature",-c(PlantingDate,ID))
When you'll have that, you'll be able to use the package lubridate, something like
library(lubridate)
df %>%
group_by(ID,PlantingDate,year(ControlDate),month(ControlDate)) %>%
summarise(MeanT=mean(Temperature))
could work

Time series analysis applicability?

I have a sample data frame like this (date column format is mm-dd-YYYY):
date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6
I want to convert this data frame into time series using ts(), but the problem is: the current data frame has multiple values for the same date. Can we apply time series in this case?
Can I convert data frame into time series, and build a model (ARIMA) which can forecast count value on a daily basis?
OR should I forecast count value based on grp, but in that case, I have to select only grp and count column of a data frame. So in that case, I have to skip date column, and daily forecast for count value is not possible?
Suppose if I want to aggregate count value on per day basis. I tried with aggregate function, but there we have to specify date value, but I have a very large data set? Any other option available in r?
Can somebody, please, suggest if there is a better approach to follow? My assumption is that the time series forcast works only for bivariate data? Is this assumption right?
It seems like there are two aspects of your problem:
i want to convert this data frame into time series using ts(), but the
problem is- current data frame having multiple values for the same
date. can we apply time series in this case?
If you are happy making use of the xts package you could attempt:
dta2$date <- as.Date(dta2$date, "%d-%m-%Y")
dtaXTS <- xts::as.xts(dta2[,2:3], dta2$date)
which would result in:
>> head(dtaXTS)
count grp
2009-09-01 54 1
2009-09-01 100 2
2009-09-01 546 3
2009-10-01 67 4
2009-11-01 80 5
2009-11-01 45 6
of the following classes:
>> class(dtaXTS)
[1] "xts" "zoo"
You could then use your time series object as univariate time series and refer to the selected variable or as a multivariate time series, example using PerformanceAnalytics packages:
PerformanceAnalytics::chart.TimeSeries(dtaXTS)
Side points
Concerning your second question:
can somebody plz suggest me what is the better approach to follow, my
assumption is time series forcast is works only for bivariate data? is
this assumption also right?
IMHO, this is rather broad. I would suggest that you use created xts object and elaborate on the model you want to utilise and why, if it's a conceptual question about nature of time series analysis you may prefer to post your follow-up question on CrossValidated.
Data sourced via: dta2 <- read.delim(pipe("pbpaste"), sep = "") using the provided example.
Since daily forecasts are wanted we need to aggregate to daily. Using DF from the Note at the end, read the first two columns of data into a zoo series z using read.zoo and argument aggregate=sum. We could optionally convert that to a "ts" series (tser <- as.ts(z)) although this is unnecessary for many forecasting functions. In particular, checking out the source code of auto.arima we see that it runs x <- as.ts(x) on its input before further processing. Finally run auto.arima, forecast or other forecasting function.
library(forecast)
library(zoo)
z <- read.zoo(DF[1:2], format = "%m-%d-%Y", aggregate = sum)
auto.arima(z)
forecast(z)
Note: DF is given reproducibly here:
Lines <- "date count grp
01-09-2009 54 1
01-09-2009 100 2
01-09-2009 546 3
01-10-2009 67 4
01-11-2009 80 5
01-11-2009 45 6"
DF <- read.table(text = Lines, header = TRUE)
Updated: Revised after re-reading question.

R- How to subset data within a time range without dates provided?

I am a brand new R user. I have a very large amount of data over a range of time, with time in the first column increasing by 1/32 second increments. I want to extract the section of data that falls within a specific time range; for example, all of the data between 12:11:08PM and 12:11:11PM. However, the times in this column do not have any dates. Therefore, I could not figure out how to apply lubridate, POSIXct, or any other time functions since they all required dates. Is there a way for me to subset my data with a time only function? Thank you for your time.
time id result
12:11:08 10 200
12:11:09 11 276
12:11:10 12 398
12:11:11 13 299
12:11:12 14 192
12:11:13 15 392
The problem is that POSIXct/POSIXlt stores time with an origin from "1970-01-01" with you timezone. So originally if you set the
as.POSIXct("12:11:10",format="%H:%M:%S")
it returns a full date format
[1] "2016-05-24 12:11:10 EEST"
You can easily subset time using default date settings (in this example function shows records after the "12:11:10"
df<-data.frame(time=c("12:11:08","12:11:09","12:11:10","12:11:11","12:11:12","12:11:13"),
id=c("10", "11","12","13","14","15"),
result=c("200","276","398","299","192","392"))
df[as.POSIXct(df$time,format="%H:%M:%S")>as.POSIXct("12:11:10",format="%H:%M:%S"),]
time id result
4 12:11:11 13 299
5 12:11:12 14 192
6 12:11:13 15 392
Another great solution is to use chrone library
library(chrone)
df[chron(times=df$time)>chron(times="12:11:10"), ]

Subsetting dataframe by day according to most non zero data

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

Resources