Subset dataframe in r for a specific month and date - r

I have a dataframe that looks like this:
V1 V2 V3 Month_nr Date
1 2 3 1 2017-01-01
3 5 6 1 2017-01-02
6 8 9 2 2017-02-01
6 8 9 8 2017-08-01
and I want to take all variables from the data set that have Month=1 (January) and date from 2017-01-01 til 2017-01-31 (so end of January), which means that I want to take the dates as well. I would create a column with days but I have multiple observations for one day and this would be even more confusing. I tried it with this:
df<- filter(df,df$Month_nr == 1, df$Date > 2017-01-01 && df$Date < 2017-01-31)
but it did not work. I would appreciate so much your help! I am desperate at this point. My dataset has measurements for an entire year (from 1 to 12) and hence I filter for months.

The problem is that you didn't put quotation marks around 2017-01-01. Directly putting 2017-01-01 will compute the subtraction and return a number, and then you're comparing a string to a number. You can compare string to string; with string, "2" is still greater than "1", so it would work for comparing dates as strings. BTW, you don't need to write df$ when using filter; you can directly write the column names without quoting when using the tidyverse.

Why do you need to have the month as well as dates in the filter? Just the filter on the dates would work fine. However, you will have to convert the date column into a date object. You can do that as follows:
df$Date_nr <- as.Date(df$Date_nr, format = "%Y-%m-%d")
df_new <- subset(df, Date_nr >= "2017-01-01" & Date_nr <= "2017-01-31")

Related

Using lubridate with multiple date formats

I have a column of dates that was stored in the format 8/7/2001, 10/21/1990, etc. Two values are just four-digit years. I converted the entire column to class Date using the following code.
lubridate::parse_date_time(eventDate, orders = c('mdy', 'Y'))
It works great, except the values that were just years are converted to yyyy-01-01 and I want them to just be yyyy. Is there a way to keep lubridate from adding on any information that wasn't already there?
Edit: Code to create data frame
id = (1:5)
eventDate = c("10/7/2001", "1989", NA, "5/5/2016", "9/18/2011")
df <- data.frame(id, eventDate)
I do not think is possible to convert your values to Dates, and keep the "yyyy" values intact. And by transforming your "yyyy" values into "yyyy-01-01" the lubridate is doing the right thing. Because dates have order, and if you have other values in your column that have days and months defined, all the other values needs to have these components too.
For example. If I produce the data.frame below. If I ask R, to order the table, according to the date column, the date in the first line ("2020"), comes before the value in the second row ("2020-02-28")? Or comes after it? The value "2020" being the year of 2020, it can actually means every possible day in this year, so how R should treate it? By adding the first day of the year, lubridate is defining these components, and avoiding that R get confused by it.
dates <- c("2020", "2020-02-28", "2020-02-20", "2020-01-10", "2020-05-12")
id <- 1:5
df <- data.frame(
id,
dates
)
id dates
1 1 2020
2 2 2020-02-28
3 3 2020-02-20
4 4 2020-01-10
5 5 2020-05-12
So if you want to mantain the "yyyy" intact, is very likely that they should not rest in your eventDate column, with other values that are in a different structure ("dd/mm/yyyy"). Now if is really necessary to mantain these values intact, I think is best, to keep the values of eventDate column as characters, and store these values as Dates in another column, like this:
df$as_dates <- lubridate::parse_date_time(df$eventDate, orders = c('mdy', 'Y'))
id eventDate as_dates
1 1 10/7/2001 2001-10-07
2 2 1989 1989-01-01
3 3 <NA> <NA>
4 4 5/5/2016 2016-05-05
5 5 9/18/2011 2011-09-18

In R, finding the start and end dates for each interval after using diff()

I am using diff() to find the difference in variables down a column. However, I would also like to display the dates the difference is found over.
For example:
Dates <- c("2017-06-07","2017-06-10","2017-06-15","2017-07-07","2017-07-12","2017-07-18")
Variable<-c(5,6,7,8,9,3)
dd<-diff(Dates)
dv<-diff(Variable)
I'd like to find a way to add columns for the start and end date for each interval, so "06-07" as the start and "06-10" as the end date for the difference between the first 2 variables. Any ideas?
The OP has requested to add columns for the start and end date for each interval.
This can be accomplished by using the head() and tail() functions:
# data provided by OP
Dates <- c("2017-06-07","2017-06-10","2017-06-15","2017-07-07","2017-07-12","2017-07-18")
Variable<-c(5,6,7,8,9,3)
start <- head(Dates, -1) # take all Dates except the last one
end <- tail(Dates, -1L) # take all Dates except the first one
dd <- diff(as.Date(Dates)) # coersion to class Date required for date arthmetic
dv <- diff(Variable)
# create data.frame of intervals
intervals <- data.frame(start, end, dd, dv)
intervals
start end dd dv
1 2017-06-07 2017-06-10 3 days 1
2 2017-06-10 2017-06-15 5 days 1
3 2017-06-15 2017-07-07 22 days 1
4 2017-07-07 2017-07-12 5 days 1
5 2017-07-12 2017-07-18 6 days -6
Note that intervals has 5 rows while the vector of breakpoints Dates it was constructed from has a length of 6.
Are you after the difference in dates?
diff(as.Date(as.character(Dates,format="%Y-%M-%D")))

Checking if dates are between a range [duplicate]

This question already has answers here:
R - check if string contains dates within specific date range
(2 answers)
Closed 7 years ago.
I have a column with start dates and end dates (plus times).
Then I'd have 31 separate columns, one for each day of the month that contains a 1 or 0 if the start and end dates encompass the day in the column.
I have converted the date values into dates using strptime. I know how to use difftime.
The bit i'm stuck on is actually doing the comparison and checking whether the start date is before or after the date of the column. e.g. i want to know if the start and end date includes the 1st of Jan, then the 2nd of Jan......if the start date is the 5th, i should return 0 for those 2 columns but I don't know how to make the comparison.
Added some sample data
Col 1 Start Date: 01/01/2015 17:00:00
Col 2 End Date: 14/01/2015 10:55:00
Col 3 Jan-01: 1
Col 3 Jan-02: 1
So in columns 3, i'd want to check if start and end date encompasses the 1st of Jan.
The start date can be at some point on the 1st of Jan, e.g. 4pm. If this is the case, i'd like column 3 to return 0.5 days.
Col 1 Start Date: 01/01/2015 00:00:00
Col 2 End Date: 01/01/2015 16:00:00
Col 3 Jan-01: 0.6667
Col 3 Jan-02: 0
Hopefully this is now more clear. I think the complexity of having time and not just returning a Boolean result means this is not a duplicate question.
Since you haven't provided any reproducible date I have created it just to illustrate comparison of two dates and give 1 if TRUE else 0.
Since you mentioned strptimeI am using the same here.
Syntax: ifelse(date1 < date2, 1,0)
> ifelse(strptime(as.Date("2015-12-16"), format = "%Y-%m-%d") < strptime(as.Date("2015-12-17"), format = "%Y-%m-%d"),1,0)
[1] 1
> ifelse(strptime(as.Date("2015-12-18"), format = "%Y-%m-%d") < strptime(as.Date("2015-12-17"), format = "%Y-%m-%d"),1,0)
[1] 0
You can use the same logic to compare two dates.

R Studio aborting with time series data [duplicate]

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

Subsetting dataframe by day according to most non zero data

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

Resources