How to find probability of dataset in R - r

I have a dataset with something like that, below is a small part. How to use barplot to calculate to probability of raining by month?
Date Rain Today
2020-01-01 Yes
2020-01-02 No
2020-01-03 Yes
2020-01-04 Yes
2020-01-05 No
... ...
2020-12-31 Yes

EDIT: Correct answer in the comments
I dont know why you would want to use a scatterplot for this, but, from this post, you can use dplyr pipelines to do something like this:
library(dplyr)
df %>%
group_by(month = format(Date, "%Y-%m")) %>%
summarise(probability = mean(`Rain Today` == 'Yes'))
To group your data into months and find out how many days it has rained/not rained. Then you find the mean of how many days it has rained.
Thank you everyone in the comments for pointing it out. I hope this helps

The lubridate package has some great functions that help you deal with dates.
install.packages("lubridate")
df$month <- lubridate::month(df$Date)
tapply(df[,"Rain Today"]=="Yes", df$month, mean)
You may need to execute df$Date <- as.Date(as.Date) first if it's currently stored as characterrather than a date.
If you don't want to have any dependencies, then I think you can get what you want like this:
df$month <- substr(df$Date, start=6, stop=7) #Get the 6th and 7th characters of your date strings, which correspond to the "month" part
tapply(df[,"Rain Today"]=="Yes", df$month, mean)

Related

Change the date in R and put it into condition

If have two columns with dates in my dataset. Now I want them to have the same format. The first one looks like this: yyyymmdd (so, January 1st 2015 is 20150101) and the second one looks like this: dd/mm/yyyy (so, January 1st 2015 is 01/01/2015). Anyone who can help me with that?
Because after that I want to have a condition that a third column gives the value 1 if the year of both dates are the same or if one is earlier than the other one.
Hope anyone can help me out!
Here is how we could solve this:
We could use parse_date_time() function from lubridate package. It is important to use the orders argument as desired so what is first year or month etc.. and
all should be wrapped around ymd() to get date without hours and minutes:
library(tibble)
# example data
df<- tibble( x_date = "20150101",
y_date= "01/01/2015")
library(lubridate)
library(dplyr)
df %>%
mutate(across(contains("date"), ~ymd(parse_date_time(., orders = c('ymd', 'dmy')))))
x_date y_date
<date> <date>
1 2015-01-01 2015-01-01

Is there an R function to split data by month?

I have data in the following format:
content, date
Hello, 2019-05-11T23:59:02+00:00
Amazing, 2019-01-08T20:22:02+00:00
Come on, 2018-11-15T10:52:45+00:00
We won, 2018-08-25T16:33:23+00:00
This is only a sample of the data, whereas I have over 1 million rows with "dates" in between August 2018 and May 2019. I would like to split my data into 10 different data frames, with each one representing a specific month (i.e. 1 = August 2018, 2 = September 2018,...,10 = May 2019).
I tried using a dplyr group-by method and also performing a loop but did not find any success. I also tried codes from other posts but to no avail.
Any help is much appreciated. I am new to Stack Overflow so apologies if I did not adhere to any form code of conduct.
Thank you in advance!
The Lubridate package has functions which will meet your needs. The key here is make them Dates (or POSIX).
require(tidyverse)
require(lubridate)
df <- data.frame(content=c('H','A'),
date=c('2019-05-11T23:59:02+00:00', '2019-01-08T20:22:02+00:00'))
df %<>%
mutate(date=ymd_hms(date)) %>%
mutate(monthGroup=floor_date(date, unit='month'))
You can either manually filter for each month using that information or put it in a loop/apply to make the computer do it.
df %>%
filter(monthGroup==ymd('2019-05-01'))
Another way without using floor_date()
df <- data.frame(content=c('H','A'),
date=c('2019-05-11T23:59:02+00:00', '2019-01-08T20:22:02+00:00'))
Get all April 2019 dates; that is dates the came before 01 May 2019 and after 01 April 2019.
df %>%
mutate(date=ymd_hms(date)) %>%
filter(date<ymd('2019-05-01') &
date>=ymd('2019-04-01'))

How to select rows from a dataset between two dates?

I have a quite large dataset (35 variables and 65 000 rows) and I would like to split it in three regardind specific dates. I have information about animals before and after a surgery. I'm currently using the dplyr package. Bellow I present what my dataset looks like, I juste give an exemple because when using on my datasetdput I obtain something really large and unreadable. As in the exemple I have several dates at which measurements were taken for an individual. The information about the individual is completed by the surgery date which is unique for each individual. As for the example measurements where taken over several years.
Name Date Measurement Surgery_date
Pierre 2016-03-15 5.12 2017-03-21
Pierre 2017-03-16 4.16 2017-03-21
Pierre 2017-08-09 5.08 2017-03-21
Paul 2016-07-03 5.47 2017-03-25
Paul 2016-09-30 4.98 2017-03-25
Paul 2017-04-12 4.51 2017-03-25
For the moment I've been carfull to have date format either for the dates of measurement and for the surgery dates using lubridate package. Then I've tried, using dplyr package to sort my data. I've tried filter and select but neither of those gave the expected results.
data1$Date <- parse_date_time(data1$Date, "d/m/y")
data1$Date <- ymd(data1$Date)
data1$Surgery_date <- parse_date_time(data1$Surgery_date, "d/m/y")
data1$Surgery_date <- ymd(data1$Surgery_date)
before_surgery <- data1
before_surgery <- dplyr::as_tibble(before_surgery)
before_surgery <- before_surgery %>%
filter(Date > Surgery_date)
before_surgery <- before_surgery %>%
select(Date < Surgery_date)
Either way no row is deleted. When I try (by the same meanings) to obtain dates after surgery, no row is actually selected.
I have checked my file to be sure there is actually dates after and before the surgery date (if not this result would have been normal) and I can confirm there is the two kind of dates in the dataset.
I have just put here the example of the dates before surgery, assuming it works on the same pattern for the dates after surgery.
Thank you in advance for those who will take time to read me. I'm sorry if the question is quite similar to other ones but I have not been able to figure a solution on my own...
EDIT : To be more specific the ultimate goal is to have, three separeted datasets. The first one would cover all measures taken before the surgery, the second the day of the surgery itself + 5 days (but I'll ty to handle this one latter on) and the third one would cover measures taken after the surgery.
The solution to what you are asking is straightforward, because you can in fact filter on dates and compare dates in multiple columns. Please try the code below and confirm for yourself that this works as you would expect. If this approach does not work on your own dataset, please share more about your data and processing because there is probably an error in your code. (One error I already saw: you can't use select(Date < Surgery_date). You need to use filter).
This is how I would approach your problem. As you can see, the code is very straightforward.
df <- data.frame(
Name = c(rep('Pierre', 3), rep('Paul', 3)),
Date = c('2016-03-15', '2017-03-26', '2017-08-09', '2016-07-03', '2016-09-30', '2017-04-12'),
Measurement = c(5.12, 4.16, 5.08, 5.47, 4.98, 4.51),
Surgery_date = c(rep('2017-03-21', 3), rep('2017-03-25', 3))
) %>%
mutate(Surgery_date = ymd(Surgery_date),
Date = ymd(Date))
df %>%
filter(Date < Surgery_date)
df %>%
filter(Date > Surgery_date & Date < (Surgery_date + days(5)))
df %>%
filter(Date > Surgery_date)

Aggregate hourly data for each month of the year

I've looked around for something similar, but couldn't find anything. I have an airport data set which looks something like this (I rounded the hours):
Date Arrival_Time Departure_Time ...
2017-01-01 13:00 14:00 ...
2017-01-01 16:00 17:00 ...
2017-01-01 17:00 18:00 ...
2017-01-01 11:00 12:00 ...
The problem is that for some months, there isn't a flight for a specific time which means I have missing data for some hour. How can I extract hourly arrivals for each hour of every month so that there are no missing values?
I've tried using dplyr and doing the following:
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
summarise(n()) %>%
na.omit()
but the problem clearly arrises as group_by cannot fill in my missing data. I end up with data for every month, but not entries for some hour (e.g. no entry for month 1, hour 22:00).
I could currently get my answer by filtering out every month in its own list, and then fully merging them with a complete list of hours, but that's really slow as I have to do this 12 times. Ideally I'm trying to end up with something like this:
Hour Month January February March ... December
00:00 1 ### ### ### ... ###
01:00 1 ### ### ### ... ###
...
00:00 12 ### ### ### ... ###
23:00 12 ### ### ### ... ###
where ### is the number of flights for that hour of that month. Is there a nice way of doing this?
Note: I was thinking if I could somehow join every month's hours with my complete list of hours, and replace all na's with 0's, then that would work, but I couldn't figure out how to do it properly.
Hopefully the question makes sense. I'd gladly clarify if anything is unclear.
EDIT:
If you want to try it with the nycflights13 package, you could reproduce my attempt with the following code:
allFlights <- nycflights13::flights
allFlights$arr_time <- format(strptime(substr(as.POSIXct(sprintf("%04.0f", allFlights$arr_time), format="%H%M"), 12, 16), '%H:%M'), '%H:00')
arrivals <- allFlights %>% filter(carrier == "MQ") %>% group_by(month, arr_time) %>% summarise(n()) %>% na.omit()
Notice how arrivals doesn't have anything for month 1, hour 02:00, 03:00, etc. What I'm trying to do is have this be a complete data set with the missing hours filled in as 0.
I think you can use the code below to generate what you need.
library(stringr)
dim_month_hour<-data.frame(expand.grid(hour=paste(str_pad(seq(0,23,1),2,"left","0"),"00",sep=":"),month=sort(unique(allFlights$month)),stringsAsFactors=F))
arrivals_full<-left_join(dim_month_hour,arrivals,by=c("hour"="arr_time","month"="month"))
arrivals_full[is.na(arrivals_full$`n()`),"n()"]<-0
Is this what you're trying to do? I'm not sure if I'm aggregating exactly how you want, but the !is.na should do what you're looking for.
arrivals <- allFlights %>% group_by(month(Date), Arrival_Time) %>%
rowwise() %>%
summarise(month = plyr::count(!is.na(Arrival_Time)))
Edit: I may not be clear. Do you want a zero to show for hours where there are no data?
So I'm circling it. There's a cool packaged, called padr that will "pad" the date/time entries with NAs for missing values. Because there is a time_hour field, you can use pad.
library(padr)
allFlightsPad <- allFlights %>% pad
Then you can summarize from there. See this page for info.

extracting "yearmon" from character format and calculating age in R

I have two columns of data:
DoB: yyyy/mm
Reported date: yyyy/mm/dd
Both are in character format.
I'd like to calculate an age, by subtracting DoB from Reported Date, without adding a fictional day to the DoB, so that the age comes out as 28.5 (meaning 28 and a half years old).
Please can someone help me with the coding, I'm struggling!
Many thanks from an R newbie.
library(lubridate)
a <- "2010/02"
b <- "2014/12/25"
c <- ymd(b) - ymd(paste0(a, "/01")) # I don't think this can be done without adding a fictional day
c <- as(c/365.25, "numeric")
What would you want the age to be if the dates are:
DoB: 2015/01
Reported date: 2015/01/30
As suggested, lubridate is a great package for working with dates. You probably want some version using difftime. You also can still use ymd for the yyyy/mm by setting truncated=1 meaning the field can be missing.
df <- data.frame(DoB = c("1987/08", "1994/04"),
Report_Date = c("2015/03/05","2014/07/04"))
library(lubridate)
df$age_years <- with(df,
as.numeric(
difftime(ymd(Report_Date),
ymd(DoB, truncated=1)
)/365.25))
df
DoB Report_Date age_years
1 1987/08 2015/03/05 27.59206023
2 1994/04 2014/07/04 20.25735797
Unfortunately difftime doesn't have a 'years' unit so you also will need to divide the 'days' output that you get back.
Use the "yearmon" class in zoo. It represents time as years + fraction (where fraction is in the set 0, 1/12, ..., 11/12) and so does not require that fictitious days be added:
library(zoo)
as.yearmon("2012/01/10", "%Y/%m/%d") - as.yearmon("1983/07", "%Y/%m")
giving:
[1] 28.5

Resources