How to select rows from a dataset between two dates? - r

I have a quite large dataset (35 variables and 65 000 rows) and I would like to split it in three regardind specific dates. I have information about animals before and after a surgery. I'm currently using the dplyr package. Bellow I present what my dataset looks like, I juste give an exemple because when using on my datasetdput I obtain something really large and unreadable. As in the exemple I have several dates at which measurements were taken for an individual. The information about the individual is completed by the surgery date which is unique for each individual. As for the example measurements where taken over several years.
Name Date Measurement Surgery_date
Pierre 2016-03-15 5.12 2017-03-21
Pierre 2017-03-16 4.16 2017-03-21
Pierre 2017-08-09 5.08 2017-03-21
Paul 2016-07-03 5.47 2017-03-25
Paul 2016-09-30 4.98 2017-03-25
Paul 2017-04-12 4.51 2017-03-25
For the moment I've been carfull to have date format either for the dates of measurement and for the surgery dates using lubridate package. Then I've tried, using dplyr package to sort my data. I've tried filter and select but neither of those gave the expected results.
data1$Date <- parse_date_time(data1$Date, "d/m/y")
data1$Date <- ymd(data1$Date)
data1$Surgery_date <- parse_date_time(data1$Surgery_date, "d/m/y")
data1$Surgery_date <- ymd(data1$Surgery_date)
before_surgery <- data1
before_surgery <- dplyr::as_tibble(before_surgery)
before_surgery <- before_surgery %>%
filter(Date > Surgery_date)
before_surgery <- before_surgery %>%
select(Date < Surgery_date)
Either way no row is deleted. When I try (by the same meanings) to obtain dates after surgery, no row is actually selected.
I have checked my file to be sure there is actually dates after and before the surgery date (if not this result would have been normal) and I can confirm there is the two kind of dates in the dataset.
I have just put here the example of the dates before surgery, assuming it works on the same pattern for the dates after surgery.
Thank you in advance for those who will take time to read me. I'm sorry if the question is quite similar to other ones but I have not been able to figure a solution on my own...
EDIT : To be more specific the ultimate goal is to have, three separeted datasets. The first one would cover all measures taken before the surgery, the second the day of the surgery itself + 5 days (but I'll ty to handle this one latter on) and the third one would cover measures taken after the surgery.

The solution to what you are asking is straightforward, because you can in fact filter on dates and compare dates in multiple columns. Please try the code below and confirm for yourself that this works as you would expect. If this approach does not work on your own dataset, please share more about your data and processing because there is probably an error in your code. (One error I already saw: you can't use select(Date < Surgery_date). You need to use filter).
This is how I would approach your problem. As you can see, the code is very straightforward.
df <- data.frame(
Name = c(rep('Pierre', 3), rep('Paul', 3)),
Date = c('2016-03-15', '2017-03-26', '2017-08-09', '2016-07-03', '2016-09-30', '2017-04-12'),
Measurement = c(5.12, 4.16, 5.08, 5.47, 4.98, 4.51),
Surgery_date = c(rep('2017-03-21', 3), rep('2017-03-25', 3))
) %>%
mutate(Surgery_date = ymd(Surgery_date),
Date = ymd(Date))
df %>%
filter(Date < Surgery_date)
df %>%
filter(Date > Surgery_date & Date < (Surgery_date + days(5)))
df %>%
filter(Date > Surgery_date)

Related

How to find probability of dataset in R

I have a dataset with something like that, below is a small part. How to use barplot to calculate to probability of raining by month?
Date Rain Today
2020-01-01 Yes
2020-01-02 No
2020-01-03 Yes
2020-01-04 Yes
2020-01-05 No
... ...
2020-12-31 Yes
EDIT: Correct answer in the comments
I dont know why you would want to use a scatterplot for this, but, from this post, you can use dplyr pipelines to do something like this:
library(dplyr)
df %>%
group_by(month = format(Date, "%Y-%m")) %>%
summarise(probability = mean(`Rain Today` == 'Yes'))
To group your data into months and find out how many days it has rained/not rained. Then you find the mean of how many days it has rained.
Thank you everyone in the comments for pointing it out. I hope this helps
The lubridate package has some great functions that help you deal with dates.
install.packages("lubridate")
df$month <- lubridate::month(df$Date)
tapply(df[,"Rain Today"]=="Yes", df$month, mean)
You may need to execute df$Date <- as.Date(as.Date) first if it's currently stored as characterrather than a date.
If you don't want to have any dependencies, then I think you can get what you want like this:
df$month <- substr(df$Date, start=6, stop=7) #Get the 6th and 7th characters of your date strings, which correspond to the "month" part
tapply(df[,"Rain Today"]=="Yes", df$month, mean)

R function for finding difference of dates with exceptions?

I was wondering if there was a function for finding the difference between and issue date and a maturity date, but with 2 maturity date. For example, I want to prioritize the dates in maturity date source 1 and subtract it from the issue date to find the difference. Then, if my dataset is missing dates from maturity date source 1, such as in lines 5 & 6, I want to use dates from maturity date source 2 to fill in the rest. I have tried the code below, but am unsure how to incorporate the data from maturity date source 2 without changing everything else. I have attached a picture for reference. Thank you in advance.
df$Maturity_Date_source_1 <- as.Date(c(df$Maturity_Date_source_1))
df$Issue_Date <- as.Date(c(df$Issue_Date))
df$difference <- (df$Maturity_Date_source_1 - df$Issue_Date) / 365.25
df$difference <- as.numeric(c(df$difference))
An option would be to coalesce the columns and then do the difference
library(dplyr)
df %>%
mutate(difference = as.numeric((coalesce(Maturity_Date_source_1,
Maturity_Date_source_2) - Issue_Date)/365.25))

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Filter Data by Seasonal Ranges Over Several Years Based on Month and Day Column in R Studio

I am trying to filter a large dataset to contain results between a range of days and months over several years to evaluate seasonal objectives. My season is defined from 15 March through 15 September. I can't figure out how to filter the days so that they are only applied to March and September and not the other months within the range. My dataframe is very large and contains proprietary information, but I think the most important information is that the dates are describes by columns: SampleDate (date formatted as %y%m%d), day (numeric), and month (numeric).
I have tried filtering using multiple conditions like so:
S1 <- S1 %>%
filter((S1$month >= 3 & S1$day >=15) , (S1$month<=9 & S1$day<=15 ))
I also attempted to set ranges using between for every year that I have data with no luck:
S1 %>% filter(between(SampleDate, as.Date("2010-03-15"), as.Date("2010-09-15") &
as.Date("2011-03-15"), as.Date("2011-09-15")&
as.Date("2012-03-15"), as.Date("2012-09-15")&
as.Date("2013-03-15"), as.Date("2013-09-15")&
as.Date("2014-03-15"), as.Date("2014-09-15")&
as.Date("2015-03-15"), as.Date("2015-09-15")&
as.Date("2016-03-15"), as.Date("2016-09-15")&
as.Date("2017-03-15"), as.Date("2017-09-15")&
as.Date("2018-03-15"), as.Date("2018-09-15")))
I am pretty new to R and can't find any solution online. I know there must be a somewhat simple way to do this! Any help is greatly appreciated!
Maybe something like this:
library(data.table)
df <- setDT(df)
# convert a date like this '2020-01-01' into this '01-01'
df[,`:=`(month_day = str_sub(date, 6, 10))]
df[month_day >= '03-15' & month_day <= '09-15']

Sort date column and convert csv file to a time -series

I need your help. I am new to R, I have this csv file shorturl.at/chDK9 with the "All Share Index" from the Nigerian stock exchange, formatted in a matrix, with the months as rows and the years as columns.
I am trying to do 4 things:
Reshape the data, to four columns for Date, Month, Year, ASI
The period should be a date column in the format 01-2013 for January 2013 and so on.
Arrange the data by the date, oldest to newest
Convert the data to a time-series type for analysis (xts prefarably)
So far I have solved 1 & 2 above.
Please see my code below
rASI <- ASI_conv_to_USD_2003_2018
gathered.rASI <- gather(rASI, Month, ASI, -Year)
gathered.rASI$Date <- format(as.Date(paste0(gathered.rASI$Month, gathered.rASI$Year, "01"), format="%b%Y%d"), "%m-%Y")
ASI <- select(gathered.rASI, Date, Month, Year, ASI)
Created on 2020-06-04 by the reprex package (v0.3.0)
I do not know what I am doing wrong, but the date column still shows as a chr. How do I make the date column function as a proper date?
Any help would be greatly appreciated.
Data:
Year,January,February,March,April,May,June,July,August,September,October,November,December
2003,104.904946,108.036674,106.6532671,106.1211644,110.6369777,114.3109402,109.7382693,120.7042254,129.0513061,141.9747008,140.2999274,147.4647619
2004,168.4931751,184.3675093,171.8948949,194.2243976,209.6846881,218.4302457,204.5201028,179.6591854,171.788925,176.3957704,175.7856172,180.1624481
2005,174.3600786,165.874575,156.2704949,165.9616111,162.3373385,162.9130468,165.5409489,177.6973735,190.975969,200.5254592,189.5253288,187.4381323
2006,184.2754864,187.0039216,184.1151874,183.9374803,195.3248086,207.753217,220.2425152,261.5902624,257.3486166,257.9713924,257.9644269,262.3660079
2007,290.763576,321.9563671,344.0977116,373.70341,397.1224052,408.8450816,422.9554882,404.1068702,405.3995157,413.592025,462.4500768,498.6259673
2008,465.9093801,564.6059512,542.1712123,511.539673,507.3090565,481.7790407,457.4977173,411.7628813,398.2089436,312.9651073,284.4105236,240.5413384
2009,151.4739254,160.8334365,136.7210055,147.8068088,203.1480164,183.6687179,169.4245226,152.975866,150.2860646,146.6946313,142.143901,141.1054878
2010,152.3225241,155.1887111,175.6850474,178.6050908,176.5795117,171.5144595,174.5049291,163.103972,154.3394041,169.2037838,167.046543,166.6141118
2011,179.0501835,173.3762495,163.0327771,164.2939247,168.9634855,165.1146804,158.8889704,141.5247531,132.2063595,139.799399,128.3830306,132.7185019
2012,133.3492814,129.4949163,132.8047714,142.0467784,142.1346216,138.9576042,148.4574482,152.9350934,167.5144256,170.2584385,170.6456267,180.8386037
2013,205.1867431,213.04438,216.0144928,215.3981965,243.4601263,232.9424155,244.1989566,233.469857,235.6526892,242.2584675,250.74636,266.2963273
2014,261.3308857,254.8076651,249.6006828,247.9683695,267.1803131,273.6744186,271.1943568,267.5533724,265.4434783,241.8539225,209.9881459,206.9083582
2015,176.4899701,152.4243544,161.5512468,176.6316031,174.6074809,170.307101,153.589313,151.067888,158.9094935,148.4871247,140.5468193,145.7620865
2016,121.710687,125.041883,128.7848346,127.5440712,140.7794402,104.7709381,89.631776,90.34052373,92.97916325,89.3927422,83.19668309,88.25819376
2017,85.43474979,83.04616393,83.42762792,84.35732766,96.74749098,108.4396857,120.8084876,116.2751597,116.1014906,120.1450704,124.20491,125.1822913
2018,145.2937418,141.8812705,136.0134688,135.2162844,124.7488623,125.4006552,121.2306533,114.014232,107.1321563,106.06426,100.7971596,102.5464927
Here might be a way out, gathering your data (i.e., changing them from wide to long), creating a date variable and only then translating the result to xts.
## This assumes that you already have written the data frame (as in your example)
myxts <- ASI_conv_to_USD_2003_2018 %>%
## gather changes the data from wide to long
tidyr::gather("month","value",-Year) %>%
## dmy creates the date variable
mutate(dat = paste0("01 ",month," ",Year) %>% lubridate::dmy()) %>%
## keep only the date and the value
select(dat, value) %>%
## sort by date (not compulsory)
arrange(dat) %>%
## convert to xts (note that xts::as_xts() is deprecated)
timetk::tk_xts(select=value,date_var=dat)

Resources