How to select rows from a dataset between two dates? - r
I have a quite large dataset (35 variables and 65 000 rows) and I would like to split it in three regardind specific dates. I have information about animals before and after a surgery. I'm currently using the dplyr package. Bellow I present what my dataset looks like, I juste give an exemple because when using on my datasetdput I obtain something really large and unreadable. As in the exemple I have several dates at which measurements were taken for an individual. The information about the individual is completed by the surgery date which is unique for each individual. As for the example measurements where taken over several years.
Name Date Measurement Surgery_date
Pierre 2016-03-15 5.12 2017-03-21
Pierre 2017-03-16 4.16 2017-03-21
Pierre 2017-08-09 5.08 2017-03-21
Paul 2016-07-03 5.47 2017-03-25
Paul 2016-09-30 4.98 2017-03-25
Paul 2017-04-12 4.51 2017-03-25
For the moment I've been carfull to have date format either for the dates of measurement and for the surgery dates using lubridate package. Then I've tried, using dplyr package to sort my data. I've tried filter and select but neither of those gave the expected results.
data1$Date <- parse_date_time(data1$Date, "d/m/y")
data1$Date <- ymd(data1$Date)
data1$Surgery_date <- parse_date_time(data1$Surgery_date, "d/m/y")
data1$Surgery_date <- ymd(data1$Surgery_date)
before_surgery <- data1
before_surgery <- dplyr::as_tibble(before_surgery)
before_surgery <- before_surgery %>%
filter(Date > Surgery_date)
before_surgery <- before_surgery %>%
select(Date < Surgery_date)
Either way no row is deleted. When I try (by the same meanings) to obtain dates after surgery, no row is actually selected.
I have checked my file to be sure there is actually dates after and before the surgery date (if not this result would have been normal) and I can confirm there is the two kind of dates in the dataset.
I have just put here the example of the dates before surgery, assuming it works on the same pattern for the dates after surgery.
Thank you in advance for those who will take time to read me. I'm sorry if the question is quite similar to other ones but I have not been able to figure a solution on my own...
EDIT : To be more specific the ultimate goal is to have, three separeted datasets. The first one would cover all measures taken before the surgery, the second the day of the surgery itself + 5 days (but I'll ty to handle this one latter on) and the third one would cover measures taken after the surgery.
The solution to what you are asking is straightforward, because you can in fact filter on dates and compare dates in multiple columns. Please try the code below and confirm for yourself that this works as you would expect. If this approach does not work on your own dataset, please share more about your data and processing because there is probably an error in your code. (One error I already saw: you can't use select(Date < Surgery_date). You need to use filter).
This is how I would approach your problem. As you can see, the code is very straightforward.
df <- data.frame(
Name = c(rep('Pierre', 3), rep('Paul', 3)),
Date = c('2016-03-15', '2017-03-26', '2017-08-09', '2016-07-03', '2016-09-30', '2017-04-12'),
Measurement = c(5.12, 4.16, 5.08, 5.47, 4.98, 4.51),
Surgery_date = c(rep('2017-03-21', 3), rep('2017-03-25', 3))
) %>%
mutate(Surgery_date = ymd(Surgery_date),
Date = ymd(Date))
df %>%
filter(Date < Surgery_date)
df %>%
filter(Date > Surgery_date & Date < (Surgery_date + days(5)))
df %>%
filter(Date > Surgery_date)
Related
How to find probability of dataset in R
I have a dataset with something like that, below is a small part. How to use barplot to calculate to probability of raining by month? Date Rain Today 2020-01-01 Yes 2020-01-02 No 2020-01-03 Yes 2020-01-04 Yes 2020-01-05 No ... ... 2020-12-31 Yes
EDIT: Correct answer in the comments I dont know why you would want to use a scatterplot for this, but, from this post, you can use dplyr pipelines to do something like this: library(dplyr) df %>% group_by(month = format(Date, "%Y-%m")) %>% summarise(probability = mean(`Rain Today` == 'Yes')) To group your data into months and find out how many days it has rained/not rained. Then you find the mean of how many days it has rained. Thank you everyone in the comments for pointing it out. I hope this helps
The lubridate package has some great functions that help you deal with dates. install.packages("lubridate") df$month <- lubridate::month(df$Date) tapply(df[,"Rain Today"]=="Yes", df$month, mean) You may need to execute df$Date <- as.Date(as.Date) first if it's currently stored as characterrather than a date. If you don't want to have any dependencies, then I think you can get what you want like this: df$month <- substr(df$Date, start=6, stop=7) #Get the 6th and 7th characters of your date strings, which correspond to the "month" part tapply(df[,"Rain Today"]=="Yes", df$month, mean)
R function for finding difference of dates with exceptions?
I was wondering if there was a function for finding the difference between and issue date and a maturity date, but with 2 maturity date. For example, I want to prioritize the dates in maturity date source 1 and subtract it from the issue date to find the difference. Then, if my dataset is missing dates from maturity date source 1, such as in lines 5 & 6, I want to use dates from maturity date source 2 to fill in the rest. I have tried the code below, but am unsure how to incorporate the data from maturity date source 2 without changing everything else. I have attached a picture for reference. Thank you in advance. df$Maturity_Date_source_1 <- as.Date(c(df$Maturity_Date_source_1)) df$Issue_Date <- as.Date(c(df$Issue_Date)) df$difference <- (df$Maturity_Date_source_1 - df$Issue_Date) / 365.25 df$difference <- as.numeric(c(df$difference))
An option would be to coalesce the columns and then do the difference library(dplyr) df %>% mutate(difference = as.numeric((coalesce(Maturity_Date_source_1, Maturity_Date_source_2) - Issue_Date)/365.25))
R calculating time differences in a (layered) long dataset
I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out. I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps. The df is structured as follows, in a long format: df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3), "Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times. "Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55")) Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow. What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers. Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average. I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr: df %>% arrange(Customer, Timestamp) %>% group_by(Customer) %>% mutate(Difference = Timestamp - lag(Timestamp)) %>% summarise(mean(Difference, na.rm = TRUE)) Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages): sort the data, ordering by customer Id, then by timestamp. calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping). find the average squish that into a data.frame. # 1 sort the data df$Timestamp <- as.POSIXct(df$Timestamp) # not debugged df <- df[order(df$Customer, df$Timestamp),] # 2 apply a diff. # if you want to force the time units to seconds, convert # the timestamp to numeric first. # without conversion diffs <- tapply(df$Timestamp, df$Customer, diff) # ======OR====== # convert to seconds diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff) # 3 find the averages diffs.mean <- lapply(diffs, mean) # 4 squish that into a data.frame diffs.df <- data.frame(do.call(rbind, diffs.mean)) diffs.df$Customer <- names(diffs.mean) # 4a tidy up the data.frame names names(diffs.df)[1] <- "Avg_Interval" diffs.df You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.
Filter Data by Seasonal Ranges Over Several Years Based on Month and Day Column in R Studio
I am trying to filter a large dataset to contain results between a range of days and months over several years to evaluate seasonal objectives. My season is defined from 15 March through 15 September. I can't figure out how to filter the days so that they are only applied to March and September and not the other months within the range. My dataframe is very large and contains proprietary information, but I think the most important information is that the dates are describes by columns: SampleDate (date formatted as %y%m%d), day (numeric), and month (numeric). I have tried filtering using multiple conditions like so: S1 <- S1 %>% filter((S1$month >= 3 & S1$day >=15) , (S1$month<=9 & S1$day<=15 )) I also attempted to set ranges using between for every year that I have data with no luck: S1 %>% filter(between(SampleDate, as.Date("2010-03-15"), as.Date("2010-09-15") & as.Date("2011-03-15"), as.Date("2011-09-15")& as.Date("2012-03-15"), as.Date("2012-09-15")& as.Date("2013-03-15"), as.Date("2013-09-15")& as.Date("2014-03-15"), as.Date("2014-09-15")& as.Date("2015-03-15"), as.Date("2015-09-15")& as.Date("2016-03-15"), as.Date("2016-09-15")& as.Date("2017-03-15"), as.Date("2017-09-15")& as.Date("2018-03-15"), as.Date("2018-09-15"))) I am pretty new to R and can't find any solution online. I know there must be a somewhat simple way to do this! Any help is greatly appreciated!
Maybe something like this: library(data.table) df <- setDT(df) # convert a date like this '2020-01-01' into this '01-01' df[,`:=`(month_day = str_sub(date, 6, 10))] df[month_day >= '03-15' & month_day <= '09-15']
Sort date column and convert csv file to a time -series
I need your help. I am new to R, I have this csv file shorturl.at/chDK9 with the "All Share Index" from the Nigerian stock exchange, formatted in a matrix, with the months as rows and the years as columns. I am trying to do 4 things: Reshape the data, to four columns for Date, Month, Year, ASI The period should be a date column in the format 01-2013 for January 2013 and so on. Arrange the data by the date, oldest to newest Convert the data to a time-series type for analysis (xts prefarably) So far I have solved 1 & 2 above. Please see my code below rASI <- ASI_conv_to_USD_2003_2018 gathered.rASI <- gather(rASI, Month, ASI, -Year) gathered.rASI$Date <- format(as.Date(paste0(gathered.rASI$Month, gathered.rASI$Year, "01"), format="%b%Y%d"), "%m-%Y") ASI <- select(gathered.rASI, Date, Month, Year, ASI) Created on 2020-06-04 by the reprex package (v0.3.0) I do not know what I am doing wrong, but the date column still shows as a chr. How do I make the date column function as a proper date? Any help would be greatly appreciated. Data: Year,January,February,March,April,May,June,July,August,September,October,November,December 2003,104.904946,108.036674,106.6532671,106.1211644,110.6369777,114.3109402,109.7382693,120.7042254,129.0513061,141.9747008,140.2999274,147.4647619 2004,168.4931751,184.3675093,171.8948949,194.2243976,209.6846881,218.4302457,204.5201028,179.6591854,171.788925,176.3957704,175.7856172,180.1624481 2005,174.3600786,165.874575,156.2704949,165.9616111,162.3373385,162.9130468,165.5409489,177.6973735,190.975969,200.5254592,189.5253288,187.4381323 2006,184.2754864,187.0039216,184.1151874,183.9374803,195.3248086,207.753217,220.2425152,261.5902624,257.3486166,257.9713924,257.9644269,262.3660079 2007,290.763576,321.9563671,344.0977116,373.70341,397.1224052,408.8450816,422.9554882,404.1068702,405.3995157,413.592025,462.4500768,498.6259673 2008,465.9093801,564.6059512,542.1712123,511.539673,507.3090565,481.7790407,457.4977173,411.7628813,398.2089436,312.9651073,284.4105236,240.5413384 2009,151.4739254,160.8334365,136.7210055,147.8068088,203.1480164,183.6687179,169.4245226,152.975866,150.2860646,146.6946313,142.143901,141.1054878 2010,152.3225241,155.1887111,175.6850474,178.6050908,176.5795117,171.5144595,174.5049291,163.103972,154.3394041,169.2037838,167.046543,166.6141118 2011,179.0501835,173.3762495,163.0327771,164.2939247,168.9634855,165.1146804,158.8889704,141.5247531,132.2063595,139.799399,128.3830306,132.7185019 2012,133.3492814,129.4949163,132.8047714,142.0467784,142.1346216,138.9576042,148.4574482,152.9350934,167.5144256,170.2584385,170.6456267,180.8386037 2013,205.1867431,213.04438,216.0144928,215.3981965,243.4601263,232.9424155,244.1989566,233.469857,235.6526892,242.2584675,250.74636,266.2963273 2014,261.3308857,254.8076651,249.6006828,247.9683695,267.1803131,273.6744186,271.1943568,267.5533724,265.4434783,241.8539225,209.9881459,206.9083582 2015,176.4899701,152.4243544,161.5512468,176.6316031,174.6074809,170.307101,153.589313,151.067888,158.9094935,148.4871247,140.5468193,145.7620865 2016,121.710687,125.041883,128.7848346,127.5440712,140.7794402,104.7709381,89.631776,90.34052373,92.97916325,89.3927422,83.19668309,88.25819376 2017,85.43474979,83.04616393,83.42762792,84.35732766,96.74749098,108.4396857,120.8084876,116.2751597,116.1014906,120.1450704,124.20491,125.1822913 2018,145.2937418,141.8812705,136.0134688,135.2162844,124.7488623,125.4006552,121.2306533,114.014232,107.1321563,106.06426,100.7971596,102.5464927
Here might be a way out, gathering your data (i.e., changing them from wide to long), creating a date variable and only then translating the result to xts. ## This assumes that you already have written the data frame (as in your example) myxts <- ASI_conv_to_USD_2003_2018 %>% ## gather changes the data from wide to long tidyr::gather("month","value",-Year) %>% ## dmy creates the date variable mutate(dat = paste0("01 ",month," ",Year) %>% lubridate::dmy()) %>% ## keep only the date and the value select(dat, value) %>% ## sort by date (not compulsory) arrange(dat) %>% ## convert to xts (note that xts::as_xts() is deprecated) timetk::tk_xts(select=value,date_var=dat)