I have the the date vector:
d <- c("30/5/15", "6/6/15", "23/5/15")
I would like to convert it to 2, 3, 1 with smallest rank to older and biggest to newest.
I tried rank(d) but it looks like it makes the ranking based on days only and reverse, it returns 3, 1, 2.
Convert to Date class, then numeric, then rank:
d <- c("30/5/15", "6/6/15", "23/5/15")
rank(as.numeric(as.Date(d, "%d/%m/%y")))
#[1] 2 3 1
Suggestions from comments:
drop as.numeric, as rank can handle dates. Although it might be preferable to be explicit.
use lubridate package: library(lubridate); rank(dmy(d))
convert the data into date format and then rank it. internally date will save in numeric values. so it can rank on it.
d <- c("30/5/15", "6/6/15", "23/5/15")
rank(as.Date(d,'%d/%m/%y'))
Related
I was wondering if there was a function for finding the difference between and issue date and a maturity date, but with 2 maturity date. For example, I want to prioritize the dates in maturity date source 1 and subtract it from the issue date to find the difference. Then, if my dataset is missing dates from maturity date source 1, such as in lines 5 & 6, I want to use dates from maturity date source 2 to fill in the rest. I have tried the code below, but am unsure how to incorporate the data from maturity date source 2 without changing everything else. I have attached a picture for reference. Thank you in advance.
df$Maturity_Date_source_1 <- as.Date(c(df$Maturity_Date_source_1))
df$Issue_Date <- as.Date(c(df$Issue_Date))
df$difference <- (df$Maturity_Date_source_1 - df$Issue_Date) / 365.25
df$difference <- as.numeric(c(df$difference))
An option would be to coalesce the columns and then do the difference
library(dplyr)
df %>%
mutate(difference = as.numeric((coalesce(Maturity_Date_source_1,
Maturity_Date_source_2) - Issue_Date)/365.25))
I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.
Just started learning R and I would like to convert a string of a total of 10 characters (yymmdd and some random numbers) to date format.
Example:
Numbers
1. 6010111234
2. 7012245675
3. 9201015678
4. 0404125689
Desired outcome:
Numbers Dates
1. 6010111234 1960-10-11
2. 7012245675 1970-12-24
3. 9201015678 1992-01-01
4. 0404125689 2004-04-12
I will able to do this easily in excel with the formula Dates, left and right:
DATES(LEFT(Numbers,2),RIGHT(LEFT(Numbers,4),2), RIGHT(LEFT(Numbers,6),2))
I have also tried using as.Date(substr(df$Numbers, 1,6), format=%y%m%d).
However, the results are not what I wanted. The results will be some 4-5 digit numbers.
Can anyone help? Thanks!
If you don't like what dates are put into 20th, respective 21st century by as.Date(..., format = '%y%m%d'), you can write your own variant:
nums <- c('6010111234', '7012245675', '9201015678', '0404125689')
breakpoint <- '30'
dplyr::if_else(substr(nums, 1, 2) >= breakpoint ,
as.Date(paste0('19', substr(nums, 1, 6)), '%Y%m%d'),
as.Date(paste0('20', substr(nums, 1, 6)), '%Y%m%d')
)
#"1960-10-11" "1970-12-24" "1992-01-01" "2004-04-12"
dplyr::if_else is used since ifelse() coerces dates to numeric, see e.g. this question
I am new to R and have the following data of user name and their usage date for a product (truncated output):
Name, Date
Jane, 01-24-2016 10:02:00
Mary, 01-01-2016 12:18:00
Mary, 01-01-2016 13:18:00
Mary, 01-02-2016 13:18:00
Jane, 01-23-2016 10:02:00
I would like to do some analysis on difference between Date, in particular the number of days between usage for each user. I'd like to plot a histogram to determine if there is a pattern among users.
how do I compute the difference between dates for each user in R ?
are there any other visualizations besides histograms I should explore ?
Thanks
Try this, assuming your data frame is df:
## in case you have different column names
colnames(df) <- c("Name", "Date")
## you might also have Date as factors when reading in data
## the following ensures it is character string
df$Date <- as.character(df$Date)
## convert to Date object
## see ?strptime for various available format
## see ?as.Date for Date object
df$Date <- as.Date(df$Date, format = "%m-%d-%Y %H:%M:%S")
## reorder, so that date are ascending (see Jane)
## this is necessary, otherwise negative number occur after differencing
## see ?order on ordering
df <- df[order(df$Name, df$Date), ]
## take day lags per person
## see ?diff for taking difference
## see ?tapply for applying FUN on grouped data
## as.integer() makes output clean
## if unsure, compare with: lags <- with(df, tapply(Date, Name, FUN = diff))
lags <- with(df, tapply(Date, Name, FUN = function (x) as.integer(diff(x))))
For you truncated data (with 5 rows), I get:
> lags
$Jane
[1] 1
$Mary
[1] 0 1
lags is a list. If you want to get Jane's information, do lags$Jane. To get a histogram, do hist(lags$Jane). Furthermore, if you want to simply produce a histogram for all clients, overlooking individual difference, use hist(unlist(lags)). The unlist() collapse a list into a single vector.
comments:
regarding your requirement for good reference to R, see CRAN: R intro and advanced R;
using tapply for multiple indices? Maybe you can try the trick I gave by using paste to first construct an auxiliary index;
Er, looks like I quickly made things complicated than necessary, by using density and central limit theorem, etc, for visualization. So I removed my other answer.
We can use data.table with lubridate
library(lubridate)
library(data.table)
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))), Name]
# Name Diff
#1: Mary 0
#2: Mary 1
#3: Jane 1
If there are several grouping variables i.e. "ID" , we can place it in the by
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))),
by = .(Name, ID)]
I have two columns of data:
DoB: yyyy/mm
Reported date: yyyy/mm/dd
Both are in character format.
I'd like to calculate an age, by subtracting DoB from Reported Date, without adding a fictional day to the DoB, so that the age comes out as 28.5 (meaning 28 and a half years old).
Please can someone help me with the coding, I'm struggling!
Many thanks from an R newbie.
library(lubridate)
a <- "2010/02"
b <- "2014/12/25"
c <- ymd(b) - ymd(paste0(a, "/01")) # I don't think this can be done without adding a fictional day
c <- as(c/365.25, "numeric")
What would you want the age to be if the dates are:
DoB: 2015/01
Reported date: 2015/01/30
As suggested, lubridate is a great package for working with dates. You probably want some version using difftime. You also can still use ymd for the yyyy/mm by setting truncated=1 meaning the field can be missing.
df <- data.frame(DoB = c("1987/08", "1994/04"),
Report_Date = c("2015/03/05","2014/07/04"))
library(lubridate)
df$age_years <- with(df,
as.numeric(
difftime(ymd(Report_Date),
ymd(DoB, truncated=1)
)/365.25))
df
DoB Report_Date age_years
1 1987/08 2015/03/05 27.59206023
2 1994/04 2014/07/04 20.25735797
Unfortunately difftime doesn't have a 'years' unit so you also will need to divide the 'days' output that you get back.
Use the "yearmon" class in zoo. It represents time as years + fraction (where fraction is in the set 0, 1/12, ..., 11/12) and so does not require that fictitious days be added:
library(zoo)
as.yearmon("2012/01/10", "%Y/%m/%d") - as.yearmon("1983/07", "%Y/%m")
giving:
[1] 28.5