I'm having data as
customer_id Last_city First city recent_date
1020 Jaipur Gujarat 20130216
1021 Delhi Lucknow 20130129
1022 Mumbai Punjab 20130221
and I want to find the number of days from recent date and today (for every record).
difftime function calculates time difference in days, hours, minutes, etc.
First, need to parse the date string into a date representation (e.g. Date or POSIXct) then compare that to the current date/time.
# create dummy data.frame for testing
df <- data.frame("customer_id"=1020, "Last_city"="Jaipur",
"First_city"="Gujarat", "recent_date"="20130216",
stringsAsFactors = FALSE)
now <- Sys.Date()
# parse date into date type (Note: %Y=4-digit year, %y=2-digit year)
df$date = as.Date(df$recent_date, format = "%Y%m%d")
# next calculate the difference between recent date and current time
df$diff = as.double(difftime(now, df$date, units = c("days")))
> df
customer_id Last_city First_city recent_date date diff
1 1020 Jaipur Gujarat 20130216 2013-02-16 1604
If wanted the difference in weeks then
> as.double(difftime(now, df$date, units = c("weeks")))
[1] 229.1429
Related
So i have a dataset in R:
IncidentID Time Vehicle
19002 4:48 Car
19003 12:30 Motorcycle
19004 14:00 Car
19005 9:30 Bicycle
And I'm trying to filter out some data, since its quite a large dataset. The above is just a few examples of data.
I want to filter out the data according to the time, where say i want to obtain the data where the Time is between 12pm to 6pm (18:00 in 24 hour format), hence i would have:
IncidentID Time Vehicle
19003 12:30 Motorcycle
19004 14:00 Car
I did:
incident <- read.csv("incident.csv")
afternoon_incident <- incident[which(incident$Time >= 12 && incident$Time <= 18),]
But I'm getting the error saying:
1: In Ops.factor(web$Time, 6:0) : ‘>=’ not meaningful for factors
2: In Ops.factor(web$Time, 12:0) : ‘<=’ not meaningful for factors
You can use lubridate to convert Time field into time object and then extract hour for filtering:
library(lubridate)
incident$Time <- hm(as.character(incident$Time))
incident[which(hour(incident$Time) >= 12 & hour(incident$Time) <= 18), ]
You need to first convert the Time into actual date-time object using as.POSIXct and then compare.
As you want to subset based on hour, we can extract only hour part of the data using format and keep rows which are in between 12 and 18 hour. Using base R, we can do
df$hour <- as.numeric(format(as.POSIXct(df$Time, format = "%H:%M"), "%H"))
subset(df, hour >= 12 & hour <= 18)
# IncidentID Time Vehicle hour
#2 19003 12:30 Motorcycle 12
#3 19004 14:00 Car 14
You can remove the hour column later if not needed.
For a general solution, we can create a date-time column and then compare
df$datetime <- as.POSIXct(df$Time, format = "%H:%M")
subset(df, datetime >= as.POSIXct("12:30:00", format = "%T") &
datetime <= as.POSIXct("18:30:00", format = "%T"))
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50
I have a data frame
> df
Age year sex
12 80210 F
13 9123 M
I want to convert the year 80210 as 26june1982. How can I do this that the new data frame contains year in day month year formate from Julian days.
You can convert Julian dates to dates using as.Date and specifying the appropriate origin:
as.Date(8210, origin=as.Date("1960-01-01"))
#[1] "1982-06-24"
However, 80210 needs an origin pretty long ago.
You should substract the origin from the year column.
as.Date(c(80210,9123)-80210,origin='1982-06-26')
[1] "1982-06-26" "1787-11-08"
There are some options for doing this job in the R package date.
See for example on page 4, the function date.mmddyy, which says:
Given a vector of Julian dates, this returns them in the form “10/11/89”, “28/7/54”, etc.
Try this code:
age = c(12,13)
year = c(8210,9123)
sex = c("F","M")
df = data.frame(cbind(age,year,sex))
library(date)
date = date.mmddyy(year, sep = "/")
df2 = transform(df,year=date) #hint provided by jilber
df2
age year sex
1 12 6/24/82 F
2 13 12/23/84 M
I like to count the number of Sundays, Mondays, Tuesdays, ...,Saturdays in year 2001. Taking the following dates { 1 Jan, 5 April, 13 April, 25 Dec and 26 Dec} as public holidays and consider them as Sundays. How can I do it in R? - Thanks
Here is the Lithuanian version:
dates <- as.Date("2001-01-01") + 0:364
wd <- weekdays(dates)
idx <- which(dates %in% as.Date(c("2001-01-01", "2001-04-05",
"2001-04-13", "2001-12-25", "2001-12-26")))
wd[idx] <- "sekmadienis"
table(wd)
wd
antradienis ketvirtadienis penktadienis pirmadienis sekmadienis šeštadienis trečiadienis
51 51 51 52 57 52 51
Try the following:
# get all the dates you need
dates <- seq(from=as.Date("2001-01-01"), to=as.Date("2001-12-31"), by="day")
# makes sure the dates are in POSIXlt format
dates <- strptime(dates, "%Y-%m-%d")
# get rid of the public holidays
pub <- strptime(c(as.Date("2001-01-01"),
as.Date("2001-04-05"),
as.Date("2001-04-13"),
as.Date("2001-12-25"),
as.Date("2001-12-26")), "%Y-%m-%d")
dates <- dates[which(!dates%in%pub)]
# To see the day of the week
weekdays <- dates$wday
# Now, count the number of Mondays for example:
length(which(weekdays == 1))
For details, see the documentation for DateTimeClasses. Remember to add 5 to your count of Sundays.
I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.