Hi this is a two part question.
How to create a auto incrementing data frame for dates?
I want to auto create a data frame with column "dates" with values in one month intervals from 2011-05-01 (1st May 2011) till today (2015-12-01).
Output:
S.no. Date
1 2011-05-01
2 2011-06-01
3 2011-07-01
. .
55 2015-12-01
Second I have a data frame with customer name and his expiry date for example:
names<-c("Tom","David")
expiryDate<-as.Date(c("2011-05-22","2011-06-19"))
df<-data.frame(names,expiryDate)
df
Name Expirydate
Tom 2011-05-22
David 2011-06-19
I want to process the expiry dates to check whether customer is active in that month.
Name 2011-05-01 2011-06-01 2011-07-01 ... (till 2015-12-01)
Tom TRUE FALSE FALSE
David TRUE TRUE FALSE
As #Roland mentioned you can use seq.Date to generate sequence of dates,
DateColumns <- seq.Date(as.Date("2011/05/01"), as.Date("2015/12/1"), by = "1 month")
DateColumnvalues <- t(sapply(df$expiryDate, function(x) x > DateColumns))
x <- data.frame(DateColumnvalues, row.names = df$names)
colnames(x) <- DateColumns
Generating a sequence of dates(DateColumns) for 1st of every month and then checking if expiryDate is greater than that dates using sapply.
The first line of the code would answer first part of your question as well.
Related
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have data in the following format:
quotes <- read.csv(text = "
id,ts,origin,product,bid,ask,nextts
1,2016-10-18 20:20:54.733,SourceA,Dow,1.09812,1.0982,
2,2016-10-18 20:20:55.093,SourceA,Ftse,7010.5,7011.5,
3,2016-10-18 20:20:55.149,SourceA,Dow,18159.0,18161.0,
4,2016-10-18 20:20:55.871,SourceA,Ftse,18159.0,18161.0,")
How can I populate the column 'nextts' with the value of ts in the next row where source is the same and product is the same? Essentially, joining the data on itself (subject to it being the same product and source) and capturing the value of ts?
I found the following answer, but this is a strict lead/lag without any criteria.
Return next row in a dataframe R
First ensure that ts is character or POSIXct rather than factor by explicitly converting it as shown here or by using the as.is=TRUE argument to read.csv. Then use ave with the indicated function to shift by group.
quotes$ts <- as.character(quotes$ts)
transform(quotes, nextts = ave(ts, origin, product, FUN = function(x) c(x[-1], NA)))
giving:
id ts origin product bid ask nextts
1 1 2016-10-18 20:20:54.733 SourceA Dow 1.09812 1.0982 2016-10-18 20:20:55.149
2 2 2016-10-18 20:20:55.093 SourceA Ftse 7010.50000 7011.5000 2016-10-18 20:20:55.871
3 3 2016-10-18 20:20:55.149 SourceA Dow 18159.00000 18161.0000 <NA>
4 4 2016-10-18 20:20:55.871 SourceA Ftse 18159.00000 18161.0000 <NA>
I have an excel file, in the date column, it shows from 1/1/15 to 12/31/15. I want to change all 15(year) to 14, so that all Date looks like from 1/1/14 to 12/31/14. How to do that in R? Right now I just use replace function manually changed the date. But there are 150000 more records....
If you don't want to convert to 'Date' class and keep the same format, one option would be sub. Here we match the last two characters that are 14 and replace it with 15.
sub('14$', '15', v1)
#[1] "1/1/15" "12/31/15" "1/1/15"
data
v1 <- c('1/1/15', '12/31/15', '1/1/14')
You could use lubridate where you can just subtract 'x' number of years.
library(lubridate)
# some random 2015 dates
df <- data.frame(dates = mdy("01/13/2015", "02/25/2015"))
# subtract 1 year
df$dates <- with(df, dates - years(1))
df
dates
1 2014-01-13
2 2014-02-25
I have this data with 4000 observations, so this is head(both):
kön gdk age fbkurs pers stterm
1 man FALSE 69 FALSE 1941-12-23 2011-01-19
2 man NA 70 FALSE 1942-02-11 2012-01-19
3 kvinna NA 65 FALSE 1942-06-04 2007-09-01
4 kvinna TRUE 68 FALSE 1943-04-04 2011-09-01
5 kvinna NA 65 FALSE 1943-10-30 2008-09-01
6 man FALSE 70 TRUE 1944-01-27 2013-09-01
I I want to create a new column based on the column named 'stterm'.
In stterm I have different dates that I would rather name for example. VT10, VT11, etc. I like to call the new column regyear.
I have tried to enter:
regyear <- factor(both$stterm, levels = c("2007-09-01"="HT07" "2008-09-01"="HT09" "2009-01-19"="VT09" "2009-09-01"="HT09" "2010-01-19"="VT10" "2010-09-01"="HT10" "2011-01-19"="VT11"
"2011-09-01"="HT11" "2012-01-19"="VT12" "2012-09-01"="HT12" "2013-01-19"="VT13" "2013-09-01"="HT13" "2014-01-19"="VT14"))
but when I do, I get the following error message:
Error: unexpected string constant in "regyear<- factor(both$stterm, levels = c("2007-09-01"='HT07' "2008-09-01""
What should I do to make them right?
Your code relies on quite a bit of hard-coding, which may be prone to mistakes and will be tedious if you have many dates which you wish to map to periods.
Here are some alternatives, where your dates first are converted to class Date using as.Date. This makes it easier to extract and map months to the periods "VT" or "HT", and to extract the year.
In the first example, I use cut which "divides the range of x into intervals and codes the values in x according to which interval they fall.":
# some dates which are converted to proper R dates
dates <- as.Date(c("2006-09-01", "2007-02-01", "2008-09-01", "2009-01-19"))
# extract month
month <- as.integer(format(dates, "%m"))
# extract year
year <- format(dates, "%y")
# cut the months into intervals and label the levels
term <- cut(x = month, breaks = c(0, 8, 12), labels = c("VT", "HT"))
# paste 'term' and 'year' together
paste0(term, year)
# [1] "HT06" "VT07" "HT08" "VT09"
In the second example, findInterval is used to create a numerical vector of interval indices. This vector is used to extract elements from a 'period' vector. The periods are then pasted with year as above.
paste0(c("VT", "HT")[findInterval(x = month, vec = c(1, 9))], year)
# [1] "HT06" "VT07" "HT08" "VT09"
Finally, a similar, more 'manual' method, which is less convenient if you have many 'breaks' and intervals to which you wish to map your dates:
paste0(c("VT", "HT")[as.integer(month > 8) + 1], year)
# [1] "HT06" "VT07" "HT08" "VT09"
Another relevant Q&A here.
You could do it like this:
both$regyear<- factor(both$stterm, labels = c("2007-09-01"="HT07","2008-09-01"="HT09",
"2011-01-19"="VT11","2011-09-01"="HT11",
"2012-01-19"="VT12","2013-09-01"="HT13"))
There are several problems in your original code:
It did not create a new variable in your dataframe: regyear<- factor(both$stterm, ... should be both$regyear<- factor(both$stterm, ...
You had no comma's between the levels/labels.
You had to many levels for the given example dataset (see these instructions on how to give a reproducable example).
I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.