R dataframe: Getting value from next row subject to criteria - r

I have data in the following format:
quotes <- read.csv(text = "
id,ts,origin,product,bid,ask,nextts
1,2016-10-18 20:20:54.733,SourceA,Dow,1.09812,1.0982,
2,2016-10-18 20:20:55.093,SourceA,Ftse,7010.5,7011.5,
3,2016-10-18 20:20:55.149,SourceA,Dow,18159.0,18161.0,
4,2016-10-18 20:20:55.871,SourceA,Ftse,18159.0,18161.0,")
How can I populate the column 'nextts' with the value of ts in the next row where source is the same and product is the same? Essentially, joining the data on itself (subject to it being the same product and source) and capturing the value of ts?
I found the following answer, but this is a strict lead/lag without any criteria.
Return next row in a dataframe R

First ensure that ts is character or POSIXct rather than factor by explicitly converting it as shown here or by using the as.is=TRUE argument to read.csv. Then use ave with the indicated function to shift by group.
quotes$ts <- as.character(quotes$ts)
transform(quotes, nextts = ave(ts, origin, product, FUN = function(x) c(x[-1], NA)))
giving:
id ts origin product bid ask nextts
1 1 2016-10-18 20:20:54.733 SourceA Dow 1.09812 1.0982 2016-10-18 20:20:55.149
2 2 2016-10-18 20:20:55.093 SourceA Ftse 7010.50000 7011.5000 2016-10-18 20:20:55.871
3 3 2016-10-18 20:20:55.149 SourceA Dow 18159.00000 18161.0000 <NA>
4 4 2016-10-18 20:20:55.871 SourceA Ftse 18159.00000 18161.0000 <NA>

Related

R how to replace/gsub a vector of values by another vector of values in a datatable

I have data with dates in a not directly usable format. I have data that are either annual, quaterly or mensual. Annual are stored correctly, quaterly are in the form 1Q2010, and monthly JAN2010.
So something like
library(tidyverse)
library(data.table)
MWE <- data.table(date=c("JAN2020","FEB2020","1Q2020","2020"),
value=rnorm(4,2,1))
> MWE
date value
1: JAN2020 2.5886057
2: FEB2020 0.5913031
3: 1Q2020 1.6237973
4: 2020 1.4093762
I want to have them in a standard format. I thing a decently readable way to do that is to replace the non standard elements, so to have these elements :
Date_Brute <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC","1Q","2Q","3Q","4Q")
Replaced by these ones
Date_Standardisee <- c("01-01","01-02","01-03","01-04","01-05","01-06","01-07", "01-08","01-09","01-10","01-11","01-12","01-01","01-04","01-07","01-10")
Now I think gsub does not work with vectors. I have found this answer that suggests using stingr::str_replace_all but I have not been able to make it function in a data.table.
I am open to other functions to replace a vector by another one, but would like to avoid for instance slicing the data, and using specific date lectures functions.
Desired output :
> MWE
date value
1: 01-01-2020 2.5886057
2: 01-02-2020 0.5913031
3: 01-01-2020 1.6237973
4: 2020 1.4093762
You can try with lubridate::parse_date_time() and which takes a vector of candidate formats to attempt in the conversion:
library(lubridate)
library(data.table)
MWE[, date := parse_date_time(date, orders = c("bY","qY", "Y"))]
date value
1: 2020-01-01 -0.4948354
2: 2020-02-01 1.0227036
3: 2020-01-01 2.6285688
4: 2020-01-01 1.9158595
We can use grep with as.yearqtr and as.yearmon to convert those 'date' elements into Date class and further change it to the specified format
library(zoo)
library(data.table)
MWE[grep('Q', date), date := format(as.Date(as.yearqtr(date,
'%qQ %Y')), '%d-%m-%Y')]
MWE[grep("[A-Z]", date), date := format(as.Date(as.yearmon(date)), '%d-%m-%Y')]
-output
MWE
# date value
#1: 01-01-2020 0.8931051
#2: 01-02-2020 2.9813625
#3: 01-01-2020 1.1918638
#4: 2020 2.8001267
Or another option is fcoalecse with myd from lubridate
library(lubridate)
MWE[, date := fcoalesce(format(myd(date, truncated = 2), '%d-%m-%Y'), date)]

R Joining multiple time series with map-function

I have a problem to join time-series-dataframes with a map-function. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price). For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to join them all into one dataframe by Date with a map-function:
lst1 <- mget(ls(pattern = "USD"))
df <- map(.x = lst1,.f = full_join(by="Date"))
But ist doesen't work:
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "character"
Can somebody help me?
The result of mget is a list of characters, thats why full_join fails with error.
Try this:
map(lst1, function(x) {full_join(tibble(x),head(BTC.USD),by="Date")}) # Full join might fail becuase lst1 has no column called Date.
Also, in the result of mget in the lst1 (that you have) there is no column called Date
Creating a lst1 tibble with Date Column:
DateVec=c("2015-12-31")
map(lst1, function(x) {full_join(tibble(x,Date=DateVec),head(BTC.USD),by="Date")})

Auto incrementing dates into vector in r

Hi this is a two part question.
How to create a auto incrementing data frame for dates?
I want to auto create a data frame with column "dates" with values in one month intervals from 2011-05-01 (1st May 2011) till today (2015-12-01).
Output:
S.no. Date
1 2011-05-01
2 2011-06-01
3 2011-07-01
. .
55 2015-12-01
Second I have a data frame with customer name and his expiry date for example:
names<-c("Tom","David")
expiryDate<-as.Date(c("2011-05-22","2011-06-19"))
df<-data.frame(names,expiryDate)
df
Name Expirydate
Tom 2011-05-22
David 2011-06-19
I want to process the expiry dates to check whether customer is active in that month.
Name 2011-05-01 2011-06-01 2011-07-01 ... (till 2015-12-01)
Tom TRUE FALSE FALSE
David TRUE TRUE FALSE
As #Roland mentioned you can use seq.Date to generate sequence of dates,
DateColumns <- seq.Date(as.Date("2011/05/01"), as.Date("2015/12/1"), by = "1 month")
DateColumnvalues <- t(sapply(df$expiryDate, function(x) x > DateColumns))
x <- data.frame(DateColumnvalues, row.names = df$names)
colnames(x) <- DateColumns
Generating a sequence of dates(DateColumns) for 1st of every month and then checking if expiryDate is greater than that dates using sapply.
The first line of the code would answer first part of your question as well.

Merge Records Over Time Interval

Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.
The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.
For example, a few records in df A:
OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal
And from df B:
OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00
The desired outcome from the merge would be:
OBS ID Time Outcome
1 01 10:12:10 Normal
3 02 10:12:45 Weird
Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.
Question
Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?
Set up data
First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.
EDIT:
Simplified/improved final sqldf statement.
here is an example:
# first, merge by ID
z <- merge(A[, -1], B, by = "ID")
# convert string to POSIX time
z <- transform(z,
s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) -
as.numeric(strptime("00:00:00", "%H:%M:%S")),
tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))
# subset by time range
subset(z, s_t < tim & tim < s_t + dur)
the output:
ID StartTime Duration Outcome OBS Time s_t dur tim
1 1 10:12:06 00:00:10 Normal 1 10:12:10 1321665126 10 1321665130
2 1 10:12:06 00:00:10 Normal 2 10:12:15 1321665126 10 1321665135
7 2 10:12:30 00:00:30 Weird 3 10:12:45 1321665150 30 1321665165
OBS #2 looks to be in the range. does it make sense?
Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.

Split date data (m/d/y) into 3 separate columns

I need to convert date (m/d/y format) into 3 separate columns on which I hope to run an algorithm.(I'm trying to convert my dates into Julian Day Numbers). Saw this suggestion for another user for separating data out into multiple columns using Oracle. I'm using R and am throughly stuck about how to code this appropriately. Would A1,A2...represent my new column headings, and what would the format difference be with the "update set" section?
update <tablename> set A1 = substr(ORIG, 1, 4),
A2 = substr(ORIG, 5, 6),
A3 = substr(ORIG, 11, 6),
A4 = substr(ORIG, 17, 5);
I'm trying hard to improve my skills in R but cannot figure this one...any help is much appreciated. Thanks in advance... :)
I use the format() method for Date objects to pull apart dates in R. Using Dirk's datetext, here is how I would go about breaking up a date into its constituent parts:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
datetxt <- as.Date(datetxt)
df <- data.frame(date = datetxt,
year = as.numeric(format(datetxt, format = "%Y")),
month = as.numeric(format(datetxt, format = "%m")),
day = as.numeric(format(datetxt, format = "%d")))
Which gives:
> df
date year month day
1 2010-01-02 2010 1 2
2 2010-02-03 2010 2 3
3 2010-09-10 2010 9 10
Note what several others have said; you can get the Julian dates without splitting out the various date components. I added this answer to show how you could do the breaking apart if you needed it for something else.
Given a text variable x, like this:
> x
[1] "10/3/2001"
then:
> as.Date(x,"%m/%d/%Y")
[1] "2001-10-03"
converts it to a date object. Then, if you need it:
> julian(as.Date(x,"%m/%d/%Y"))
[1] 11598
attr(,"origin")
[1] "1970-01-01"
gives you a Julian date (relative to 1970-01-01).
Don't try the substring thing...
See help(as.Date) for more.
Quick ones:
Julian date converters already exist in base R, see eg help(julian).
One approach may be to parse the date as a POSIXlt and to then read off the components. Other date / time classes and packages will work too but there is something to be said for base R.
Parsing dates as string is almost always a bad approach.
Here is an example:
datetxt <- c("2010-01-02", "2010-02-03", "2010-09-10")
dates <- as.Date(datetxt) ## you could examine these as well
plt <- as.POSIXlt(dates) ## now as POSIXlt types
plt[["year"]] + 1900 ## years are with offset 1900
#[1] 2010 2010 2010
plt[["mon"]] + 1 ## and months are on the 0 .. 11 intervasl
#[1] 1 2 9
plt[["mday"]]
#[1] 2 3 10
df <- data.frame(year=plt[["year"]] + 1900,
month=plt[["mon"]] + 1, day=plt[["mday"]])
df
# year month day
#1 2010 1 2
#2 2010 2 3
#3 2010 9 10
And of course
julian(dates)
#[1] 14611 14643 14862
#attr(,"origin")
#[1] "1970-01-01"
To convert date (m/d/y format) into 3 separate columns,consider the df,
df <- data.frame(date = c("01-02-18", "02-20-18", "03-23-18"))
df
date
1 01-02-18
2 02-20-18
3 03-23-18
Convert to date format
df$date <- as.Date(df$date, format="%m-%d-%y")
df
date
1 2018-01-02
2 2018-02-20
3 2018-03-23
To get three seperate columns with year, month and date,
library(lubridate)
df$year <- year(ymd(df$date))
df$month <- month(ymd(df$date))
df$day <- day(ymd(df$date))
df
date year month day
1 2018-01-02 2018 1 2
2 2018-02-20 2018 2 20
3 2018-03-23 2018 3 23
Hope this helps.
Hi Gavin: another way [using your idea] is:
The data-frame we will use is oilstocks which contains a variety of variables related to the changes over time of the oil and gas stocks.
The variables are:
colnames(stocks)
"bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC"
"emMN" "emMN.1" "chdate" "chV" "cbO" "chC" "chMN" "chMX"
One of the first things to do is change the emdate field, which is an integer vector, into a date vector.
realdate<-as.Date(emdate,format="%m/%d/%Y")
Next we want to split emdate column into three separate columns representing month, day and year using the idea supplied by you.
> dfdate <- data.frame(date=realdate)
year=as.numeric (format(realdate,"%Y"))
month=as.numeric (format(realdate,"%m"))
day=as.numeric (format(realdate,"%d"))
ls() will include the individual vectors, day, month, year and dfdate.
Now merge the dfdate, day, month, year into the original data-frame [stocks].
ostocks<-cbind(dfdate,day,month,year,stocks)
colnames(ostocks)
"date" "day" "month" "year" "bpV" "bpO" "bpC" "bpMN" "bpMX" "emdate" "emV" "emO" "emC" "emMN" "emMX" "chdate" "chV"
"cbO" "chC" "chMN" "chMX"
Similar results and I also have date, day, month, year as separate vectors outside of the df.

Resources