Remove duplicate rows from xts object - r

I am having trouble deleting duplicated rows in an xts object. I have a R script that will download tick financial data of a currency and convert it to an xts object of OHLC format. The script also pulls new data every 15 minutes. The new data is downloaded from the first trade of today to the last recorded trade of today. The old previous data downloaded was stored in .Rdata format and called. Then the new data is added to the old data and it overwrites the old data in .Rdata format.
Here is an example of what my data looks like:
.Open .High .Low .Close .Volume .Adjusted
2012-01-07 00:00:11 6.69683 7.01556 6.38000 6.81000 48387.58 6.81000
2012-01-08 00:00:09 6.78660 7.20000 6.73357 7.11358 57193.53 7.11358
2012-01-09 00:00:57 7.08362 7.19100 5.81000 6.32570 148406.85 6.32570
2012-01-10 00:01:01 6.32687 6.89000 6.00100 6.36000 110210.25 6.36000
2012-01-11 00:00:07 6.44904 7.13800 6.41266 6.90000 99442.07 6.90000
2012-01-12 00:01:02 6.90000 6.99700 6.33700 6.79999 140116.52 6.79999
2012-01-13 00:02:01 6.78211 6.80400 6.40000 6.41000 60228.77 6.41000
2012-01-14 00:00:23 6.42000 6.50000 6.23150 6.31894 25392.98 6.31894
Now if I run the script again I will add the new data to the xts.
.Open .High .Low .Close .Volume .Adjusted
2012-01-07 00:00:11 6.69683 7.01556 6.38000 6.81000 48387.58 6.81000
2012-01-08 00:00:09 6.78660 7.20000 6.73357 7.11358 57193.53 7.11358
2012-01-09 00:00:57 7.08362 7.19100 5.81000 6.32570 148406.85 6.32570
2012-01-10 00:01:01 6.32687 6.89000 6.00100 6.36000 110210.25 6.36000
2012-01-11 00:00:07 6.44904 7.13800 6.41266 6.90000 99442.07 6.90000
2012-01-12 00:01:02 6.90000 6.99700 6.33700 6.79999 140116.52 6.79999
2012-01-13 00:02:01 6.78211 6.80400 6.40000 6.41000 60228.77 6.41000
2012-01-14 00:00:23 6.42000 6.50000 6.23150 6.31894 25392.98 6.31894
2012-01-14 00:00:23 6.42000 6.75000 6.22010 6.57157 75952.01 6.57157
As you can see the last line is the same day as the second to last line. I want to keep the last row for the last date and delete the second to last row. When I try the following code to delete duplicated rows it does not work, the duplicated rows stay there.
xx <- mt.xts[!duplicated(mt.xts$Index),]
xx
.Open .High .Low .Close .Volume .Adjusted
I do not get any result. How can I delete duplicate data entries in an xts object using the Index as the indicator of duplication?

Should't it be index(mt.xts) rather than mt.xts$Index?
The following seems to work.
# Sample data
library(xts)
x <- xts(
1:10,
rep( seq.Date( Sys.Date(), by="day", length=5 ), each=2 )
)
# Remove rows with a duplicated timestamp
y <- x[ ! duplicated( index(x) ), ]
# Remove rows with a duplicated timestamp, but keep the latest one
z <- x[ ! duplicated( index(x), fromLast = TRUE ), ]

In my case,
x <- x[! duplicated( index(x) ),]
did not work as intended, because the system somehow makes date-time unique in each row.
x <- x[! duplicated( coredata(x) ),]
This may work if the previous solution did not help.

Related

getSymbols downloading data for multiple symbols and export adjusting prices to CSV file

quantmode newbie here,
My end goal is to have a CSV file including monthly stock prices, I've downloaded the data using getSymbols using this code:
Symbols <- c("DIS", "TSLA","ATVI", "MSFT", "FB", "ABT","AAPL","AMZN",
"BAC","NFLX","ADBE","WMT","SRE","T","MS")
Data <- new.env()
getSymbols(c("^GSPC",Symbols),from="2015-01-01",to="2020-12-01"
,periodicity="monthly",
env=Data)
the line above works fine, now I need to create a data frame that only includes the adjusted prices for all the symbols with a data column ofc,
any help, please? :)
Desired output would be something similar to this
enter image description here
Another straightforward way to get your monthly data:
tickers <- c('AMZN','FB','GOOG','AAPL')
getSymbols(tickers,periodicity="monthly")
head(do.call("merge.xts",c(lapply(mget(tickers),"[",,6),all=FALSE)),3)
AMZN.Adjusted FB.Adjusted GOOG.Adjusted AAPL.Adjusted
2012-06-01 228.35 31.10 288.9519 17.96558
2012-07-01 233.30 21.71 315.3032 18.78880
2012-08-01 248.27 18.06 341.2658 20.46477
Note the logical argument all = FALSE is the equivalent of an innerjoin and you get data when all of your stocks have prices. all = TRUE fills data which is not available with NAs (outerjoin).
To write the file you can use:
write.zoo(monthlyPrices,file = 'filename.csv',sep=',',quote=FALSE)
First get your data from the environment:
require(quantmod)
# your code
dat <- mget(ls(Data), env=Data)
Then draw the data from the Objects:
newdat <- as.data.frame(sapply( names(dat), function(x) coredata(dat[[x]])[,1] ))
Note that this takes the Opening values (see: dat[[x]])[,1]), the Objects have more, e.g.:
names(dat[["AAPL"]])
[1] "AAPL.Open" "AAPL.High" "AAPL.Low" "AAPL.Close"
[5] "AAPL.Volume" "AAPL.Adjusted"
Last, get the dates (assumes symmetric dates for all symbols):
rownames(newdat) <- index(dat[["AAPL"]])
# OR, more universal, by extracting from the complete list:
rownames(newdat) <-
as.data.frame( sapply( names(dat), function(x) as.character(index(dat[[x]])) ) )[,1]
head(newdat, 3)
AAPL ABT ADBE AMZN ATVI BAC DIS FB GSPC MS
2015-01-01 27.8475 45.25 72.70 312.58 20.24 17.99 94.91 78.58 2058.90 39.05
2015-02-01 29.5125 44.93 70.44 350.05 20.90 15.27 91.30 76.11 1996.67 33.96
2015-03-01 32.3125 47.34 79.14 380.85 23.32 15.79 104.35 79.00 2105.23 35.64
MSFT NFLX SRE T TSLA WMT
2015-01-01 46.66 49.15143 111.78 33.59 44.574 86.27
2015-02-01 40.59 62.84286 112.38 33.31 40.794 84.79
2015-03-01 43.67 67.71429 108.20 34.56 40.540 83.93
Writing the csv:
write.csv(newdat, "file.csv")

How to save data column of zoo object to matrix?

I am downloading some data using R package tseries,
require('tseries')
tickers<- c('JPM','AAPL','MSFT','FB','GE');
prices = matrix(NA,nrow=40,ncol=6)
startdate<-'2015-02-02'
enddate<-'2015-03-30'# 40 rows dim()
for(i in 1:5){
prices[,i]<-get.hist.quote(
instrument=tickers[i],
start=startdate,
end=enddate,
quote='AdjClose',
provider='yahoo')
}
colnames(prices)<-c('JPM','AAPL','MSFT','FB','GE');
I want to construct a matrix saving the adjclose price and date information, but I don't know how to access the zoo date column, say when I construct a zoo object using get.hist.quote(), I can view the object like this
But when I save them to matrix, the date column is missing
Here Map applied to get.hist.quote will create a zoo object for each ticker. Then we use zoo's multiway merge.zoo to merge them all together creating a final zoo object prices:
prices <- do.call(merge,
Map(get.hist.quote, tickers,
start=startdate,
end=enddate,
quote='AdjClose',
provider='yahoo')
)
I would probably keep all the series in a zoo object. This can be done like in the following code, thereby also avoiding your for-loop etc. You can always convert this object to a matrix by as.matrix() afterwards.
prices <-lapply(tickers, get.hist.quote, start=startdate, end=enddate, quote='AdjClose')
prices <- Reduce(cbind, prices)
names(prices) <- tickers
prices <- as.matrix(prices)
head(prices)
JPM AAPL MSFT FB GE
2015-02-02 55.10 118.16 40.99 74.99 23.99
2015-02-03 56.35 118.18 41.31 75.40 24.25
2015-02-04 56.01 119.09 41.54 75.63 23.94
2015-02-05 56.40 119.94 42.15 75.61 24.28
2015-02-06 57.51 118.93 42.11 74.47 24.30
2015-02-09 57.44 119.72 42.06 74.44 24.42

In R How do I convert this CSV data to XTS

I am trying to read in a CSV file and change it to XTS format. However, I am running into and issue with the CSV format have date and time fields in separate columns.
2012.10.30,20:00,1.29610,1.29639,1.29607,1.29619,295
2012.10.30,20:15,1.29622,1.29639,1.29587,1.29589,569
2012.10.30,20:30,1.29590,1.29605,1.29545,1.29574,451
2012.10.30,20:45,1.29576,1.29657,1.29576,1.29643,522
2012.10.30,21:00,1.29643,1.29645,1.29581,1.29621,526
2012.10.30,21:15,1.29621,1.29644,1.29599,1.29642,330
I am trying to pull it in with
euXTS <- as.xts(read.zoo(file="EURUSD15.csv", sep=",", format="%Y.%m.%d", header=FALSE))
But it gives me this warning message so I think somehow I have to attached the time stamp but I am not sure the best way to do that.
Warning message:
In zoo(rval3, ix) :
Some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
It is better to use read.zoo to read directly your ts in a zoo object, easily coerced to xts one:
library(xts)
ts.z <- read.zoo(text='2012.10.30,20:00,1.29610,1.29639,1.29607,1.29619,295
2012.10.30,20:15,1.29622,1.29639,1.29587,1.29589,569
2012.10.30,20:30,1.29590,1.29605,1.29545,1.29574,451
2012.10.30,20:45,1.29576,1.29657,1.29576,1.29643,522
2012.10.30,21:00,1.29643,1.29645,1.29581,1.29621,526
2012.10.30,21:15,1.29621,1.29644,1.29599,1.29642,330',
sep=',',index=1:2,tz='',format="%Y.%m.%d %H:%M")
as.xts(ts.z)
V3 V4 V5 V6 V7
2012-10-30 20:00:00 1.29610 1.29639 1.29607 1.29619 295
2012-10-30 20:15:00 1.29622 1.29639 1.29587 1.29589 569
2012-10-30 20:30:00 1.29590 1.29605 1.29545 1.29574 451
2012-10-30 20:45:00 1.29576 1.29657 1.29576 1.29643 522
2012-10-30 21:00:00 1.29643 1.29645 1.29581 1.29621 526
2012-10-30 21:15:00 1.29621 1.29644 1.29599 1.29642 330

R quantmod::getFinancials

I'm using the quantmodpackage. I've got a vector of tickers like this :
c("AAPL","GOOG","IBM","GS","AMZN","GE")
and I want to create a function to calculate the EBIT margin of a stock (= operating income / total revenue). So for a given stock, I use the following piece of code which only works for GE (provided a ".f" is added a the end of the ticker) :
require(quantmod)
getFinancials("GE",period="A")
ebit.margin <- function(stock.ticker.f){
return(stock.ticker$IS$A["Operating Income",]/stock.ticker$IS$A["Total Revenue",])
}
ebit.margin("GE")
I would like to generalize this function in order to use then the applyfunction. There are several difficulties :
when applying the quantmod::getFinancialfunction to a ticker, the financial statements of the stocks are saved in the default environment. The viewFinancialhas then to be used to get and print the financial statements. I need a way to get access to the financial statements directly into the function
The function's argument function is a string like "GE.f" but it would more convenient to enter directly the ticker ("GE"). I've tried to use the paste0 and gsub to get a string like "GE.f" it doesn't work because "GE.f" doesn't belong to the financials class.
To sum up, I'm a bit lost...
It's easier if you use auto.assign=FALSE
s <- c("AAPL","GOOG","IBM","GS","AMZN","GE")
fin <- lapply(s, getFinancials, auto.assign=FALSE)
names(fin) <- s
lapply(fin, function(x) x$IS$A["Operating Income", ] / x$IS$A["Total Revenue",])
#$AAPL
#2012-09-29 2011-09-24 2010-09-25 2009-09-26
# 0.3529596 0.3121507 0.2818704 0.2736278
#
#$GOOG
#2012-12-31 2011-12-31 2010-12-31 2009-12-31
# 0.2543099 0.3068724 0.3540466 0.3514585
#
#$IBM
#2012-12-31 2011-12-31 2010-12-31 2009-12-31
# 0.2095745 0.1964439 0.1974867 0.1776439
#
#$GS
#2012-12-31 2011-12-31 2010-12-31 2009-12-31
#0.2689852 0.1676678 0.2804621 0.3837401
#
#$AMZN
#2012-12-31 2011-12-31 2010-12-31 2009-12-31
#0.01106510 0.01792957 0.04110630 0.04606471
#
#$GE
#2012-12-31 2011-12-31 2010-12-31 2009-12-31
#0.11811969 0.13753327 0.09415548 0.06387029
Anaother option is to laod your tickers in an new environnement.
tickers <- new.env()
s <- c("AAPL","GOOG","IBM","GS","AMZN","GE")
lapply(s, getFinancials,env=tickers)
sapply(ls(envir=tickers),
function(x) {x <- get(x) ## get the varible name
x$IS$A["Operating Income", ] / x$IS$A["Total Revenue",]})
AAPL.f AMZN.f GE.f GOOG.f GS.f IBM.f
2012-09-29 0.3529596 0.01106510 0.11811969 0.2543099 0.2689852 0.2095745
2011-09-24 0.3121507 0.01792957 0.13753327 0.3068724 0.1676678 0.1964439
2010-09-25 0.2818704 0.04110630 0.09415548 0.3540466 0.2804621 0.1974867
2009-09-26 0.2736278 0.04606471 0.06387029 0.3514585 0.3837401 0.1776439
EDIT
No need to use ls, get.... just the handy eapply (thanks #GSee) which applies FUN to the named values from an environment and returns the results as a list
eapply(tickers, function(x)
x$IS$A["Operating Income", ] / x$IS$A["Total Revenue",])

R: convert email addresses into unique integers

R beginner with what seems to be a pretty simple problem :
I have a number of email logs that I have read into R in the format:
>log1
Date Time From To
1 2000-01-01 00:00:00 bob#mail.com test1#mail.com
2 2000-01-02 01:00:00 carolyn #mail.com test2#mail.com
3 2000-01-03 02:00:00 chris#mail.com test3#mail.com
4 2000-01-04 03:00:00 chris #mail.com test4#mail.com
5 2000-01-05 04:00:00 alan#mail.com test5#mail.com
6 2000-01-06 05:00:00 alan.#mail.com test6#mail.com
I need to change log1$From and log1$To to a global unique numeric identifier, such that when I read in other logs later any given email address will receive the same identifier as previous logs.
I have tried:
id <- as.numeric(as.character(log1[,3])))
id<-as.numeric(levels(log1[,3])))
id <- charToRaw(log1[,4]), base=16)
Would some kind soul please help me out – Thanks!
Apologies should probably have included this:
Date=c( "01/01/2000" ,"02/01/2000" ,"03/01/2000", "04/01/2000" ,"05/01/2000" ,"06/01/2000","07/01/2000","08/01/2000",
"09/01/2000","10/01/2000","11/01/2000", "12/01/2000" ,"13/01/2000", "14/01/2000", "15/01/2000","16/01/2000"
,"17/01/2000","18/01/2000","19/01/2000","20/01/2000","01/01/2000","02/01/2000")
Time=c("00:00:00","01:00:00","02:00:00", "03:00:00" ,"04:00:00" ,"05:00:00", "06:00:00" ,"07:00:00", "08:00:00", "09:00:00" ,"10:00:00",
"11:00:00", "12:00:00","13:00:00", "14:00:00","15:00:00","16:00:00","17:00:00","18:00:00","19:00:00","00:00:00" ,"00:00:00")
From=c("bob.shults#mail.com","carolyn.green#mail.com","chris.long#mail.com","christi.nicolay#mail.com","alan.aronowitz#mail.com","alan.comnes#mail.com",
"dab#sprintmail.com","ana.correa#mail.com","andrew.fastow#mail.com","elena.kapralova#mail.com","bob.shults#mail.com","carolyn.green#mail.com",
"chris.long#mail.com","christi.nicolay#mail.com","alan.aronowitz#mail.com","alan.comnes#mail.com","dab#sprintmail.com","ana.correa#mail.com",
"andrew.fastow#mail.com","elena.kapralova#mail.com","bob.shults#mail.com","bob.shults#mail.com")
To=c("ana.correa#mail.com","test2#mail.com","test3#mail.com","test4#mail.com","test5#mail.com","test6#mail.com","test7#mail.com",
"test8#mail.com","test9#mail.com","test10#mail.com","test11#mail.com","test12#mail.com","test13#mail.com","test14#mail.com",
"test15#mail.com","test16#mail.com","test17#mail.com","test18#mail.com","test19#mail.com","test20#mail.com","ana.correa#mail.com","ana.correa#mail.com")
log<-data.frame(Date=Date,Time=Time,From=From,To=To)
Attempt at using MD5 to generate globally unique identifiers: Note how the identifier for ana.correa#mail.com is a correct match within ID_to but is not within ID_from
ID_to<-data.frame()
ID_from<-data.frame()
for (i in 1:nrow(log)){
to<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,4], algo='md5'), 2), c(1, 9, 17, 25), c(8, 16, 24, 32)),sep=""))
(ID_to<-rbind(ID_to,to))
from<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,3], algo='md5'), 2), c(1, 9, 17, 25),c(8, 16, 24, 32)),sep=""))
(ID_from<-rbind(ID_from,from))
}
ID_to[,3]<-paste(ID_to[,1],ID_to[,2], sep="")
ID_from[,3]<-paste(ID_from[,1],ID_from[,2], sep="")
edgelist<-data.frame(ID_from[,3],log[,3],ID_to[,3],log[,4],log[,1],log[,2])
print(edgelist)
ID_from...3. log...3. ID_to...3. log...4. log...1. log...2.
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 01/01/2000 00:00:00
38124472891255273775 carolyn.green#mail.com 1251903296725454474 test2#mail.com 02/01/2000 01:00:00
29070047663451376630 chris.long#mail.com 17074276751156451031 test3#mail.com 03/01/2000 02:00:00
8261398433828474582 christi.nicolay#mail.com 1563683670909194033 test4#mail.com 04/01/2000 03:00:00
18727221862165338513 alan.aronowitz#mail.com 26735368323826533112 test5#mail.com 05/01/2000 04:00:00
5680838251168988404 alan.comnes#mail.com 2923605896229594830 test6#mail.com 06/01/2000 05:00:00
2351312285811012730 dab#sprintmail.com 17171333544033270402 test7#mail.com 07/01/2000 06:00:00
328278708432069254 ana.correa#mail.com 33840664403556851587 test8#mail.com 08/01/2000 07:00:00
1127901879852039037 andrew.fastow#mail.com 1973548136161209824 test9#mail.com 09/01/2000 08:00:00
7349515121496417787 elena.kapralova#mail.com 5680838251168988404 test10#mail.com 10/01/2000 09:00:00
27488842661591306920 bob.shults#mail.com 328278708432069254 test11#mail.com 11/01/2000 10:00:00
38124472891255273775 carolyn.green#mail.com 1127901879852039037 test12#mail.com 12/01/2000 11:00:00
29070047663451376630 chris.long#mail.com 27488842661591306920 test13#mail.com 13/01/2000 12:00:00
8261398433828474582 christi.nicolay#mail.com 38124472891255273775 test14#mail.com 14/01/2000 13:00:00
18727221862165338513 alan.aronowitz#mail.com 29070047663451376630 test15#mail.com 15/01/2000 14:00:00
5680838251168988404 alan.comnes#mail.com 8261398433828474582 test16#mail.com 16/01/2000 15:00:00
2351312285811012730 dab#sprintmail.com 2351312285811012730 test17#mail.com 17/01/2000 16:00:00
328278708432069254 ana.correa#mail.com 7349515121496417787 test18#mail.com 18/01/2000 17:00:00
1127901879852039037 andrew.fastow#mail.com 41762759923562968495 test19#mail.com 19/01/2000 18:00:00
7349515121496417787 elena.kapralova#mail.com 24894056753582090007 test20#mail.com 20/01/2000 19:00:00
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 01/01/2000 00:00:00
27488842661591306920 bob.shults#mail.com 18727221862165338513 ana.correa#mail.com 02/01/2000 00:00:00
Attempt at levels/factor method:
Getting an error:
log <- union(levels(log[,3]), levels(log[,4]))
>Error in emails[, 3] : incorrect number of dimensions
You can use MD5 to generate globally unique identifiers since it has a very low probability of collisions, but since its output is 128-bit you need a few numbers to represent it (four integers in 32-bit R, two integers in 64-bit R). This should be easy to deal with using short numeric vectors, though.
Here is how you can generate such a vector of four integers for an email address (or any other string for that matter):
library(digest)
email <- 'test1#gmail'
as.numeric(paste('0x', substr(rep(hmac('secret56f8a7', email, algo='md5'), 4), c(1, 9, 17, 25), c(8, 16, 24, 32)), sep=''))
You could use algo='crc32' and obtain just one integer, but this isn't recommended since collisions are far more likely with CRC.
you need to create a unique id for every email in your logs. One way would be to calculate the crc checksum of every email and use that as a identifier, but it will be very long number. Or you could implement a hashmap in R and make the email the key of the hashmap.
I think this will do what you want, and it's efficient, and you can do it using only base packages...
Procedure:
1.Convert both columns to factors
2.Union the factor levels, in exactly the same way, so that each email has a unique ID in the factor levels.
3.Change the entries in each column to the number corresponding to their factor level. As a result, we can identify the times when "test1#gmail.com" sent and received emails by simply looking up "1" in both columns.
log1$From <- as.factor(log1$From)
log1$To <- as.factor(log1$To)
emails <- union(levels(log1$From), levels(log1$To))
levels(log1$From) <- emails
levels(log1$To) <- emails
log1$From <- as.numeric(log1$From)
log1$To <- as.numeric(log1$To)
It will probably be a good idea to keep a record of the original email addresses, as I have done here. Then if you were interested in, say, which emails test1#gmail.com sent:
log1[log1$From == which(emails == "test1#gmail.com"), ]
should do the trick! You can write a procedure to make that look much cleaner as well...

Resources