Sorting of data frame per date and string value - r

I am downloading data from Bloomberg and then update it in R. The dataframe looks as follows.
ticker date PX_LAST PE_RATIO
1 SMI Index 2014-09-30 8835.14 20.3692
2 SMI Index 2014-10-31 8837.78 20.3753
3 DAX Index 2014-09-30 9474.30 16.6487
4 DAX Index 2014-10-31 9326.87 16.3896
5 SMI Index 2014-11-28 9150.46 21.0962
6 SMI Index 2014-12-31 8983.37 20.6990
7 DAX Index 2014-11-28 9980.85 17.5388
8 DAX Index 2014-12-31 9805.55 16.8639
Now I would like to have this dataframe sorted according to ticker (not in alphabetical order, but in the order of the ticker) and then the date so that the end result would be:
ticker date PX_LAST PE_RATIO
1 SMI Index 2014-09-30 8835.14 20.3692
2 SMI Index 2014-10-31 8837.78 20.3753
3 SMI Index 2014-11-28 9150.46 21.0962
4 SMI Index 2014-12-31 8983.37 20.6990
5 DAX Index 2014-09-30 9474.30 16.6487
6 DAX Index 2014-10-31 9326.87 16.3896
7 DAX Index 2014-11-28 9980.85 17.5388
8 DAX Index 2014-12-31 9805.55 16.8639

There's a function chgroup() in the data.table package that does exactly what you're looking for. It groups values from the vector together while preserving the initial order. It is only available for character vectors (ch for character).
require(data.table)
DF[chgroup(DF$ticker), ]
# ticker date PX_LAST PE_RATIO
# 1 SMIIndex 2014-09-30 8835.14 20.3692
# 2 SMIIndex 2014-10-31 8837.78 20.3753
# 5 SMIIndex 2014-11-28 9150.46 21.0962
# 6 SMIIndex 2014-12-31 8983.37 20.6990
# 3 DAXIndex 2014-09-30 9474.30 16.6487
# 4 DAXIndex 2014-10-31 9326.87 16.3896
# 7 DAXIndex 2014-11-28 9980.85 17.5388
# 8 DAXIndex 2014-12-31 9805.55 16.8639
If your ticker column is factor, then convert it first to character type.

You could try order
df$ticker <- factor(df$ticker, levels=unique(df$ticker))
df1 <- df[with(df, order(ticker, date)),]
row.names(df1) <- NULL
df1
# ticker date PX_LAST PE_RATIO
#1 SMI Index 2014-09-30 8835.14 20.3692
#2 SMI Index 2014-10-31 8837.78 20.3753
#3 SMI Index 2014-11-28 9150.46 21.0962
#4 SMI Index 2014-12-31 8983.37 20.6990
#5 DAX Index 2014-09-30 9474.30 16.6487
#6 DAX Index 2014-10-31 9326.87 16.3896
#7 DAX Index 2014-11-28 9980.85 17.5388
#8 DAX Index 2014-12-31 9805.55 16.8639

Related

Create multiple lagged variables using a zoo object

I need to create 'n' number of variables with lags of the original variable from 1 to 'n' on the fly. Something like so :-
OrigVar
DatePeriod, value
2/01/2018,6
3/01/2018,4
4/01/2018,0
5/01/2018,2
6/01/2018,4
7/01/2018,1
8/01/2018,6
9/01/2018,2
10/01/2018,7
Lagged 1 variable
2/01/2018,NA
3/01/2018,6
4/01/2018,4
5/01/2018,0
6/01/2018,2
7/01/2018,4
8/01/2018,1
9/01/2018,6
10/01/2018,2
11/01/2018,7
Lagged 2 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,6
5/01/2018,4
6/01/2018,0
7/01/2018,2
8/01/2018,4
9/01/2018,1
10/01/2018,6
11/01/2018,2
12/01/2018,7
Lagged 3 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,NA
5/01/2018,6
6/01/2018,4
7/01/2018,0
8/01/2018,2
9/01/2018,4
10/01/2018,1
11/01/2018,6
12/01/2018,2
13/01/2018,7
and so on
I tried using the shift function and various other functions. Wtih most of them that worked for me, the lagged variables finished at the last date of the original variable. In other words, the length of the lagged variable is the same as that of the original variable.
What I am looking for the new lagged variable to be shifted down by the 'kth' lag and the data series to be extended by 'k' elements including the index.
The reason I need this is to be able to compute the value of the dependent variable using the regression coeffficients and the corresponding lagged variable value beyond the in-sample period
y1 <- Lag(ciresL1_usage_1601_1612, shift = 1)
head(y1)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA -5171.051 -6079.887 -3687.227 -3229.453 -2110.368
y2 <- Lag(ciresL1_usage_1601_1612, shift = 2)
head(y2)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA NA -5171.051 -6079.887 -3687.227 -3229.453
tail(y2)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-2316.039 -2671.185 -4100.793 -2043.020 -1147.798 1111.674
tail(ciresL1_usage_1601_1612)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-4100.793 -2043.020 -1147.798 1111.674 3498.729 2438.739
Is there a way to do it relatively easily. I know I can do it by looping and adding 'k' rows in a new vector and reloading the data in to this new vector appropriately shifting the data values in the new vector but I don't want to use that method unless I have to. I am quietly confident that there has to be a better way to do it than this!
By the way, the object is a zoo object with daily dates as the index.
Best regards
Deepak
Convert the input zoo object to zooreg and then use lag.zooreg like this:
library(zoo)
# test input
z <- zoo(1:10, as.Date("2008-01-01") + 0:9)
zr <- as.zooreg(z)
lag(zr, -(0:3))
giving:
lag0 lag-1 lag-2 lag-3
2008-01-01 1 NA NA NA
2008-01-02 2 1 NA NA
2008-01-03 3 2 1 NA
2008-01-04 4 3 2 1
2008-01-05 5 4 3 2
2008-01-06 6 5 4 3
2008-01-07 7 6 5 4
2008-01-08 8 7 6 5
2008-01-09 9 8 7 6
2008-01-10 10 9 8 7
2008-01-11 NA 10 9 8
2008-01-12 NA NA 10 9
2008-01-13 NA NA NA 10

Calculation using the date function

Need to get the sales for every last month of the years.
Month Sales
01-03-2018 2351
01-06-2018 4522
01-09-2018 3632
01-12-2018 6894
01-03-2019 5469
01-06-2019 6546
01-09-2019 7885
01-12-2019 6597
01-03-2020 7845
01-06-2020 6894
01-09-2020 5469
01-12-2020 6546
01-03-2021 2351
01-06-2021 4522
01-09-2021 3632
01-12-2021 6546
01-03-2022 7885
01-06-2022 6597
01-09-2022 7845
01-12-2022 6894
Here i want to find the sales of every 12 months of the year.
Output should be as follows:
Month Sales
01-12-2018 6894
01-12-2019 6597
01-12-2020 6546
01-12-2021 6546
01-12-2022 6894
I can select every forth row from the table, but i want to do it using the Date Function. Please help.
Make sure your column Month is set as a date variable and use format to get the month, i.e.
#Make sure it is a date variable
df$Month <- as.POSIXct(df$Month, format = '%d-%m-%Y')
df[format(df$Month, '%m') == 12,]
which gives,
Month Sales
4 2018-12-01 6894
8 2019-12-01 6597
12 2020-12-01 6546
16 2021-12-01 6546
20 2022-12-01 6894
One way with startsWith:
#Month needs to be of character type
df[startsWith(df$Month, '01-12-'), ]
# Month Sales
#4 01-12-2018 6894
#8 01-12-2019 6597
#12 01-12-2020 6546
#16 01-12-2021 6546
#20 01-12-2022 6894

R: Subtract values in one dataframe based on Reference ID in second dataframe

I have searched hard for an answer and attempted many times to figure out a way to achieve what I am after. I have two XTS dataframes: 1.) Account data 2.) Reference data. Here is some sample data (NOTE: Sample data is not in XTS form but same idea applies):
#Accounts Data
df1 <- data.frame(index = c("2015-08-31","2015-07-31","2015-06-30","2015-05-31","2015-04-30","2015-03-31","2015-02-28","2015-01-31","2014-12-31","2014-11-30"),
"account1"=runif(10, -5.5,5.5), "account2"=runif(10, -5.5,5.5),"account3"=runif(10, -5.5,5.5))
#Reference Data
df2 <- data.frame(index = c("2015-08-31","2015-07-31","2015-06-30","2015-05-31","2015-04-30","2015-03-31","2015-02-28","2015-01-31","2014-12-31","2014-11-30"),
"CC456"=runif(10, -5.5,5.5),"EE789"=runif(10, -5.5,5.5),"AA123"=runif(10, -5.5,5.5))
#Accounts and Reference IDs Table
df3 <- data.frame("accounts"=c("account1","account2","account3"),"AccountRef"=c("CC456","EE789","AA123"))
so you have df1 (accounts):
index account1 account2 account3
1 2015-08-31 4.324357 4.313650 -0.94839158
2 2015-07-31 -2.334594 -5.412580 -0.03622573
3 2015-06-30 -3.209001 -3.278428 2.89787864
4 2015-05-31 -4.182272 -1.413639 -3.08145877
5 2015-04-30 -5.169491 1.525406 1.07483264
6 2015-03-31 1.601116 -3.539175 0.50540123
7 2015-02-28 2.843629 -3.127028 -1.56990909
8 2015-01-31 -4.037052 -5.276741 2.14786518
9 2014-12-31 5.165133 2.122590 5.36112955
10 2014-11-30 -0.516639 -5.399451 -3.85471675
df2 contains the data what I need subtract from accounts (with ref IDs as column names):
index CC456 EE789 AA123
1 2015-08-31 5.4784459 4.804948 -3.2405529
2 2015-07-31 3.3577544 -2.300360 -2.8951527
3 2015-06-30 1.7617227 5.207737 -4.8039332
4 2015-05-31 -5.3399975 -5.431412 -3.9897288
5 2015-04-30 5.3813128 1.123664 1.0381041
6 2015-03-31 4.5766378 3.536293 1.5906431
7 2015-02-28 -0.6529657 4.187261 0.2982024
8 2015-01-31 -3.1963028 4.215060 -3.7630705
9 2014-12-31 -0.7919399 5.022231 0.3401043
10 2014-11-30 0.9654926 2.702521 1.8169502
and df3 is the table from which I would identify which column in df2 reference to transform df1 into the difference between df1-df2
The issue is that I have about 200 accounts, and 1200 Ref IDs. Essentially I want to create a new dataframe which is the result of subtracting each column in df1 by the corresponding refID in df2.

How to do a BETWEEN merge the data.table way?

I have two data.tables that are each 5-10GB in size. They look similar to the following.
library(data.table)
A <- data.table(
person = c(1,1,1,2,3,3,3,3,4,4),
datetime = c(
'2015-04-06 14:22:18',
'2015-04-07 02:55:32',
'2015-11-21 10:16:05',
'2015-10-03 13:37:29',
'2015-02-26 23:51:56',
'2015-05-16 18:21:44',
'2015-06-02 04:07:43',
'2015-11-28 15:22:36',
'2015-01-19 04:10:22',
'2015-01-24 02:18:11'
)
)
B <- data.table(
person = c(1,1,3,4,4,5),
datetime2 = c(
'2015-04-06 14:24:59',
'2015-11-28 15:22:36',
'2015-06-02 04:07:43',
'2015-01-19 06:10:22',
'2015-01-24 02:18:18',
'2015-04-06 14:22:18'
)
)
A$datetime <- as.POSIXct(A$datetime)
B$datetime2 <- as.POSIXct(B$datetime2)
The idea is to find rows in B where the datetime is within 0-10 minutes of a matching row in A (matching is done by person) and mark them in A. The question is how can I do it most efficiently using data.table?
One plan is to join the two data tables based on [I]person[/I] only, then calculate the time difference and find rows where the time difference is between 0 and 600 seconds, and finally outer join the latter with A:
setkey(A,person)
AB <- A[B,.(datetime,
datetime2,
diff = difftime(datetime2, datetime, units = "secs"))
, by = .EACHI]
M <- AB[diff < 600 & diff > 0]
setkey(A, person, datetime)
setkey(M, person, datetime)
M[A,]
Which gives us the correct result:
person datetime datetime2 diff
1: 1 2015-04-06 14:22:18 2015-04-06 14:24:59 161 secs
2: 1 2015-04-07 02:55:32 <NA> NA secs
3: 1 2015-11-21 10:16:05 <NA> NA secs
4: 2 2015-10-03 13:37:29 <NA> NA secs
5: 3 2015-02-26 23:51:56 <NA> NA secs
6: 3 2015-05-16 18:21:44 <NA> NA secs
7: 3 2015-06-02 04:07:43 <NA> NA secs
8: 3 2015-11-28 15:22:36 <NA> NA secs
9: 4 2015-01-19 04:10:22 <NA> NA secs
10: 4 2015-01-24 02:18:11 2015-01-24 02:18:18 7 secs
However, I am not sure if this is the most efficient way. Specifically, I am using AB[diff < 600 & diff > 0] which I assume will run a vector search not a binary search, but I cannot think of how to do it using a binary search.
Also, I am not sure if converting to POSIXct is the most efficient way of calculating time differences.
Any ideas on how to improve efficiency are high appreciated.
data.table's rolling join is perfect for this task:
B[, datetime := datetime2]
setkey(A,person,datetime)
setkey(B,person,datetime)
B[A,roll=-600]
person datetime2 datetime
1: 1 2015-04-06 14:24:59 1428319338
2: 1 NA 1428364532
3: 1 NA 1448090165
4: 2 NA 1443868649
5: 3 NA 1424983916
6: 3 NA 1431789704
7: 3 2015-06-02 04:07:43 1433207263
8: 3 NA 1448713356
9: 4 NA 1421629822
10: 4 2015-01-24 02:18:18 1422055091
The only difference with your expected output is that it checks timedifference as less or equal to 10 minutes (<=). If that is bad for you you can just delete equal matches

Can I cross tab dates, grouped by year?

I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.

Resources